Documentation

PII Firewall Developer Guide

This documentation is structured for fast onboarding and deep implementation work. Start with Overview and Quickstart, then move to API, profiles, detectors, and operational compliance.

Section 01

Install

Package is pii-firewall on PyPI. Import from privacy_firewall.

Pick your install

Start with the recommended option. Add extras only when you need them.

Single-language apps

Skip [langdetect] and pass language="es" to create_firewall. Saves a dependency and removes auto-detection overhead.

Command	What you get
pip install pii-firewall	Regex-only. No ML. Good for IDs, emails, phones.
pip install "pii-firewall[presidio,langdetect]"	Recommended. Named entities + 55-language auto-detect.
pip install "pii-firewall[all]"	Every backend: GLiNER, Transformers, OPF, Nemotron.
pip install "pii-firewall[gliner]"	Zero-shot NER, no fine-tuning.
pip install "pii-firewall[transformers]"	Biomedical NER (BioBERT, d4data).

Example

# Recommended
pip install "pii-firewall[presidio,langdetect]"

# Download spaCy models for the languages you use
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
python -m spacy download fr_core_news_sm

Section 02

Quickstart

One import, one call. The firewall anonymizes input, calls the model, and restores real values in the response.

Minimal example

process() runs the full detect → anonymize → LLM → rehydrate cycle.

Example

from privacy_firewall import create_firewall

firewall = create_firewall("healthcare")

result = firewall.process(
    text="Ana García, 43 años, hipertensión. Prescripción: enalapril 10mg.",
    context={
        "tenant_id": "hospital-001",
        "case_id":   "patient-123",
        "thread_id": "consultation-1",
        "actor_id":  "doctor-456",
    },
)

print(result.sanitized_text)
# → "[PERSON_001], 40-49, hipertensión. enalapril 10mg."
#   Name pseudonymized. Age generalized. Medical terms KEPT.

print(result.final_text)
# → LLM response with real names restored.

Context fields

All calls require these four fields. thread_id drives token-mapping continuity across turns.

Field	Role
tenant_id	Hard isolation boundary between customers.
thread_id	Maps tokens consistently across a conversation. [PERSON_001] always means the same person.
case_id	Groups related threads (e.g. one patient). Used for GDPR forget().
actor_id	Audit trail. Does not affect token logic.

Section 03

Domain Profiles

Profiles define what survives the anonymization pass — what the model is allowed to see — and what action is taken on everything else.

Built-in presets

Pass the preset name to create_firewall(). Treat presets as starting points, then copy and customize the profile or backend mix. See the custom profile example below.

Profile	Keeps	Pseudonymizes	Transforms
healthcare	Diagnoses, medications, procedures, lab values	Names, IDs, addresses	Ages → decade range, dates → month/year
finance	Company names; amounts pass through as non-PII	Names, account numbers, IBANs, tax IDs	Credit cards masked (************1111)
legal	Court/firm names, statutes, case citations (public record)	Party names, strong identifiers	All dates → month/year
generic	Safe defaults for any domain	Names, emails, phones, IDs	—

Example

firewall = create_firewall("healthcare")        # or "finance", "legal", "generic"
firewall = create_firewall("healthcare", detector_backend="presidio")
firewall = create_firewall("healthcare", detector_backend="presidio", language="es")

Custom profile

Create a profile from scratch and set per-entity actions.

Example

from privacy_firewall import (
    create_custom_profile,
    EntityDisposition,
    DispositionAction,
    create_firewall,
)

profile = create_custom_profile("my_domain")

profile.add_disposition(EntityDisposition(
    entity_type="EMPLOYEE_ID",
    action=DispositionAction.PSEUDONYMIZE,
    confidence_threshold=0.9,
))
profile.add_disposition(EntityDisposition(
    entity_type="CASE_NUMBER",
    action=DispositionAction.KEEP,
    confidence_threshold=0.8,
))

firewall = create_firewall("generic", profile=profile)

Disposition actions

Only PSEUDONYMIZE is reversible — vault stores the original. Every other action permanently discards it.

Action	Example	Reversible?
KEEP	hipertensión → hipertensión	N/A
PSEUDONYMIZE	Ana García → [PERSON_001]	Yes — vault mapping
GENERALIZE	43 años → 40-49	No
MASK	4111 1111 1111 1111 → ************1111	No
HASH	SHA-256 digest	No
REDACT	span removed entirely	No

Section 04

Detection Backends

Switch backends with a single parameter. Start from a preset, then customize the profile or backend mix with custom regex, recognizers, or model IDs.

Backend comparison

presidio is the default for production. regex for zero-dependency paths. hybrid is the fallback when you want maximum coverage.

Backend	Extra install	Best for	Latency
regex	(none)	Structured IDs, emails, phones	< 1 ms
presidio	[presidio] + spaCy model	Named entities — best balance	50–200 ms
gliner	[gliner]	Zero-shot NER	100–400 ms
transformers	[transformers]	Biomedical NER (d4data, BC5CDR)	100–500 ms
opf	[opf]	Language-agnostic token classifier	50–200 ms
nemotron	[opf]	High recall on free text	100–300 ms
hybrid	[presidio,langdetect]	Regex + Presidio combined	50–250 ms

Example

firewall = create_firewall("healthcare", detector_backend="presidio")   # recommended
firewall = create_firewall("healthcare", detector_backend="regex")      # zero deps
firewall = create_firewall("healthcare", detector_backend="gliner")      # zero-shot NER
firewall = create_firewall("healthcare", detector_backend="transformers", transformer_model_id="d4data/biomedical-ner-all")
firewall = create_firewall("healthcare", detector_backend="opf")          # token classifier
firewall = create_firewall("healthcare", detector_backend="hybrid")        # max coverage

Section 05

Integrations

Pass any callable(prompt: str) -> str as llm_client. Works with OpenAI, Anthropic, LangChain, local models — anything. For FastAPI, use the microservice pattern below.

OpenAI

OpenAIGPT-4o

Example

from openai import OpenAI
from privacy_firewall import create_firewall

client   = OpenAI()
firewall = create_firewall("healthcare", detector_backend="presidio")
ctx      = {"tenant_id": "t1", "case_id": "c1", "thread_id": "th1", "actor_id": "u1"}

def llm(prompt: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    ).choices[0].message.content

result = firewall.secure_call(text=user_input, context=ctx, llm_client=llm)
print(result.final_text)   # real names restored

Anthropic

AnthropicClaude

Example

import anthropic
from privacy_firewall import create_firewall

ac       = anthropic.Anthropic()
firewall = create_firewall("healthcare", detector_backend="presidio")

def llm(prompt: str) -> str:
    return ac.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    ).content[0].text

result = firewall.secure_call(text=user_input, context=ctx, llm_client=llm)

LangChain

Example

from langchain_openai import ChatOpenAI
from privacy_firewall import create_firewall

llm_chain = ChatOpenAI(model="gpt-4o")
firewall   = create_firewall("generic", detector_backend="presidio")

result = firewall.secure_call(
    text=user_input,
    context=ctx,
    llm_client=lambda prompt: llm_chain.invoke(prompt).content,
)

Streaming (SSE / WebSocket)

StreamingSSE

Yields rehydrated tokens as they arrive from the model. No buffering needed.

Example

# secure_call_stream yields tokens with real names already restored
for token in firewall.secure_call_stream(
    text=user_input,
    context=ctx,
    llm_client=your_streaming_llm,
):
    yield token   # send to SSE / WebSocket immediately

Section 06

Microservice Pattern

Run the firewall as a sidecar HTTP service. Any language can call it — Node.js, Go, Java, anything with fetch.

Python microservice

Configure via env vars. Expose /sanitize and /rehydrate.

Example

# main.py
import os
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from privacy_firewall import PrivacyFirewallSDK

SDK = None

@asynccontextmanager
async def lifespan(app):
    global SDK
    SDK = PrivacyFirewallSDK.create(
        domain=os.getenv("PII_DOMAIN", "healthcare"),
        language=os.getenv("PII_LANGUAGE") or None,
        detector_backend=os.getenv("PII_BACKEND", "presidio"),
    )
    yield

app = FastAPI(lifespan=lifespan)

class Req(BaseModel):
    text: str
    context: dict

@app.post("/sanitize")
async def sanitize(req: Req):
    if not SDK: raise HTTPException(503)
    return {"sanitized_text": SDK.anonymize_text(req.text, req.context).sanitized_text}

@app.post("/rehydrate")
async def rehydrate(req: Req):
    if not SDK: raise HTTPException(503)
    return {"final_text": SDK.rehydrate_text(req.text, req.context)}

@app.get("/health")
async def health(): return {"ok": SDK is not None}

Calling from TypeScript / Node.js

Example

const BASE = process.env.PII_URL ?? "http://localhost:8000";

export async function sanitize(text: string, ctx: Record<string, string>) {
  const r = await fetch(`${BASE}/sanitize`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ text, context: ctx }),
  });
  return (await r.json()).sanitized_text as string;
}

export async function rehydrate(text: string, ctx: Record<string, string>) {
  const r = await fetch(`${BASE}/rehydrate`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ text, context: ctx }),
  });
  return (await r.json()).final_text as string;
}

Environment variables

These are read by your bootstrap code — not by the library itself.

Variable	Default	Values
PII_DOMAIN	healthcare	healthcare, finance, legal, generic
PII_LANGUAGE	(auto)	es, en, fr, de, it, pt — or omit for auto-detect
PII_BACKEND	presidio	presidio, regex, hybrid, gliner, transformers, opf, nemotron

Section 07

Vault & GDPR

The vault stores token ↔ original-value mappings per tenant + case + thread. Use forget() to satisfy Art. 17 GDPR erasure requests.

Vault backends

Default is in-memory. Switch to SQLite for persistence across restarts.

Example

# In-memory (default) — wiped on restart
firewall = create_firewall("healthcare")

# SQLite — survives restarts
from privacy_firewall import SQLiteMappingVault
vault    = SQLiteMappingVault("privacy_vault.db")
firewall = create_firewall("healthcare", vault=vault)

Manual anonymize / rehydrate

Split the round-trip when your LLM call lives outside the firewall process.

Example

# Step 1 — Anonymize before sending to LLM
anon       = firewall.anonymize(text=raw_text, context=ctx)
clean_text = anon.sanitized_text

# Step 2 — Call LLM with sanitized text
llm_out = my_llm(clean_text)

# Step 3 — Rehydrate on the way back
final = firewall.rehydrate(text=llm_out, context=ctx)
print(final)   # original names restored

GDPR right to be forgotten

forget() removes all vault mappings for the given scope. After this call rehydration will not restore values for that thread.

Note

Only vault mappings are deleted. LLM responses or logs your application has already stored are outside the library's scope.

Example

deleted = firewall.forget(
    tenant_id="hospital-001",
    case_id="patient-123",
    thread_id="consultation-1",
)
print(f"Deleted {deleted} mappings")

Section 08

Custom Entities

Register your own entity types at runtime — no config files. Option A (regex) works with any backend. Option B (recognizer) requires presidio.

Option A — Regex (any backend)

Fastest path. Pass a regex string and a disposition action.

Example

# One-liner
firewall.add_custom_regex(
    entity_type="EMPLOYEE_ID",
    regex=r"\bEMP-\d{6}\b",
    locales=["GLOBAL"],          # or ["US"], ["ES"]...
    confidence=0.95,
    context_words=["employee", "staff"],
    disposition_action="pseudonymize",   # keep / pseudonymize / generalize / mask / redact
)

# Full EntityPattern for more control
import re
from privacy_firewall.patterns.catalog import EntityPattern

firewall.add_custom_pattern(EntityPattern(
    entity_type="CASE_NUMBER",
    locale="ES",
    pattern=re.compile(r"\bEXP-\d{4}/\d{6}\b"),
    confidence=0.98,
    context_words=("expediente", "exp"),
    description="Spanish legal case number",
))

Option B — Presidio recognizer

Use create_custom_recognizer() as a shortcut, or subclass EntityRecognizer for full control.

Example

from privacy_firewall import create_firewall
from privacy_firewall.presidio_integration import create_custom_recognizer

recognizer = create_custom_recognizer(
    entity_type="EMPLOYEE_ID",
    patterns=[r"\bEMP\d{6}\b"],
    context_words=["employee", "badge"],
    score=0.9,
)

firewall = create_firewall(
    "generic",
    detector_backend="presidio",
    custom_recognizers=[recognizer],
)

# ── Full ML-based recognizer ──────────────────────────────────
from presidio_analyzer import EntityRecognizer, RecognizerResult

class MyRecognizer(EntityRecognizer):
    def load(self): ...
    def analyze(self, text, entities, nlp_artifacts):
        return [
            RecognizerResult("CUSTOM", s.start, s.end, s.score)
            for s in my_model.predict(text)
        ]

firewall = create_firewall(
    "generic",
    detector_backend="presidio",
    custom_recognizers=[MyRecognizer(supported_entities=["CUSTOM"])],
)

Section 09

Custom HuggingFace Models

Pass any HuggingFace model ID to the transformers backend. The model is downloaded automatically on first call.

Install

Example

pip install "pii-firewall[transformers]"

Usage

Swap transformer_model_id for any HF NER model.

Example

from privacy_firewall import create_firewall

# Any HuggingFace NER model ID
firewall = create_firewall(
    "healthcare",
    detector_backend="transformers",
    transformer_model_id="dslim/bert-base-NER",
)

# GPU (0 = first card, -1 = CPU default)
firewall = create_firewall(
    "healthcare",
    detector_backend="transformers",
    transformer_model_id="d4data/biomedical-ner-all",
    transformer_device=0,
)

# Use the built-in curated catalog
from privacy_firewall.transformers_ner.models import get_model_for_domain
config   = get_model_for_domain("medical", "en")
firewall = create_firewall("healthcare", detector_backend="transformers", transformer_model_id=config.model_id)

Curated models

Pre-vetted models shipped in transformers_ner/models.py.

Domain	Language	Model ID
General	en	dslim/bert-base-NER
General	multilingual	Davlan/xlm-roberta-base-ner-hrl
General	fr	Jean-Baptiste/camembert-ner
Medical	en	d4data/biomedical-ner-all
Medical	es	PlanTL-GOB-ES/bsc-bio-ehr-es

Combine HF model with regex (Presidio hybrid)

Wrap the HF model as a Presidio recognizer so you can mix it with locale regex patterns in one pipeline.

Example

from presidio_analyzer import EntityRecognizer, RecognizerResult
from transformers import pipeline

class HFRecognizer(EntityRecognizer):
    def __init__(self, model_id: str):
        super().__init__(supported_entities=["PERSON", "ORG", "LOC"])
        self._pipe = pipeline("ner", model=model_id, aggregation_strategy="simple")
    def load(self): ...
    def analyze(self, text, entities, nlp_artifacts):
        return [
            RecognizerResult(s["entity_group"], s["start"], s["end"], s["score"])
            for s in self._pipe(text)
        ]

firewall = create_firewall(
    "healthcare",
    detector_backend="presidio",
    custom_recognizers=[HFRecognizer("dslim/bert-base-NER")],
)

Section 10

Language Support

Auto-detects 55+ languages. Thread-level cache means 0 ms overhead after the first request. Six locales have dedicated country-document patterns.

Locale patterns

Other languages fall back to global patterns (email, phone, credit card, IBAN).

Language	Code	Country patterns	spaCy model
Spanish	es	DNI, NIE, IBAN-ES	es_core_news_sm
English (US)	en	SSN, EIN, ZIP	en_core_web_sm
French	fr	INSEE, SIREN	fr_core_news_sm
German	de	Steuernummer, IBAN-DE	de_core_news_sm
Italian	it	Codice Fiscale	it_core_news_sm
Portuguese	pt	NIF, NIS	pt_core_news_sm

Example

# Force a single language (skips auto-detect)
firewall = create_firewall("healthcare", language="es")

# Pre-warm spaCy models at startup — avoids lazy-load latency
firewall.preload_languages(["es", "en", "fr"])

Section 11

API Reference

Key methods and return fields.

create_firewall()

Parameter	Type	Default	Notes
domain	str	"generic"	"healthcare", "finance", "legal", "generic"
profile	DomainProfile	None	Custom profile — overrides domain.
language	str \| None	None (auto)	"es", "en", "fr", "de", "it", "pt"
detector_backend	str	"regex"	"regex", "presidio", "hybrid", "gliner", "transformers", "opf", "nemotron"
vault	MappingVaultProtocol	InMemory	Pass SQLiteMappingVault for persistence.
custom_recognizers	list	[]	Presidio EntityRecognizer instances.
transformer_model_id	str	None	HF model ID (transformers backend).
transformer_device	int	-1	GPU index. -1 = CPU.

PrivacyFirewall methods

Method	Returns	Description
process(text, context)	ProcessResult	anonymize + call LLM + rehydrate.
secure_call(text, context, llm_client)	ProcessResult	Explicit LLM client version of process().
secure_call_stream(text, context, llm_client)	Iterator[str]	Streaming — yields rehydrated tokens.
anonymize(text, context)	ProcessResult	Detect + replace only. No LLM call.
rehydrate(text, context)	str	Restore vault values in text.
forget(tenant_id, case_id, thread_id)	int	Delete vault scope. Returns deleted count.
add_custom_regex(...)	None	Register regex entity at runtime.
add_custom_pattern(EntityPattern)	None	Register full EntityPattern at runtime.
preload_languages(list)	None	Pre-warm spaCy models at startup.

ProcessResult fields

Field	Description
sanitized_text	Anonymized text sent to the LLM.
model_output	Raw LLM response (tokens not yet replaced).
final_text	Rehydrated LLM output — real values restored.
trace.detected_entities	All detected entities with type, span, confidence.
trace.entities_kept	Entities that received KEEP disposition.
trace.replacements	Applied substitutions: original → token.
trace.language	Detected or forced language code.
trace.cleanup_warnings	Residual PII warnings after cleanup passes.