Documentation

PII Firewall Developer Guide

This documentation is structured for fast onboarding and deep implementation work. Start with Overview and Quickstart, then move to API, profiles, detectors, and operational compliance.

Section 01

Install

Package is pii-firewall on PyPI. Import from privacy_firewall.

Pick your install

Start with the recommended option. Add extras only when you need them.

Single-language apps

Skip [langdetect] and pass language="es" to create_firewall. Saves a dependency and removes auto-detection overhead.

CommandWhat you get
pip install pii-firewallRegex-only. No ML. Good for IDs, emails, phones.
pip install "pii-firewall[presidio,langdetect]"Recommended. Named entities + 55-language auto-detect.
pip install "pii-firewall[all]"Every backend: GLiNER, Transformers, OPF, Nemotron.
pip install "pii-firewall[gliner]"Zero-shot NER, no fine-tuning.
pip install "pii-firewall[transformers]"Biomedical NER (BioBERT, d4data).
Example
# Recommended
pip install "pii-firewall[presidio,langdetect]"

# Download spaCy models for the languages you use
python -m spacy download en_core_web_sm
python -m spacy download es_core_news_sm
python -m spacy download fr_core_news_sm
Section 02

Quickstart

One import, one call. The firewall anonymizes input, calls the model, and restores real values in the response.

Minimal example

process() runs the full detect → anonymize → LLM → rehydrate cycle.

Example
from privacy_firewall import create_firewall

firewall = create_firewall("healthcare")

result = firewall.process(
    text="Ana García, 43 años, hipertensión. Prescripción: enalapril 10mg.",
    context={
        "tenant_id": "hospital-001",
        "case_id":   "patient-123",
        "thread_id": "consultation-1",
        "actor_id":  "doctor-456",
    },
)

print(result.sanitized_text)
# → "[PERSON_001], 40-49, hipertensión. enalapril 10mg."
#   Name pseudonymized. Age generalized. Medical terms KEPT.

print(result.final_text)
# → LLM response with real names restored.

Context fields

All calls require these four fields. thread_id drives token-mapping continuity across turns.

FieldRole
tenant_idHard isolation boundary between customers.
thread_idMaps tokens consistently across a conversation. [PERSON_001] always means the same person.
case_idGroups related threads (e.g. one patient). Used for GDPR forget().
actor_idAudit trail. Does not affect token logic.
Section 03

Domain Profiles

Profiles define what survives the anonymization pass — what the model is allowed to see — and what action is taken on everything else.

Built-in presets

Pass the preset name to create_firewall(). Treat presets as starting points, then copy and customize the profile or backend mix. See the custom profile example below.

ProfileKeepsPseudonymizesTransforms
healthcareDiagnoses, medications, procedures, lab valuesNames, IDs, addressesAges → decade range, dates → month/year
financeCompany names; amounts pass through as non-PIINames, account numbers, IBANs, tax IDsCredit cards masked (************1111)
legalCourt/firm names, statutes, case citations (public record)Party names, strong identifiersAll dates → month/year
genericSafe defaults for any domainNames, emails, phones, IDs
Example
firewall = create_firewall("healthcare")        # or "finance", "legal", "generic"
firewall = create_firewall("healthcare", detector_backend="presidio")
firewall = create_firewall("healthcare", detector_backend="presidio", language="es")

Custom profile

Create a profile from scratch and set per-entity actions.

Example
from privacy_firewall import (
    create_custom_profile,
    EntityDisposition,
    DispositionAction,
    create_firewall,
)

profile = create_custom_profile("my_domain")

profile.add_disposition(EntityDisposition(
    entity_type="EMPLOYEE_ID",
    action=DispositionAction.PSEUDONYMIZE,
    confidence_threshold=0.9,
))
profile.add_disposition(EntityDisposition(
    entity_type="CASE_NUMBER",
    action=DispositionAction.KEEP,
    confidence_threshold=0.8,
))

firewall = create_firewall("generic", profile=profile)

Disposition actions

Only PSEUDONYMIZE is reversible — vault stores the original. Every other action permanently discards it.

ActionExampleReversible?
KEEPhipertensión → hipertensiónN/A
PSEUDONYMIZEAna García → [PERSON_001]Yes — vault mapping
GENERALIZE43 años → 40-49No
MASK4111 1111 1111 1111 → ************1111No
HASHSHA-256 digestNo
REDACTspan removed entirelyNo
Section 04

Detection Backends

Switch backends with a single parameter. Start from a preset, then customize the profile or backend mix with custom regex, recognizers, or model IDs.

Backend comparison

presidio is the default for production. regex for zero-dependency paths. hybrid is the fallback when you want maximum coverage.

BackendExtra installBest forLatency
regex(none)Structured IDs, emails, phones< 1 ms
presidio[presidio] + spaCy modelNamed entities — best balance50–200 ms
gliner[gliner]Zero-shot NER100–400 ms
transformers[transformers]Biomedical NER (d4data, BC5CDR)100–500 ms
opf[opf]Language-agnostic token classifier50–200 ms
nemotron[opf]High recall on free text100–300 ms
hybrid[presidio,langdetect]Regex + Presidio combined50–250 ms
Example
firewall = create_firewall("healthcare", detector_backend="presidio")   # recommended
firewall = create_firewall("healthcare", detector_backend="regex")      # zero deps
firewall = create_firewall("healthcare", detector_backend="gliner")      # zero-shot NER
firewall = create_firewall("healthcare", detector_backend="transformers", transformer_model_id="d4data/biomedical-ner-all")
firewall = create_firewall("healthcare", detector_backend="opf")          # token classifier
firewall = create_firewall("healthcare", detector_backend="hybrid")        # max coverage
Section 05

Integrations

Pass any callable(prompt: str) -> str as llm_client. Works with OpenAI, Anthropic, LangChain, local models — anything. For FastAPI, use the microservice pattern below.

OpenAI

OpenAIGPT-4o

Example
from openai import OpenAI
from privacy_firewall import create_firewall

client   = OpenAI()
firewall = create_firewall("healthcare", detector_backend="presidio")
ctx      = {"tenant_id": "t1", "case_id": "c1", "thread_id": "th1", "actor_id": "u1"}

def llm(prompt: str) -> str:
    return client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "user", "content": prompt}],
    ).choices[0].message.content

result = firewall.secure_call(text=user_input, context=ctx, llm_client=llm)
print(result.final_text)   # real names restored

Anthropic

AnthropicClaude

Example
import anthropic
from privacy_firewall import create_firewall

ac       = anthropic.Anthropic()
firewall = create_firewall("healthcare", detector_backend="presidio")

def llm(prompt: str) -> str:
    return ac.messages.create(
        model="claude-opus-4-5",
        max_tokens=1024,
        messages=[{"role": "user", "content": prompt}],
    ).content[0].text

result = firewall.secure_call(text=user_input, context=ctx, llm_client=llm)

LangChain

LangChain

Example
from langchain_openai import ChatOpenAI
from privacy_firewall import create_firewall

llm_chain = ChatOpenAI(model="gpt-4o")
firewall   = create_firewall("generic", detector_backend="presidio")

result = firewall.secure_call(
    text=user_input,
    context=ctx,
    llm_client=lambda prompt: llm_chain.invoke(prompt).content,
)

Streaming (SSE / WebSocket)

StreamingSSE

Yields rehydrated tokens as they arrive from the model. No buffering needed.

Example
# secure_call_stream yields tokens with real names already restored
for token in firewall.secure_call_stream(
    text=user_input,
    context=ctx,
    llm_client=your_streaming_llm,
):
    yield token   # send to SSE / WebSocket immediately
Section 06

Microservice Pattern

Run the firewall as a sidecar HTTP service. Any language can call it — Node.js, Go, Java, anything with fetch.

Python microservice

Configure via env vars. Expose /sanitize and /rehydrate.

Example
# main.py
import os
from contextlib import asynccontextmanager
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
from privacy_firewall import PrivacyFirewallSDK

SDK = None

@asynccontextmanager
async def lifespan(app):
    global SDK
    SDK = PrivacyFirewallSDK.create(
        domain=os.getenv("PII_DOMAIN", "healthcare"),
        language=os.getenv("PII_LANGUAGE") or None,
        detector_backend=os.getenv("PII_BACKEND", "presidio"),
    )
    yield

app = FastAPI(lifespan=lifespan)

class Req(BaseModel):
    text: str
    context: dict

@app.post("/sanitize")
async def sanitize(req: Req):
    if not SDK: raise HTTPException(503)
    return {"sanitized_text": SDK.anonymize_text(req.text, req.context).sanitized_text}

@app.post("/rehydrate")
async def rehydrate(req: Req):
    if not SDK: raise HTTPException(503)
    return {"final_text": SDK.rehydrate_text(req.text, req.context)}

@app.get("/health")
async def health(): return {"ok": SDK is not None}

Calling from TypeScript / Node.js

Example
const BASE = process.env.PII_URL ?? "http://localhost:8000";

export async function sanitize(text: string, ctx: Record<string, string>) {
  const r = await fetch(`${BASE}/sanitize`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ text, context: ctx }),
  });
  return (await r.json()).sanitized_text as string;
}

export async function rehydrate(text: string, ctx: Record<string, string>) {
  const r = await fetch(`${BASE}/rehydrate`, {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({ text, context: ctx }),
  });
  return (await r.json()).final_text as string;
}

Environment variables

These are read by your bootstrap code — not by the library itself.

VariableDefaultValues
PII_DOMAINhealthcarehealthcare, finance, legal, generic
PII_LANGUAGE(auto)es, en, fr, de, it, pt — or omit for auto-detect
PII_BACKENDpresidiopresidio, regex, hybrid, gliner, transformers, opf, nemotron
Section 07

Vault & GDPR

The vault stores token ↔ original-value mappings per tenant + case + thread. Use forget() to satisfy Art. 17 GDPR erasure requests.

Vault backends

Default is in-memory. Switch to SQLite for persistence across restarts.

Example
# In-memory (default) — wiped on restart
firewall = create_firewall("healthcare")

# SQLite — survives restarts
from privacy_firewall import SQLiteMappingVault
vault    = SQLiteMappingVault("privacy_vault.db")
firewall = create_firewall("healthcare", vault=vault)

Manual anonymize / rehydrate

Split the round-trip when your LLM call lives outside the firewall process.

Example
# Step 1 — Anonymize before sending to LLM
anon       = firewall.anonymize(text=raw_text, context=ctx)
clean_text = anon.sanitized_text

# Step 2 — Call LLM with sanitized text
llm_out = my_llm(clean_text)

# Step 3 — Rehydrate on the way back
final = firewall.rehydrate(text=llm_out, context=ctx)
print(final)   # original names restored

GDPR right to be forgotten

forget() removes all vault mappings for the given scope. After this call rehydration will not restore values for that thread.

Note

Only vault mappings are deleted. LLM responses or logs your application has already stored are outside the library's scope.

Example
deleted = firewall.forget(
    tenant_id="hospital-001",
    case_id="patient-123",
    thread_id="consultation-1",
)
print(f"Deleted {deleted} mappings")
Section 08

Custom Entities

Register your own entity types at runtime — no config files. Option A (regex) works with any backend. Option B (recognizer) requires presidio.

Option A — Regex (any backend)

Fastest path. Pass a regex string and a disposition action.

Example
# One-liner
firewall.add_custom_regex(
    entity_type="EMPLOYEE_ID",
    regex=r"\bEMP-\d{6}\b",
    locales=["GLOBAL"],          # or ["US"], ["ES"]...
    confidence=0.95,
    context_words=["employee", "staff"],
    disposition_action="pseudonymize",   # keep / pseudonymize / generalize / mask / redact
)

# Full EntityPattern for more control
import re
from privacy_firewall.patterns.catalog import EntityPattern

firewall.add_custom_pattern(EntityPattern(
    entity_type="CASE_NUMBER",
    locale="ES",
    pattern=re.compile(r"\bEXP-\d{4}/\d{6}\b"),
    confidence=0.98,
    context_words=("expediente", "exp"),
    description="Spanish legal case number",
))

Option B — Presidio recognizer

Use create_custom_recognizer() as a shortcut, or subclass EntityRecognizer for full control.

Example
from privacy_firewall import create_firewall
from privacy_firewall.presidio_integration import create_custom_recognizer

recognizer = create_custom_recognizer(
    entity_type="EMPLOYEE_ID",
    patterns=[r"\bEMP\d{6}\b"],
    context_words=["employee", "badge"],
    score=0.9,
)

firewall = create_firewall(
    "generic",
    detector_backend="presidio",
    custom_recognizers=[recognizer],
)

# ── Full ML-based recognizer ──────────────────────────────────
from presidio_analyzer import EntityRecognizer, RecognizerResult

class MyRecognizer(EntityRecognizer):
    def load(self): ...
    def analyze(self, text, entities, nlp_artifacts):
        return [
            RecognizerResult("CUSTOM", s.start, s.end, s.score)
            for s in my_model.predict(text)
        ]

firewall = create_firewall(
    "generic",
    detector_backend="presidio",
    custom_recognizers=[MyRecognizer(supported_entities=["CUSTOM"])],
)
Section 09

Custom HuggingFace Models

Pass any HuggingFace model ID to the transformers backend. The model is downloaded automatically on first call.

Install

Example
pip install "pii-firewall[transformers]"

Usage

Swap transformer_model_id for any HF NER model.

Example
from privacy_firewall import create_firewall

# Any HuggingFace NER model ID
firewall = create_firewall(
    "healthcare",
    detector_backend="transformers",
    transformer_model_id="dslim/bert-base-NER",
)

# GPU (0 = first card, -1 = CPU default)
firewall = create_firewall(
    "healthcare",
    detector_backend="transformers",
    transformer_model_id="d4data/biomedical-ner-all",
    transformer_device=0,
)

# Use the built-in curated catalog
from privacy_firewall.transformers_ner.models import get_model_for_domain
config   = get_model_for_domain("medical", "en")
firewall = create_firewall("healthcare", detector_backend="transformers", transformer_model_id=config.model_id)

Curated models

Pre-vetted models shipped in transformers_ner/models.py.

DomainLanguageModel ID
Generalendslim/bert-base-NER
GeneralmultilingualDavlan/xlm-roberta-base-ner-hrl
GeneralfrJean-Baptiste/camembert-ner
Medicalend4data/biomedical-ner-all
MedicalesPlanTL-GOB-ES/bsc-bio-ehr-es

Combine HF model with regex (Presidio hybrid)

Wrap the HF model as a Presidio recognizer so you can mix it with locale regex patterns in one pipeline.

Example
from presidio_analyzer import EntityRecognizer, RecognizerResult
from transformers import pipeline

class HFRecognizer(EntityRecognizer):
    def __init__(self, model_id: str):
        super().__init__(supported_entities=["PERSON", "ORG", "LOC"])
        self._pipe = pipeline("ner", model=model_id, aggregation_strategy="simple")
    def load(self): ...
    def analyze(self, text, entities, nlp_artifacts):
        return [
            RecognizerResult(s["entity_group"], s["start"], s["end"], s["score"])
            for s in self._pipe(text)
        ]

firewall = create_firewall(
    "healthcare",
    detector_backend="presidio",
    custom_recognizers=[HFRecognizer("dslim/bert-base-NER")],
)
Section 10

Language Support

Auto-detects 55+ languages. Thread-level cache means 0 ms overhead after the first request. Six locales have dedicated country-document patterns.

Locale patterns

Other languages fall back to global patterns (email, phone, credit card, IBAN).

LanguageCodeCountry patternsspaCy model
SpanishesDNI, NIE, IBAN-ESes_core_news_sm
English (US)enSSN, EIN, ZIPen_core_web_sm
FrenchfrINSEE, SIRENfr_core_news_sm
GermandeSteuernummer, IBAN-DEde_core_news_sm
ItalianitCodice Fiscaleit_core_news_sm
PortugueseptNIF, NISpt_core_news_sm
Example
# Force a single language (skips auto-detect)
firewall = create_firewall("healthcare", language="es")

# Pre-warm spaCy models at startup — avoids lazy-load latency
firewall.preload_languages(["es", "en", "fr"])
Section 11

API Reference

Key methods and return fields.

create_firewall()

ParameterTypeDefaultNotes
domainstr"generic""healthcare", "finance", "legal", "generic"
profileDomainProfileNoneCustom profile — overrides domain.
languagestr | NoneNone (auto)"es", "en", "fr", "de", "it", "pt"
detector_backendstr"regex""regex", "presidio", "hybrid", "gliner", "transformers", "opf", "nemotron"
vaultMappingVaultProtocolInMemoryPass SQLiteMappingVault for persistence.
custom_recognizerslist[]Presidio EntityRecognizer instances.
transformer_model_idstrNoneHF model ID (transformers backend).
transformer_deviceint-1GPU index. -1 = CPU.

PrivacyFirewall methods

MethodReturnsDescription
process(text, context)ProcessResultanonymize + call LLM + rehydrate.
secure_call(text, context, llm_client)ProcessResultExplicit LLM client version of process().
secure_call_stream(text, context, llm_client)Iterator[str]Streaming — yields rehydrated tokens.
anonymize(text, context)ProcessResultDetect + replace only. No LLM call.
rehydrate(text, context)strRestore vault values in text.
forget(tenant_id, case_id, thread_id)intDelete vault scope. Returns deleted count.
add_custom_regex(...)NoneRegister regex entity at runtime.
add_custom_pattern(EntityPattern)NoneRegister full EntityPattern at runtime.
preload_languages(list)NonePre-warm spaCy models at startup.

ProcessResult fields

FieldDescription
sanitized_textAnonymized text sent to the LLM.
model_outputRaw LLM response (tokens not yet replaced).
final_textRehydrated LLM output — real values restored.
trace.detected_entitiesAll detected entities with type, span, confidence.
trace.entities_keptEntities that received KEEP disposition.
trace.replacementsApplied substitutions: original → token.
trace.languageDetected or forced language code.
trace.cleanup_warningsResidual PII warnings after cleanup passes.