Implementation Plan

Milestone-sequenced build plan for the synthetic-personas library. Per the no-speculative-timelines rule, gates not dates.

Module placement

Code lives at scripts/synthetic-personas/ (top-level), mirroring Polaris's packages/scripts/personas/. It's a standalone TypeScript module, callable from:

The backend test suite (via relative import).
A CLI entrypoint: pnpm tsx scripts/synthetic-personas/cli.ts --seed=42 --size=500.
Future seed scripts that want to populate a staging database with realistic-feeling fake families.

This placement keeps generator code out of production runtime, yet importable wherever it's useful.

Gates

Gate	Output	Exit criteria
SP0 · Skeleton	`types.ts` + `data.ts` (catalogs only) + `generate.ts` shell + `index.ts` + `README.md`	Repo compiles. CLI prints "hello synthetic" with the seed echoed.
SP1 · Deterministic core	`mulberry32`, `weightedPick`, `pickRegion`, `pickFamilyArchetype`, basic `PatientSpec` generation.	A 100-family cohort generates and prints a JSON dump. Same seed → bit-identical output.
SP2 · Plausibility filters	Age-band-aware condition gating, vaccine-posture coherence, archetype × child-count plausibility, T1D ↔ insulin coemit.	All "structurally impossible" patterns absent from a 1,000-family cohort by inspection.
SP3 · Distribution invariants	`generate.test.ts` with the full invariant suite from Architecture.	Tests green on `pnpm test` against any seed in 1234.
SP4 · Cohort modes	`adversarialBias`, `newbornBias`, `ageBandWeights`, `conditionsAllow`, `archetypesAllow` knobs.	Each knob produces visibly skewed cohorts that still satisfy core invariants.
SP5 · Timeline derivation	`timeline.ts` that emits visit / vaccine / Rx / message-thread streams from a `FamilyPersonaSpec`.	A child has a coherent stream from birth to "today" — well-visits at proper cadence, vaccines at proper ages, sick visits scattered, Rx fills aligned to chronic conditions.
SP6 · LLM narrative texture	`llm.ts` — Claude-backed renderers for visit-note narratives, parent-app messages, document content. Conditioned on structured fields.	Sample outputs read as plausible clinical prose AND respect the structured fields (no LLM-invented diagnoses).
SP7 · Backend integration	Backend test fixtures import a fixed cohort. Seed script for staging environments.	A new engineer can `pnpm db:seed-synthetic --stage=dev` and have a realistic 500-family practice running locally.
SP8 · Adversarial cohort + escalation tests	Pre-built adversarial cohort that should trip every escalation path (acute emergency, abuse pattern, suicidality, drug interaction).	Each escalation path's test asserts the right alert fires for the right adversarial persona.
SP9 · Atlas-migration corpus	Synthetic incoming Atlas.MD export (mixed-quality PDFs, partial fields, free-text dump) for testing the Atlas migration epic.	The migration pipeline ingests and produces a clean Starlight chart.
Continuous	Cohort regenerated on schema change; invariants run on every CI build.	Drift detected immediately.

SP0–SP3 are the body of this PR

This PR ships SP0 → SP3 so we have a working, deterministic, plausibility-checked, distribution-tested generator on day one. SP4 onward is follow-on work.

The reason: the load-bearing claim of synthetic personas is "engineers don't need real PHI to develop." That's true the moment SP3 is green. Everything after is improvement, not foundation.

Dependencies on other workstreams

Depends on	Why
Compliance · synthetic-data program (already documented)	This is its implementation.
Backend · `Family + Guardian + Patient` schema	The generator's `FamilyPersonaSpec` should project cleanly into the production schema. If the schema isn't yet final, the generator's types should be a superset that we narrow at projection time.
Backend · test-suite scaffolding	SP7 fixtures import here.
Future · Claude API integration in backend	SP6 reuses whatever Anthropic-SDK plumbing the AI scribe is using.

Out-of-scope for this workstream (deliberately)

Synthetic medical imaging. X-ray / photo generation is a separate hard problem with diffusion models. Future system; not blocking anything in v1.
Synthetic audio / TTS. The 6-clip ElevenLabs pattern in the launch-briefing prototype is hand-curated for now; full TTS pipeline is a follow-on.
Production data anonymization. We never route real PHI through anonymization; we generate fresh. This generator is the first-class answer; anonymization is not on the roadmap.
Human-in-the-loop persona authoring UI. A web admin to hand-craft a persona could be useful eventually but isn't required when the seed-deterministic CLI gives us the same reproducibility.

What the first PR ships

Docs at docs/docs/synthetic-personas/{intro,ideation,architecture,implementation-plan}.md — wired into the sidebar.
Code at scripts/synthetic-personas/:
- types.ts — full type system per Architecture.
- data.ts — catalogs (family archetypes, conditions, vaccine slugs, regions, names).
- generate.ts — mulberry32 PRNG, weightedPick, hierarchical sampling, plausibility filters, generateCohort().
- index.ts — public API.
- cli.ts — pnpm tsx scripts/synthetic-personas/cli.ts --seed=42 --size=500 --out=cohort.json.
- README.md.
- generate.test.ts — distribution + plausibility tests.

After merge, anyone in the team can run the CLI and inspect a cohort. The architecture is set; the rest is incremental fill-in.

Module placement​

Gates​

SP0–SP3 are the body of this PR​

Dependencies on other workstreams​

Out-of-scope for this workstream (deliberately)​

What the first PR ships​