Implementation Plan
Milestone-sequenced build plan for the synthetic-personas library. Per the no-speculative-timelines rule, gates not dates.
Module placement
Code lives at scripts/synthetic-personas/ (top-level), mirroring Polaris's packages/scripts/personas/. It's a standalone TypeScript module, callable from:
- The backend test suite (via relative import).
- A CLI entrypoint:
pnpm tsx scripts/synthetic-personas/cli.ts --seed=42 --size=500. - Future seed scripts that want to populate a staging database with realistic-feeling fake families.
This placement keeps generator code out of production runtime, yet importable wherever it's useful.
Gates
| Gate | Output | Exit criteria |
|---|---|---|
| SP0 · Skeleton | types.ts + data.ts (catalogs only) + generate.ts shell + index.ts + README.md | Repo compiles. CLI prints "hello synthetic" with the seed echoed. |
| SP1 · Deterministic core | mulberry32, weightedPick, pickRegion, pickFamilyArchetype, basic PatientSpec generation. | A 100-family cohort generates and prints a JSON dump. Same seed → bit-identical output. |
| SP2 · Plausibility filters | Age-band-aware condition gating, vaccine-posture coherence, archetype × child-count plausibility, T1D ↔ insulin coemit. | All "structurally impossible" patterns absent from a 1,000-family cohort by inspection. |
| SP3 · Distribution invariants | generate.test.ts with the full invariant suite from Architecture. | Tests green on pnpm test against any seed in 1234. |
| SP4 · Cohort modes | adversarialBias, newbornBias, ageBandWeights, conditionsAllow, archetypesAllow knobs. | Each knob produces visibly skewed cohorts that still satisfy core invariants. |
| SP5 · Timeline derivation | timeline.ts that emits visit / vaccine / Rx / message-thread streams from a FamilyPersonaSpec. | A child has a coherent stream from birth to "today" — well-visits at proper cadence, vaccines at proper ages, sick visits scattered, Rx fills aligned to chronic conditions. |
| SP6 · LLM narrative texture | llm.ts — Claude-backed renderers for visit-note narratives, parent-app messages, document content. Conditioned on structured fields. | Sample outputs read as plausible clinical prose AND respect the structured fields (no LLM-invented diagnoses). |
| SP7 · Backend integration | Backend test fixtures import a fixed cohort. Seed script for staging environments. | A new engineer can pnpm db:seed-synthetic --stage=dev and have a realistic 500-family practice running locally. |
| SP8 · Adversarial cohort + escalation tests | Pre-built adversarial cohort that should trip every escalation path (acute emergency, abuse pattern, suicidality, drug interaction). | Each escalation path's test asserts the right alert fires for the right adversarial persona. |
| SP9 · Atlas-migration corpus | Synthetic incoming Atlas.MD export (mixed-quality PDFs, partial fields, free-text dump) for testing the Atlas migration epic. | The migration pipeline ingests and produces a clean Starlight chart. |
| Continuous | Cohort regenerated on schema change; invariants run on every CI build. | Drift detected immediately. |
SP0–SP3 are the body of this PR
This PR ships SP0 → SP3 so we have a working, deterministic, plausibility-checked, distribution-tested generator on day one. SP4 onward is follow-on work.
The reason: the load-bearing claim of synthetic personas is "engineers don't need real PHI to develop." That's true the moment SP3 is green. Everything after is improvement, not foundation.
Dependencies on other workstreams
| Depends on | Why |
|---|---|
| Compliance · synthetic-data program (already documented) | This is its implementation. |
Backend · Family + Guardian + Patient schema | The generator's FamilyPersonaSpec should project cleanly into the production schema. If the schema isn't yet final, the generator's types should be a superset that we narrow at projection time. |
| Backend · test-suite scaffolding | SP7 fixtures import here. |
| Future · Claude API integration in backend | SP6 reuses whatever Anthropic-SDK plumbing the AI scribe is using. |
Out-of-scope for this workstream (deliberately)
- Synthetic medical imaging. X-ray / photo generation is a separate hard problem with diffusion models. Future system; not blocking anything in v1.
- Synthetic audio / TTS. The 6-clip ElevenLabs pattern in the launch-briefing prototype is hand-curated for now; full TTS pipeline is a follow-on.
- Production data anonymization. We never route real PHI through anonymization; we generate fresh. This generator is the first-class answer; anonymization is not on the roadmap.
- Human-in-the-loop persona authoring UI. A web admin to hand-craft a persona could be useful eventually but isn't required when the seed-deterministic CLI gives us the same reproducibility.
What the first PR ships
- Docs at
docs/docs/synthetic-personas/{intro,ideation,architecture,implementation-plan}.md— wired into the sidebar. - Code at
scripts/synthetic-personas/:types.ts— full type system per Architecture.data.ts— catalogs (family archetypes, conditions, vaccine slugs, regions, names).generate.ts—mulberry32PRNG,weightedPick, hierarchical sampling, plausibility filters,generateCohort().index.ts— public API.cli.ts—pnpm tsx scripts/synthetic-personas/cli.ts --seed=42 --size=500 --out=cohort.json.README.md.generate.test.ts— distribution + plausibility tests.
After merge, anyone in the team can run the CLI and inspect a cohort. The architecture is set; the rest is incremental fill-in.