Skip to main content

Implementation Plan

Milestone-sequenced build plan for the synthetic-personas library. Per the no-speculative-timelines rule, gates not dates.

Module placement

Code lives at scripts/synthetic-personas/ (top-level), mirroring Polaris's packages/scripts/personas/. It's a standalone TypeScript module, callable from:

  • The backend test suite (via relative import).
  • A CLI entrypoint: pnpm tsx scripts/synthetic-personas/cli.ts --seed=42 --size=500.
  • Future seed scripts that want to populate a staging database with realistic-feeling fake families.

This placement keeps generator code out of production runtime, yet importable wherever it's useful.

Gates

GateOutputExit criteria
SP0 · Skeletontypes.ts + data.ts (catalogs only) + generate.ts shell + index.ts + README.mdRepo compiles. CLI prints "hello synthetic" with the seed echoed.
SP1 · Deterministic coremulberry32, weightedPick, pickRegion, pickFamilyArchetype, basic PatientSpec generation.A 100-family cohort generates and prints a JSON dump. Same seed → bit-identical output.
SP2 · Plausibility filtersAge-band-aware condition gating, vaccine-posture coherence, archetype × child-count plausibility, T1D ↔ insulin coemit.All "structurally impossible" patterns absent from a 1,000-family cohort by inspection.
SP3 · Distribution invariantsgenerate.test.ts with the full invariant suite from Architecture.Tests green on pnpm test against any seed in 1234.
SP4 · Cohort modesadversarialBias, newbornBias, ageBandWeights, conditionsAllow, archetypesAllow knobs.Each knob produces visibly skewed cohorts that still satisfy core invariants.
SP5 · Timeline derivationtimeline.ts that emits visit / vaccine / Rx / message-thread streams from a FamilyPersonaSpec.A child has a coherent stream from birth to "today" — well-visits at proper cadence, vaccines at proper ages, sick visits scattered, Rx fills aligned to chronic conditions.
SP6 · LLM narrative texturellm.ts — Claude-backed renderers for visit-note narratives, parent-app messages, document content. Conditioned on structured fields.Sample outputs read as plausible clinical prose AND respect the structured fields (no LLM-invented diagnoses).
SP7 · Backend integrationBackend test fixtures import a fixed cohort. Seed script for staging environments.A new engineer can pnpm db:seed-synthetic --stage=dev and have a realistic 500-family practice running locally.
SP8 · Adversarial cohort + escalation testsPre-built adversarial cohort that should trip every escalation path (acute emergency, abuse pattern, suicidality, drug interaction).Each escalation path's test asserts the right alert fires for the right adversarial persona.
SP9 · Atlas-migration corpusSynthetic incoming Atlas.MD export (mixed-quality PDFs, partial fields, free-text dump) for testing the Atlas migration epic.The migration pipeline ingests and produces a clean Starlight chart.
ContinuousCohort regenerated on schema change; invariants run on every CI build.Drift detected immediately.

SP0–SP3 are the body of this PR

This PR ships SP0 → SP3 so we have a working, deterministic, plausibility-checked, distribution-tested generator on day one. SP4 onward is follow-on work.

The reason: the load-bearing claim of synthetic personas is "engineers don't need real PHI to develop." That's true the moment SP3 is green. Everything after is improvement, not foundation.

Dependencies on other workstreams

Depends onWhy
Compliance · synthetic-data program (already documented)This is its implementation.
Backend · Family + Guardian + Patient schemaThe generator's FamilyPersonaSpec should project cleanly into the production schema. If the schema isn't yet final, the generator's types should be a superset that we narrow at projection time.
Backend · test-suite scaffoldingSP7 fixtures import here.
Future · Claude API integration in backendSP6 reuses whatever Anthropic-SDK plumbing the AI scribe is using.

Out-of-scope for this workstream (deliberately)

  • Synthetic medical imaging. X-ray / photo generation is a separate hard problem with diffusion models. Future system; not blocking anything in v1.
  • Synthetic audio / TTS. The 6-clip ElevenLabs pattern in the launch-briefing prototype is hand-curated for now; full TTS pipeline is a follow-on.
  • Production data anonymization. We never route real PHI through anonymization; we generate fresh. This generator is the first-class answer; anonymization is not on the roadmap.
  • Human-in-the-loop persona authoring UI. A web admin to hand-craft a persona could be useful eventually but isn't required when the seed-deterministic CLI gives us the same reproducibility.

What the first PR ships

  • Docs at docs/docs/synthetic-personas/{intro,ideation,architecture,implementation-plan}.md — wired into the sidebar.
  • Code at scripts/synthetic-personas/:
    • types.ts — full type system per Architecture.
    • data.ts — catalogs (family archetypes, conditions, vaccine slugs, regions, names).
    • generate.tsmulberry32 PRNG, weightedPick, hierarchical sampling, plausibility filters, generateCohort().
    • index.ts — public API.
    • cli.tspnpm tsx scripts/synthetic-personas/cli.ts --seed=42 --size=500 --out=cohort.json.
    • README.md.
    • generate.test.ts — distribution + plausibility tests.

After merge, anyone in the team can run the CLI and inspect a cohort. The architecture is set; the rest is incremental fill-in.