Synthetic Personas
A first-class library that generates synthetic pediatric patients, parents, families, conditions, visits, and clinical artifacts for Starlight Practice. Used for HIPAA-safe development, tests, demos, AI evaluation, and (eventually) seeding production-like staging environments.
Why this exists
Three reinforcing reasons:
- HIPAA safety. Engineers should never need real patient PHI to develop, debug, or demo Starlight. The Compliance · Synthetic Data Program lists this as a gating principle.
- Test coverage on edge cases. Real patient datasets cluster around the mode (a 7-year-old with a viral cough is overrepresented; an unusual pediatric case is underrepresented). A generator can deliberately produce the long tail.
- Engine reuse. The case-generation infrastructure built here also powers the Diagnostic Game, CME Delivery, and the Parent Triage feature's synthetic conversation pipeline. Build it once, ship it three times.
Engineering inspiration
The pattern is adapted from the Polaris synthetic-IT-buyer / Solutions Explorer system (packages/scripts/personas/ in the polaris repo). Polaris's approach proved that a deterministic, hierarchically-sampled, distribution-tested persona library is feasible and useful:
- Deterministic PRNG (
mulberry32) + cohort seeds → reproducible cohorts. - Hierarchical sampling (industry → size → archetype) with plausibility filters at each step → no impossible combinations like "CIO at solopreneur."
- Population-weighted region sampling + industry-region affinity multipliers → realistic geographic distribution.
- Rich
Slugenums project down to coarseFacetenums for downstream consumers. - Distribution invariants tested in CI (no industry > 15%, every census region represented, etc.).
- Sentiment + voice + backstory fields for an LLM harness that runs the persona through a session.
Starlight adapts the same skeleton to a different domain: pediatric patients within families, with age-driven plausibility, condition profiles, family structures, and clinically-coherent comorbidities.
What's in this section
📄️ Ideation
Posture: generate-many-then-prune. This page is intentionally over-imagined. Some of it is good. Some of it is wrong. The goal is to enumerate what could be before we anchor on what we'll build. Architecture and Implementation pages are the soberer drafts.
📄️ Architecture
The sober draft. Pruned from the Ideation page; targeted at "what we'll actually build." The pattern follows Polaris's packages/scripts/personas/ very closely; the domain mapping is what makes it ours.
📄️ Implementation Plan
Milestone-sequenced build plan for the synthetic-personas library. Per the no-speculative-timelines rule, gates not dates.
📄️ Sample Cohort
Reproducible verification artifact. Generated by walking a single deterministic cohort (seed=42, size=500) and picking the first child matching each of twenty archetypal predicates. Re-runnable via:
📄️ Verification
This page documents the end-to-end verification we ran against the generator, both as the audit trail for shipping SP0–SP3 and as a reproducible recipe anyone on the team can re-run.
What's in code
The generator code lives at scripts/synthetic-personas/ in this repo and is callable from the backend's test suite and seed scripts. See the Implementation Plan for the full module layout and the Architecture for the type system.
Quick start
Three pnpm scripts at the repo root:
# Pretty-printed cohort summary
pnpm synth -- --seed=42 --size=200
# Tiny HTTP server (zero new deps) for curl-driven exploration
pnpm synth:serve # then: curl http://127.0.0.1:7777/health
# Regenerate the published 20-sample verification page
pnpm synth:samples > docs/docs/synthetic-personas/sample-cohort.md
Verification
The Sample Cohort page is a reproducible verification artifact: 20 archetypal pediatric children (newborn → adolescent, all major coemission rules, Spanish-primary single-mother, religious-exemption refuser, foster family with CPS history, adversarial adolescent with self-harm-history, sibling pair with diverging conditions, etc.). Same seed → bit-identical output, so anyone on the team can re-run pnpm synth:samples and get the same artifact.
Cross-references
- Compliance · Synthetic Data Program — the regulatory anchor.
- Auxiliary · Diagnostic Game and CME Delivery — sibling consumers of this engine.
- Horizon · Parent Triage — needs this for safe testing of conversational flows.
- Launch Briefing · Hardcoded Data Models — the patient-roster shape the generator targets.