Skip to main content

Synthetic Personas

A first-class library that generates synthetic pediatric patients, parents, families, conditions, visits, and clinical artifacts for Starlight Practice. Used for HIPAA-safe development, tests, demos, AI evaluation, and (eventually) seeding production-like staging environments.

Why this exists

Three reinforcing reasons:

  1. HIPAA safety. Engineers should never need real patient PHI to develop, debug, or demo Starlight. The Compliance · Synthetic Data Program lists this as a gating principle.
  2. Test coverage on edge cases. Real patient datasets cluster around the mode (a 7-year-old with a viral cough is overrepresented; an unusual pediatric case is underrepresented). A generator can deliberately produce the long tail.
  3. Engine reuse. The case-generation infrastructure built here also powers the Diagnostic Game, CME Delivery, and the Parent Triage feature's synthetic conversation pipeline. Build it once, ship it three times.

Engineering inspiration

The pattern is adapted from the Polaris synthetic-IT-buyer / Solutions Explorer system (packages/scripts/personas/ in the polaris repo). Polaris's approach proved that a deterministic, hierarchically-sampled, distribution-tested persona library is feasible and useful:

  • Deterministic PRNG (mulberry32) + cohort seeds → reproducible cohorts.
  • Hierarchical sampling (industry → size → archetype) with plausibility filters at each step → no impossible combinations like "CIO at solopreneur."
  • Population-weighted region sampling + industry-region affinity multipliers → realistic geographic distribution.
  • Rich Slug enums project down to coarse Facet enums for downstream consumers.
  • Distribution invariants tested in CI (no industry > 15%, every census region represented, etc.).
  • Sentiment + voice + backstory fields for an LLM harness that runs the persona through a session.

Starlight adapts the same skeleton to a different domain: pediatric patients within families, with age-driven plausibility, condition profiles, family structures, and clinically-coherent comorbidities.

What's in this section

What's in code

The generator code lives at scripts/synthetic-personas/ in this repo and is callable from the backend's test suite and seed scripts. See the Implementation Plan for the full module layout and the Architecture for the type system.

Quick start

Three pnpm scripts at the repo root:

# Pretty-printed cohort summary
pnpm synth -- --seed=42 --size=200

# Tiny HTTP server (zero new deps) for curl-driven exploration
pnpm synth:serve # then: curl http://127.0.0.1:7777/health

# Regenerate the published 20-sample verification page
pnpm synth:samples > docs/docs/synthetic-personas/sample-cohort.md

Verification

The Sample Cohort page is a reproducible verification artifact: 20 archetypal pediatric children (newborn → adolescent, all major coemission rules, Spanish-primary single-mother, religious-exemption refuser, foster family with CPS history, adversarial adolescent with self-harm-history, sibling pair with diverging conditions, etc.). Same seed → bit-identical output, so anyone on the team can re-run pnpm synth:samples and get the same artifact.

Cross-references