Architecture

The sober draft. Pruned from the Ideation page; targeted at "what we'll actually build." The pattern follows Polaris's packages/scripts/personas/ very closely; the domain mapping is what makes it ours.

Layered model

┌─────────────────────────────────────────────────────────────┐
│  L4 · CLINICAL ARTIFACT STREAM                              │
│      Visits · Vaccines · Notes · Rx · Docs · Messages       │
│      (derived from the persona's structural shape)          │
├─────────────────────────────────────────────────────────────┤
│  L3 · PERSONA SPEC                                          │
│      Family + Guardians + Children + Sentiment + Voice      │
├─────────────────────────────────────────────────────────────┤
│  L2 · SAMPLING (deterministic, hierarchical)                │
│      familyArchetype → demographics → children → conditions │
├─────────────────────────────────────────────────────────────┤
│  L1 · CATALOGS                                              │
│      familyArchetypes · conditions · regions · names · …    │
├─────────────────────────────────────────────────────────────┤
│  L0 · PRIMITIVES                                            │
│      mulberry32 PRNG · weighted-pick · plausibility filters │
└─────────────────────────────────────────────────────────────┘

L0–L2 are deterministic and pure-TypeScript. L3 is the materialized persona object. L4 is the clinical-artifact stream — partially deterministic (vaccines, scheduled well-visits) and partially LLM-generated (note narratives, message threads).

The fundamental object — `FamilyPersonaSpec`

Polaris generates one persona per "row." We generate one Family per row, with N children and M guardians underneath. This shape is deliberate — it's how the actual Starlight data model works (Family + Guardian + Patient).

export interface FamilyPersonaSpec {
  /** Stable id deterministic from cohort seed + index. */
  id: string;
  cohortSeed: number;

  /** Structural archetype that drove generation. */
  familyArchetype: FamilyArchetypeSlug;

  /** Household-level facts. */
  household: Household;

  /** One or more guardians. Order is significant: index 0 is "primary." */
  guardians: GuardianSpec[];

  /** One or more children. Each is a fully-formed PatientSpec. */
  children: PatientSpec[];

  /** Subscription + billing posture. */
  billing: BillingSpec;

  /** Generated-at + summary. */
  generatedAt: string;
}

`Household`

export interface Household {
  region: Region;                        // state, metro, ZIP-area, urban/rural
  primaryLanguage: LanguageCode;         // 'en' | 'es' | 'vi' | 'zh' | …
  spokenLanguages: LanguageCode[];       // for translation needs
  sesProxy: 0 | 1 | 2 | 3;               // 0=very-low … 3=affluent (drives plan choice + adherence)
  transportationAccess: 'good' | 'limited' | 'rural';
  splitHousehold: boolean;               // shared custody → two physical homes
}

`GuardianSpec`

export interface GuardianSpec {
  firstName: string;
  lastName: string;
  relationshipToChildren: GuardianRelationship[]; // per-child relationship
  // 'mother' | 'father' | 'step-mother' | 'step-father' |
  // 'grandmother' | 'grandfather' | 'foster-parent' |
  // 'court-appointed' | 'aunt' | 'uncle' | 'sibling-guardian'

  contact: { mobile: string; email: string };
  preferredChannel: 'app' | 'sms' | 'email' | 'phone';

  /** Voice / sentiment for the LLM harness. */
  sentiment: GuardianSentiment;
  voice: GuardianVoice;

  /** Hard constraints for the harness + UX. */
  constraints: GuardianConstraint[];
}

export type GuardianConstraint =
  | 'limited-english'
  | 'health-literacy-low'
  | 'health-literacy-clinician'   // parent is themselves a clinician
  | 'high-anxiety'
  | 'vaccine-hesitant'
  | 'financially-strained'
  | 'evening-only-availability'
  | 'co-parent-conflict'          // parents disagree on care
  | 'cps-history'
  | 'has-court-order';

export interface GuardianSentiment {
  engagement: 0 | 1 | 2 | 3;       // 0=disengaged … 3=hyper-engaged
  trustInMedicine: 0 | 1 | 2 | 3;  // 0=actively-skeptical … 3=full-trust
  anxietyLevel: 0 | 1 | 2 | 3;
  financialPressure: 0 | 1 | 2 | 3;
}

export interface GuardianVoice {
  tone: 'warm' | 'direct' | 'anxious' | 'skeptical' | 'exhausted' | 'casual';
  readingLevel: 'plain' | 'professional' | 'medical';
  backstory: string;               // ~2 sentences for LLM-prompt preamble
}

`PatientSpec` (the child)

export interface PatientSpec {
  firstName: string;
  lastName: string;                 // not always same as guardian (step / divorced / adopted)
  sex: 'F' | 'M';
  dob: string;                      // ISO date, deterministic from cohort+index
  ageBand: AgeBand;                 // 'newborn' | 'infant' | 'toddler' | 'preschool' | 'school' | 'tween' | 'adolescent'

  /** Birth history — present always; richer for newborns/infants. */
  birth: BirthHistory;

  /** Active conditions / allergies. Coherent with ageBand. */
  conditionProfile: ConditionProfile;

  /** Vaccine status — what's complete, what's pending, what's declined. */
  vaccineStatus: VaccineStatus;

  /** Growth-percentile trajectory at the cohort's "today." */
  growth: GrowthSnapshot;

  /** Constraints that aren't conditions but matter for UX. */
  patientConstraints: PatientConstraint[];

  /** Per-child seed for deriving stochastic timeline events. */
  seed: number;
}

export type AgeBand =
  | 'newborn'      // 0–28 days
  | 'infant'       // 1–11 months
  | 'toddler'      // 1–2 years
  | 'preschool'    // 3–4 years
  | 'school'       // 5–10 years
  | 'tween'        // 11–13 years
  | 'adolescent';  // 14–17 years

export interface BirthHistory {
  gestationalAgeWeeks: number;      // 24–42
  deliveryMode: 'vaginal' | 'csection' | 'vacuum-assisted' | 'forceps';
  birthWeightKg: number;
  birthLengthCm: number;
  apgar1: number;
  apgar5: number;
  nicuStay: boolean;
  feeding: 'exclusive-breast' | 'mixed' | 'formula' | 'transitioned-solids';
  bilirubinPeak?: number;
}

export interface ConditionProfile {
  allergies: AllergySlug[];           // 'peanut-severe' | 'tree-nut-mod' | 'penicillin' | 'NKDA' | …
  chronicConditions: ChronicSlug[];   // 'asthma-mild' | 'asthma-moderate' | 'eczema' | 'adhd' | 't1d' | …
  developmental: DevelopmentalSlug[]; // 'speech-delay' | 'autism-mild' | 'iep-on-file' | …
  mentalHealth: MentalHealthSlug[];   // 'anxiety' | 'depression' | 'eating-disorder' | …
  /** Active medications keyed by chronic-condition slug. */
  activeRx: RxSpec[];
}

export type PatientConstraint =
  | 'cii-rx-on-file'           // ADHD CII Schedule prescription
  | 'epi-pen-on-file'
  | 'glucagon-on-file'
  | 'iep-on-file'
  | '504-on-file'
  | 'school-physical-due'
  | 'sport-physical-due'
  | 'transitioning-to-adult'
  | 'recent-er-visit'
  | 'prior-cps-flag'
  | 'gender-incongruence-conversation'
  | 'newborn-home-visit-due'
  | 'travel-vaccines-needed'
  | 'recent-international-arrival';

export interface VaccineStatus {
  posture: 'on-schedule' | 'mild-delay' | 'partial-catch-up-needed' | 'religious-exemption' | 'philosophical-exemption' | 'partial-international';
  doses: VaccineDose[];        // structured list of given doses with date + lot (synth)
  declinedSlugs: VaccineSlug[]; // explicitly declined
  pendingSlugs: VaccineSlug[];  // due / overdue
}

export interface GrowthSnapshot {
  weightKg: number;
  heightCm: number;
  weightPercentile: number;     // 0–100
  heightPercentile: number;
  bmiPercentile?: number;
  trajectory: 'tracking' | 'rising' | 'falling' | 'volatile';
}

`BillingSpec`

export interface BillingSpec {
  plan: 'monthly' | 'quarterly' | 'annual';
  rateUSD: number;
  cardOnFile: boolean;
  status: 'active' | 'past-due-30' | 'past-due-60' | 'cancelled-pending';
  splitBilling?: { mom: number; dad: number };  // 0..100, must sum to 100
  employerSponsored: boolean;
}

Cohort envelope

export interface FamilyCohort {
  seed: number;
  size: number;                    // # of families (children = ~1.6× this)
  generatedAt: string;
  families: FamilyPersonaSpec[];
  summary: CohortSummary;
}

export interface CohortSummary {
  byFamilyArchetype: Record<FamilyArchetypeSlug, number>;
  byPrimaryLanguage: Record<LanguageCode, number>;
  byRegion: Record<string, number>;
  byChildAgeBand: Record<AgeBand, number>;
  byChronicCondition: Record<ChronicSlug, number>;
  byVaccinePosture: Record<VaccineStatus['posture'], number>;
  byBillingStatus: Record<BillingSpec['status'], number>;
  totalFamilies: number;
  totalChildren: number;
  totalGuardians: number;
}

Sampling (L2) — hierarchical and deterministic

The sampling cascade per family:

seed → mulberry32 RNG
  ↓
draw familyArchetype  (weighted, US distribution)
  ↓
draw region           (population × industry-style affinity, see below)
  ↓
derive household      (region drives language probability, SES proxy)
  ↓
draw # of children    (Poisson-ish capped 1–5, weighted by archetype)
  ↓
draw guardians        (count + relationship from archetype)
  ↓
for each child:
   ↓
   draw ageBand        (uniform-ish across the practice's panel)
   ↓
   derive birth        (age-band-appropriate)
   ↓
   draw conditions     (age-band plausibility filter applied)
   ↓
   derive vaccines     (age + posture + condition-modifying)
   ↓
   derive growth       (age + condition + family pattern)
   ↓
draw billing

Plausibility filters at every step, mirroring Polaris's "no CIO at solopreneur" pattern:

A 2-year-old can't have ADHD diagnosed (DSM-5 minimum age is 4).
A newborn can't have a school physical.
An adolescent can have HEEADSSS-confidential threads; a 5-year-old cannot.
A "religious exemption" vaccine posture is incompatible with "fully on schedule."
T1D requires a chronic-Rx (insulin) — generator must coemit.
An immunocompromised flag must defer live vaccines.
Recent international arrival is incompatible with "fully on US schedule."

Filters are encoded as TypeScript predicates and applied at draw time — exactly the Polaris approach.

Region weighting

Identical to Polaris's strategy:

Base weight per state = 2020 Census population in millions. California (~39M) is ~25× more likely than Wyoming.
Family-archetype × region affinity — e.g., "recently-arrived-immigrant-family" has multipliers for TX/CA/FL/NY; "rural-multigenerational" has multipliers for the Mountain West and Appalachia. Affinity is a multiplier, not a hard restriction — every state can still produce every archetype, just at different probabilities.

LLM-generated content (L4)

Most of the persona is structurally generated. Only natural-language texture uses LLMs:

Field	Generation method
`voice.backstory` (per guardian)	Hand-curated templates, lightly randomized. (Polaris pattern.)
Visit-note narratives in the timeline stream	LLM (Claude), conditioned on the persona's structured fields, age, and the visit type.
Parent-app message thread content	LLM, conditioned on tone + reading level + topic + the child's actual record.
Document content (lab reports, imaging, school physicals)	LLM, conditioned on the structured panel/result.
Audio conversation lines (for AI scribe demos)	Hand-curated initially → LLM-generated + TTS later.

Crucial design rule: the LLM never invents structured fields. The LLM only renders structured fields into prose. This keeps the generator's distribution properties intact even when natural language is added — the structure is the source of truth, the prose is decoration.

Distribution invariants (CI tests)

Mirroring Polaris's test suite. Any cohort of 1,000+ families must satisfy:

Coverage: every age band has ≥ 50 children. Every primary-language code in the catalog appears at least once. Every census region has at least 5 families.
Concentration: no single condition slug exceeds 30% of children. No family archetype exceeds 25% of families. No state exceeds 25% of families (California will be the binding case here).
Coherence: every child's age is within their age-band's range. Every chronic Rx is paired with the corresponding chronic condition. Every vaccine posture is internally consistent (no "fully-on-schedule" + 3-year gaps in doses). Every splitHousehold: true family has > 1 guardian and at least one with a custody-related constraint.
Determinism: same (seed, size, options) produces bit-identical families[]. PRNG is mulberry32. No Math.random() allowed.
Long-tail presence: at least 1 newborn home-visit pattern, at least 1 adolescent HEEADSSS-confidential pattern, at least 1 vaccine-hesitant family, at least 1 split-custody family, at least 1 limited-English family in any cohort ≥ 200.

Cohort modes

Like Polaris's industriesAllow etc., we'll support stress-testing modes:

export interface GenerateCohortOptions {
  seed: number;
  size: number;                                // # of families
  familyArchetypeWeights?: Partial<Record<FamilyArchetypeSlug, number>>;
  archetypesAllow?: FamilyArchetypeSlug[];
  conditionsAllow?: ChronicSlug[];
  ageBandWeights?: Partial<Record<AgeBand, number>>;
  /** Adversarial mode — bias toward edge cases, suicidality, abuse, drug-interaction, fraud-attempt patterns. */
  adversarialBias?: number;                    // 0..1 — share of cohort that is adversarial
  /** Newborn-heavy mode — generate a higher proportion of newborns (for home-visit testing). */
  newbornBias?: number;                        // 0..1
}

Module layout (code)

Mirrors Polaris closely, in scripts/synthetic-personas/:

scripts/synthetic-personas/
├── README.md
├── package.json (or use root tsx)
├── types.ts                — All exported types (above)
├── data.ts                 — Catalogs: familyArchetypes, conditions, regions, names, vaccines
├── generate.ts             — Main generator: cohort + family + child
├── generate.test.ts        — Distribution + plausibility tests
├── timeline.ts             — Derive visit / vaccine / Rx / message stream from PersonaSpec
├── llm.ts                  — Optional Claude calls for narrative texture
├── cli.ts                  — Command-line entrypoint: `pnpm synth --seed=42 --size=500`
└── index.ts                — Public API surface

The first PR ships everything except llm.ts — natural-language rendering can come later. The structural cohort generator is the load-bearing piece.

What this gives us, immediately

A deterministic 500-family cohort the backend test suite can load on every CI run.
Stress-test modes for newborn flows, adolescent confidentiality, vaccine hesitancy, custody complexity.
A staging-environment seed that lets a new engineer spin up a realistic-feeling Starlight without ever needing access to real PHI.
An evaluation harness for AI features: run any new chart-side AI feature against the cohort and compute precision/recall by condition.
Demo material that's defensibly synthetic — every screenshot we put on the website or in marketing comes from a cohort row with a deterministic seed; we can reproduce any image and prove it's not a real patient.

The next page, Implementation Plan, sequences the build by gate.

Layered model​

The fundamental object — FamilyPersonaSpec​

Household​

GuardianSpec​

PatientSpec (the child)​

BillingSpec​

Cohort envelope​

Sampling (L2) — hierarchical and deterministic​

Region weighting​

LLM-generated content (L4)​

Distribution invariants (CI tests)​

Cohort modes​

Module layout (code)​

What this gives us, immediately​