Skip to main content

Data Sources

A comprehensive map of medical data sources Starlight Practice can ingest, license, or substitute around. Built to answer the question: "For pediatric DPC, what's the most rigorous set of data we could put into our data lakes โ€” free, public, or paid?"

Erik's framing: "I just want to know what's the whole sources of data that I can get truly free and public, and notable paid sources that we could also consider."

This section is the answer.

How this section is organizedโ€‹

Reading orderโ€‹

  1. Government / Public-Sector Sources โ€” start here. NIH/NLM, CDC, FDA, HRSA/MCHB, AHRQ, CMS, state registries. Free or near-free, often canonical for pediatric care. ~80 sources catalogued with license terms, ingestion priority, and pediatric relevance per source.
  2. Vocabularies, Guidelines & Standards โ€” code sets (ICD-10-CM, SNOMED CT, LOINC, RxNorm, NDC, FHIR R4, USCDI, CVX), pediatric-specific structured content (Bright Futures, ACIP, growth charts, USPSTF), drug-interaction sources, and validated screeners. Critical licensing flags (CPT, DSM, SNOMED-Intl).
  3. Paid Commercial References โ€” UpToDate, DynaMed, Lexicomp, FDB MedKnowledge, etc. Includes the license re-use cheat sheet (most paid sources cannot be ingested into LLM context without an AI addendum โ€” load-bearing for our architecture).
  4. OpenEvidence Benchmark โ€” what the leading clinical-AI platform draws from, as a reality-check on our own mix.

Key cross-cutting findingsโ€‹

These show up repeatedly across the four pages and are worth surfacing here:

  • CPT codes (AMA-licensed) are the single biggest billing-side licensing risk. Recommendation: stay pure-cash-pay DPC for v1 and avoid CPT entirely; license from AMA only when an insurance-billing module is added.
  • SNOMED CT US Edition is free for US use only. Any non-US clinician or patient triggers a paid Affiliate License obligation โ€” material if Starlight ever expands internationally.
  • Pediatric weight-based dosing must be a paid commercial reference (Lexicomp Pediatric or Micromedex NeoFax) before any active prescribing module โ€” wrong-dose-by-weight is the #1 patient-safety risk in pediatrics. Rolling our own from public sources is not acceptable.
  • DSM-5-TR is sidesteppable for now. ICD-10-CM F-codes plus public-domain screeners (PHQ-9, GAD-7, Vanderbilt, M-CHAT-R, SWYC, PSC) cover MVP behavioral-health needs without an APA license.
  • Most paid clinical references prohibit ingestion into LLM context absent a separate AI addendum. Starlight's AI substrate must default to free/open sources (PubMed, FDA, CDC, AAP open guidelines, MedlinePlus). Paid sources should only feed clinician-direct UI โ€” not RAG.
  • Synthea (Apache 2.0) is the right answer for HIPAA-safe dev/QA seed data with broad coverage. Our synthetic-personas library is purpose-built for pediatric specifics, and Synthea is the broader community baseline.

Verification statusโ€‹

Important: The cartography research that produced these pages was completed without live web access during the session. All URLs and pricing are flagged in-doc as "unverified during the research session." Before signing any contract or wiring an ingestion pipeline, the data team must:

  1. Confirm the canonical URL returns 200 and the expected content type.
  2. Snapshot the current ToS / license terms (these change).
  3. Confirm rate-limit + robots posture before crawling.
  4. Verify pricing directly with the vendor for "contact for pricing" sources.

Each of the three child pages opens with a verification disclaimer and lists open verification questions in its own ยง"Open Questions Requiring Verification."

Phased ingestion (gates, not dates)โ€‹

GateOutput
DS0 ยท MVP foundationFree vocabularies wired (ICD-10-CM, RxNorm, NDC, CVX/MVX, LOINC). FHIR R4 server scaffolded. Bright Futures + ACIP schedules ingested as structured derivatives. CDC growth charts.
DS1 ยท Active prescribingLexicomp Pediatric (or Micromedex NeoFax) licensed. Drug-interaction database (FDB MedKnowledge or substitute) wired into the eRx flow.
DS2 ยท AI groundingPubMed Central, MedlinePlus, openFDA, AAP open-access guidelines ingested into the AI-context vector store. Synthea integrated for HIPAA-safe dev seed data.
DS3 ยท Clinical-quality reportingHCUP-KID (with DUA), NSCH benchmarks, AHRQ measure specs ingested for the Reports surface.
DS4 ยท Insurance-billing moduleCPT licensed via the AMA Vendor Program. CDT licensed if dental rolls in.
DS5 ยท International expansionSNOMED CT International Affiliate License if any clinician/patient is outside the US.
ContinuousLicense-snapshot re-verification on vendor-renewal and on regulatory change.

The full per-source phasing detail is on each child page.