Data Sources
A comprehensive map of medical data sources Starlight Practice can ingest, license, or substitute around. Built to answer the question: "For pediatric DPC, what's the most rigorous set of data we could put into our data lakes โ free, public, or paid?"
Erik's framing: "I just want to know what's the whole sources of data that I can get truly free and public, and notable paid sources that we could also consider."
This section is the answer.
How this section is organizedโ
๐๏ธ Government & Public-Sector Data Sources
Map of US public-sector data sources for Starlight Practice (pediatric DPC EMR + parent app + AI substrate). Covers NIH/NLM, CDC, FDA, HRSA, AHRQ, CMS, state public-health, and standardized vocabularies. Paid commercial sources (UpToDate, Lexicomp, Nelson's, Harriet Lane, paywalled AAP Red Book) are tracked separately in /docs/data-sources/commercial.md.
๐๏ธ Vocabularies & Guidelines
Comprehensive map of clinical vocabularies, ontologies, structured guideline content, and decision-support datasets relevant to Starlight Practice (pediatric DPC EMR + parent app + AI substrate).
๐๏ธ Paid Clinical References & Datasets
The commercial counterpart to our free public-sector and standardized-vocabulary maps. This catalogs the paid clinical-reference products and licensed research datasets that a pediatric DPC EMR like Starlight Practice would either license, evaluate-then-substitute, or skip outright.
๐๏ธ OpenEvidence Benchmark
Research on OpenEvidence's licensed corpus as a benchmark / reference for what a leading clinical-AI platform draws from. Used as a reality-check on Starlight's own data-source mix โ sister pages in this section catalog what's available to us (free / paid / standardized).
Reading orderโ
- Government / Public-Sector Sources โ start here. NIH/NLM, CDC, FDA, HRSA/MCHB, AHRQ, CMS, state registries. Free or near-free, often canonical for pediatric care. ~80 sources catalogued with license terms, ingestion priority, and pediatric relevance per source.
- Vocabularies, Guidelines & Standards โ code sets (ICD-10-CM, SNOMED CT, LOINC, RxNorm, NDC, FHIR R4, USCDI, CVX), pediatric-specific structured content (Bright Futures, ACIP, growth charts, USPSTF), drug-interaction sources, and validated screeners. Critical licensing flags (CPT, DSM, SNOMED-Intl).
- Paid Commercial References โ UpToDate, DynaMed, Lexicomp, FDB MedKnowledge, etc. Includes the license re-use cheat sheet (most paid sources cannot be ingested into LLM context without an AI addendum โ load-bearing for our architecture).
- OpenEvidence Benchmark โ what the leading clinical-AI platform draws from, as a reality-check on our own mix.
Key cross-cutting findingsโ
These show up repeatedly across the four pages and are worth surfacing here:
- CPT codes (AMA-licensed) are the single biggest billing-side licensing risk. Recommendation: stay pure-cash-pay DPC for v1 and avoid CPT entirely; license from AMA only when an insurance-billing module is added.
- SNOMED CT US Edition is free for US use only. Any non-US clinician or patient triggers a paid Affiliate License obligation โ material if Starlight ever expands internationally.
- Pediatric weight-based dosing must be a paid commercial reference (Lexicomp Pediatric or Micromedex NeoFax) before any active prescribing module โ wrong-dose-by-weight is the #1 patient-safety risk in pediatrics. Rolling our own from public sources is not acceptable.
- DSM-5-TR is sidesteppable for now. ICD-10-CM F-codes plus public-domain screeners (PHQ-9, GAD-7, Vanderbilt, M-CHAT-R, SWYC, PSC) cover MVP behavioral-health needs without an APA license.
- Most paid clinical references prohibit ingestion into LLM context absent a separate AI addendum. Starlight's AI substrate must default to free/open sources (PubMed, FDA, CDC, AAP open guidelines, MedlinePlus). Paid sources should only feed clinician-direct UI โ not RAG.
- Synthea (Apache 2.0) is the right answer for HIPAA-safe dev/QA seed data with broad coverage. Our synthetic-personas library is purpose-built for pediatric specifics, and Synthea is the broader community baseline.
Verification statusโ
Important: The cartography research that produced these pages was completed without live web access during the session. All URLs and pricing are flagged in-doc as "unverified during the research session." Before signing any contract or wiring an ingestion pipeline, the data team must:
- Confirm the canonical URL returns 200 and the expected content type.
- Snapshot the current ToS / license terms (these change).
- Confirm rate-limit + robots posture before crawling.
- Verify pricing directly with the vendor for "contact for pricing" sources.
Each of the three child pages opens with a verification disclaimer and lists open verification questions in its own ยง"Open Questions Requiring Verification."
Phased ingestion (gates, not dates)โ
| Gate | Output |
|---|---|
| DS0 ยท MVP foundation | Free vocabularies wired (ICD-10-CM, RxNorm, NDC, CVX/MVX, LOINC). FHIR R4 server scaffolded. Bright Futures + ACIP schedules ingested as structured derivatives. CDC growth charts. |
| DS1 ยท Active prescribing | Lexicomp Pediatric (or Micromedex NeoFax) licensed. Drug-interaction database (FDB MedKnowledge or substitute) wired into the eRx flow. |
| DS2 ยท AI grounding | PubMed Central, MedlinePlus, openFDA, AAP open-access guidelines ingested into the AI-context vector store. Synthea integrated for HIPAA-safe dev seed data. |
| DS3 ยท Clinical-quality reporting | HCUP-KID (with DUA), NSCH benchmarks, AHRQ measure specs ingested for the Reports surface. |
| DS4 ยท Insurance-billing module | CPT licensed via the AMA Vendor Program. CDT licensed if dental rolls in. |
| DS5 ยท International expansion | SNOMED CT International Affiliate License if any clinician/patient is outside the US. |
| Continuous | License-snapshot re-verification on vendor-renewal and on regulatory change. |
The full per-source phasing detail is on each child page.