Even with genomic sequencing, many people with rare diseases never receive a clear genetic diagnosis. Roughly half remain undiagnosed after extensive testing and specialist review. Their medical data may contain clues but finding them can require sifting through thousands to millions of possible genetic variants, fragmented clinical records, and rapidly changing scientific literature.
As new gene-disease relationships, case reports, and classification evidence accumulate, unsolved cases can become newly interpretable.
Researchers from Boston Children’s Hospital’s Manton Center for Orphan Disease Research, Harvard University, and OpenAI used the OpenAI o3 Deep Research reasoning model to analyze de-identified clinical and genomic information from 376 previously analyzed cases that remained unsolved. The model surfaced evidence-linked candidate explanations for researchers and clinicians to review. Following expert review, additional testing, and clinical confirmation, physicians established diagnoses in 18 cases—an additional diagnostic yield of 4.8% after earlier analysis by specialists. This study was published on June 18, 2026, in NEJM AI and shows how an AI-assisted research workflow can help experts generate leads when revisiting some of the most difficult cases.
Many of these cases had evaded years of expert analysis. In this study, OpenAI o3 Deep Research helped researchers identify leads that were later assessed through established clinical processes, suggesting that expert-led periodic reanalysis could become more scalable as knowledge evolves. The model did not diagnose any patient or make any clinical decision. It produced evidence-linked hypotheses for specialists to review and, where appropriate, investigate through additional testing and confirm in a clinical laboratory.
An inconclusive genetic test is not always a permanent finding. A patient’s phenotype descriptions, test results, and family history can be split across databases that use different identifiers, formats, and vocabularies. Linking those records is difficult, so even specialists can miss a diagnosis. Experts may also sequence a child’s genome before a relevant gene or its variants have been linked to disease. As scientific knowledge advances, the same data can reveal answers that were previously impossible to uncover.
Rare-disease reanalysis is both a scientific and a maintenance problem. The patient’s genome may stay the same, but the evidence around it keeps changing: researchers link new genes and variants to disease, labs reclassify old variants, and case databases and papers accumulate new observations. Each update can make an old inconclusive case worth revisiting, so many institutions inherit a growing backlog of genomes to keep in sync with a moving knowledge base.
In this study, researchers designed the workflow so that the model acted as an explanation-first reasoning layer on top of existing genomic pipelines. Instead of returning only a ranked gene, it was asked to connect the clinical features, inheritance pattern, variant evidence, and scientific literature into a justification that a human reviewer could interrogate.
For each case, the team assembled a de-identified packet containing standardized Human Phenotype Ontology terms to describe the patient’s clinical presentation, occasional clinician notes and any descriptive clinical diagnosis, metadata such as age and gender, and a filtered variant table. The table captured each variant’s rarity, its predicted effect on the encoded protein, ClinVar classification, and signal quality across available family members. Most cases included data from the child and both biological parents.
The team asked the model to propose the most plausible molecular explanation and to show its work. Researchers then reviewed the outputs using the same ACMG/AMP framework that clinical labs use to classify genetic variants. At least two team members reviewed each candidate, disagreements were resolved by consensus, and a model output was never treated as a diagnosis. A finding counted as a diagnosis only after qualified experts reviewed the evidence, the variant was classified as pathogenic or likely pathogenic, a CLIA-certified laboratory confirmed it, and the clinical team returned the result to the family.
Before analyzing unsolved cases, the team refined the workflow on cases with established diagnoses. It recovered the correct gene and variant in duplicate runs for 48 of 51 cases that included a variety of rare conditions. In a set of 57 neuromuscular cases, the workflow returned the correct diagnosis in duplicate runs for 45 of the cases. In a 15-case long-read genome set, it named the correct gene in every case and both disease-causing alleles in 12 cases. These evaluations aided in prompt development and showed where expert review remained essential.
The model’s self-reported confidence scores tracked with correct diagnoses in these previously solved cases: the mean minimum score was 85.6 for consistently correct calls and 42.1 for incorrect or unknown calls. The scores were not calibrated probabilities, and the team did not use them as a substitute for evidence or clinical adjudication. But they were helpful in guiding the expert reviewers to focus on the most promising candidate diagnoses.
The team then applied the workflow to four groups of previously unsolved cases: children with neurodevelopmental conditions, people with rare neuromuscular disease, children and adolescents with early psychosis, and cases of sudden unexpected death in pediatrics. These were not fresh cases awaiting a first review. Many had already been examined by multiple commercial or institutional pipelines and discussed by multidisciplinary teams.
Cohort
Cases
Diagnoses surfaced
Yield
Neurodevelopmental
100
10
10.0%
Neuromuscular disease
61
4
6.6%
Sudden unexpected death in pediatrics
200
2
1.0%
Early psychosis
15
2
13.3%
Total
376
18
4.8%
The early psychosis cohort was small, so its percentage has a wide confidence interval. Yield also reflects how likely each cohort was to have a single-gene explanation.
After the model surfaced candidates and experts completed review and clinical confirmation, physicians established diagnoses in 4.8% of the cases. That rate is modest but meaningful in this population because previous expert reviews had not resolved the cases. Similar reanalysis studies report single-digit gains in heavily reviewed cases; higher yields usually come from studies containing new cases or well-known disorders awaiting genetic confirmation.
Of the 18 diagnoses, 7 were rediscoveries: diagnoses established outside the local research workflow but absent from the record the team reviewed. In several cases, the variants were already listed as pathogenic or likely pathogenic in public databases, highlighting the operational challenge of synthesizing information across data sources.
In one early-psychosis case, the model inferred a structural event in the genome that was not listed in the input data. It connected a run of low-quality calls on chromosome 22 with the child’s cardiac, immune, neurodevelopmental, and psychiatric features, then hypothesized a 22q11.2 deletion associated with DiGeorge syndrome. This hypothesized variant was confirmed with follow-up genome sequencing.
Although the prompt asked for one monogenic cause, the model sometimes surfaced two genes that better explained a complex presentation. Variants in LAMA2 and FOXP1 together helped account for muscle and neurodevelopmental features in one case; another had a previously unrecognized digenic explanation involving TTN and SRPK3.
In addition to diagnoses, the model also identified a possible novel mechanistic explanation for a condition called vitiligo. In one neurodevelopmental case, the model highlighted an 11-amino-acid deletion in S1PR1 in a person with vitiligo. S1PR1 encodes a cell-surface receptor involved in signaling, immune-cell movement, and tissue biology. The model integrated evidence suggesting that the deletion could alter receptor structure and signaling in ways that reduce pigment production while also helping immune cells persist in the skin.
The proposed S1PR1-vitiligo relationship requires additional experimental validation but it illustrates a powerful role for AI in translating scattered findings from structural biology, immunology, and clinical genetics into concrete, testable hypotheses.
The team also saw possible phenotype expansion in the neuromuscular cohort. Damaging variants in HSPB8 and CDK13 did not perfectly match the genes’ best-known disorders, suggesting a broader clinical spectrum that more cases and laboratory work will need to test.
This study shows that a general-purpose reasoning model can contribute to retrospective genomic reanalysis by combining phenotype, inheritance, variant annotations, data-quality patterns, and scientific literature into reviewable hypotheses. It also shows why periodic reanalysis matters: some answers surface only after knowledge advances or fragmented records are brought together.
This research is not evidence that patients, clinicians, or customers should use OpenAI models to diagnose disease or make medical decisions. It does not describe or endorse an intended customer use of OpenAI o3 Deep Research, ChatGPT, or any other OpenAI product for diagnosis. The model did not diagnose any participant; physicians and other qualified clinical experts made every diagnosis through established review, testing, and clinical-confirmation processes.
The study was retrospective, the cohorts were heterogeneous, and reviewers were not blinded to model confidence. The researchers did not measure time saved, cost, clinician effort, false-positive workload, or changes in care. Nor did they systematically evaluate other forms of genetic variation such as structural variants, repeat expansions, deep-intronic changes, or mosaicism.
Large language models can misread context or produce plausible explanations that fail upon closer inspection. Therefore, every result passed through human adjudication and clinical confirmation. The model widened the search and focused the subsequent human-led analysis; it did not decide what information or diagnosis should be returned to a family.
This study used de-identified information, with no protected health information utilized or transmitted outside approved environments. Broader clinical deployment will require the same attention to privacy, security, auditability, and local regulation that applies to all medical care. Model access does not replace sequencing infrastructure, genetic counseling, confirmatory testing, or specialist judgment.
Alan Beggs, director of the Manton Center for Orphan Disease Research
Prospective, multi-center studies should compare LLM-assisted reanalysis with standard practice on diagnostic yield, time to a candidate, clinician effort, false-positive burden, cost, and effects on care. Versioned prompts, reference checks, audit logs, and calibrated uncertainty will be important for reproducibility and safety. Such studies would still require qualified clinicians to evaluate evidence, order appropriate tests, and make any diagnosis or treatment decision.
This study used OpenAI o3 Deep Research. Newer general-purpose models can search and synthesize more scientific material, while purpose-built systems such as GPT‑Rosalind are designed for deeper life-sciences work, including variant effects on protein structure and function. Those capabilities were not tested here and will require their own evaluations and access controls.
While OpenAI helped support this initial research study, the Manton Center will lead the next stage of the work through a grant from the OpenAI Foundation. The grant will support the Center's broader effort to develop a platform-agnostic, low-cost genetics AI copilot that helps clinical teams analyze rare disease cases more quickly and consistently.
The longer-term research opportunity is to explore whether expert-led AI-assisted reanalysis can help scientific understanding keep pace with discovery. The promise is not that AI replaces a doctor’s diagnosis, but that carefully evaluated research tools may help specialists identify evidence worth investigating. For thousands of families, today’s unanswered questions do not have to remain unanswered forever.