What is AI-assisted literature screening for Class II medical devices?
AI Literature Screening for Medical Devices is the application of artificial intelligence to identify, review, classify, and prioritize scientific literature relevant to a medical device. This reduces the manual effort required for evidence generation and regulatory review workflows. For instance, AI analyzes titles, abstracts, and sometimes full-text publications to determine whether a study meets predefined inclusion and exclusion criteria.
In the fast-paced world of health economics and outcomes research (HEOR), systematic literature reviews (SLRs) form the backbone of regulatory submissions and evidence-based decision-making. However, traditional title-and-abstract screening remains one of the biggest bottlenecks—labor-intensive, time-consuming, and subject to reviewer fatigue and variability.
At ISPOR 2026 in Philadelphia, MadeAi showcased promising interim results from an evaluation of its GenAI-enabled literature screening platform. The evaluation used a real-world Class II medical device dataset and the newly established Elevate-GenAI framework. The results highlight how AI can deliver both speed and reliability in regulated environments. The research was presented at ISPOR 2026 and is available through the official ISPOR conference program, providing additional details on the methodology and findings.
Why This Matters for HEOR Teams and Medical Device Sponsors
For HEOR teams and medical device sponsors, the stakes are high when generating regulatory-grade evidence. Missing a relevant study can undermine your submission. Over-including irrelevant ones wastes precious team hours during full-text review. AI-enabled tools promise efficiency. However, adoption in regulated settings demands more than raw accuracy metrics. This evaluation goes deeper.
Understanding the Elevate-AI Framework of Screening Class II Medical Devices
Developed by the ISPOR Working Group on Generative AI, the AI framework offers a structured, multi-domain approach to evaluating large language models (LLMs) in HEOR. It moves beyond simple performance scores to assess accuracy, comprehensiveness, factuality, reproducibility, and operational readiness, including factors critical for transparency, auditability, and trust in regulated workflows.
MadeAi applied three key domains in this interim evaluation:
- Accuracy (classification performance)
- Comprehensiveness (evidence retention)
- Factuality (reasoning alignment)
Dataset and Approach of AI Literature Screening for Medical Devices
To evaluate the real-world performance, the research team tested MadeAi using a golden dataset of 2,302 records focused on a Class II medical device. These records were meticulously curated and adjudicated by subject matter experts at the title-and-abstract screening stage, providing a robust benchmark for real-world performance.
As a result, Human-validated decisions served as the reference standard, allowing direct comparison with the platform’s outputs.
SLR’s Strong Performance Across Key Evaluation Metrics
The evaluation demonstrated strong and balanced performance across multiple evidence synthesis quality metrics, particularly in areas aligned with regulatory and HEOR expectations for accuracy, traceability, and consistency.
Inclusion Screening Performance
For study inclusion decisions, the platform achieved a recall of 83%, a precision of 80.7%, and an F1 score of 81.9%. These results indicate a strong ability to identify relevant studies while maintaining balanced screening precision across large evidence datasets.
Exclusion Classification Accuracy
In addition, performance was particularly strong in exclusion workflows, where both recall and precision reached 96%, with an F1 score of 95.9%. This demonstrates a high level of reliability in identifying irrelevant studies while minimizing the risk of incorrect exclusions.
Conservative Evidence Retention Strategy
The evaluation also highlighted a conservative screening approach designed to prioritize evidence retention over aggressive filtering. This methodology aligns with evidence synthesis best practices and regulatory expectations, where preserving potentially relevant evidence is critical for reducing downstream review risk.
Comprehensiveness and Coverage
Furthermore, the platform achieved 93% overall article relevance alignment against the human-reviewed reference standard. This level of comprehensiveness indicates strong performance in capturing relevant evidence during the early stages of literature screening, helping reduce the likelihood of evidence gaps in systematic reviews and HEOR workflows.
Factuality and Decision Transparency
Beyond binary classification accuracy, the evaluation assessed how closely AI-generated reasoning aligned with expert reviewer logic. Concordance between AI exclusion decisions and primary human exclusion rationales reached 84%, demonstrating substantial agreement between automated outputs and human judgment.
Auditability and Regulatory Readiness
Consequently, the ability to align AI-generated decisions with expert rationale supports stronger auditability, transparent decision-making, and reproducibility. These characteristics are increasingly important for regulatory-facing evidence synthesis workflows, where explainability and traceability are essential for quality assurance and compliance.
Tangible Benefits and Business Value – AI Literature Screening for Medical Devices
From an operational standpoint, the interim evaluation demonstrates how MadeAi can significantly reduce the time and effort required for literature screening while maintaining strong accuracy and evidence retention. By consistently identifying relevant studies and minimizing premature exclusions, the platform helps teams accelerate evidence generation without compromising scientific rigor, a critical advantage in regulated HEOR and medical device workflows.
For clients and sponsors, this delivers the following results:
- Faster screening without compromising quality
- Reduced reviewer burden and variability
- Stronger auditability and regulatory confidence
Scalability for large evidence bases is common in medical device and pharmaceutical submissions
Performance Indicators
| Domain | Metric | Result | Interpretation |
|---|---|---|---|
| Accuracy (Inclusion) | Recall / Precision / F1 | 83% / 80.7% / 81.9% | Strong sensitivity, balanced specificity |
| Accuracy (Exclusion) | Recall / Precision / F1 | 96% / 96% / 95.9% | Highly consistent non-relevant identification |
| Comprehensiveness | Overall Article Relevance | 93% | Excellent evidence coverage |
| Factuality | Reasoning Concordance | 84% | Strong alignment with human logic |
Connecting Research to Real Client Impact
In regulated HEOR, speed alone isn’t enough. Teams need tools that enhance consistency, reduce fatigue, and provide explainable outputs. MadeAi’s performance on a Class II medical device dataset demonstrates practical readiness for high-stakes projects. Clients can expect accelerated timelines, more predictable resource allocation, and evidence packages that stand up to scrutiny.
This evaluation is an important step toward broader, responsible adoption of GenAI in evidence synthesis. It shows how structured frameworks like GenAI can bridge innovation and compliance.
Conclusion
The interim evaluation presented at ISPOR 2026 demonstrates the growing potential of AI literature screening for medical devices within regulated evidence-generation workflows. Across a human-verified dataset of 2,302 records, MadeAi delivered strong performance in inclusion screening, exclusion classification, evidence retention, and reasoning alignment.
While human oversight remains essential, these findings suggest that AI can significantly reduce screening effort while preserving transparency, auditability, and scientific rigor. As regulatory expectations evolve, structured evaluation frameworks such as Elevate-GenAI will play an important role in ensuring that AI-enabled evidence synthesis solutions remain both effective and trustworthy.
Author’s Note: This article was supported by AI-based research and writing, with Claude 4.5 assisting in the creation of text and images.
FAQs
What is the Elevate-GenAI framework?
The Elevate-GenAI framework is a structured reporting guideline developed by the ISPOR Working Group to evaluate generative AI tools in HEOR. It emphasizes multiple domains beyond basic accuracy, including comprehensiveness and factuality.
How accurate is AI for literature screening in medical devices?
In this evaluation, our Advanced AI Solutions for Life Sciences achieved 81.9% F1 for inclusions and 95.9% for exclusions on a 2,302-record Class II dataset—demonstrating reliable performance suitable for screening stages.
Can GenAI replace human reviewers in systematic literature reviews?
No. The best results come from human-AI collaboration. AI excels at initial screening and consistency, while humans provide final adjudication and nuanced judgment.
What does 93% comprehensiveness mean for my SLR project?
It indicates that the platform retained nearly all relevant evidence identified by experts, reducing the chance of missing critical studies early in the process.
Why is reasoning concordance or factuality important?
An 84% match with human rationales enhances transparency and auditability, which are key requirements for regulatory submissions and quality assurance.
Is MadeAi suitable for pharmaceutical and medical device evidence generation?
Yes. This interim evaluation on a Class II medical device dataset, combined with the framework, supports its use in regulated HEOR workflows.
How can I learn more or pilot MadeAi for my next review?
Visit AI Platform for Life Sciences to explore the platform, request a demo, or discuss how it can streamline your specific evidence synthesis needs.