AI Literature Screening for Medical Devices Explained

What is AI-assisted literature screening for Class II medical devices?

AI Literature Screening for Medical Devices is the application of artificial intelligence to identify, review, classify, and prioritize scientific literature relevant to a medical device. This reduces the manual effort required for evidence generation and regulatory review workflows. For instance, AI analyzes titles, abstracts, and sometimes full-text publications to determine whether a study meets predefined inclusion and exclusion criteria.

In the fast-paced world of health economics and outcomes research (HEOR), systematic literature reviews (SLRs) form the backbone of regulatory submissions and evidence-based decision-making. However, traditional title-and-abstract screening remains one of the biggest bottlenecks—labor-intensive, time-consuming, and subject to reviewer fatigue and variability.

At ISPOR 2026 in Philadelphia, MadeAi showcased promising interim results from an evaluation of its GenAI-enabled literature screening platform. The evaluation used a real-world Class II medical device dataset and the newly established Elevate-GenAI framework. The results highlight how AI can deliver both speed and reliability in regulated environments. The research was presented at ISPOR 2026 and is available through the official ISPOR conference program, providing additional details on the methodology and findings.

Why This Matters for HEOR Teams and Medical Device Sponsors

For HEOR teams and medical device sponsors, the stakes are high when generating regulatory-grade evidence. Missing a relevant study can undermine your submission. Over-including irrelevant ones wastes precious team hours during full-text review. AI-enabled tools promise efficiency. However, adoption in regulated settings demands more than raw accuracy metrics. This evaluation goes deeper.

Understanding the Elevate-AI Framework of Screening Class II Medical Devices

Developed by the ISPOR Working Group on Generative AI, the AI framework offers a structured, multi-domain approach to evaluating large language models (LLMs) in HEOR. It moves beyond simple performance scores to assess accuracy, comprehensiveness, factuality, reproducibility, and operational readiness, including factors critical for transparency, auditability, and trust in regulated workflows.

MadeAi applied three key domains in this interim evaluation:

Accuracy (classification performance)
Comprehensiveness (evidence retention)
Factuality (reasoning alignment)

Dataset and Approach of AI Literature Screening for Medical Devices

To evaluate the real-world performance, the research team tested MadeAi using a golden dataset of 2,302 records focused on a Class II medical device. These records were meticulously curated and adjudicated by subject matter experts at the title-and-abstract screening stage, providing a robust benchmark for real-world performance.

As a result, Human-validated decisions served as the reference standard, allowing direct comparison with the platform’s outputs.

SLR’s Strong Performance Across Key Evaluation Metrics

The evaluation demonstrated strong and balanced performance across multiple evidence synthesis quality metrics, particularly in areas aligned with regulatory and HEOR expectations for accuracy, traceability, and consistency.

Inclusion Screening Performance

For study inclusion decisions, the platform achieved a recall of 83%, a precision of 80.7%, and an F1 score of 81.9%. These results indicate a strong ability to identify relevant studies while maintaining balanced screening precision across large evidence datasets.

Exclusion Classification Accuracy

In addition, performance was particularly strong in exclusion workflows, where both recall and precision reached 96%, with an F1 score of 95.9%. This demonstrates a high level of reliability in identifying irrelevant studies while minimizing the risk of incorrect exclusions.

Conservative Evidence Retention Strategy

The evaluation also highlighted a conservative screening approach designed to prioritize evidence retention over aggressive filtering. This methodology aligns with evidence synthesis best practices and regulatory expectations, where preserving potentially relevant evidence is critical for reducing downstream review risk.

Comprehensiveness and Coverage

Furthermore, the platform achieved 93% overall article relevance alignment against the human-reviewed reference standard. This level of comprehensiveness indicates strong performance in capturing relevant evidence during the early stages of literature screening, helping reduce the likelihood of evidence gaps in systematic reviews and HEOR workflows.

Factuality and Decision Transparency

Beyond binary classification accuracy, the evaluation assessed how closely AI-generated reasoning aligned with expert reviewer logic. Concordance between AI exclusion decisions and primary human exclusion rationales reached 84%, demonstrating substantial agreement between automated outputs and human judgment.

Auditability and Regulatory Readiness

Consequently, the ability to align AI-generated decisions with expert rationale supports stronger auditability, transparent decision-making, and reproducibility. These characteristics are increasingly important for regulatory-facing evidence synthesis workflows, where explainability and traceability are essential for quality assurance and compliance.

Tangible Benefits and Business Value – AI Literature Screening for Medical Devices

From an operational standpoint, the interim evaluation demonstrates how MadeAi can significantly reduce the time and effort required for literature screening while maintaining strong accuracy and evidence retention. By consistently identifying relevant studies and minimizing premature exclusions, the platform helps teams accelerate evidence generation without compromising scientific rigor, a critical advantage in regulated HEOR and medical device workflows.

For clients and sponsors, this delivers the following results:

Faster screening without compromising quality
Reduced reviewer burden and variability
Stronger auditability and regulatory confidence

Scalability for large evidence bases is common in medical device and pharmaceutical submissions

Performance Indicators

Domain	Metric	Result	Interpretation
Accuracy (Inclusion)	Recall / Precision / F1	83% / 80.7% / 81.9%	Strong sensitivity, balanced specificity
Accuracy (Exclusion)	Recall / Precision / F1	96% / 96% / 95.9%	Highly consistent non-relevant identification
Comprehensiveness	Overall Article Relevance	93%	Excellent evidence coverage
Factuality	Reasoning Concordance	84%	Strong alignment with human logic

Connecting Research to Real Client Impact

In regulated HEOR, speed alone isn’t enough. Teams need tools that enhance consistency, reduce fatigue, and provide explainable outputs. MadeAi’s performance on a Class II medical device dataset demonstrates practical readiness for high-stakes projects. Clients can expect accelerated timelines, more predictable resource allocation, and evidence packages that stand up to scrutiny.

This evaluation is an important step toward broader, responsible adoption of GenAI in evidence synthesis. It shows how structured frameworks like GenAI can bridge innovation and compliance.

Conclusion

The interim evaluation presented at ISPOR 2026 demonstrates the growing potential of AI literature screening for medical devices within regulated evidence-generation workflows. Across a human-verified dataset of 2,302 records, MadeAi delivered strong performance in inclusion screening, exclusion classification, evidence retention, and reasoning alignment.

While human oversight remains essential, these findings suggest that AI can significantly reduce screening effort while preserving transparency, auditability, and scientific rigor. As regulatory expectations evolve, structured evaluation frameworks such as Elevate-GenAI will play an important role in ensuring that AI-enabled evidence synthesis solutions remain both effective and trustworthy.

Author’s Note: This article was supported by AI-based research and writing, with Claude 4.5 assisting in the creation of text and images.

ISPOR 2026: Evaluating MadeAi for Literature Screening of a Class II Medical Device

What is AI-assisted literature screening for Class II medical devices?

Why This Matters for HEOR Teams and Medical Device Sponsors

Understanding the Elevate-AI Framework of Screening Class II Medical Devices

Dataset and Approach of AI Literature Screening for Medical Devices

SLR’s Strong Performance Across Key Evaluation Metrics

Inclusion Screening Performance

Exclusion Classification Accuracy

Conservative Evidence Retention Strategy

Comprehensiveness and Coverage

Factuality and Decision Transparency

Auditability and Regulatory Readiness

Tangible Benefits and Business Value – AI Literature Screening for Medical Devices

Performance Indicators

Connecting Research to Real Client Impact

Conclusion

FAQs

What is the Elevate-GenAI framework?

How accurate is AI for literature screening in medical devices?

Can GenAI replace human reviewers in systematic literature reviews?

What does 93% comprehensiveness mean for my SLR project?

Why is reasoning concordance or factuality important?

Is MadeAi suitable for pharmaceutical and medical device evidence generation?

How can I learn more or pilot MadeAi for my next review?

Viji Queen

You May Also Like

Unlocking Performance: How AI Goals Are Shaping Team Productivity

AI Strategy Readiness: Insights on Team Capabilities and Vendor Support (CapeStart Report)

Advanced AI Solutions for Life Sciences

Products

Services

By Need

Company

Resources

Connect With Us