Forecasting target markets using epidemiology data: how Clarivate does it
Developing a deep understanding of a targeted market is a crucial first step when deciding whether to invest in drug research or commercialization. With a commitment to providing precise forecasts of incidence, prevalence, and drug-treatable population estimates across global populations, our methodology ensures that clients receive unparalleled insights into the evolving landscape of diseases that are needed to:
- Size market potential
- Profile patient segments
- Make confident business decisions based on accurate forecast models.
All our epidemiological forecast models are developed under the supervision of expert epidemiologists from various backgrounds, including clinical medicine, public health medicine, and biostatistics. Backed by these experts, our robust patient-level data details disease trends from 10 to 20-year outlook periods to help you validate investments and identify growth opportunities in your markets of interest (Fig.1).
Fig.1. Clarivate Epidemiology – what we do:

Rigorous literature review, data selection and analysis
At the core of Clarivate’s methodology is an exhaustive literature review process. Our epidemiologists adopt a systematic approach to identify and analyze data from peer-reviewed journals, registries, hospital discharge datasets, national health surveys, insurance claims (medical and prescription), electronic health records, grey literature and Clarivate’s extensive data sources library. To enhance precision and efficiency, we leverage state-of-the-art tools, including AI/ML models and curated search strategies.
Our primary objective as epidemiologists is to understand and describe the epidemiology of a disease within a specific geography or region. This includes analyzing aspects such as incidence, prevalence, mortality, severity, hospitalizations, disease events, disease stage, survival, progression, recurrence, symptoms, comorbidities, risk factors, diagnostic criteria, natural disease history, treatments, treatment prognosis, new drug launches, and changes in disease detection methods, disease classification and public health policies.
The Clarivate epidemiology literature review process is built on a rigorous and robust framework to ensure the inclusion of high-quality data sources for robust disease incidence and prevalence estimations. We utilize multiple databases, including PubMed and Web of Science, to gather comprehensive data on peer-reviewed literature and conference abstracts (Fig.2).
Fig.2. Data sources reviewed by epidemiologists:

This process is further enriched through consultations with subject matter experts and therapy specialists within Clarivate. By evaluating published literature, online registries, and surveys, we identify the most representative country-specific epidemiological data, applying standardized inclusion and exclusion criteria across our epidemiology team. Inclusion criteria for data selection include:
- Representative, population-based studies
- Recent studies conducted within the past three-to-four years, ceteris paribus
- Adequate sample sizes to ensure statistical validity
- Detailed methodologies, including age- and gender-specific data.
Clarivate’s epidemiological research tackles a wide range of critical questions tailored to specific disease types. For chronic diseases, we explore prevalence, incidence, risk factors, survival rates, treatment outcomes, and common comorbidities. For infectious diseases, we focus on incidence, risk factors, hospitalization rates, and diagnostic and preventive measures. In oncology, our studies delve into disease incidence, risk factors, staging, progression, recurrence, survival rates, treatment efficacy, and limited-duration prevalence. By addressing these key questions, we provide actionable insights into disease incidence, prevalence, events, and drug-treatable populations, burden, prognosis, and treatment dynamics, helping clients make informed decisions.
Our epidemiologists critically appraise these peer-reviewed studies and other data sources to maintain reliability, avoiding the inclusion of low-quality studies that might compromise results. When evaluating country-specific estimates, we account for variations in diagnostic practices, lifestyle, and genetics across regions. While recent studies are often preferred, study quality and methodology remain paramount. Our team avoids extrapolating historical data trends without considering factors like public health interventions, changes in exposure to protective or risk factors, improvements in survival, improvements in disease therapy, risk of disease by gender and age group, and demographic changes. Truncated estimates, such as those limited to specific age groups, are carefully adjusted to provide a comprehensive understanding of disease risk across populations.
Following the literature review, highly trained and experienced epidemiologists perform analyses using validated processes and proprietary models in cases where the epidemiology data is scarce. These include incidence-to-prevalence, prevalence-to-incidence, survival models, and extrapolation methods. The analyses also account for risk factors, population changes, and cohort effects.
A benchmarking and validation of Clarivate Epidemiology forecast estimates is carried out by comparing our estimates vs. publicly available sources, as well as prevalent, diagnosed, and drug-treated cases vs. published sales or other treatment data. As a last step in the process, the Clarivate expert epidemiology team provides a thorough forecast assessment, including the preliminary forecast model, supporting evidence and rationale for the choice of data sources, and model assumptions.
The metrics reported (depending on the analysis) are incidence, prevalence, and proportions (sub-populations and drug-treatable populations). We perform a stratification of patient populations based on diagnosed and drug-treated status as well as relevant clinical variables such as stage of disease at diagnosis and severity. Estimates are presented as both rates and cases, available at age-, gender- and country-specific levels by population for all countries analyzed. These include detailed methods chapters for analyses and population estimates, glossaries of epidemiological terms, and interactive and downloadable graphs and tables.
This systematic approach guarantees that our clients receive the most reliable and actionable epidemiological insights.
Leveraging technology for proactive monitoring
Clarivate has been incorporating emerging technology to conduct annual literature reviews for all diseases. This proactive approach ensures updates when novel studies indicate significant changes in incidence or prevalence. Through this initiative, our clients gain timely access to emerging trends, empowering rapid adjustments to R&D and market strategies.
With the current focus on AI, AI/ML models are an essential tool for conducting targeted searches. To refine this extensive output, we employ an AI/ML relevance model API to rank papers by relevance.
Clarivate’s epidemiology machine learning (EPI ML) project automates the scanning of epidemiology studies, focusing on model selection and the development of an ML service platform. Six classifier ML models were evaluated, and performance reports were carefully studied.
Since our datasets are imbalanced, with a larger number of irrelevant samples compared to relevant ones, the use of an ensemble learning technique (ELT) is required to improve accuracy. An ensemble model combines several individual models to produce more accurate predictions than a single model alone.
The multinomial naive Bayes (NB) model with an easy ensemble classifier (EEC) performed best for Epidemiology, with a weighted accuracy of around 70% (Fig.3).
Fig.3. Performance report for the six classifier ML models tested:

Launched in August 2024, the EPI ML service platform utilizes APIs for prediction, training, model management, and system tracking, facilitating integration into the Epi-Intelligence platform. Through these APIs, users can predict the relevancy of epidemiology studies, upload training data, and manage ML models. The targeted search exercise uses sources like PubMed, Web of Science, and cited literature for additional valuable sources. Data undergoes rigorous validation, and the AI/ML model then assigns relevance scores for efficient curation.
By focusing on papers with a relevance score of 80% or higher, the process becomes more efficient, typically narrowing the list for detailed evaluation. This approach saves considerable time while ensuring the inclusion of optimal and impactful research studies.
The next steps for the EPI ML service integration include adding an automated feedback loop for continuous learning, creating a user-friendly Singularity Dashboard, and conducting deep learning tests for improving model accuracy.
Meanwhile, the AI/ML-mediated targeted search exercise will focus on annual updates for all indications, documenting recent articles, and integrating findings into the Epi-Intelligence platform to provide clients with up-to-date, comprehensive epidemiology data. The Epi-Intelligence platform integrates legacy system content and offers enhanced features such as a user-friendly search box, quick loading times, redesigned summary tables, a new, visual and customizable data application, and a heatmap view of epidemiology numbers across countries. These features collectively provide a comprehensive, efficient, and user-friendly experience for accessing and analyzing epidemiology data.
Furthermore, by integrating AI/ML technology, Clarivate continues to set the benchmark for precision and efficiency in epidemiological research.
A crystal ball comprised of robust data sets
The Clarivate Epi Intelligence Platform provides:
- data across 45 countries, with extrapolation capabilities extending to 171 countries
- over 200 diseases and key populations spanning dermatology, oncology, cardiovascular, infectious diseases, and more
- 10–20-year forecasts, enabling long-term strategic planning for our clients.
This extensive coverage equips pharmaceutical companies and researchers with actionable insights tailored to diverse markets. Whether it is understanding disease burden, identifying key risk factors, or forecasting future trends, Clarivate’s epidemiology team delivers insights that matter.
For a comprehensive overview of the diseases and methodologies we cover, visit our Epi Intelligence Platform, or contact our team today. Together, we can shape the future of healthcare research.
This post was written by Narendra Parihar, Director, Epidemiology, and Shyama Ghosh, Senior Principal STEM Content Analyst.