How to Evaluate Generative AI Output Effectively

By Christine Stohn and Marta Enciso, Clarivate

As generative AI (GenAI) tools become more embedded in academic workflows—from research discovery to learning support—questions about output quality are moving to the forefront. Whether an AI is used for literature review, extracting insights from data sources, or enriching metadata, institutions need confidence that the results are trustworthy, accurate and appropriate for academic use.

Evaluating AI-generated output isn’t straightforward. Traditional quality assurance methods don’t quite fit, and newer approaches are still evolving. This blog shares some of the thinking behind how Clarivate approaches that challenge and outlines the broader methods, dimensions, and metrics involved in evaluating GenAI output effectively.

Why evaluating AI output is so challenging

Unlike traditional systems where there’s usually a clear “right” answer, generative AI often produces a range of possible responses—all slightly different but potentially valid. That variability is part of its power, but it also makes evaluation more complex, particularly in academic contexts, where nuance, interpretation, and subjectivity often come into play.

Even when human review is possible, it doesn’t scale well. Testing across multiple prompts, data sets and user scenarios quickly becomes unmanageable without automation.

And since AI solutions rely on third-party large language models (LLMs), the challenge grows. These models evolve constantly, and therefore, ensuring output quality requires more than just testing responses—it requires frameworks that can monitor, benchmark and adapt over time.

What we measure: Key dimensions

At Clarivate, we are working with academic partners and customers to define “quality” in context. Based on both industry research and real-world use cases, we focus on a set of core dimensions:

Relevance: Does the AI response directly address the user’s query?
Accuracy / Faithfulness: Does the source material support the answer? Are there signs of hallucination?
Clarity and structure: Is the response easy to read and logically organized?
Bias or offensive content: Does the output include offensive or inappropriate content? Are relevant perspectives excluded?
Comprehensiveness: Does the answer consider multiple perspectives or angles, especially in academic contexts?
Behavior when information is lacking: Does the answer acknowledge uncertainty or produce misleading content? (Also known as noise reduction and negative rejection.)

The choice of metrics should reflect the specific goals and user needs. For example, what counts as a “relevant” answer may differ depending on whether the user is an undergraduate student, a faculty member or a researcher.

These quality dimensions help guide AI evaluation and shape the future development of AI-powered features across our solutions.

How we test: Methods and tools

Testing GenAI outputs usually involves a mix of manual and semi-automated methods, depending on the stage of development and the nature of the use case.

Manual review is essential early in development. It helps clarify use cases, surface subtle issues, and set a foundation for improving automation. As solutions move toward deployment, semi-automated testing, using workflows that simulate real-world usage, becomes essential.

For example, in the Clarivate AI-powered Research Assistants, we test across a large set of prompts to evaluate:

Answer consistency across different iterations
Response quality across content types and languages
Alignment with expected behaviors

The following are examples of quality testing frameworks we’re applying and evaluating:

Using LLMs to evaluate LLMs

One increasingly common approach to scaling quality testing is using an LLM to evaluate the output of another LLM. In this setup, one model generates the answer, and the second evaluates its quality based on predefined criteria. This method can be useful at scale, but not without limitations. LLMs can replicate each other’s blind spots, which is why human oversight is needed, particularly for complex or high-stakes scenarios.

Retrieval-Augmented Generation Assessment

Another framework gaining traction is Retrieval-Augmented Generation Assessment (RAGAS), which evaluates the relevance, context, and faithfulness of answers.

RAGAS assigns scores to each dimension, making it easier to benchmark and track changes over time. A response might get a faithfulness score of 1.0 if every point in the answer is clearly supported by the documents provided. But the same answer could get a context relevance score of 0.8 if one of the supporting documents isn’t on topic. Web of Science is the first Clarivate solution to use RAGAS, and we plan to use it more widely as we expand our evaluation capabilities across products.

In addition to RAGAS, we apply other evaluation methods and metrics, depending on the use case and context. For example, we are exploring task-specific metrics such as BLEU (Bilingual Evaluation Understudy) scores for translation or summarization, which can provide insight where clear reference outputs exist.

Example: How a faithfulness score is calculated

The faithfulness score measures how accurately an AI-generated response reflects the source content it’s based on, such as abstracts or full-text documents. It is calculated by checking how many of the claims made by the AI can be verified as true. The score is determined by dividing the number of verified, accurate claims by the total number of claims in the response.

For example, if an LLM output contains 4 claims and 3 of them are supported by the source material, the faithfulness score would be:

This means that 75% of the claims in the response are faithful to the original content.

Key takeaways for institutions

While the bulk of testing is done by AI solution providers, such as Clarivate, institutions still play a vital role in defining expectations and providing real-world feedback. Here are some recommended guidelines:

Start simple: Focus on core risks like hallucinations, inappropriate content, and failure to cite sources.
Push for transparency: Stay informed about how AI tools are evaluated and how quality considerations are integrated into product development.
Match evaluation to use case: Different AI applications (search vs. document insights vs. tutoring) require different testing approaches.
Expect iteration: Quality evaluation practices will—and should—evolve as AI models mature.

As AI becomes a standard part of academic infrastructure, quality evaluation needs to evolve alongside it as part of responsible development and deployment. At Clarivate, we’re committed to making this process transparent and collaborative.

Watch this webinar on evaluating the quality of generative AI output to learn more.

Product logins

Evaluating the quality of generative AI output: Methods, metrics and best practices

Why evaluating AI output is so challenging

What we measure: Key dimensions

How we test: Methods and tools

Key takeaways for institutions

Follow us

Contact us