Megapatents: Why expert curated biological sequence data is critical for effective IP strategy

Today, a single patent can contain tens of thousands of biological sequences, many of which are hidden in plain sight. These so-called ‘megapatents’ raise the stakes for how biological sequence data is captured, interpreted and used in decision-making.

Relying solely on electronic sequence listings can lead to costly blind spots for patent professionals and researchers. That’s why curated, context-rich data, like that provided by GENESEQ, is no longer a luxury but a necessity.

This blog explores how megapatents expose the limitations of automated sequence capture and why GENESEQ’s human-curated approach is essential for accurate freedom-to-operate (FTO) assessments, competitive intelligence and strategic intellectual property (IP) planning.

Read more about how GENESEQ can strengthen your IP strategy in our factsheet.

What is a megapatent?

Before the mid-2000s, biological sequence data in patents, when present, were typically embedded within the patent specification. The GENESEQ editorial team manually extracted and annotated these sequences to ensure accuracy and completeness. Formal sequence listings formed part of the patent specification, and the sequence data was manually captured by the GENESEQ team of data capture analysts.

Around that time, the World Intellectual Property Office (WIPO) began publishing electronic sequence listings, a format that has since been gradually adopted by other patent authorities. These listings are intended to capture all sequences referenced in a patent, whether in the claims, examples or disclosure sections. However, in practice, they are often incomplete or entirely absent. As a result, manual curation is still required to ensure a full and accurate representation of the sequence content.

This challenge is amplified by the emergence of so-called megapatents — single patent filings that disclose vast numbers of genetic sequences. These patents typically:

Cover many variants of DNA or RNA sequence, that are similar in structure or function.
Use percent identity thresholds (e.g., 80%, 90%, or 95% similarity) to extend the scope of protection to sequences that are not explicitly described in the patent.
Are often filed early in the discovery process, before the full function or utility of the sequences is known.
Seek broad control over a genetic domain, potentially blocking others from using a wide array of related sequences.

This practice has raised concerns among scientists and legal experts alike. Megapatents can create legal uncertainty, stifle innovation and limit access to foundational genetic information, especially when the underlying data is incomplete or poorly annotated.

Megapatents and GENESEQ: Why manual work matters

When we talk about a ‘vast number’ of sequences disclosed by a single patent, we mean it. Take, for example, WO2025059390A2, which discloses upward of 34,000 sequence records. Yet, the formal sequence listing comprises only 2,757. The remaining sequences were manually captured by the GENESEQ editorial team – real experts, not algorithms – ensuring that every relevant sequence was accounted for and searchable within Derwent SequenceBase.

While electronic sequence data — digitally encoded representations of DNA, RNA or protein sequences — plays a vital role in bioinformatics, it often lacks the legal and scientific context needed for accurate patent analysis. This is especially problematic when dealing with megapatents. Relying solely on electronic listings introduces several critical risks:

Misinterpretation of legal scope

Electronic sequence data typically includes only the raw nucleotide or amino acid sequences. However, patent claims define the legal boundaries of protection, which may:

Include percentage identity thresholds (e.g., sequences with ≥90% identity to a reference sequence)
Be limited to specific uses, organisms, or structural contexts.
Be subject to exceptions or disclaimers not visible in the sequence data.

Without reading the full patent text, it’s easy to misjudge what is and isn’t protected.

Overlooking functional or structural limitations

There may be important information regarding the sequences within the patent specification that isn’t present in the sequence listing, though this isn’t always the case.

Megapatents often claim sequences in combination with specific functions (e.g., encoding a therapeutic protein) or structural features (e.g., motifs, domains). These details are typically found in the patent specification, not the sequence listing. Ignoring them can lead to incorrect assumptions about infringement or freedom to operate (FTO).

Inaccurate FTO assessments

A sequence might appear unclaimed in the listing, but the patent could cover a broader class of sequences that includes it. Conversely, a sequence might seem protected, but the claim could be narrower than it appears, or the patent might have expired or be invalid. Without full context, FTO assessments can be dangerously flawed.

Lack of contextual metadata

While formats like WIPO ST.26 include basic metadata (e.g., organism, function), they don’t replace the rich legal and scientific context found in the full patent document. That context is essential for accurate interpretation and strategic decision-making.

How do other sequence search tools treat megapatents?

Most sequence search tools rely exclusively on electronic sequence listings, automated data that often omits critical context, especially in megapatents. This creates a significant blind spot for organizations that depend on accurate, comprehensive sequence intelligence for IP strategy.

By contrast, GENESEQ combines manual sequence capture with expert annotation, ensuring that even sequences buried in figures, tables or unstructured text are included and searchable. Moreover, superior search capabilities available via Derwent SequenceBase minimize any risk associated with only viewing the electronic sequence listing data and provide a more reliable foundation for decision-making.

Manual vs. electronic sequence capture: A comparison

Aspect	Electronic sequence capture	Manual sequence capture
Data source	Automated extraction (OCR, etc)	Human curation
Accuracy	May miss nuances, misinterpret	Very high, extremely nuanced
Contextual info	Limited to metadata only	Very rich, includes claims
Legal interpretation	Very limited	Detailed and comprehensive
Human expertise	None	Expert analysis

Conclusion: In the age of megapatents, precision is power

As biological sequence data grows in volume and complexity, the risks of relying solely on electronic listings become harder to ignore. Megapatents expose the limitations of automated tools and highlight the need for curated, context-rich data that supports confident decision-making.

With GENESEQ, you’re not just searching sequences; you’re uncovering the full legal and scientific picture. Backed by expert curation and integrated into Derwent SequenceBase, GENESEQ empowers IP professionals, researchers and legal teams to navigate the genomic IP landscape with clarity and confidence.

Ready to see the difference GENESEQ can make? Explore the product or contact our team to learn more.

Derwent Geneseq

Product logins