Why is life sciences data integration so hard?

A new analysis and guide from the Life Sciences data integration team at Clarivate Analytics asks the question, “Why is data integration so hard?”

The paper looks at the many reasons biopharma companies might want to integrate their vast and often mixed storehouses of data – revisiting an old drug program in light of new knowledge, for example – as well as the pitfalls typically encountered when embarking on such as project:

Not knowing what data you have
Not understanding its use or having a clear vision of what success looks like: “There is a world of difference between aggregating data and putting search and/ or dashboards on top and building a data repository with some kind of programmatic access, g. via API, that users can interrogate. “
Not realizing just how ugly your data really is

“Drug programs accrete mini-ecosystems of experimental results, reports and analyses that vary in size depending on how far along the pipeline they get,” the paper notes, adding: “A modicum of data forensics at the start of a project can pay dividends later on.”

Text mining, artificial intelligence, curation

“Why is data integration so hard?” looks at the anatomy of a data integration application, starting with “bringing the content in,” which includes discussion of connectors and the Extract/Transform/Load (ETL) process; text mining; artificial intelligence (AI); and curation.

It examines the differences between heavyweight and lightweight integration, recommending the middle ground: “There is a natural inclination to address the opportunity by integrating everything. That is a dangerous road!”

And the report notes in conclusion:

Having clear goals for your project, going for middleweight data integration, choosing the right ontologies and actively managing the entities you care about will serve as a good starting point in order to achieve a successful outcome.

The goals will dictate the technologies that will help you along the way, such as choosing which database engine to use or whether to build an API around your data. Remember that technologies are not mutually exclusive, and for example you can use triple store, document storage and relational database management systems (RDBMS) in one project to represent different facets of data.

AI and text mining can help you to groom and enrich your data, but use the tools wisely. As powerful as machine learning methods can be, in many instances they are still inferior or offer very little advantages compared to more traditional algorithms.

And finally, even with all that automation, manual curation is still necessary in most data integration projects, even if it’s just for quality control purposes.

Eugene Rakhmatulin is one of the authors of “Why is data integration so hard?”, the whitepaper on which this article is based.

To download the full paper, please click here.

Contact us

Support

Investors

Careers

Login

Why is life sciences data integration so hard?

Text mining, artificial intelligence, curation

Clarivate Launches LatAm Market Tracking Solution in Partnership with Global Healthcare Intelligence Enabling Real-Time Tracking of Medical Device Markets

Clarivate to Report First Quarter 2024 Results on May 8, 2024

Clarivate Leadership Presents at Recent Investor Conferences

Clarivate Acquires AI Start Up to Accelerate Strategy and Business Development Success for Life Sciences & Healthcare Clients

New Webinar: Clarivate in the Age of AI – Driving Innovation in Intellectual Property Management and Decision Making

From skepticism to strategy: How AI is transforming the IP practice

Demonstrating socioeconomic impact – a historical perspective of ancient wisdom and modern challenges

Real world data reveal how diseases manifest in diverse populations

Related posts

Real world data reveal how diseases manifest in diverse populations

Navigating Joint Clinical Assessment and the role of RWE in the new process

Rebates are likely driving U.S. payer coverage of GLP-1 agonists