A new analysis and guide from the Life Sciences data integration team at Clarivate Analytics asks the question, “Why is data integration so hard?”
The paper looks at the many reasons biopharma companies might want to integrate their vast and often mixed storehouses of data – revisiting an old drug program in light of new knowledge, for example – as well as the pitfalls typically encountered when embarking on such as project:
- Not knowing what data you have
- Not understanding its use or having a clear vision of what success looks like: “There is a world of difference between aggregating data and putting search and/ or dashboards on top and building a data repository with some kind of programmatic access, g. via API, that users can interrogate. “
- Not realizing just how ugly your data really is
“Drug programs accrete mini-ecosystems of experimental results, reports and analyses that vary in size depending on how far along the pipeline they get,” the paper notes, adding: “A modicum of data forensics at the start of a project can pay dividends later on.”
Text mining, artificial intelligence, curation
“Why is data integration so hard?” looks at the anatomy of a data integration application, starting with “bringing the content in,” which includes discussion of connectors and the Extract/Transform/Load (ETL) process; text mining; artificial intelligence (AI); and curation.
It examines the differences between heavyweight and lightweight integration, recommending the middle ground: “There is a natural inclination to address the opportunity by integrating everything. That is a dangerous road!”
And the report notes in conclusion:
Having clear goals for your project, going for middleweight data integration, choosing the right ontologies and actively managing the entities you care about will serve as a good starting point in order to achieve a successful outcome.
The goals will dictate the technologies that will help you along the way, such as choosing which database engine to use or whether to build an API around your data. Remember that technologies are not mutually exclusive, and for example you can use triple store, document storage and relational database management systems (RDBMS) in one project to represent different facets of data.
AI and text mining can help you to groom and enrich your data, but use the tools wisely. As powerful as machine learning methods can be, in many instances they are still inferior or offer very little advantages compared to more traditional algorithms.
And finally, even with all that automation, manual curation is still necessary in most data integration projects, even if it’s just for quality control purposes.
Eugene Rakhmatulin is one of the authors of “Why is data integration so hard?”, the whitepaper on which this article is based.
To download the full paper, please click here.