Facilitate Biomedical Data Re-use using Natural Language Processing



Biomedical research produces complex datasets ranging from molecular level to individuals and populations. Many biomedical data repositories have been created, aiming to serve the community by housing and making the datasets available for reuse. Those data repositories greatly improve the availability and utility of biomedical datasets. However, the metadata of such biomedical datasets are often not standardized, making them less compliant with the FAIR principles (Findable, Accessible, Interoperable, and Reusable).


In a project funded by National Institute of Allergy and Infectious Diseases (NIAID), Melax Tech proposed a solution to tackle this challenge for immunology research by developing natural language processing (NLP) and ontology-based methods and tools to extract and normalize the metadata of immunology datasets, thus improving their discoverability by general and specific search engines. We extended our flagship NLP product CLAMP to extract and normalize biomedical entities such as genes, diseases, and drugs in description of biomedical datasets, making them interoperable via standard biomedical ontologies. Over 20,000 immunological datasets, as well as their linked publications, were processed by CLAMP and indexed through a public search engine. The developed framework that can automatically extract and normalize metadata of biomedical datasets and feed them into any search engines will greatly improve re-use of existing datasets to increase biomedical research productivity and reproducibility.