Cancer Information Extraction from Pathology Reports Using CLAMP

Biospecimen repositories are collections of both diseased and normal tissue, as well as specimens of blood, urine, and other bodily fluids. They represent a critical resource for researchers trying to understand the etiology of diseases, and for the development of new diagnostics and therapies. Collected from patients and others who desire to make a positive impact towards medical research, sometimes via invasive medical procedures, they are regarded as precious. Consequently, both medical ethics and organizational policy demand they not languish unused. Even in mid-sized biorepositories, the indexing and categorizing of specimens that meet specific criteria for research on a particular disease or condition can be time consuming and costly. Specimen requests are usually specific to an attribute of the patient often using knowledge that is contained in the pathology reports such as diagnosis, staging, grade, disease progression, or immunohistochemistry (IHC) biomarker status.

It is only in recent years that pathology reports have become integrated into the EHR system. The majority of EHRs often include surgical pathology reports only as an attachment to an external media section. The lack of structured data from the reports has resulted in costly manpower hours being used to extract this information to add value to the donated specimens. Consequently, many of the donated specimens go unused, thus not fulfilling the donors’ intended purpose and costing the hosting organization money and space to properly store and maintain.

For these reasons, the biorepository at Baylor College of Medicine (BCM) in Houston wished to improve on their ability to find and utilize specimens. As shown in Figure 1, the BCM biospecimen repository contains over 693,000 specimens collected from over 58,000 participants from five different affiliate hospitals in the Houston area. Most of the details of the specimens – including the anatomic site they are collected from, the stage and grade of the tumors, and molecular and histological characterization of the specimens – are contained in pathology reports as either narrative or semi-structured text. Consequently, the BCM team wanted to construct a database suitable for supporting their mission and automatically populated with the data derived from the pathology reports. Melax Tech is applying our CLAMP NLP technology to support the creation and maintenance of this biospecimen repository of cancer specimens from five hospitals in the Houston area.

Figure 1: BCM biorepository summary statistics dashboard

To automate the extraction of this information for the BCM biorepository, we developed a set of customizable modules for extracting comprehensive types of cancer-related information in pathology reports (e.g., tumor size, tumor stage, and biomarkers), by leveraging the existing CLAMP cancer information extraction system, which provides user-friendly interfaces for building customized NLP solutions for individual needs. The default Melax Tech CLAMP NLP pipeline for cancer data extraction was developed based on prior work from pathology reports as presented at Medinfo 2019 (see Stud Health Technol Inform. 2019 Aug 21; 264: 1041–1045. Available at The resulting pipeline extracts cancer information from pathology reports according to CAP recommendations using a hybrid approach consisting of machine learning models, dictionaries, and rule sets. Data elements are normalized (mapped) to ICD-10 codes using the NLM’s Unified Medical Language System (UMLS). The default pipeline achieves good F-measures for named entity recognition, with a range of 0.87 to 0.99.

Since BCM wished to extract additional data elements, and in a greater degree of granularity than our default cancer pipeline, we extended the information model of the CLAMP cancer extraction system, as well as the number and types of information that can be retrieved from the pathology reports. A high-level diagram of the basic concepts and their relationships is summarized in Figure 2.

Figure 2: Basic concepts and their relationships

Identification of the cancer status of the patient depends on these information elements, explained in the pathology reports. These concepts were used during the annotation process, where trained annotators (such as a nurse, pathologist, etc.) marked some concepts and some relationships between these concepts. These annotated reports were then used to allow CLAMP to generate extended and enhanced machine and deep-learning models for use in the final NLP pipelines. The constructed NLP system was run on 20 years’ worth of surgical pathology reports derived from multiple EHR systems in use at Baylor and their affiliates during that time frame. The system is now processing current reports on a near-real time basis as specimens are accrued to the biorepository.

Because CLAMP was constructed to allow for easy enhancement of NLP functionality, we have been able to easily apply the system to the different types of pathology reports emanating from Baylor and the four affiliate hospitals contributing to the project and from diverse electronic health record (EHR) and pathology systems including Epic (Beaker), Cerner(CoPath), and CPRS (VistA).

Our overall method in applying CLAMP to the Baylor use case was as follows:
  1. Assess the accuracy of the default CLAMP pathology pipeline on reports from the Baylor project.

  2. Identify gaps in the information model of the required clinical entities.

  3. Extend the information model to include the necessary elements, as shown above in Figure 1.

  4. Using a sample of pathology reports from the BCM systems, two annotators added these new annotations to the reports.

  5. The CLAMP deep- and machine-learning models were retrained using the additional annotated pathology reports.

  6. Performance was once again assessed; additional rounds of annotation/model retraining, and parameter adjustments were performed until suitable F1 scores were obtained for critical data elements.

The system now supports extraction of 38 data elements aligned with AJCC staging criteria and the CAP recommendations. We are currently achieving F1 measures exceeding 0.90 on most of these data elements and continue to improve our accuracy. Baylor anticipates that the system will provide substantial cost savings and a measurable increase in overall specimen utilization.

The results of the NLP process allow the biobank staff to identify specimens based on desirable markers. Searches can now be run and data filters applied based on the data extracted from the unstructured text. Common examples are finding specimens associated with triple negative breast cancer (ER-, PR-, HER-) or KRAS mutation status in colon or pancreatic cancer. Figure 3 shows an example of these biomarker report filters applied to the NLP output using the BCM dashboard.

Figure 3: Biomarker Summary Report showing Gender, Grade, stake, smoking status, and number of samples in the selected cohort by anatomic and biomarker site

The biobank is now able to identify specimens affected by prior treatment (as shown by the “y” symbol categorization code in a pathology report) and specimens of advanced stage and grade. Figure 4 demonstrates how resulting dashboards from a selected cohort can be explored to examine the distribution of anatomic sites, biomarkers, and other characteristics of the cohort.

Figure 4: BCM biorepository summary statistics dashboard

The use of NLP to create additional specimen annotations has shown to be a low-cost method providing easy access to patient and specimen annotations. It has greatly enhanced the utility of the biospecimens and resulted in important specimen distributions for unique medical research that may not have been feasible in the past.

To learn more about cancer information extraction from pathology reports using CLAMP, request a demo today!