top of page

WHITE PAPER: The Use of Natural Language Processing in Oncology

Executive Summary

Over the past two decades, Electronic Health Record (EHR) systems have been increasingly implemented at US hospitals. Substantial amounts of detailed, longitudinal patient information have been accumulated and are available electronically in these EHRs, leading to a rich source of data for a variety of primary and secondary applications related to the healthcare enterprise.

However, a well-known challenge of analyzing EHR data is that 80% of detailed patient information remains embedded in unstructured data rather than discrete EHR data elements. It is both costly and time-consuming to extract such information manually, as is routinely done when building disease registries and other analytic resources. The use of Natural Language Processing (NLP) technologies represents an opportunity to unlock this information from the clinical narratives, and consequently, healthcare organizations are increasingly turning to NLP for solutions.

In this white paper, we examine one specific use case for NLP-based applications, demonstrating how the Melax Tech NLP system, CLAMP (Clinical Language Annotation, Modeling, and Processing), has been effectively used to extract highly detailed oncology information from the clinical narrative text of pathology reports.

Oncology information is a particularly difficult task for NLP systems due to its complexity and the important diagnostic and treatment oncologic information, which is often not expressed as discrete data elements in an EHR system. As a result, accessing this data for the purposes of support for clinical care and in subsequent secondary-use applications such as population health and reporting to public health registries can be difficult.

The Complexity of Oncology Information

Cancer is not a single disease; rather, it is a heterogeneous group of related diseases with a high degree of diagnostic and therapeutic complexity. Modern oncology practice relies heavily on a variety of laboratory tests (including genetic tests) and imaging procedures to provide information regarding cancer severity, spread in the body, speed of progression, treatment options, and clinical prognosis. Multiple types of information are required for these tasks, and many different laboratory tests are possible depending on the specific location within the body and cell types from which laboratory specimens are collected.

As discussed below, information on tumor stage and grade help clinicians elucidate treatment options and prognosis. Unfortunately, this information is not routinely captured as discrete data elements in current EHR and clinical laboratory systems. The reasons for this vary from institution to institution. However, they often involve an interaction of several factors, including:

  • the current design of EHRs

  • the current state of best practice guidelines and certification requirements by professional societies such as the College of American Pathologists (CAP)

  • the level of coordination required for adopting these guidelines by the multiple clinical practitioners involved in treating cancer patients within a given institution

Consequently, NLP is often necessary to extract staging and other clinically important information from the reports. The criteria for staging and grading depends on the precise organ system involved as well as the cells in the tumor sample and demonstrates the high degree of complexity of information extraction for these specimens.

Considerations for Using NLP in Oncology

To be successful at clinical NLP, a system needs to perform the following three types of functions at a minimum:

  1. Named entity recognition (NER), sometimes termed clinical entity recognition (CER), extracts the mentions of clinical concepts from free text, such as diseases, medications, lab tests, and results.

  2. Concept encoding to map extracted mentions to codes in standard terminologies, such as SNOMED-CT, RxNorm, LOINC, ICD-10, and others.

  3. Relation extraction (RE) to identify the relations between concepts, such as temporal relations.

Oncology, however, presents additional challenges due to the chronic nature of the disease, the genetically modulated fashion in which it arises and progresses, and the complex nature of the treatment processes involved. As a disease specialty area, the information needed to successfully diagnose, treat, and manage patients is not routinely captured in most EHR systems. Indeed, not all EHR systems have full support for all the components of oncology care. Further, due to the specific diagnostic and treatment practices used in oncology, a successful NLP system must have sufficient and accurate models of these practices.

The NLP System as it Relates to Oncology

Desirable features of NLP systems targeted for application in the field of oncology include some of the items listed below as examples. This is by no means an exhaustive list.

  • The stage of the disease or tumor. Cancer is staged according to a classification system, currently in its 8th edition, developed by the American Joint Committee on Cancer (AJCC). The system is sometimes referred to as TNM staging. The goals of the system are to record the location of the original tumor, its size, dissemination and lymph node involvement, and the presence or absence of metastasis. There are four different types of staging: clinical, pathologic, post-therapy, and restaging (in the case of a disease recurrence). Staging is complicated by the fact that it varies by cancer type. Ideally an NLP system should be able to properly recognize these elements and report them accurately in its output.

  • The grade of the tumor refers to the manner in which tumor cells are organized to the eye of a trained pathologist. Grading systems are often defined and standardized by professional societies of clinicians and pathologists working in a particular disease subspecialty. The Gleason score used with prostate cancer is one such example. Its dissemination and standardization are overseen by the International Society of Urological Pathology. Cancer grade should be recognized and accurately reported.

  • Biomarkers are now routinely being used to predict disease progression and outcomes, as well as guide therapeutic selection. Well-known biomarkers in breast cancer are human epidermal growth factor receptor 2 (HER2), estrogen receptor (ER), and progesterone receptor (PR). Increasingly, biomarker panels are being designed and used incorporating genetic test results. Consequently, there is a wide and rapidly increasing number of genes and gene-products being examined. NLP systems should have the capability to recognize all the common biomarkers for diseases of interest and have facility to recognize with high reliability those gene names mentioned in a diagnostic or treatment context. Ideally, the system should incorporate a means to expand gene name recognition as the field expands.

  • A robust temporal model. Recognition of time-based sequences of events, including medication administration and the placement of events within sequences is an important consideration in cancer therapy. As noted above, treatments often occur in multiple rounds of varying therapy, called cycles. Effective NLP of oncology clinical notes must accurately and robustly represent these events and place them in time.

Roadblocks to Robust NLP Configurations

Identification of this type of information is not straightforward. For example, many abbreviations exist in cancer reports as shorthand for chemotherapy drugs and treatment regimes. Cancer therapy is often administered in multiple rounds of chemotherapeutic drugs, sometimes combined with other interventions. Often, multi-drug combinations given for specific time periods are referred by abbreviation or other shorthand names. These can be difficult to recognize as they are not necessarily directly tied to National Drug Codes (NDC) names. For example, the abbreviation “AC” refers to a regimen consisting of doxorubicin hydrochloride (Adriamycin) and cyclophosphamide, often given in a 21-day treatment cycle. Similarly, “TCH” stands for a regimen of docetaxel (Taxotere) and carboplatin combined with trastuzumab (Herceptin). It is not uncommon for locally generated abbreviations for treatment regimens to be used at a given health system, further complicating the problem. NLP systems used with text referring to chemotherapeutic regimens must be aware of these nuances and be extensible and context aware.

When attempting to implement NLP in oncology or any clinical discipline, there are additional factors to consider that may become barriers within an organization. These may include the following:

  • Research has shown that general clinical NLP systems cannot achieve optimal performance for all tasks in the medical domain. NLP systems built for a specific purpose often show good results at a given task, but performance will drop when transporting these tools between clinical note types, such as from pathology notes to discharge summaries. Similar problems occur when moving NLP algorithms from one organizational setting to another, such as when moving an NLP application between different hospitals, or when changing the application domain by moving a system developed for quality measurement to staging and grading cancer specimens.

  • The complexity of the data, its resulting distribution across multiple clinical reports from different departments (oncology and hematology, pathology, imaging, etc.) can present barriers to organizations.

Evolution of NLP use within the field

Nonetheless, there is increasing interest from the cancer community to use NLP for a wide array of purposes. For example, the National Cancer Institute’s Surveillance, Epidemiology, and End Results (SEER) Program is currently undertaking efforts to develop the ability to support and maintain national cancer registry reporting through the use of NLP. They deem this necessary as the cost of maintaining a national level disease registry using human cancer registrars is becoming too high. Meanwhile, the American Society of Clinical Oncology’s CancerLinQ project involves NLP in an attempt to retrospectively analyze oncology charts to develop new treatment protocols and to improve outcomes by providing clinicians with tools to gauge adherence to clinical best practices for oncology.

Implementing NLP for an Oncology Use Case

Here we describe the steps that an organization should take to apply NLP successfully for a given application. Even if an existing NLP system exists at an institution, using it for a new use case typically involves a series of steps to achieve the desired degree of performance on information extraction tasks. These tasks involve developing an information model for the data to be extracted, developing and annotating a sample corpus of text for use in training machine learning-based models and finally, training and refining the model.

In the examples below, we show how we used the Melax CLAMP NLP engine to develop an information extraction system based on a specific oncology use case. The use case is based on the need to identify diverse types of cancer diagnoses information from pathology reports of lung cancer patients. While the information displays shown are from the CLAMP tool, the overall process would be similar for any machine learning-based NLP tool.

Steps Involved in Developing Information Extraction

We begin by developing a model to represent the various types of information we want to extract. The model shown in Figure 1 (below) is designed for the diagnosis section of cancer pathology notes and is based on the CAP cancer protocols. This model provides the backbone of NLP-based information extraction. The concepts and the relations defined in it are annotated and extracted in later steps.

Next, an annotation guideline contains detailed definitions of concepts and relations in the above information model, with specific instructions and examples to clarify how to determine the boundary of each entity and what should and should not be annotated.

Figure 1 – Diagnosis of Cancer Pathology – Notes used for NLP Extraction

Then, a corpus of cancer pathology notes is annotated, usually by two or more annotators. This is an iterative process usually done in multiple rounds of annotation, with differences between the annotators being resolved between rounds. The annotation guidelines are usually updated during this process. Figure 2 (below) shows some examples of the annotated information.

Figure 2. Annotated information - Prostate Cancer

Putting Annotation to the Test

Half of the resulting annotated corpus is then used as input to develop models for named entity recognition and relation extraction. The other half of the corpus is set aside for use in evaluation of the final system. Named entity recognition models can be developed using regular expressions, dictionary based look-up, or machine learning-based methods. The relation extraction models are also developed using hybrid approaches (combining machine learning and rules) for optimized performance. Machine learning NLP models are generally built by submitting the annotated corpus to a machine learning kernel. Different NLP systems vary as to how the resulting trained models are deployed into production, but in practice the method is not difficult. Within the workbench from the Melax CLAMP system, various hybrid named entity and relation extraction models deployed in a sequential process are termed a “pipeline.”

Before deploying the resulting NLP pipeline, it should be rigorously evaluated for performance. For evaluation purposes, the part of the annotated corpus set aside earlier is run through the NLP pipeline and the results of the machine annotation are compared against the previous annotation. Table 1 lists the detailed performance on entity recognition and the end-to-end performance of recognizing both entities and relations. By reviewing these metrics, a data scientist can easily see where performance needs to be improved.

Table 1. Evaluation results of information extraction from cancer pathology notes


While the field of oncology provides additional challenges to developing successful NLP applications, by understanding the data and by employing sufficiently robust NLP tools, useful, targeted applications can be developed and deployed.

Care must be taken in the process of understanding the components of oncology that are unique to the field and in developing appropriate data models. All models developed by NLP must be evaluated properly to understand their performance and limitations before being deployed into a production environment.

For More Information

Methods/results for the cancer diagnoses information extraction use case: Proceedings of MedInfo 2020

The CLAMP NLP toolkit features and architecture: The Melax website and Journal of the American Medical Informatics Association, Vol. 25, No. 3, March 2018, Pgs. 331–336,

CLAMP is available for commercial use at . A non-commercial version is available for academic use at

bottom of page