Skip to Content

Text Mining for Pre-screening of Cancer Clinical Trials

Sogeti Labs
August 17, 2016

Details of medical laboratory, scientist hands using microscope for chemistry test samples

Big data / NoSQL Cassandra / SOLR / Natural Language Processing – Text Mining for Pre-screeening of Cancer Clinical Trials.

Cancer clinical trials are search studies that test the pertinence of a new medical treatment on cancer patients. They are key factors for medical improvement and their success depends essentially on the number of enrollments onto trials.

Pre-screening patients manually require lengthy investigations and successive matching on patients’ records during a limited phase.

Adding to this, a large amount of money spent for this phase, automating the eligibility prescreening process turns out a promising and a beneficial solution for cancer treatment.

In fact, automating this process remains an information retrieval task. Medical records, which are mainly originated from surgical pathology laboratory, constitute a rich source of unstructured data. They are written in a natural/human language which is complex and difficult for a machine to process. Dealing with such type of data requires a structuring phase for extracting useful information in order to provide the necessary knowledge to the machine, thus for translating the human language to a machine recognizable language.

Text Mining and Natural Language Processing (NLP) combined together, constitute a solid solution for representing this valuable information stored on medical records. They deal both with free text, and the main objective is to extract non-trivial knowledge from it. It encompasses everything from information retrieval to terminology extraction, text classification to spelling correction and sentiment analysis. NLP methods rely intensely on probability theory, statistics and machine learning field. It deals also with linguistics concepts, grammatical structure and the lexicon of words.

Recently, cancer research is benefiting from the Text Mining advancement and uses its theory for clinical decisions. More precisely, automating cancer clinical matching trials have been the subject for many studies and solutions which dealt with information retrieval from medical records. In fact, working with cancer data consists of covering hundreds of cancer diseases with a very large lexicon. Many medical terminologies have been constructed in order to regroup medical concepts and thus to provide a unified lexicon in the field of medical. Those libraries, mainly UMLS, SNOMED and CIMO, are a major component of Natural Language systems designed for medical field. They serve as a link between patient data and the Text Mining system in order to enrich clinical records and extract synonyms for medical concepts.


To get a clear view of how Text Mining and NLP can help the automation of clinical trials, let’s get in deep of the most used methods for processing natural language stored in clinical data. Firstly, we recall that the objective is to extract medical concepts and semantic types from both the clinical trial criteria datasets and patient data. NLP provides a semantic representation of the natural language sentences in order to map them to their original meaning. It uses either a rule-based algorithms or the machine learning paradigm for more complex language processing.

Most of the automated patients prescreening systems are rule-based. They are easy, fast and more preferred to deploy. Such methods perform well on simple types of information, but for complex type of data ML algorithms, although being a Black Box for clinicians, are more robust and give good performance. Rule-based models are mainly used for medical text pre-processing: tokenization, sentence parsing, redundancy removal, etc. After pre-processing the free text, an assertion detection phase is followed in order to detect negation. NLP system tries also to detect medical terms using different medical terminologies. The other approach is to use Machine Learning models for the same purpose through the analysis of a set of documents or individual sentences that have been hand annotated with the correct values to be learned. Main ML algorithms that are used for NLP are Naïve Bayes, Support Vector Machine and Random Forest… They take as an input a large set of features induced from patient’s records and try to learn rules from the annotated examples. ML methodology can also be used for learning information from the previously selected patients’ data by detecting features that explain the enrollments into previous clinical trials.

After retrieving all useful information from the unstructured data and expanding it with all possible medical hyponyms from medical ontologies, it serves as an information retrieval data source for matching patients with inclusion and exclusion clinical trials criteria. Given a cancer clinical trial and the encounter patients, the Text Mining system supplies to clinicians a restricted list of eligible patients, thus providing a significant impact in reduction of time and effort for manual pre-screening.


Due to the large volume of data to be managed, we have selected and designed a big data architecture based on Datastax. Why Datastax? Because it supports Hadoop, Spark, Cassandra and SOLR. Already ready to use. So, deploying it using MS AZURE portal, it took around 1 hour to get several nodes working and ready to use.

We imported all the data into a Cassandra database, then SOLR indexed it and we were able to perform some data exploration and search quickly.

We add synonyms coming from SNOMED and UMLS in order to be able to use synonyms search feature of SOLR. Thanks to dedicated NLP developments in PYTHON we implemented Natural Language Processing features (negation, semantic improvements, medical terms identification, stemming, etc) in order to improve prescreening process performance.


By the end of 2016, we will complete the test phase and we will add some improvements taking in account users feedback.

A new post will be published then, with the final conclusions and results.


Taking benefits of all scientific articles we were able to design a cancer clinical trials prescreening solution in French. Several products exist in English but no solution is available for France or French-speaking countries.

Business benefits offers by our solution is already obvious: by suggesting a list of patients in a few minutes to clinical trials team instead of several days of manual screening, the team can focus to confirm results proposed instead of screening tons of patients records and data.

Contributor: Bilal AZENNOUD, Data Scientist, SOGETI France

About the author

SogetiLabs gathers distinguished technology leaders from around the Sogeti world. It is an initiative explaining not how IT works, but what IT means for business.


    2 thoughts on “Text Mining for Pre-screening of Cancer Clinical Trials

    1. Hello Xavier, great post. This topic is at the top of every healthcare organization wich are working on. Could you provide us with some indicators regarding the reduction time?.

    2. Hi Jacques,
      We expect to save 90% of time. At least. When they spent one week with 3 people for prescreening, we hope to provide good results in a few hours. Then they will have to refine the results. In a first time. Then, when we will get more data, we will be able to improve supervised machine learning algorithm and save more time. They key success is the massive use of the application: more they use it and more they provide data allowing to improve machine learning algo.

    Leave a Reply

    Your email address will not be published. Required fields are marked *