In our previous post, NLP & Genomics: Play with a full deck, we covered all the reasons we want to use Natural Language Understanding to extract unstructured information from genomics publications. We also demonstrated the kinds of information we’re able to extract with our own NLP.
In this post, we’re going to cover some of the linguistic hurdles AI needs to overcome in order to understand publications like humans.
NER or Named Entity Recognition
Named Entity Extraction is just what it sounds, finding an entity in a text. Sounds simple enough, doesn’t it? Feed your computer a list of words and it will find them within the text. In data science, we call this technique Bag of Words.
But not so fast! Bag of Words will run into several types of problems when running on genomics publications.
Gene or Gene
Let’s take the example of finding publications about genes. Genes can be composed of a few letters, which can also happen to be a known acronym. For example, MCAT is a gene, but also the Medical College Admission Tests. If you had to search for all publications containing MCAT do you know how many publications referring to the gene you would get?
State your full name
Another problem we commonly run into is that entities can have several commonly used names. For example, breast cancer is also referred to as breast and ovarian cancer. If you’re using a bag of words, it’s unlikely your list will contain all of the possible entity variations. This gets even more complex with phenotypes, which the clinician can write up in a variety of ways.
Who are you?
M protein is a protein, right? Not so fast. Run a quick search in Wikipedia for M Protein and you find:
M protein the virulence factor produced by strep.
Protein M the protein you expected.
Myeloma protein, also referred to sometimes as m protein.
MYOM2 – the gene that encodes Protein M, which is also sometimes referred to as Protein M.
A single name referring to multiple entities is more common than you would expect, and a simple bag of words type solution won’t be able to identify the right entity.
That’s where disambiguation comes in. Disambiguation determines the most probable meaning of synonyms or entities which are referred to by the same name. This is especially relevant in genetics, since there is rarely a unified naming standard for genetic or medical entities.
In all of the examples above, disambiguation is needed in order to understand which entity is discussed in the publication. If MCAT is mentioned, is it the gene? In which case we’ll incorporate it into our database. Or is it the test? In which case we won’t.
Regarding the M-Protein example, disambiguation gets even more difficult, as a much deeper understanding of the context is needed in order to understand which entity is discussed.
There is another category of words that needs disambiguating in publications, and these are linguistic anaphoras. The words ‘it’, ‘he’, and ‘the gene’ all have meaning in the context of a publication. We need to understand which entity the anaphora is referring to in order to extract relevant textual information, even when the specific entity isn’t named.
While this is an open problem in the NLP world, and anybody wanting to solve this for a specific domain will have to develop their own tools and train domain-specific models.
Going back to our goal of holding a complete representation of genomics knowledge, we also need to understand the relationship between any two entities in a publication.
The easy way to do this is to decide that if a publication contains two entities, they are related.
However, the relationships between any two entities can be quite varied. They may interact, regulate or block each other. Even if they are mentioned in the same publication, that doesn’t mean we understand their relationship.
Some publications study multiple diseases and multiple genes, and the research only points to one or two connection mechanisms and refutes the rest. If you use the simplified method of tracking mentions in the same publication, all of the diseases and genes in a paper will have a positive relationship.
Another example where simply tracking whether a publication mentions an entity fails is when a paper refutes a gene-disease connection. In this case, the simplified method will classify this as a connection, when in fact it is the opposite.
So in order to extract the key relationship in a publication, we must first understand all the relationships. This is what a relationship map between variant-gene-disease could look like in a single paper, quite a few possible connections!
The relation extraction algorithms must identify the key relationship out of all these. We also extract secondary relationships which will be useful in interpreting the case and building out our knowledge graph, as we showed in last week’s example:
In this article, there is a single relationship, but we added complexity and depth to it. We were able to understand that this is a causal relationship, and not a regulation\is-a or other type. We’re also able to describe the data layers like inheritance mode, body part etc.
There are other many other uses for NLP in genomics, for example, around phenotypes, but we’ll cover those another day.