Epidemiological Text Mining

Normalization method

Normalization of identified key characteristics

A normalization procedure is applied in order to recognize descriptive attributes that can assist in the understanding of epidemiological information related to a health problem. Before the normalization process is applied, an approach is followed allowing the elimination of similar/identical mentions (with the exception of effect size mentions). Read more

It was hypothesized that the lengthiest span is the most informative in each characteristic. Therefore the normalization process is performed to the mentions that have been considered unique by using a string comparison module between the lengthiest mention and the rest for each characteristic. Mentions similar to the longest one are ignored; those no similar were used for normalization. The identified effect size mentions are excluded from this procedure as epidemiological abstracts report the same effect size only once.

Hide

The extracted mentions of each characteristic are normalized using a variety of methods:

1. study design - mapping to the adapted Ontology of Clinical Research (OCRe). More

Study design mentions are normalized through the use of an adapted version of OCRe (Tu et al. 2009). OCRe’s study design branch was expanded to include experimental research, observational types such as "correlational" and secondary research ones e.g., "meta-analyses". Each identified study design is mapped to one of the ontology nodes (23 in total). In order to match the identified study design mentions to the ontology, we used a string comparison module (Tresoldi, 2009) which is based on previous work by Yang et al. (2001). The module compares two strings and estimates the similarity between them through edit distance (inserts, deletes, substitutes). The similarity score is based on the number of edit operations performed. Since the aim is the representation of key epidemiological characteristics in each abstract, it was assumed that the longer the mention is, the more information it will include, and therefore more detailed normalization could be performed. The match that returns the highest score is chosen as the normalized version of the input study design. Following that, the normalized span is classified into higher level nodes of the ontology with any additional information stored as attributes. More detailed information (where available) is considered as a study attribute e.g., “prospective cohort study” is mapped to “cohort study”, with "prospective" being the additional attribute.

Less

2. population - identification of age, gender, nationality and ethnicity. More

Any identified population mentions are normalized according to specific attributes (age, gender, nationality, ethnicity). Before the normalization, the lengthiest span is chosen in each abstract. Each attribute takes specific values: age (juvenile 0/19 years old, early adulthood 20/39 years old, middle adulthood 40/59 years old, late adulthood 60+ years old), gender (male, female, mixed), nationality (229 nationalities obtained from here and here) and ethnicity (26 ethnicities obtained from here and here). Nationality, ethnicity and gender are being detected from the use of respective dictionaries while the age is recognised from applied regular expressions. If any attributes are not detected in the chosen longest span, then a script is applied to the other identified mentions for the identification of related information.

Less

3. exposures, outcomes and covariates mapped to UMLS. More

Any mentions of exposures, outcomes and covariates that appear more than once in the same abstract are filtered out through the string comparison module Any span with a similarity score below 40.0% is considered a different concept and is incorporated into the document level representation of the abstract's information. The normalization here is essentially normalization of biomedical concepts and the state-of-the-art software for this procedure was applied, MetaMap (Aronson et al. 2010). Each concept is classified in one of 135 UMLS semantic categories and then clustered into one of the 15 UMLS semantic groups (McCray et al. 2001). The greatest challenge is the resolution of ambiguity issues when two or more UMLS concepts share a common synonym. Hence the World Sense Disambiguation (WSD) option in MetaMap was used.

Less

4. effect size - identification of effect size value, concept, confidence internval and effect size type. More

Effect size mentions are normalized the application of regular expressions that focused on their individual attributes. Effect sizes follow a relatively structured format comprised from different types of data, hence making their processing relatively straightforward. Usually, the extracted effect size concepts contain the following attributes (not necessarily in that order):

a) the effect size measure type (adjusted odds ratio, odds ratio, hazard ratio, relative risk, prevalence, incidence, adjusted relative risk, adjusted hazard ratio);
b) the respective (numeric) value of the effect size measure (usually a percentage);
c) confidence interval (an observational numeric interval indicating the reliability of an estimate);
d) the concept the effect size is linked to (either as an exposure or as an outcome).

Since the same abstract is not likely to report the same effect size more than once, the process of eliminating any spans that are similar or identical was not followed.

Less

References

McCray AT, Burgun Α, Bodenreider Ο. Aggregating UMLS semantic types for reducing conceptual complexity. Studies in health technology and informatics 1 (2001): 216-220.
Yang ΧQ, Yuan SS, Chun L, Zhao L, Peng S. Faster Algorithm of String Comparison. (2001).
Tiago Tresoldi, [http://code.google.com/p/nltk/source/browse/trunk/nltk_contrib/nltk_contrib/stringcomp.py], 2009.
Tu SW, Carini S, Rector A, Maccallum P, Toujilov I, Harris S. OCRe: an ontology of clinical research. 2009.
Aronson AR, Lang FΜ. An Overview of MetaMap: Historical Perspective and Recent Advances. J Am Med Inform Assoc. 2010 May-Jun; 17(3) : 229-36.