More Website Templates @ Templates.com!

Extraction method

Extraction of key characteristics from epidemiological literature

We designed and implemented a rule-based approach combined with relevant dictionaries to recognise potential mentions of key epidemiological characteristics through the environment of Minorthird (Cohen, 2004) in an epidemiological corpus related to obesity. The reasons for which the rule methodology was selected are stated here.

  1. lack of large annotated corpora that are required for the training of machine learning methods
  2. selection of appropriate features required for the training of machine learnings techniques
  3. limited human resources for the manual curation of the data
  4. relative structure in epidemiological texts enables rules to identify common syntactical expressions in text that suggest the presence of a key characteristic more efficient

Hide

More specifically, the methodology includes:

1. Creation of vocabularies

(Details)

A number of semantic classes are identified using custom-made vocabularies that include both unique and synonymous terms in order to detect key characteristics. These vocabularies contain mostly generic terms regarding the role of a concept in epidemiological studies and they can be potentially used in other tasks. A total of fourteen vocabularies were created and utilized. The dictionaries can be downloaded from here.

Hide

2. Identification of biomedical concepts

(Details)

For the recognition of biomedical concepts that belong potentially to exposures, outcomes and covariates, the Specialist lexicon was applied. In order to expand the dictionary resources, the related corpus of epidemiological abstracts is processed through the ATR C-value method for the extraction of multi-word candidate concepts (Fratzi et al. 1997)-chosen due to its successful application in various text mining problems of biomedical nature showcasing encouraging performance (Nenadic et al. 2004). Filtering was applied aiming to remove any concepts of non-biomedical nature, hence improving the resulting dictionary. A common stop-word list was used (created by Fox (1989) and was manually expanded through an empirical validation on the training set.Both the Specialist lexicon and the ATR dictionary can be downloaded from here and here respectively.

Hide

3. Rules for epidemiological characteristics extraction

(Details)

A set of text based rules are applied to the corpus with a combination of the respective dictionaries. The rules were designed and based on semantic patterns observed in epidemiological text. The semantic patterns are specific combinations of lexical expressions and semantic classes (identified through the vocabularies) that indicate the presence of a key characteristic in text. More than one semantic patterns can exist in an epidemiological study referring either to one characteristic or to many

e.g., “X is associated with Y”, “after adjusting for confounders such as X, Y, Z”

The lexical expressions contain from verbs to prepositions and noun phrases and their translation into the rule design includes the usage of regular expressions and the creation of vocabularies. Through the training set inspection, semantic patterns for each characteristic were observed and incorporated in the rule design. After analysing the development set, the rules were expanded to include more similar semantic patterns for each characteristic. The generated rules can be downloaded from here.

Hide

4. Pattern matching

(Details)

In order to identify a mention, a semantic pattern has to be matched by the related rule and the respective vocabularies. Then the defined concept will be recognised through the Specialist lexicon and the ATR integrated dictionary resources. Consequently, candidate mentions of epidemiological concepts are tagged in text when these are matching any of the concepts in the applied dictionaries.

Hide

    References

  • Cohen WW. MinorThird: Methods for Identifying Names and Ontological Relations in Text using Heuristics for Inducing Regularities from Data, http://github.com/TeamCohen/MinorThird/, 2004.
  • Fox C. A stop list for general text. In ACM SIGIR Forum, vol. 24, no. 1-2, pp. 19-21. ACM, 1989.
  • Frantzi K, Ananiadou S. Automatic Term Recognition using Contextual Cues. Proceedings of 3rd DELOS Workshop, Zurich, Switzerland 1997.
  • Nenadić G, Ananiadou S, McNaught J. (2004). Enhancing automatic term recognition through recognition of variation. In Proceedings of COLING 2004. Geneva. 604--610.
This template was downloaded form free website templates