Information for Prospective Postgraduate Students (2011/12)
We are always keen to have postgraduate research students in various areas of text mining and natural language processing. As a rule of thumb, you will need to have an excellent first degree in computer science or related area (e.g. computational lingustics, mathematics, physics, bioinformatics), with very good programming experience and some experience in natural language processing (e.g. final year project, summer internship, an ad-hoc project). An MSc or publications in a related area will be also a distinctive advanatage.
The main theme of our research is feature engineering from unstructured documents written in natural languages. We investigate methodologies for the extraction of both explicit and implicit features from large collections of textual documents. Features can be terms, names, relations, co-occurances, events, etc. Once engineered from text, the features can be used to provide understanding and reasoning over knowledge (e.g. by applying machine learning or data mining) - this discipline is referred to as text analytics, text mining or more generally natural language processing (NLP).
Here are some core text mining themes (please see below for details) that are currently the focus in our TEAM:
- Text analytics and sentiment analysis: identification of subjective opinion and sentiment features from user-generated content (e.g. blog mining, tweets, etc.);
- Extracting negations, contrasts and contradictions: identification of utterances that are negated, or contrast or contradict some other expressions (both explicit and implicit);
- Concept mining and structuring: learning and identification of concepts and terminology from text, including their structuring (internal and external);
- Temporal text analytics: identification of temporal expressions and their scope in text;
- Integrated text and data mining: combining the results from different perspectives using various methods from machine learning;
- Text processing midleware for the Semantic Web: building an infrastructure to support building text mining solutions for the Semantic Web (identification of concepts, links, etc);
and these are preferred application areas:
- Biology and biomedicine (molecular interactions, cancer studies, characterisation of molecular events, etc.)
- Bioinformatics and computational biology (tools, services, resources, methods)
- Clinical medicine and health-care (clinical decision support, quality of life monitoring)
- E-science, e-commerce and e-government (e.g. monitoring, tracking, dissemination of information)
- Engineering (knowledge management)
Application steps
You will be expected to have passion for text processing, in addition to an excellent first degree in computer science or related area. Some experience in natural language processing is very useful, whereas very good programming experience (in a combination of programming languages) is a must. If you belive you've got all these, send an email to Goran Nenadic (see below) with a full CV and a brief note as why you would like to do PhD in our TEAM. Please allow some time for us to reply. Contact email: .
Funding
PhD studies are between 3 and 4 years, typically closer to 4 than to 3 years. There is only one route for securing funding: the candidate needs to be outstanding. There are 3 possible sources of funding:
- specific, pre-defined projects (NONE CURRENTLY),
- funding from the School of Computer Science (see here
for details) and - external funding (private, external bodies - e.g. foreign governments, etc).
). Note that because of conditions associated with some funding sources, studentships might be open to students "eligible for home fees only"; this includes UK and EU nationals; non-EU students should always check this page).
Environment
The School of Computer Science is one of the leading Schools in the UK reknown for the excellence of its research. The world's first computer with internal memory was build in the School and Alan Turing has laid the foundations of Computer Science and Artificial intelligence while in Manchester. The international reputation of our research reflects on its high ranking in the last national Research Assessment Exercise (RAE), which places the School among the best five Computer Science departments in the UK and top in England for research power. The School has a vibrant research environment with more than 150 PhD students, 90 research staff and 70 academic staff.
Our research TEAM is part of the Text Mining/NLP research group, which hosts the UK National Centre for Text Mining. We are also affiliated to the Manchester Interdisciplinary BioCentre. The team is vibrant, diverse and very much international.
Selection of research topics
Text mining for biology, biomedicine and health-care
In general, the main objective of these topics is to develop solutions to locate, extract and present useful information and knowlegde burried in various biomedical textual resources.- Meta-data annotation of biomedical documents.
The goal of this project would be to develop a system that will generate various types of meta-data
from text automatically.
The main idea is make use of both domain terms recognised in documents and existing databases.
Also, the project may include extraction of lexical, syntactic and contextual associations from
documents, and thier further incorporation in meta-annotation.
- Integrated and constrastive text and data mining for biological research.
The aim is to use various results of text and data mining in order to integrate or contrast findings drawn from heterougenous sources. A combination of machine learning approaches will be explored to optimise the feature engineering from both text and data resources.
- Understanding terminological coordinations.
We have shown that morpho-syntactic information is not sufficient for recogniton and
identification of terminoloigcal coordination. The goal of this project is to investigate
alternative methods (such as background knowledge and statistics) to improve both precision
and recall of coordination extraction.
- Design and evaluation metrics for bio-text mining.
Text mining scenarios are small-scale, but real-world problems that are defined in close cooperation with
domain specialists in order to support solving a specific set of problems by text mining.
This project will design and evaluate a framework for text mining scenarios in various subdomains.
Text mining for e-science, e-commerce, e-health and e-government
Apart from biomedicine, other domains also generate huge document repositories. The main objective of these topics is to address specific application areas of e-science, e-commerce and e-government.- Automatic terminology processing for e-science, e-commerce and e-government.
The aim is to investigate methods for automatic identification of terms in
these domains. This will include term recognition, term classification and
mapping of terms into existing term-databases, or population of knowledge bases and ontologies.
- Terminiology and ontology driven text mining.
The goal of this project would be to investigate possibilities for text mining
in the domains of e-commerce, engineering and legislation (e-government) using
existing, manually produced terminiologies and ontologies, as well as resources
that have been automatically mined from documents.
- Integration of text mining into business intelligence applications for e-commerce.
The goal of this project would be to investigate possibilities for integrating text and Web mining
into systems that provide business intelligence for various sectors, including e-commerce.
- Design and evaluation metrics for text mining.
Text mining scenarios are small-scale, but real-world problems that are defined in close cooperation with
domain specialists in order to support solving a specific set of problems by text mining.
This project will design and evaluate a framework for text mining scenarios in various domains (e.g. biomedical,
engineering, legislation).
Text processing for the Semantic Web
The Semantic Web is one of the main research directions and concepts for improving (automated) accessibility of the Web. The main objective of these topics is to develop text processing methods that will automatically generate knowledge that can be used as a basis for Semantic Web applications including searching and querying repositories using Semantic Web languages.- Mining bioinformatics services, resources and workflows from documents.
There are a number of services and resources available to the bioinformatics community, but meta-data that describe them is typically scarce. This project
aims to develop text mining techniques to automatically describe, locate, retrieve and reason about bioinformatics services and resources. We investigate
methods that extract descriptions from various document types (articles, reviews, application notes, email archives, discussion forums, etc), and map them to
service descriptions using both general service ontologies and domain-specific ontologies. As a working and target environment, the project uses the
myGRID/Taverna infrastructure.
- Trustworthiness of information presented on the Web.
The main issue is to investigate to what extent the information that is presented
on the Web can be trusted.
Text analytics and sentiment analysis
Sentiment analysis is the extraction of attitudes and opinions from human-authored documents. The capture and analysis of such attitudes and opinions in an automated and structured fashion might offer a powerful technology to a number of problem domains, including business intelligence, marketing, national security, and crime prevention. This project would aim to develop technologies for extraction and analysis of sentiment from free text using a combination of natural language processing (NLP), text mining and machine learning techniques. An interesting epxerimental area would be blog mining. The work will evolve building models of sentiment from which suitable templates for extraction will be designed. Apart from the domains mentioned above, the approach will be tested in the scientific domain (testing the hypothesis that scientific articles involve less sentiment than other genres).
Multi-lingual text mining
- Terminology driven multi-lingual information retrieval in digital libraries.
The goal of this project is to investigate possibilities for cross-lingual text mining
that is driven by domain terminologies that are acquired in parallel.
- Text categorisation. The scope of this project is to develop techniques for multi-lingual categorisation of documents, in particular in a dynamic Web environment using a set of ontological and terminological resources.
NLP for Serbian
The main idea is to provide standards-based solutions to basic NLP problems for a highly morhologically rich langauge like Serbian.- POS-tagging for Serbian.
The idea is to investigate various POS-tagging methods (including rule-based,
probabilistic and machine-learing) for different text types. Also, a challenging
problem could be to design a POS-tagger based on a voting system.
- Named-entity recognition.
The scope of this project would be to develop methods (rule-based, probabilistic
or machine-learning) to recognise various classes of named entities in Serbian.
- Shallow parsing for Serbian.
The aim is to develop a set of local grammars to support identification of
basic chunks in text.
- Information retrieval for Serbian.
The idea is to investigate various indexing methods for information retrieval in Serbian, and to produce
a simple search-engine. The project will also include a langauge identification module.
- Development of domain-specific WordNets.
The aim is to develop basic ontologies for specific domains and to integrate
them into the Serbian WordNet. The project would also include validation of the developed WordNets.
