Förslaget inkom 2006-09-18

Integration of data sources in bioinformatics: Use of knowledge bases for query rewriting

Scientists in bioinformatics often have to retrieve data from multiple data sources(*) to solve their research problems. The fact that there exist a large amount of data sources having heterogeneous data, data formats and access methods, makes data retrieval a difficult task. To successfully accomplish this task, a lot of effort and knowledge is required from the user. She has to decide which data sources to access and in which order, how to retrieve data and how to combine the results. Though some information integration systems are available in the area, still a number of problems remain unsolved.

The thesis is a part of a larger project that builds a system enabling transparent access to multiple heterogeneous biological data sources [1]. The user of the system does not need to know about the integrated data sources. She formulates a query in a uniform query language using terms of the mediated schema that uniformly describes content of the underlying data sources. The system performs a query processing, i.e. reformulates a user query expressed over the mediated schema into the query over the relevant multiple data sources, creates a query plan that specifies how the query should be executed, executes the query plan and returns the retrieved results to the user.

The focus of the thesis is to consider how domain knowledge (application area specific knowledge) can be used to expand user queries to find a larger set of relevant results and to rewrite the user queries into queries over the data sources. For example, it is known that enzymes are proteins. Based on this knowledge, the user queries searching for proteins will also be extended to search for enzymes. For this, the student will model the domain knowledge and describe content of the data sources in OWL [2], use a reasoning system to infer the relevant knowledge, analyze the user queries extending them when possible and rewrite the user queries into queries over multiple data sources. To select the reasoning system, a few systems will need to be studied and compared. Java will be used as an implementation language.

The student is expected to have background or interest in learning knowledge representation languages (Description Logics and OWL). Basic knowledge in biology would be helpful but is not necessary.

(*) Data sources refers to different types of sources of the data, e.g. databases, text files storing semistructured information and applications.


[1] BioTrifu. http://www.ida.liu.se/~patla/research/ceniit.html
[2] OWL. Ontology Web Language. http://www.w3.org/2004/OWL


