Exjobbsförslag från företag

Detta är ett uppsatsförslag hämtat från Nationella Exjobb-poolen. Klicka här för att komma tillbaka till samtliga exjobbsförslag.

Förslaget inkom 2006-09-18

Integration of data sources in bioinformatics: Exploration of possibilities to join data sources

OBS! ANSÖKNINGSTIDEN FÖR DETTA EXJOBB HAR LÖPT UT.
Scientists in bioinformatics often have to retrieve data from multiple data sources(*) to solve their research problems. The fact that there exist a large amount of data sources having heterogeneous data, data formats and access methods, makes data retrieval a difficult task. To successfully accomplish this task, a lot of effort and knowledge is required from the user. She has to decide which data sources to access and in which order, how to retrieve data and how to combine the results. Though some information integration systems are available in the area, still a number of problems remain unsolved.

The thesis is a part of a larger project that builds a system enabling transparent access to multiple heterogeneous biological data sources [1]. The user of the system does not need to know about the integrated data sources. She formulates a query in a uniform query language using terms of the mediated schema that uniformly describes content of the underlying data sources. The system performs a query processing, i.e. reformulates a user query expressed over the mediated schema into the query over the relevant multiple data sources, creates a query plan that specifies how the query should be executed, executes the query plan and returns the retrieved results to the user.

One of the steps during query processing is to find a way to integrate (join) data coming from different data sources. For biological data sources available on the Internet there are a few different ways to identify connection points between different data sources: a) Data sources in bioinformatics are rich with explicit links (hyperlinks / cross-references) between entries at different data sources. b) Different data sources may contain semantically equivalent concepts (e.g. object identifiers or names). c) Some data sources annotate their data using some ontology terminology that can be used to identify semantically equivalent concepts. In some cases, two data sources may not have direct linking between each other. The intermediate data sources could be used to connect them. Often, alternative paths through intermediate data sources exist between two analyzed data sources.

The focus of the thesis is to explore links available between biological data sources, declaratively specify them and implement algorithms that find all alternative links/paths between the selected data sources. The work will include a literature study of the systems that analyze links between data sources and implementation of path search approaches. Java will be used as an implementation language. The student is expected to have good programming skills.

(*) Data sources refers to different types of sources of the data, e.g. databases, text files storing semistructured information and applications.

References

[1] BioTrifu. http://www.ida.liu.se/~patla/research/ceniit.html


  GÅ TILL XJOBB.NU FÖR FULLSTÄNDIG INFO OM DETTA EXJOBB




Informationen om uppsatsförslag är hämtad från Nationella Exjobb-poolen.