Cloud-based solutions supporting data and knowledge integration in bioinformatics

Milia, Gabriele

In recent years, computer advances have changed the way the science progresses and have boosted studies in silico; as a result, the concept of “scientific research” in bioinformatics has quickly changed shifting from the idea of a local laboratory activity towards Web applications and databases provided over the network as services. Thus, biologists have become among the largest beneficiaries of the information technologies, reaching and surpassing the traditional ICT users who operate in the field of so-called "hard science" (i.e., physics, chemistry, and mathematics). Nevertheless, this evolution has to deal with several aspects (including data deluge, data integration, and scientific collaboration, just to cite a few) and presents new challenges related to the proposal of innovative approaches in the wide scenario of emergent ICT solutions. This thesis aims at facing these challenges in the context of three case studies, being each case study devoted to cope with a specific open issue by proposing proper solutions in line with recent advances in computer science. The first case study focuses on the task of unearthing and integrating information from different web resources, each having its own organization, terminology and data formats in order to provide users with flexible environment for accessing the above resources and smartly exploring their content. The study explores the potential of cloud paradigm as an enabling technology to severely curtail issues associated with scalability and performance of applications devoted to support the above task. Specifically, it presents Biocloud Search EnGene (BSE), a cloud-based application which allows for searching and integrating biological information made available by public large-scale genomic repositories. BSE is publicly available at: http://biocloud-unica.appspot.com/. The second case study addresses scientific collaboration on the Web with special focus on building a semantic network, where team members, adequately supported by easy access to biomedical ontologies, define and enrich network nodes with annotations derived from available ontologies. The study presents a cloud-based application called Collaborative Workspaces in Biomedicine (COWB) which deals with supporting users in the construction of the semantic network by organizing, retrieving and creating connections between contents of different types. Public and private workspaces provide an accessible representation of the collective knowledge that is incrementally expanded. COWB is publicly available at: http://cowb-unica.appspot.com/. Finally, the third case study concerns the knowledge extraction from very large datasets. The study investigates the performance of random forests in classifying microarray data. In particular, the study faces the problem of reducing the contribution of trees whose nodes are populated by non-informative features. Experiments are presented and results are then analyzed in order to draw guidelines about how reducing the above contribution. With respect to the previously mentioned challenges, this thesis sets out to give two contributions summarized as follows. First, the potential of cloud technologies has been evaluated for developing applications that support the access to bioinformatics resources and the collaboration by improving awareness of user's contributions and fostering users interaction. Second, the positive impact of the decision support offered by random forests has been demonstrated in order to tackle effectively the curse of dimensionality.

UNICA IRIS Institutional Research Information System