Please use this identifier to cite or link to this item: http://hdl.handle.net/2307/40921
Title: Big biomedical data modeling for knowledge extraction with machine learning techniques
Authors: Cappelli, Eleonora
Advisor: Torlone, Riccardo
metadata.dc.contributor.referee: Elloumi, Mourad
Swiercz, Aleksandra
Keywords: BIOINFORMATICS
DATA STANDARDIZATION
Issue Date: 20-Apr-2020
Publisher: Università degli studi Roma Tre
Abstract: Background. Over the last ten years biomedical data daily produced by Next Generation DNA Sequencing (NGS) techniques has doubled every seven months. Nowadays genomics plays a relevant role in the field of Big Data, because of the large amount of biomedical data being produced, analyzed, and stored in many public databases. Currently, the storage of this data is performed by many different organizations and their acquisition methods are highly distributed and involve heterogeneous formats. Methods. In this dissertation the problem of biomedical data heterogeneity is addressed by proposing new standardization methods and pipelines, which permit to easily integrate genomic and clinical data of cancer related to different NGS experiments. Moreover, novel methods for querying them are defined: (i) use cases of the GenoMetric Query Language, a high-level domain-specific query language, are presented to demonstrate the efficiency of the data standardization in terms of information retrieval; (ii) a new data model that minimizes the amount of redundant information is defined, allowing the creation of an Application Programming Interfaces (API) for data retrieval; (iii) methods for discovering and querying large datasets through taxonomy-based methodologies are proposed. Finally, thanks to biomedical data standardization, it is possible to easily apply machine learning techniques for the analysis of genomic data and their interpretation. In particular, knowledge extraction experiments are shown on big biomedical datasets of cancer with promising performance and models. Results. The main results of the dissertation are new software tools and methods: i) OpenGDC, which allows to automatically standardize and extend genomic and clinical data of cancer; OpenGDC software is freely available at http://geco.deib.polimi.it/opengdc/, and additionally, a publicly accessible repository, containing homogenized and enhanced data (resulting in more than 1.5 TB) is released; ii) OpenOmics, which provides a flexible collection of Application Programming Interfaces (APIs), in particular a set of implemented endpoints are available at http://bioinformatics.iasi.cnr.it/openomics/api/routes; An ontological software layer that allows users to interact with experimental data and metadata without knowledge about their representation schema; iii) new software pipelines for gene-oriented data preprocessing are implemented, and a large knowledge base of classification results (datasets, logic formulas, performance, and statistics) obtained by the application of different machine learnings algorithms on a big repository of public available RNA sequencing and DNA methylation of Cancer. iv) CamurWeb, a web service that aims to make the CAMUR machine learning software easily accessible and usable. Conclusions. The aim of the dissertation is to provide tools for the management and analysis of Big Biomedical Data and to allow the definition of a framework for standardization, querying, and knowledge extraction from clinical and genomic data. The obtained experimental results confirm the soundness of the proposed approaches.
URI: http://hdl.handle.net/2307/40921
Access Rights: info:eu-repo/semantics/openAccess
Appears in Collections:T - Tesi di dottorato

Files in This Item:
File Description SizeFormat
Tesi_Cappelli_Eleonora.pdf4.22 MBAdobe PDFView/Open
Show full item record Recommend this item

Page view(s)

85
checked on Apr 28, 2024

Download(s)

32
checked on Apr 28, 2024

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.