Utilizza questo identificativo per citare o creare un link a questo documento:
http://hdl.handle.net/2307/40921
Titolo: | Big biomedical data modeling for knowledge extraction with machine learning techniques | Autori: | Cappelli, Eleonora | Relatore: | Torlone, Riccardo | metadata.dc.contributor.referee: | Elloumi, Mourad Swiercz, Aleksandra |
Parole chiave: | BIOINFORMATICS DATA STANDARDIZATION |
Data di pubblicazione: | 20-apr-2020 | Editore: | Università degli studi Roma Tre | Abstract: | Background. Over the last ten years biomedical data daily produced by Next Generation DNA Sequencing (NGS) techniques has doubled every seven months. Nowadays genomics plays a relevant role in the field of Big Data, because of the large amount of biomedical data being produced, analyzed, and stored in many public databases. Currently, the storage of this data is performed by many different organizations and their acquisition methods are highly distributed and involve heterogeneous formats. Methods. In this dissertation the problem of biomedical data heterogeneity is addressed by proposing new standardization methods and pipelines, which permit to easily integrate genomic and clinical data of cancer related to different NGS experiments. Moreover, novel methods for querying them are defined: (i) use cases of the GenoMetric Query Language, a high-level domain-specific query language, are presented to demonstrate the efficiency of the data standardization in terms of information retrieval; (ii) a new data model that minimizes the amount of redundant information is defined, allowing the creation of an Application Programming Interfaces (API) for data retrieval; (iii) methods for discovering and querying large datasets through taxonomy-based methodologies are proposed. Finally, thanks to biomedical data standardization, it is possible to easily apply machine learning techniques for the analysis of genomic data and their interpretation. In particular, knowledge extraction experiments are shown on big biomedical datasets of cancer with promising performance and models. Results. The main results of the dissertation are new software tools and methods: i) OpenGDC, which allows to automatically standardize and extend genomic and clinical data of cancer; OpenGDC software is freely available at http://geco.deib.polimi.it/opengdc/, and additionally, a publicly accessible repository, containing homogenized and enhanced data (resulting in more than 1.5 TB) is released; ii) OpenOmics, which provides a flexible collection of Application Programming Interfaces (APIs), in particular a set of implemented endpoints are available at http://bioinformatics.iasi.cnr.it/openomics/api/routes; An ontological software layer that allows users to interact with experimental data and metadata without knowledge about their representation schema; iii) new software pipelines for gene-oriented data preprocessing are implemented, and a large knowledge base of classification results (datasets, logic formulas, performance, and statistics) obtained by the application of different machine learnings algorithms on a big repository of public available RNA sequencing and DNA methylation of Cancer. iv) CamurWeb, a web service that aims to make the CAMUR machine learning software easily accessible and usable. Conclusions. The aim of the dissertation is to provide tools for the management and analysis of Big Biomedical Data and to allow the definition of a framework for standardization, querying, and knowledge extraction from clinical and genomic data. The obtained experimental results confirm the soundness of the proposed approaches. | URI: | http://hdl.handle.net/2307/40921 | Diritti di Accesso: | info:eu-repo/semantics/openAccess |
È visualizzato nelle collezioni: | T - Tesi di dottorato |
File in questo documento:
File | Descrizione | Dimensioni | Formato | |
---|---|---|---|---|
Tesi_Cappelli_Eleonora.pdf | 4.22 MB | Adobe PDF | Visualizza/apri |
Page view(s)
178
checked on 23-nov-2024
Download(s)
85
checked on 23-nov-2024
Google ScholarTM
Check
Tutti i documenti archiviati in DSpace sono protetti da copyright. Tutti i diritti riservati.