Big biomedical data modeling for knowledge extraction with machine learning techniques

Cappelli, Eleonora

Please use this identifier to cite or link to this item: http://hdl.handle.net/2307/40921

Title:	Big biomedical data modeling for knowledge extraction with machine learning techniques
Authors:	Cappelli, Eleonora
Advisor:	Torlone, Riccardo
metadata.dc.contributor.referee:	Elloumi, Mourad Swiercz, Aleksandra
Keywords:	BIOINFORMATICS DATA STANDARDIZATION
Issue Date:	20-Apr-2020
Publisher:	Università degli studi Roma Tre
Abstract:	Background. Over the last ten years biomedical data daily produced by Next Generation DNA Sequencing (NGS) techniques has doubled every seven months. Nowadays genomics plays a relevant role in the field of Big Data, because of the large amount of biomedical data being produced, analyzed, and stored in many public databases. Currently, the storage of this data is performed by many different organizations and their acquisition methods are highly distributed and involve heterogeneous formats. Methods. In this dissertation the problem of biomedical data heterogeneity is addressed by proposing new standardization methods and pipelines, which permit to easily integrate genomic and clinical data of cancer related to different NGS experiments. Moreover, novel methods for querying them are defined: (i) use cases of the GenoMetric Query Language, a high-level domain-specific query language, are presented to demonstrate the efficiency of the data standardization in terms of information retrieval; (ii) a new data model that minimizes the amount of redundant information is defined, allowing the creation of an Application Programming Interfaces (API) for data retrieval; (iii) methods for discovering and querying large datasets through taxonomy-based methodologies are proposed. Finally, thanks to biomedical data standardization, it is possible to easily apply machine learning techniques for the analysis of genomic data and their interpretation. In particular, knowledge extraction experiments are shown on big biomedical datasets of cancer with promising performance and models. Results. The main results of the dissertation are new software tools and methods: i) OpenGDC, which allows to automatically standardize and extend genomic and clinical data of cancer; OpenGDC software is freely available at http://geco.deib.polimi.it/opengdc/, and additionally, a publicly accessible repository, containing homogenized and enhanced data (resulting in more than 1.5 TB) is released; ii) OpenOmics, which provides a flexible collection of Application Programming Interfaces (APIs), in particular a set of implemented endpoints are available at http://bioinformatics.iasi.cnr.it/openomics/api/routes; An ontological software layer that allows users to interact with experimental data and metadata without knowledge about their representation schema; iii) new software pipelines for gene-oriented data preprocessing are implemented, and a large knowledge base of classification results (datasets, logic formulas, performance, and statistics) obtained by the application of different machine learnings algorithms on a big repository of public available RNA sequencing and DNA methylation of Cancer. iv) CamurWeb, a web service that aims to make the CAMUR machine learning software easily accessible and usable. Conclusions. The aim of the dissertation is to provide tools for the management and analysis of Big Biomedical Data and to allow the definition of a framework for standardization, querying, and knowledge extraction from clinical and genomic data. The obtained experimental results confirm the soundness of the proposed approaches.
URI:	http://hdl.handle.net/2307/40921
Access Rights:	info:eu-repo/semantics/openAccess
Appears in Collections:	T - Tesi di dottorato

Files in This Item:

File	Description	Size	Format
Tesi_Cappelli_Eleonora.pdf		4.22 MB	Adobe PDF	View/Open

Show full item record Recommend this item

Page view(s)

419

checked on Dec 20, 2025

Download(s)

232

checked on Dec 20, 2025

Google Scholar^TM

Check

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM