Utilizza questo identificativo per citare o creare un link a questo documento: http://hdl.handle.net/2307/4557
Titolo: Logic mining techniques for biological data analysis and classification
Autori: Weitschek, Emanuel
Relatore: Bertolazzi, Paola
Parole chiave: bioformatics
data mining
software engeneering
Data di pubblicazione: 4-giu-2013
Editore: Università degli studi Roma Tre
Abstract: Advances in molecular biology lead to an exponential growth of biological data, also thanks to the support of computer science. The primary sequences data base GenBank is doubling its size every 18 months, actually consisting in more than 160 billions sequences. The 1000 genomes project released whole DNA sequences of a large number of individuals, producing more than 3000 billions DNA base pairs. Analyzing these enormous amount of data is becoming very important in order to shed light on biological and biomedical questions. The challenges are in managing this huge amount of data, in discovering its interactions and in the integration of the biological know-how. The analysis of biological data requires new methods to extract compact and relevant information; effective and efficient computer science algorithms are needed to support the analysis of complex biological data sets. The interdisciplinary field of data mining, which guides the automated knowledge discovery process, is a natural way to approach the task of biological data analysis. In this dissertation new data mining methods are presented and proven to be effective in many biological data analysis problems. The particular field of logic data mining, where a data classification model is extracted in form of propositional logic formulas, is investigated and a new system for performing a complete knowledge discovery process is described. The system presents new methods for discretization, clustering, feature selection and classification. All methods have been integrated in three different tools: BLOG, MALA and DMiB, the first dedicated to the classification of species, the second to the analysis of gene expression profiles, and the third for multipurpose use. These tools were applied to species classification with DNA Barcode sequences, viruses identification, gene expression profiles analysis, clinical patient characterization, tag snp classification, non coding DNA identification, and whole genome analysis. A comparison with other data mining methods was performed. The analysis results were all very positive and contributed to the gain of important additional knowledge in biology and medicine, like the detection of nine core genes able to distinguish Alzheimer diseased versus control experimental samples, or the identification of the distinguishing nucleotides positions in the five actually known human polyomaviruses. Moreover, a new technique based on alignment free sequence analysis and logic data mining, that is able to perform the classification of sequences without the strict requirement of computing an alignment between them, is presented. This is a major advantage as the problem of alignment is computationally hard and many biological sequences are not alignable, because of their intrinsic nature, e.g. non coding regions. Also in this case the performed experiments on whole genomes and on conserved non encoding elements show the success of this approach. The models extracted from several analysis are logic formulas able to characterize the different classes of the data set in a clear and compact way. The model is a strong plus for the domain expert, that gains a precious and directly interpretable knowledge.
URI: http://hdl.handle.net/2307/4557
Diritti di Accesso: info:eu-repo/semantics/openAccess
È visualizzato nelle collezioni:X_Dipartimento di Ingegneria
T - Tesi di dottorato

File in questo documento:
File Descrizione DimensioniFormato
EWeitschekPhDThesis.pdf4.7 MBAdobe PDFVisualizza/apri
Visualizza tutti i metadati del documento Suggerisci questo documento

Page view(s)

59
Last Week
0
Last month
1
checked on 19-apr-2024

Download(s)

51
checked on 19-apr-2024

Google ScholarTM

Check


Tutti i documenti archiviati in DSpace sono protetti da copyright. Tutti i diritti riservati.