Please use this identifier to cite or link to this item: http://hdl.handle.net/2307/4557
Title: Logic mining techniques for biological data analysis and classification
Authors: Weitschek, Emanuel
Advisor: Bertolazzi, Paola
Keywords: bioformatics
data mining
software engeneering
Issue Date: 4-Jun-2013
Publisher: Università degli studi Roma Tre
Abstract: Advances in molecular biology lead to an exponential growth of biological data, also thanks to the support of computer science. The primary sequences data base GenBank is doubling its size every 18 months, actually consisting in more than 160 billions sequences. The 1000 genomes project released whole DNA sequences of a large number of individuals, producing more than 3000 billions DNA base pairs. Analyzing these enormous amount of data is becoming very important in order to shed light on biological and biomedical questions. The challenges are in managing this huge amount of data, in discovering its interactions and in the integration of the biological know-how. The analysis of biological data requires new methods to extract compact and relevant information; effective and efficient computer science algorithms are needed to support the analysis of complex biological data sets. The interdisciplinary field of data mining, which guides the automated knowledge discovery process, is a natural way to approach the task of biological data analysis. In this dissertation new data mining methods are presented and proven to be effective in many biological data analysis problems. The particular field of logic data mining, where a data classification model is extracted in form of propositional logic formulas, is investigated and a new system for performing a complete knowledge discovery process is described. The system presents new methods for discretization, clustering, feature selection and classification. All methods have been integrated in three different tools: BLOG, MALA and DMiB, the first dedicated to the classification of species, the second to the analysis of gene expression profiles, and the third for multipurpose use. These tools were applied to species classification with DNA Barcode sequences, viruses identification, gene expression profiles analysis, clinical patient characterization, tag snp classification, non coding DNA identification, and whole genome analysis. A comparison with other data mining methods was performed. The analysis results were all very positive and contributed to the gain of important additional knowledge in biology and medicine, like the detection of nine core genes able to distinguish Alzheimer diseased versus control experimental samples, or the identification of the distinguishing nucleotides positions in the five actually known human polyomaviruses. Moreover, a new technique based on alignment free sequence analysis and logic data mining, that is able to perform the classification of sequences without the strict requirement of computing an alignment between them, is presented. This is a major advantage as the problem of alignment is computationally hard and many biological sequences are not alignable, because of their intrinsic nature, e.g. non coding regions. Also in this case the performed experiments on whole genomes and on conserved non encoding elements show the success of this approach. The models extracted from several analysis are logic formulas able to characterize the different classes of the data set in a clear and compact way. The model is a strong plus for the domain expert, that gains a precious and directly interpretable knowledge.
URI: http://hdl.handle.net/2307/4557
Access Rights: info:eu-repo/semantics/openAccess
Appears in Collections:X_Dipartimento di Ingegneria
T - Tesi di dottorato

Files in This Item:
File Description SizeFormat
EWeitschekPhDThesis.pdf4.7 MBAdobe PDFView/Open
Show full item record Recommend this item

Page view(s)

175
Last Week
0
Last month
1
checked on Nov 24, 2024

Download(s)

60
checked on Nov 24, 2024

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.