Logic mining techniques for biological data analysis and classification

Weitschek, Emanuel

Please use this identifier to cite or link to this item: http://hdl.handle.net/2307/4557

DC Field	Value	Language
dc.contributor.advisor	Bertolazzi, Paola	-
dc.contributor.author	Weitschek, Emanuel	-
dc.date.accessioned	2015-05-26T15:10:25Z	-
dc.date.available	2015-05-26T15:10:25Z	-
dc.date.issued	2013-06-04	-
dc.identifier.uri	http://hdl.handle.net/2307/4557	-
dc.description.abstract	Advances in molecular biology lead to an exponential growth of biological data, also thanks to the support of computer science. The primary sequences data base GenBank is doubling its size every 18 months, actually consisting in more than 160 billions sequences. The 1000 genomes project released whole DNA sequences of a large number of individuals, producing more than 3000 billions DNA base pairs. Analyzing these enormous amount of data is becoming very important in order to shed light on biological and biomedical questions. The challenges are in managing this huge amount of data, in discovering its interactions and in the integration of the biological know-how. The analysis of biological data requires new methods to extract compact and relevant information; effective and efficient computer science algorithms are needed to support the analysis of complex biological data sets. The interdisciplinary field of data mining, which guides the automated knowledge discovery process, is a natural way to approach the task of biological data analysis. In this dissertation new data mining methods are presented and proven to be effective in many biological data analysis problems. The particular field of logic data mining, where a data classification model is extracted in form of propositional logic formulas, is investigated and a new system for performing a complete knowledge discovery process is described. The system presents new methods for discretization, clustering, feature selection and classification. All methods have been integrated in three different tools: BLOG, MALA and DMiB, the first dedicated to the classification of species, the second to the analysis of gene expression profiles, and the third for multipurpose use. These tools were applied to species classification with DNA Barcode sequences, viruses identification, gene expression profiles analysis, clinical patient characterization, tag snp classification, non coding DNA identification, and whole genome analysis. A comparison with other data mining methods was performed. The analysis results were all very positive and contributed to the gain of important additional knowledge in biology and medicine, like the detection of nine core genes able to distinguish Alzheimer diseased versus control experimental samples, or the identification of the distinguishing nucleotides positions in the five actually known human polyomaviruses. Moreover, a new technique based on alignment free sequence analysis and logic data mining, that is able to perform the classification of sequences without the strict requirement of computing an alignment between them, is presented. This is a major advantage as the problem of alignment is computationally hard and many biological sequences are not alignable, because of their intrinsic nature, e.g. non coding regions. Also in this case the performed experiments on whole genomes and on conserved non encoding elements show the success of this approach. The models extracted from several analysis are logic formulas able to characterize the different classes of the data set in a clear and compact way. The model is a strong plus for the domain expert, that gains a precious and directly interpretable knowledge.	it_IT
dc.language.iso	en	it_IT
dc.publisher	Università degli studi Roma Tre	it_IT
dc.subject	bioformatics	it_IT
dc.subject	data mining	it_IT
dc.subject	software engeneering	it_IT
dc.title	Logic mining techniques for biological data analysis and classification	it_IT
dc.type	Doctoral Thesis	it_IT
dc.subject.miur	Settori Disciplinari MIUR::Ingegneria industriale e dell'informazione::SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI	it_IT
dc.subject.miur	Ingegneria industriale e dell'informazione	-
dc.subject.isicrui	Categorie ISI-CRUI::Ingegneria industriale e dell'informazione::Information Technology & Communications Systems	it_IT
dc.subject.isicrui	Ingegneria industriale e dell'informazione	-
dc.subject.anagraferoma3	Ingegneria industriale e dell'informazione	it_IT
dc.rights.accessrights	info:eu-repo/semantics/openAccess	-
dc.description.romatrecurrent	Dipartimento di Ingegneria	*
item.languageiso639-1	other	-
item.fulltext	With Fulltext	-
item.grantfulltext	restricted	-
Appears in Collections:	X_Dipartimento di Ingegneria T - Tesi di dottorato

Files in This Item:

File	Description	Size	Format
EWeitschekPhDThesis.pdf		4.7 MB	Adobe PDF	View/Open

Show simple item record Recommend this item

Page view(s)

393

Last Week
0

Last month
1

checked on Jun 23, 2026

Download(s)

84

checked on Jun 23, 2026

Google Scholar^TM

Check

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM