Pattern recognition and data mining in graphs and strings

Drovandi, Guido

Please use this identifier to cite or link to this item: http://hdl.handle.net/2307/4551

DC Field	Value	Language
dc.contributor.advisor	Bertolazzi, Paola	-
dc.contributor.author	Drovandi, Guido	-
dc.date.accessioned	2015-05-25T14:44:53Z	-
dc.date.available	2015-05-25T14:44:53Z	-
dc.date.issued	2011-03-26	-
dc.identifier.uri	http://hdl.handle.net/2307/4551	-
dc.description.abstract	The objective of the dissertation is to study methods to manage and to retrieve relevant information in graphs and strings, contributing to the advancement of research in this promising field. In particular, we focus our attention on two types of data: graphs and strings. The first part of the dissertation deals with the problem of counting subgraphs, also called motifs. It was conjectured that, in some cases, the problem could be solved in polynomial time. However, we show that this is not true giving proofs of NP-completeness. We also propose a method to obtain an approximation of the number of occurrences of topological colored motifs, that are motifs in which each node has a different color. In the second part, we consider the compression problem of graphs and strings, in particular those related with sequences of integers. A new compression scheme for graphs is introduced. To obtain a compact representation of a graph, some particular subgraphs are exploited; however, we do not search explicitly for these subgraphs. The proposed scheme is suitable for the Web Graph, the graph of the URLs over the World Wide Web. To any extent, the proposed method allows to compress other kind of graphs in contrast with many previous works that need the URLs for the compression phase. We also introduce an extension of the graph compression method proposed that exploits the high number of reciprocal edges that characterize social networks. After the compression, the graph is represented by a string of characters that must be stored on the disk. A convenient way is to use prefix codes, that are variable length codes in which a code is not prefix of any other. We give the definition of a new prefix code that is suitable for the compression of a stream of integers following a power law distribution. We show analytically that this encoding follows the entropy of a zeta distribution (one of the most famous power law probability distribution) better than codes proposed in the literature. Besides, we propose a method to build prefix codes for an arbitrary probability distribution of the integers. To deal with this problem, we introduce a general prefix code and we study its properties. We test the method creating prefix codes for different distributions and comparing the compression results with codes presented in the literature. In this case the results are as good as, orbetter than those available in the literature. Moreover, we show that using the general code proposed in the previous chapter it is possible to give a common description to many of the prefix codes proposed in the literature. This result is possible thanks to the introduction of a new method to encode a finite set of integers. The last part deals with the problem of find information in huge sets of strings. A logic mining method is applied to generate logic formulas to distinguish among different classes of strings. The method is applied with success in the analysis of biological data, such as DNA sequences; in particular, it is able to find relevant genes that distinguish among Alzheimer diseased and healthy mice.	it_IT
dc.language.iso	en	it_IT
dc.publisher	Università degli studi Roma Tre	it_IT
dc.title	Pattern recognition and data mining in graphs and strings	it_IT
dc.type	Doctoral Thesis	it_IT
dc.subject.miur	Settori Disciplinari MIUR::Ingegneria industriale e dell'informazione::SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI	it_IT
dc.subject.miur	Ingegneria industriale e dell'informazione	-
dc.subject.isicrui	Categorie ISI-CRUI::Ingegneria industriale e dell'informazione::Information Technology & Communications Systems	it_IT
dc.subject.isicrui	Ingegneria industriale e dell'informazione	-
dc.subject.anagraferoma3	Ingegneria industriale e dell'informazione	it_IT
dc.contributor.referee	Apostolico, Alberto	-
dc.contributor.referee	Blazewicz, Jacek	-
dc.rights.accessrights	info:eu-repo/semantics/openAccess	-
dc.description.romatrecurrent	X_Dipartimento di Informatica e automazione	*
item.languageiso639-1	other	-
item.fulltext	With Fulltext	-
item.grantfulltext	restricted	-
Appears in Collections:	X_Dipartimento di Informatica e automazione T - Tesi di dottorato

Files in This Item:

File	Description	Size	Format
Pattern Recognition and Data Mining in Graphs and Strings.pdf		856.72 kB	Adobe PDF	View/Open

Show simple item record Recommend this item

Page view(s)

245

Last Week
0

Last month
0

checked on Dec 20, 2025

Download(s)

145

checked on Dec 20, 2025

Google Scholar^TM

Check

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM