Please use this identifier to cite or link to this item: http://hdl.handle.net/2307/4551
Title: Pattern recognition and data mining in graphs and strings
Authors: Drovandi, Guido
Advisor: Bertolazzi, Paola
metadata.dc.contributor.referee: Apostolico, Alberto
Blazewicz, Jacek
Issue Date: 26-Mar-2011
Publisher: Università degli studi Roma Tre
Abstract: The objective of the dissertation is to study methods to manage and to retrieve relevant information in graphs and strings, contributing to the advancement of research in this promising field. In particular, we focus our attention on two types of data: graphs and strings. The first part of the dissertation deals with the problem of counting subgraphs, also called motifs. It was conjectured that, in some cases, the problem could be solved in polynomial time. However, we show that this is not true giving proofs of NP-completeness. We also propose a method to obtain an approximation of the number of occurrences of topological colored motifs, that are motifs in which each node has a different color. In the second part, we consider the compression problem of graphs and strings, in particular those related with sequences of integers. A new compression scheme for graphs is introduced. To obtain a compact representation of a graph, some particular subgraphs are exploited; however, we do not search explicitly for these subgraphs. The proposed scheme is suitable for the Web Graph, the graph of the URLs over the World Wide Web. To any extent, the proposed method allows to compress other kind of graphs in contrast with many previous works that need the URLs for the compression phase. We also introduce an extension of the graph compression method proposed that exploits the high number of reciprocal edges that characterize social networks. After the compression, the graph is represented by a string of characters that must be stored on the disk. A convenient way is to use prefix codes, that are variable length codes in which a code is not prefix of any other. We give the definition of a new prefix code that is suitable for the compression of a stream of integers following a power law distribution. We show analytically that this encoding follows the entropy of a zeta distribution (one of the most famous power law probability distribution) better than codes proposed in the literature. Besides, we propose a method to build prefix codes for an arbitrary probability distribution of the integers. To deal with this problem, we introduce a general prefix code and we study its properties. We test the method creating prefix codes for different distributions and comparing the compression results with codes presented in the literature. In this case the results are as good as, orbetter than those available in the literature. Moreover, we show that using the general code proposed in the previous chapter it is possible to give a common description to many of the prefix codes proposed in the literature. This result is possible thanks to the introduction of a new method to encode a finite set of integers. The last part deals with the problem of find information in huge sets of strings. A logic mining method is applied to generate logic formulas to distinguish among different classes of strings. The method is applied with success in the analysis of biological data, such as DNA sequences; in particular, it is able to find relevant genes that distinguish among Alzheimer diseased and healthy mice.
URI: http://hdl.handle.net/2307/4551
Access Rights: info:eu-repo/semantics/openAccess
Appears in Collections:X_Dipartimento di Informatica e automazione
T - Tesi di dottorato

Files in This Item:
File Description SizeFormat
Pattern Recognition and Data Mining in Graphs and Strings.pdf856.72 kBAdobe PDFView/Open
Show full item record Recommend this item

Page view(s)

58
Last Week
0
Last month
0
checked on Mar 29, 2024

Download(s)

41
checked on Mar 29, 2024

Google ScholarTM

Check


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.