INSTANCE-LEVEL ATTRIBUTE ALIGNMENT FOR HETEROGENEOUS PRODUCT SOURCES

PIAI, FEDERICO

Please use this identifier to cite or link to this item: http://hdl.handle.net/2307/40818

DC Field	Value	Language
dc.contributor.advisor	MERIALDO, PAOLO	-
dc.contributor.author	PIAI, FEDERICO	-
dc.date.accessioned	2022-06-17T10:56:30Z	-
dc.date.available	2022-06-17T10:56:30Z	-
dc.date.issued	2020-04-20	-
dc.identifier.uri	http://hdl.handle.net/2307/40818	-
dc.description.abstract	This thesis focuses on Big Data integration, a foundational area in data management research. We describe in particular the integration of product specifications from multiple sources of data, with the final goal of building a complete and reliable product graph. Exploiting multiple data sources has the advantage to provide information about rare and niche products and uncommon properties, and having enough redundancy to solve potential conflicts. On the other hand, it involves several challenges due to the heterogeneity of Web sources. We described a complete pipeline for product data integration, involving Web extraction and integration steps, which, unlike traditional approaches, performs the record linkage step (group specifications by product) before attribute alignment step (group attributes with equivalent semantics and define mappings). Indeed, record linkage in product context is simplified by the presence of general product identifiers, while attribute alignment is a very complex task due to presence of a lot of properties about a product, some rarer and some more common, with many different representations. We provided an extensive analysis of the state of the art on these two tasks. We formulated a novel problem of computing attribute alignment at the instance level. Traditional schema-level alignment methods, which critically rely on local homogeneity within a source, are unable to effectively solve this problem due to the significant heterogeneity exhibited by product specifications, both across and within sources. We take advantage of the opportunities arising from the richness and redundancy of information across sources, and propose an iterative solution, called RaF-AIA, that consists of three key steps: (i) First, it uses a Bayesian model to analyze overlapping information across sources to match the most locally homogeneous attributes; (ii) Second, inspired by NLP techniques, it uses a tagging approach to create (virtual) homogeneous attributes from tagged portions of heterogeneous attribute values; (iii) Third, it makes creative use of classical alignment techniques based on matching of attribute names and domains. We developed a publicly available benchmark (Alaska Benchmark) for the tasks of attribute alignment and record linkage, which we also used to run experiments for evaluating the RaF-AiA approach, demonstrating its effectiveness and efficiency, and its superiority over alternative approaches adapted from the literature.	en_US
dc.language.iso	en	en_US
dc.publisher	Università degli studi Roma Tre	en_US
dc.subject	DATA INTEGRATION	en_US
dc.subject	TEXT MINING	en_US
dc.subject	BIG DATA	en_US
dc.title	INSTANCE-LEVEL ATTRIBUTE ALIGNMENT FOR HETEROGENEOUS PRODUCT SOURCES	en_US
dc.type	Doctoral Thesis	en_US
dc.subject.miur	Settori Disciplinari MIUR::Ingegneria industriale e dell'informazione::SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI	en_US
dc.subject.isicrui	Categorie ISI-CRUI::Ingegneria industriale e dell'informazione::Information Technology & Communications Systems	en_US
dc.subject.anagraferoma3	Ingegneria industriale e dell'informazione	en_US
dc.rights.accessrights	info:eu-repo/semantics/openAccess	-
dc.description.romatrecurrent	Dipartimento di Ingegneria	*
item.grantfulltext	restricted	-
item.languageiso639-1	other	-
item.fulltext	With Fulltext	-
Appears in Collections:	X_Dipartimento di Ingegneria T - Tesi di dottorato

Files in This Item:

File	Description	Size	Format
Piai_Federico___PhD_thesis.pdf		3.08 MB	Adobe PDF	View/Open

Show simple item record Recommend this item

Page view(s)

158

checked on Nov 21, 2024

Download(s)

121

checked on Nov 21, 2024

Google Scholar^TM

Check

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM