User-assisted synergic crawling and wrapping for entity discovery in vertical domains

Badr, Celine

Please use this identifier to cite or link to this item: http://hdl.handle.net/2307/4555

DC Field	Value	Language
dc.contributor.advisor	Merialdo, Paolo	-
dc.contributor.author	Badr, Celine	-
dc.date.accessioned	2015-05-26T15:05:01Z	-
dc.date.available	2015-05-26T15:05:01Z	-
dc.date.issued	2014-06-09	-
dc.identifier.uri	http://hdl.handle.net/2307/4555	-
dc.description.abstract	Large data-intensive web sites publish considerable quantities of information stored in their structured repositories. Data is usually rendered in numerous data- rich pages using templates or automatic scripts. This wealth of information is of wide interest to many applications and online services that do not have direct access to the structured data repositories. Therefore, there’s a great need to lo- cate such pages, accurately and efficiently extract data on them, and store it in a structured format more adapted to automatic processing than HTML. In this context, we exploit intra- and inter-web site information redundancy to address the problem of locating relevant data-rich pages and inferring wrappers on them, while incurring a minimum user overhead. In the first part, we propose to model large data-intensive web sites, to crawl only the subset of pages pertaining to one vertical domain, and then build effective wrappers for attributes of interest on them, with minimum user effort. Our methodology for synergic specification and execution of crawlers and wrappers is supported by a working system devoted to non-expert users, built over an active-learning inference engine. In the second part, we use the information gathered during inference on the training site, to automatically discover new similar sources on the same type of entities of the vertical domain, which can be useful to complement, enrich, or ver- ify the collected data. Our proposed approach performs an automated search and filter operation by generating specific queries and analyzing the returned search engines results. It combines exploiting existing attributes, template, and page information with a semantic, syntactic, and structural evaluation of newly discov- ered pages to identify relevant semi-structured sources. Both techniques are validated with extensive testing on a variety of sources from different vertical domains.	it_IT
dc.language.iso	en	it_IT
dc.publisher	Università degli studi Roma Tre	it_IT
dc.subject	active learning	it_IT
dc.subject	entity discovery	it_IT
dc.subject	data extraction	it_IT
dc.subject	semi structured	it_IT
dc.title	User-assisted synergic crawling and wrapping for entity discovery in vertical domains	it_IT
dc.type	Doctoral Thesis	it_IT
dc.subject.miur	Settori Disciplinari MIUR::Ingegneria industriale e dell'informazione::SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI	it_IT
dc.subject.miur	Ingegneria industriale e dell'informazione	-
dc.subject.isicrui	Categorie ISI-CRUI::Ingegneria industriale e dell'informazione::Information Technology & Communications Systems	it_IT
dc.subject.isicrui	Ingegneria industriale e dell'informazione	-
dc.subject.anagraferoma3	Ingegneria industriale e dell'informazione	it_IT
dc.rights.accessrights	info:eu-repo/semantics/openAccess	-
dc.description.romatrecurrent	Dipartimento di Ingegneria	*
item.fulltext	With Fulltext	-
item.languageiso639-1	other	-
item.grantfulltext	restricted	-
Appears in Collections:	X_Dipartimento di Ingegneria T - Tesi di dottorato

Files in This Item:

File	Description	Size	Format
cbadr-phd-thesis.pdf		3.06 MB	Adobe PDF	View/Open

Show simple item record Recommend this item

Page view(s)

350

Last Week
0

Last month
8

checked on Jul 1, 2026

Download(s)

99

checked on Jul 1, 2026

Google Scholar^TM

Check

Files in This Item:

Page view(s)

Download(s)

Google ScholarTM

Google Scholar^TM