Please use this identifier to cite or link to this item:
http://hdl.handle.net/2307/5036
DC Field | Value | Language |
---|---|---|
dc.contributor.advisor | Merialdo, Paolo | - |
dc.contributor.author | Qiu, Disheng | - |
dc.date.accessioned | 2016-07-05T09:13:22Z | - |
dc.date.available | 2016-07-05T09:13:22Z | - |
dc.date.issued | 2015-06-08 | - |
dc.identifier.uri | http://hdl.handle.net/2307/5036 | - |
dc.description.abstract | The Web is a rich source of data that represents a valuable resource for many organizations. Data in theWeb is usually encoded in HTML pages, thus they are not processable; a data extraction process, which is made by software modules called wrappers, is required to use these data. Several attempts have been conducted to reduce the e↵orts of generating wrappers. Supervised approaches, based on annotated pages, achieve high accuracy; but the costs of the training data, i.e. annotations, limit their scalability. Unsupervised approaches have been developed to achieve high scalability, but the diversity of the data sources can drastically limit the accuracy of the results. Overall, obtaining high accuracy and high scalability is challenging because of the scale of the Web and the heterogeneity of the published information. In this dissertation we describe a solution to address these challenges: to scale to the Web we define an unsupervised approach that is built considering several wrapper inference techniques; to control the quality we define a quality model that understands at runtime if human feedback is required; feedback is provided by workers enrolled from a crowdsourcing platform. Crowdsourcing represents an e↵ective way to reduce the costs for the annotation process, but previous proposals are designed for experts and they are not suitable for the crowd, in fact, workers from crowdsourcing platforms are typically non-expert. An open issue to scale the generation of wrappers is the collection of the pages to wrap, we describe an end-to-end pipeline that discovers and crawls relevant websites in a case study for product specifications. An extensive evaluation with real data confirms that: (i) we can generate accurate wrappers with few simple interactions from the crowd; (ii) we can accurately estimate workers’ error rate and select at runtime the number of workers to enroll for a task; (iii) we can e↵ectively start by considering unsupervised approaches and switch to the crowd to increase the quality; and (iv) we can discover thousands of websites from a small initial seed. | it_IT |
dc.language.iso | en | it_IT |
dc.publisher | Università degli studi Roma Tre | it_IT |
dc.title | Crowdsourcing large scale data extraction from the web: bridging automatic and supervised approaches | it_IT |
dc.type | Doctoral Thesis | it_IT |
dc.subject.miur | Settori Disciplinari MIUR::Ingegneria industriale e dell'informazione::SISTEMI DI ELABORAZIONE DELLE INFORMAZIONI | it_IT |
dc.subject.isicrui | Categorie ISI-CRUI::Ingegneria industriale e dell'informazione::Information Technology & Communications Systems | it_IT |
dc.subject.anagraferoma3 | Ingegneria industriale e dell'informazione | it_IT |
dc.rights.accessrights | info:eu-repo/semantics/openAccess | - |
dc.description.romatrecurrent | Dipartimento di Ingegneria | * |
item.languageiso639-1 | other | - |
item.grantfulltext | restricted | - |
item.fulltext | With Fulltext | - |
Appears in Collections: | X_Dipartimento di Ingegneria T - Tesi di dottorato |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
Tesi - Crowdsourcing Large scale Data Extraction from the Web Bridging Automatic and Supervised Approaches.pdf | 1.25 MB | Adobe PDF | View/Open |
Page view(s)
151
Last Week
0
0
Last month
0
0
checked on Nov 23, 2024
Download(s)
56
checked on Nov 23, 2024
Google ScholarTM
Check
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.