Data Extraction and Alignment of Search Results by Combining Tag Value Structure

Tushar Jadhav; Santosh Chobe

Data Extraction and Alignment of Search Results by Combining Tag Value Structure

Tushar Jadhav, Santosh Chobe

Abstract

The databases that are available as a web based on
HTML form search interfaces have increased tremendously over the
years. For every query presented to such a Website, the results
retrieved from the corresponding databases are dynamically
embedded into the result pages for human browsing. In order to
make the embedded data to be machine-processable,which is
necessary for most applications like comparison shopping and deep
web data collection, the data extraction and relevant label
assignment should be done. A multi-annotator approach is
implemented which does the alignment of the corresponding data on
result pages into groups, and then annotates those groups from
different manner, and combines the various annotations in order to
anticipate a ultimate label for each group of data. Lastly for the
search website a wrapper is constructed for annotation
automatically. A new technique is applied to handle the case when
the search results are not adjoining which happens due to ads ,
comments etc. Also to handle nested tag structure which might be
present in the search results. The experimental results shows that
the algorithm used and implemented performs considerably well
than existing methods.

Full Text:

PDF

References

Yiyao L, He. H, Zhao .H ,â€œAnnotating Search Results from web databasesâ€,

IEEE Transaction VOL. 25, NO. 3, MARCH 2013.

Arasu, A. and Garcia-Molina, H.,â€œExtracting Structured Data from Web

Pagesâ€. SIGMOD-03, 2003.

Embley, D., Jiang, Y and Ng, Y. â€œRecord-boundary discovery in Web

documents.â€ SIGMOD-99, 1999.

Buttler, D., Liu, L., Pu, C.,â€A fully automated extraction system for the

World Wide Webâ€. IEEE ICDCS-21, 2001.

Liu, B., Grossman, R. and Zhai, Y. â€œMining data records from Web pages.â€

KDD-03, 2003.

Chang, C. and Lui, S-L â€œIEPAD: Information extraction based on pattern

discovery â€ WWW-10, 2001.

Y. Zhai and B. Liu, â€œWeb Data Extraction Based on Partial Tree

Alignment,â€ Proc. 14th Intâ€™l Conf. World Wide Web (WWW â€™05),

W. Liu, X. Meng, and W. Meng, â€œViDE: A Vision- Based Approach for

Deep Web Data Extraction,â€ IEEE Trans. Knowledge and Data Eng., vol. 22,

no. 3, pp. 447-460, Mar. 2010.

Crescenzi, V., Mecca, G. and Merialdo, P. Roadrunner: Towards automatic

data extraction from large web sites. VLDB-01, 2001.

Wang, J.-Y., and Lochovsky, F. Data extraction and label assignment for

Web databases. WWW-03, 2003.

L. Arlotta, V. Crescenzi, G. Mecca, and P. Merialdo, â€œAutomatic

Annotation of Data Extracted from Large Web Sites,â€ Proc. Sixth Intâ€™l

Workshop the Web and Databases (WebDB), 2003.

W. Su, J. Wang, and F.H. Lochovsky, â€œODE: Ontology-Assisted Data

Extraction,â€ ACM Trans. Database Systems, vol. 34, no. 2,article 12, June

Refbacks

There are currently no refbacks.

Username
Password
Remember me