Data Extraction and Alignment of Search Results by Combining Tag Value Structure

Tushar Jadhav, Santosh Chobe

Abstract


The databases that are available as a web based on
HTML form search interfaces have increased tremendously over the
years. For every query presented to such a Website, the results
retrieved from the corresponding databases are dynamically
embedded into the result pages for human browsing. In order to
make the embedded data to be machine-processable,which is
necessary for most applications like comparison shopping and deep
web data collection, the data extraction and relevant label
assignment should be done. A multi-annotator approach is
implemented which does the alignment of the corresponding data on
result pages into groups, and then annotates those groups from
different manner, and combines the various annotations in order to
anticipate a ultimate label for each group of data. Lastly for the
search website a wrapper is constructed for annotation
automatically. A new technique is applied to handle the case when
the search results are not adjoining which happens due to ads ,
comments etc. Also to handle nested tag structure which might be
present in the search results. The experimental results shows that
the algorithm used and implemented performs considerably well
than existing methods.

Full Text:

PDF

References


Yiyao L, He. H, Zhao .H ,“Annotating Search Results from web databases”,

IEEE Transaction VOL. 25, NO. 3, MARCH 2013.

Arasu, A. and Garcia-Molina, H.,“Extracting Structured Data from Web

Pages”. SIGMOD-03, 2003.

Embley, D., Jiang, Y and Ng, Y. “Record-boundary discovery in Web

documents.” SIGMOD-99, 1999.

Buttler, D., Liu, L., Pu, C.,”A fully automated extraction system for the

World Wide Web”. IEEE ICDCS-21, 2001.

Liu, B., Grossman, R. and Zhai, Y. “Mining data records from Web pages.”

KDD-03, 2003.

Chang, C. and Lui, S-L “IEPAD: Information extraction based on pattern

discovery ” WWW-10, 2001.

Y. Zhai and B. Liu, “Web Data Extraction Based on Partial Tree

Alignment,” Proc. 14th Int’l Conf. World Wide Web (WWW ’05),

W. Liu, X. Meng, and W. Meng, “ViDE: A Vision- Based Approach for

Deep Web Data Extraction,” IEEE Trans. Knowledge and Data Eng., vol. 22,

no. 3, pp. 447-460, Mar. 2010.

Crescenzi, V., Mecca, G. and Merialdo, P. Roadrunner: Towards automatic

data extraction from large web sites. VLDB-01, 2001.

Wang, J.-Y., and Lochovsky, F. Data extraction and label assignment for

Web databases. WWW-03, 2003.

L. Arlotta, V. Crescenzi, G. Mecca, and P. Merialdo, “Automatic

Annotation of Data Extracted from Large Web Sites,” Proc. Sixth Int’l

Workshop the Web and Databases (WebDB), 2003.

W. Su, J. Wang, and F.H. Lochovsky, “ODE: Ontology-Assisted Data

Extraction,” ACM Trans. Database Systems, vol. 34, no. 2,article 12, June


Refbacks

  • There are currently no refbacks.


Copyright © IJETT, International Journal on Emerging Trends in Technology