An Experimental evaluation of Adaptive Real Time Web Crawler

Swapnil M Phalak, Sandip M Walunj

Abstract


The internet is a vague collection of web pages containing
vague amount of information arranged in multiple servers.
The mere size of this collection is a daunting obstacle in
getting necessary and relevant information. This is where
search engines come into view which strives to retrieve
relevant information and serve it to the user. A Web Crawler is
one of the basic blocks of search engines. It is a program
which browses the World Wide Web for the purpose of Web
indexing and storing the data in a database for further analysis
and arrangement of the data. This paper is being aimed to
create an adaptive real time web crawler (ARTWC) which
retrieves the web links from a dataset and then achieves fast
in-site searching by extracting most relevant links with a
flexible and dynamic link re-ranking scheme. Our system
deduces that it is more effective than existing baseline
crawlers along with an increased coverage.

Full Text:

PDF

References


Feng Zhao, Jingyu Zhou, Chang Nie, Heqing Huang, Hai

Jin, "Smart Crawler: A Two-stage Crawler for Efficiently

Harvesting Deep-Web Interfaces", IEEE Transactions on

Services Computing, vol.99, 2015.

J. Cho and H. Garcia-Molina, "Parallel crawlers". In

Proceedings of the Eleventh International World Wide

Web Conference, pp. 124 - 135, 2002.

A. Heydon and M. Najork, "Mercator: A scalable,

extensible web crawler", World Wide Web, vol. 2, no. 4,

pp. 219-229, 1999.

D. Fetterly, M. Manasse, M. Najork, and J. Wiener, "A

large-scale study of the evolution of web pages", In

proceedings of the twelfth international conference on

World Wide Web, Budapest, Hungary, pp. 669-678.

ACM Press, 2003.

O. Papapetrou and G. Samaras, "Minimizing the

Network Distance in Distributed Web Crawling",

International Conference on Cooperative Information

Systems, pp. 581- 596, 2004.

J Cho, H. G. Molina, Lawrence Page, "Efficient

Crawling Through URL Ordering", Computer Networks

and ISDN Systems, vol. 30, no. (1-7), pp. 161-172, 1998.

J. Cho and H. G. Molina, "The Evolution of the Web and

Implications for an incremental Crawler", In Proceedings

of 26th International Conference on Very Large

Databases (VLDB), pp. 200-209, September 2000.

Md. Abu Kausar, V S Dhaka and Sanjeev Kumar Singh,

"Web Crawler Based on Mobile Agent and Java Aglets"

I.J. Information Technology and Computer Science, vol.

, no. 10, pp. 85-91, 2013.

Md. Abu Kausar, V S Dhaka and Sanjeev Kumar Singh,

"An Effective Parallel Web Crawler based on Mobile

Agent and Incremental Crawling", Journal of Industrial

and Intelligent Information, vol. 1, no. 2, pp. 86-90,

Martin Hilbert, "How to Measure How Much

Information Theoretical, Methodological, and Statistical

Challenges for the Social Sciences", International Journal

of Communication 6 (2012).

Luciano Barbosa and Juliana Freire, "Searching for

hidden-web databases", In Web DB, pages 16, 2005.

Michael K. Bergman, "White paper: The deep web:

Surfacing hidden value", Journal of electronic

publishing, 7(1), 2001.

Yeye He, Dong Xin, Venkatesh Ganti, Sriram Rajaraman,

and Nirav Shah, "Crawling deep web entity pages", In

Proceedings of the sixth ACM international conference

on Web search and data mining, pages 355364. ACM,

Shestakov Denis, "On building a search interface

discovery system", In Proceedings of the 2nd

international conference on Resource discovery, pages

, Lyon France, 2010. Springer.

Luciano Barbosa and Juliana Freire, "An adaptive

crawler for locating hiddenweb entry points", In

Proceedings of the 16th international conference on

World Wide Web, pages 441450. ACM, 2007.

Soumen Chakrabarti, Martin Van den Berg, and Byron

Dom, "Focused crawling: a new approach to topicspecific

web resource discovery", Computer Networks,

(11):16231640,1999.

Jayant Madhavan, David Ko, ucja Kot, Vignesh

Ganapathy, Alex Rasmussen, and Alon Halevy, "Googles

deep web crawl", Proceedings of the VLDB Endowment,

(2):12411252, 2008.

Balakrishnan Raju and Kambhampati Subbarao.

"Sourcerank: Relevance and trust assessment for deep

web sources based on intersource agreement", In

Proceedings of the 20th international conference on

World Wide Web, pages 227-236, 2011.

Olston Christopher and Najork Marc, "Web crawling.

Foundations and Trends in Information Retrieval",

(3):175246,2010.




 

Copyright © IJETT, International Journal on Emerging Trends in Technology