Enhancing Crawler Performance for Deep Web Information Extraction

Parigha V. Suryawanshi, D. V. Patil


 Scenario in web is changing rapidly and volume of web
resources is growing, efficiency has become a challenging issue for
crawling such data. The deep web content is the data that cannot be
indexed by search engines as they stay behind searchable web
interfaces. The proposed system aims to develop a framework for
focused crawler for efficient harvesting hidden web interfaces.
Initially Crawler performs site-based searching for center pages with
the assistance of web search tools to abstain from visiting more
number of pages. To get more precise results for a focused crawler,
proposed crawler ranks websites by giving high priority to more
relevant ones for a given search. Crawler accomplishes quick in-site
searching via looking for more relevant links with an adaptive linkranking.
Here we have incorporated Breath First Search (BFS)
algorithm in incremental site prioritizing for broad coverage of deep
web sites.

Full Text:



Feng Zhao, Jingyu Zhou, Chang Nie, Heqing Huang, Hai Jin,

“SmartCrawler: A Two-stage Crawler for Efficiently Harvesting Deep-

Web Interfaces”, IEEE Transactions on Services Computing 2015.

Yeye He , Dong Xin , Venkatesh Ganti, “Crawling Deep Web Entity

Pages”, ACM 2013.

“Information Retrieval”, by David A. Grossman and Ophir Frieder.

“Modern Information Retrieval”, by Ricardo Baeza-Yates and Berthier


Soumen Chakrabarti, Martin Van den Berg, and Byron Dom, “Focused

crawling: a new approach to topic-specific web resource Discovery”,

Computer Networks, 1999.

L. Barbosa and J. Freire., “Searching for Hidden-Web Databases”, In

Proceedings of WebDB, pages 1-6, 2005.

Luciano Barbosa and Juliana Freire, “An adaptive crawler for locating

hidden-web entry points”,In Proceedings of the 16th international

conference on World Wide Web, pages 441-450.ACM, 2007.

G. Almpanidis, C. Kotropoulos, I. Pitas, “Combining text and link

analysis for focused crawling-An application for vertical search

engines”,Elsevire Information Systems 2007.

Gunjan H. Agre, Nikita V. Mahajan, “Keyword Focused Web Crawler”,

IEEE sponsored ICECS 2015.

Niran Angkawattanawit and Arnon Rungsawang, “Learnable Crawling:

An Efficient Approach to Topic-specific Web Resource Discovery” ,

Qingyang Xu , Wanli Zuo, “First-order Focused Crawling”,International

World Wide Web Conferences 2007.

Hongyu Liu, Jeannette Janssen, Evangelos Milios, “Using HMM to

learn user browsing patterns for focused Web crawling” ,Elsevire Data

Knowledge Engineering 2006.


Copyright © IJETT, International Journal on Emerging Trends in Technology