Plagiarism Detection in MEDLINE Using Multiple Query Expansion and Approximate Phrase Match Techniques

Vijay S Mhaskar, Prof. Thombre B. H


Plagiarism is using others work and using them off as one’s own without appropriately acknowledging the source. Identifying duplicated and plagiarized passages of text is becoming popular area of research. Plagiarism is an important issue in every academic and research institutes and this situation is becoming worse with easily available online resources. MEDLINE has a dataset of more than 26 million publications from 5639 various publications in the area of medicine and related fields. New publications are getting added continuously which makes difficult to keep track of information contained within it. In this paper we have proposed method for plagiarism detection that will identify most relevant sources of plagiarism from MEDLINE. We have used set of document matching rules based on query expansion techniques and approximate phrase match techniques are used to match relevant documents. Also the system that we are proposing for candidate document selection is built on highly scalable open source technologies. finally, when ranking suspected documents, we carefully rank those documents based on combined weight of various match types. We have evaluated our proposed approach with recently available MEDLINE corpus which gave us very promising results.

Full Text:



