Plagiarism Detection in MEDLINE Using Multiple Query Expansion and Approximate Phrase Match Techniques

Vijay S Mhaskar, Prof. Thombre B. H


Plagiarism is using others work and using them off as one’s own without appropriately acknowledging the source. Identifying duplicated and plagiarized passages of text is becoming popular area of research. Plagiarism is an important issue in every academic and research institutes and this situation is becoming worse with easily available online resources. MEDLINE has a dataset of more than 26 million publications from 5639 various publications in the area of medicine and related fields. New publications are getting added continuously which makes difficult to keep track of information contained within it. In this paper we have proposed method for plagiarism detection that will identify most relevant sources of plagiarism from MEDLINE. We have used set of document matching rules based on query expansion techniques and approximate phrase match techniques are used to match relevant documents. Also the system that we are proposing for candidate document selection is built on highly scalable open source technologies. finally, when ranking suspected documents, we carefully rank those documents based on combined weight of various match types. We have evaluated our proposed approach with recently available MEDLINE corpus which gave us very promising results.

Full Text:



R. Muhammad Adeel Nawab; M. Stevenson; P. Clough, ”An IR-based

Approach Utilising Query Expansion for Plagiarism Detection in MEDLINE,”

in IEEE/ACM transactions on Computational Biology and Bioinformatics

, vol.PP, no.99, pp.1-1, doi: 10.1109/TCBB.2016.2542803

M. Potthast, A. Barron-Cedeno, B. Stein, and P. Rosso, Crosslanguage

plagiarism detection, Lang. Resour. Eval., vol. 45, no. 1, pp. 4562, Mar.

[Online]. Available: s10579-009-9114-z

Zahrani, Salha M., Naomie Salim, and Ajith Abraham. Understanding

plagiarism linguistic patterns, textual features, and detection methods.

Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE

Transactions on 42.2 (2012): 133- 149.

Das N, Panjabi M. Plagiarism: Why is it such a big issue for

medical writers? Perspectives in Clinical Research. 2011;2(2):67-71.


Chandran, David, Keeley Crockett, David Mclean, and Zuhair Bandar.

FAST: A fuzzy semantic sentence similarity measure. In Fuzzy Systems

(FUZZ), 2013 IEEE International Conference on, pp. 1-8. IEEE, 2013.

Alberto Barrn-Cedeo, Paolo Rosso, Jos-Miguel Bened Reducing the

plagiarism detection search space on the basis of the kullback-leibler distance,

in Proceedings of 10th International Conference on Computational

Linguistics and Intelligent Text Processing, 2009, pp. 523534.

J. Lewis, S. Ossowski, J. Hicks, M. Errami, and H. Garner, Text similarity:

An alternative way to search medline, Bioinformatics, vol. 18, no. 22, pp.

, 2006., ”Solr Features”, 2016. [Online]. Available:,

Accessed: 1-Dec-2016, ”Implementation of Similarity with the Vector Space

Model”, 2016. [Online]. Available: 3 0,

Accessed: 1-Dec-2016

E. Fox and J. Shaw, Combination of multiple searches, In Proceedings

of TREC-2, 1994, pp. 243249.

A. Aronson and F. Lang, ”An overview of metamap: Historical perspective

and recent advances”, Journal of the American Medical Informatics

Association, vol. 3, no. 17, pp. 229236, 2010.


Copyright © IJETT, International Journal on Emerging Trends in Technology