You are here: start » dpd

Table of Contents


DPD (Dummy Plagiarism Detector)

version 0.1

A small (and ugly less-than-one-hour-hack) Python script that searches (using google web search) phrases in a file in order to ease plagiarism detection. GNU GPL License. Developed and tested on Ubuntu 6.10 (Edgy).


python >=2.4, w get (w get package), pdftotext (only for pdf support; in ubuntu it's in poppler-utils package) and GNU recode (recode package) + Internet connection


Provided that the file has already executable permissions (chmod +x) and python executable is in /usr/bin

./ inputfile [minwords] [maxwords]

…if it doesn't work

python inputfile [minwords] [maxwords]

Input file can be a text (UTF-8) or PDF (provided that pdftotext is available). PDF files are converted to text (a new file inputfile.txt is created). minwords (default 7) is the minimum words per phrase (phrases with less than minwords words are ignored) maxwords (default 20) is the maximum words per phrase (phrases with more than maxwords words are ignored) Modify minwords and maxwords to tune the analysis. The script considers . and ; as phrase separator. The result is a list of links that contains the considered phrases, ordered from the least significant to the more significant (more phrases found in it).


This tool is far from being perfect. It merely searches pieces of text (word OR word OR word, etc.) on the web. I assume no responsability for its usage, and I do not guarantee that it works correctly as described. Use it only to get some hints. Getting many matches does not indicate a plagiarism: the script merely looks for pages where the same words appear, so manually verify every result before claiming that's a plagiarism!


dpd/start.txt · Last modified: 2024/05/27 12:22 by
Kleine Websites, die ein Wiki als CMS