Similarity Texter

Weitere Informationen

DIE HOCHMÜTIGE DOHLE UND DER PFAU

Einst lebte eine Dohle, voll von eitlem Stolz, die stahl sich Federn, die dem Pfau entfallen warn und putzte sich damit. Das eigne Dohlenvolk ver- achtend trat sie in der schönen Pfauen Reihn. Der Unver- schämten reißt man hier die Federn aus, jagt sie mit Schnäbeln. Und die Dohle, bös verbleit will wieder nun betrübt zu ihrem Volk zurück. Die aber stoßen sie von sich, mit herbem Schimpf. Und eine derer, die zuvor verachtet, sprach zu ihr “Hätt’ unsre Lebensart dir vormals conveniert, hätt’st du, was die Natur dir schenkte, akzeptiert, dann wär dir weder jene Schande widerfahrn noch müsstest du zum Unglück jetzt verstoßen sein.”

Diese Version von Aesops Fabel ist aus Wilfried Strohs Sammlung von Übersetzungen von Jan Novák: “Aesopia”, die auf Geschichten von Phaedrus basieren.

Paper 2008

On the Utility of Plagiarism Detection Software

Debora Weber-Wulff

HTW Berlin[*]
weberwu@htw-berlin.de


Note: Presented at the 3rd International Plagiarism Conference in Newcastle, England, June 2008. Submitted for publication with updated results February 2009. Since the editor does not respond to emails, I am publishing it here under a Creative Commons license:
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 Unported License.


Abstract

There are many software systems that suggest that they can reliably determine if a submitted text or an online document is plagiarized or not. This paper discusses the problems associated with such software and reports on a test of plagiarism detection and collusion detection systems conducted in 2007. There were no systems that achieved a mark of very good and there was only one system, Ephorus, which was just barely given a mark of good. Eight systems were graded acceptable, with the popular Turnitin only reaching eighth place, and four systems were found to be completely unacceptable. Details of the test methodology and the test cases are given, as well as results for each software system. The paper concludes that plagiarism detection systems cannot be relied on to find all instances of plagiarism, and many cannot even find simple plagiarisms.

1. Introduction

Plagiarism is a problem that has been affecting authors for centuries. The term “plagiarism” goes back to the Greek Epigrammatic Martialius, who in anger over another author, Fidentius, publishing Martialius’ poems under his own name, accused him of plagiarium, the stealing of children. Martialius considered his poems to be the children of his mind, and felt that they had been stolen by Fidentinus.

Many authors have been accused of plagiarism, rightly or wrongly, throughout the ages, but only since the concept of author’s rights arose around the time of the first encyclopedaeists in the 1800s has there been the clear notion that it is not ethical to use someone else’s words without proper citation.

With the spread of the Internet and the vast amount of texts available, there appears to be an increase in the amount of plagiarism, although there are very few investigations that offer exact numbers. It is clear that in educational settings many pupils and students do not understand the meaning of plagiarism and why it is a problem, and thus some may submit other people’s words as their own.

Many administrators and teachers, looking for an easy way to solve this apparently growing problem, turn to plagiarism detection software for help. This flourishing industry, sometimes even offering the stamp of approval “100% original!” that can never actually be determined, purports to make life easier for teachers. This paper will first look into the possibility of mechanically determining that plagiarism has occurred, briefly discuss collusion problems especially as they occur with program code, and then present the results of a large-scale test of plagiarism detection software.

2. Plagiarism Detection

What does it mean to detect plagiarism? We first have to define plagiarism, no easy task. This is similar to the decision of whether a man is bald or not – it is very clear when he is bald, and very clear when he is not, but there is a vast grey area in which a clear decision is not easy to make.

2.1 Definition of Plagiarism

One definition of plagiarism as in the German Wikipedia as “representing the intellectual property of others or the works of a third party as one’s own work, totally or in part. This can mean an exact copy as well as any adaptation (changing the order or words or replacing them with synonyms) or use of the structure or argumentation structure or a translation. Some sources classify cooked data and incomplete footnoting as plagiarism.” This is, actually, my own definition of plagiarism as I have formulated it there, and I reference it here so that a supposed plagiarism on my own part can be cleared up right away.

We have classified the various types of plagiarism in [1]. They include copy&paste; copy, shake&paste; patchwriting (rewording); structural plagiarism; translations.

Which of these can software hope to discover? As has been shown in the tests, the detection of translations is currently not possible. Human beings have an advantage here, as they can see that a text appears to have been written by a native speaker of German or Chinese, but a machine cannot determine the source language, only the language that the text is in [2].

Structural plagiarism is the next most difficult to determine. One can see the ideas being presented in order, or perhaps the footnote citations also in the identical order to some other work, but this happens on the level of the semantics of a text, not at a syntactical level. This kind of plagiarism can only be detected by a human reader.

Patchwriting is defined by Howard in [3] as „copying from a source text and then deleting some words, altering grammatical structures, or plugging in one synonym for another.“ The plagiarist takes one or more text passages as a basis, and then does some editing: Adjectives are removed or replaced with synonyms, verb tenses adjusted, series lists are resorted, phrases are deleted or inserted. Many students are actually of the opinion that this is scientific writing! They feel that since they have put some “work” into the report that it is now their own version, without having checked a single fact or done any of their own research.  Howard controversially contends that patchwriting is an effective tool for teaching students how to write scientifically. The problem is that if one indeed uses patchwriting to teach writing, at some point one must wean them from this method and get them to find their own voice.

Patchwriting can be found by some systems that use a “fuzzy” matching algorithm and distance functions on the word set to determine similarity. Systems that only do character-by-character comparisons can be fooled by this kind of writing.

A kind of plagiarism that is often observed can be called “copy, shake & paste”. There is an American product called Shake&Bake™ in which you put spices in a bag, add pieces of chicken, shake well, and then take the pieces out in random order, placing them on a baking sheet. This process seems to be similar to what some students do. They find paragraphs that have something to do with their topic, copy them, put them in a virtual bag and shake them well before putting them in random order in their paper. When reading such a paper one is struck by marked and abrupt changes in style and paragraphs not flowing from one to the next. Occasionally a student will include a paragraph of their own writing, and the difference between this writing, which often includes grammatically incorrect structures, and the polished writing of the rest are signs of a plagiarist at work.

If one of the plagiarized paragraphs is included in the portion of text tested by a plagiarism detection system, there is a good chance that it may be caught. But if, for example, only one paragraph is taken, only 8 out of 13 plagiarism detections systems were successful in finding the source.

The simplest kind of plagiarism, and the one that is theoretically easiest to detect, is the exact copy, the copy&paste plagiarism. These tend to be the plagiarisms that teachers take action on, although students will defend themselves that they did give a source, perhaps in a footnote with a link to the site. But in a word-for-word copy, there should be quotation marks around the portions used.

2.2 Finding Plagiarism with Software

Software can only hope to compare the syntax, on a character or word level, and determine the similarity between texts. There is some experimental work being done in the area of semantic recognition. But this only seems successful in the area of highly structured text such as program language code. These algorithms will be briefly discussed here.

The question is how similarity between two texts, the suspected plagiarism T1 and potential source T2, can be determined. The easiest case is when they are exact copies of each other. They will be the same length, and character for character identical. Calculating a hash value on each text will result in the same value. But if any changes have been made, they are no longer identical, although highly similar.

A text is a sequence of characters, traditionally grouped into words or subsequences of characters that are separated from each other by blanks or punctuation symbols. One measure of similarity can be the determination of the longest subsequence of characters shared by two texts.

In order to do this, a portion of T1 is selected and compared to T2 using variations on algorithms such as the Knuth-Moritz-Pratt [4] or the Boyer-Moore string-searching algorithm [5]. In order to test the entire document of T1 against T2, one would have to test all reasonable substrings of T1 against T2, and this is just one possible source file, search engines and databases store billions of files that are candidates for T2. It is possible to first do a search on keywords from T1 in order to obtain a manageable subset of files as candidate source files.

Another method involves a database with extra storage for indexing files. The text can be split into words, stop words can be removed, and documents are retrieved which contain many of the words left. The proximity of the words to each other in both T1 and T2 is calculated and documents with similar proximities of some words are scored higher. Often a combination of sharing subsequences of characters and proximities is used for scoring.

There are some recent papers on attempts to use natural language processing methods to detect plagiarism, but they are so computing intensive that they are currently not feasible for plagiarism detection in general ([5], [6]).

2.3. Collusion Detection

Collusion detection is a somewhat easier task, as the corpus of documents to be tested is known and bounded. Collusion is when two or more students hand in the same or similar papers. This happens in larger classes where the chances of being found out are considered to be minimal.

The algorithms used in collusion detection are similar to those used for plagiarism detection, but since the test does not have to be made with an extremely large corpus, a more thorough testing can be done using more complicated algorithms. Still, each paper must be compared with each of the other ones, so there are (n-1)! comparisons that need to be made. In our test of one system, JPlag, [7] it could not handle more than 50 program files without breaking down.

3. Software Test

Seventeen software systems for plagiarism and collusion detection were tested in Berlin, Germany, in September 2007 to see how well they can determine if short papers are plagiarisms or not. The methodology used was previously used in 2004 for a similar test, this time twice the number of test cases was used.  This chapter discusses the test methodology and the results obtained.

3.1 Test Methodology

The test of the plagiarism detection software systems was done using 20 short papers in German (i.e. that have non-standard characters such as the German umlauts) that were specially prepared for this test. Ten of the papers were used in the 2004 test. Since this material has been available on the Internet since 2004, it was supposed that they would be in many databases, despite requesting that there be no indexing of these files. Ten new papers were prepared using different types of plagiarism. In particular, the following test cases were used:

  • Leap year: This is an original paper.
  • Djembe: This is a translation of an English essay about the Djembe drum that was done using Babelfish. There exist numerous plagiarisms of the English original essay, and many teachers participating in a course on plagiarism detection using this material have easily found either the original or another plagiarism, however, no software system has ever even come close to finding the source, although many “untranslatable” words are kept in the original.
  • Atwood: This book report copies from the Amazon site, with only some cosmetic changes made. Some plagiarisms of this site exist.
  • IETF: This paper is taken from a technical report, at least one plagiarism in an online exhibition catalogue exists.
  • Döner: This essay is carefully crafted patchwriting of three sources, a scientific one, a popular scientific one, and the German Wikipedia.
  • Telnet: This paper is one that was actually submitted by a student to a colleague. It plagiarizes a bootleg PDF copy of a hacker’s book that circulates on the Internet. The student was aware that the plaintext dates looked odd (i.e. they were too old), so these values were changed. The timestamps on the telnet commands, however, were not changed, which struck my colleague as being very strange and induced her to search for the timestamps – which was quickly successful.
  • Friðrik Þór Friðriksson: This is an original biography written about the Icelandic director that was included in the German Wikipedia with the correct author named in the history. Some students do this – put their reports online before they are graded – and it can cause a false positive, especially if a teacher does not check the authorship of the Wikipedia article given as the source by the plagiarism detection software.
  • Maple Syrup: This report is a patchwriting of a children’s TV show script available online and the Wikipedia.
  • Reinhard Lettau: This original biography was placed by the author in both the English and and German Wikipedia and noted as such. There exist a number of legal copies (copies of the Wikipedia) and also plagiarisms of this online.
  • Grass frogs: This essay was purchased from a paper mill and is used by permission. Human searchers have found the schoolbook from which this paper was cribbed, there is a PDF of the book available online.
  • Fraktur: This paper about the German type family is taken from a PDF that is itself written in Fraktur. That means that all ligatures are encoded, and since every second or third word includes a ligature it is highly unlikely for this to be found by a software match (i.e. # encodes ff,  sz encodes ß). Paragraphs from the PDF are mixed as a shake&paste with paragraphs from a book about Fraktur.
  • Henning Mankell: This book report about a detective story by the Swedish author is an exact copy from the Internet (including typos).
  • Microbreweries: This report is a translation of the English Wikipedia.
  • Allspice: This report is a translation into German from an English translation of a Swedish original. It is taken from a book using shake&paste, but sticks out for a teacher because it talks about the Danish and Swedish names and uses of allspice.
  • Max Schmeling: This biography of the German boxing legend is original, but the footnotes are made up. The information was found in a tourist brochure, so since this was not quotable, a scientific journal of local history was made up as the source.
  • Public toilets: This report is taken from a DVD version of an encyclopedia that was published in 1910 and is now in the public domain. Even though this is not technically a copyright problem, it is still plagiarism, as the source is not given. The dates in the footnotes had 100 years added to them to look more modern. The pictures illustrating the work are the copperplates found in the encyclopedia – obvious to a teacher, but oblivious for software.
  • Elfriede Jelinek: This biography of the Nobel Prize winner is a shake&paste plagiarism from three sources, one translated by hand, an official book report, and a newspaper article.
  • Square dancing: This paper is almost original, except for one paragraph that was taken verbatim from the Internet.
  • Vikings: This paper is highly adapted from the online version of a magazine article. Almost every sentence had some sort of change done to it – word order changed, synonyms used.
  • Blogs: This is a structural plagiarism of a PDF. Sentences and paragraphs were used in ascending order, as well as the footnotes. The “glue” between the paragraphs is entirely made up.

These reports are intended to be representative of a wide variety of plagiarisms – and variations of original authorship, which could be falsely identified as plagiarism. Because they are so short, they do give an advantage for the software, as the chances of a random sample actually being a plagiarism are much higher than they might be in a 40-page thesis. All 20 reports were prepared in different formats, as well as placed online in HTML for one system, which only tests online texts.

We requested and obtained free access to all of the systems that were ultimately tested. The German system Plagiarism-Finder refused to give us access as they were in the process of updating, and when the test of MyDropBox (now SafeAssign) kept hitting internal system errors we discontinued the test of this system.

All 20 papers were submitted to the system at once, if possible. Since some systems only permit three tests a day or have some other limitation, we followed the submission rules. The time of the beginning of the test was recorded as well as the time that the results were available. Any problems with the submissions were also recorded on paper, and screenshots of the different stages of use were taken.

After the results came they were scored on a scale 0-3. If there were three sources, 3 points were given for finding all sources, 2 for finding 2, 1 for 1, 0 for no sources found. For the originals, 3 points were given if no plagiarisms were returned. Reports that could be found in the Wikipedia were also scored with 3 points, if the plagiarism detection software returned the Wikipedia in a top scoring position.

Some systems did find the source, but it was far down the list returned. This was not given full points. The final score was given only as the sum of the scores for all 20 reports. There were no points given or taken off for usability, although this should be a factor in future tests. Some systems were tested twice, as they delivered a new system as the test was ending. We averaged the results of both tests for the final value for each system.

We also tested three collusion detection systems, but the results were not comparable to each other (two tested software programs, the other one text documents) so they are not included in the discussion here.

3.2 Test results

The results were classified into very good, good, acceptable and not acceptable. A discussion of these results is given here, along with a brief discussion of the systems that were not part of the test.

3.2.1 Very good software

No system was classified as very good. We felt that a system needed to obtain 85% of the possible points (51/60 points) in order to be classified as very good.

3.2.2 Good software

We set the level of 60% (36/60 points) for this category, although 60% is actually the lowest passing grade in Germany. Still, there was only one system, the Dutch Ephorus system, which reached this mark and is the system in first place. The system was tested twice, an old version (which was graded with 42 points) and a new, “improved” version that only obtained 36 points, just barely making this grade. We averaged the scores to 38.

Both systems have some very grave usability problems. One can easily land in states from which the reports cannot be scrolled, and navigation through the reports is not intuitive. The results showed that the system has problems with umlauts and with plagiarisms from PDF sources. We found it problematic that the new version found less plagiarism than the older version.

3.2.3 Acceptable Software

The point level for being deemed acceptable was set to be 40% of the total points, or 24/60. The term “acceptable” is used quite loosely, as for any system that had less than 30 points (50%), it would be just as effective to toss a coin to determine if a paper is a plagiarism or not.

The system in second place with 35 points is the German system docoloc that was developed at the TU Braunschweig. The reports are chaotic and the navigation could use some workflow-orientation. The symbols and labels are at times mysterious, the layout is problematic and it would be desirable to upload a ZIP archive of all of the files to be tested. Uploading 200 papers is quite an exercise in frustrations. This system is the best in this classification because it was able to cope with plagiarisms from PDF sources.

There is a tie for third place with 34 points with three systems:

  • The Swedish system Urkund was tested twice. The old system received 33 points, the new one 35. The newer system copes better with swapping out individual words, but breaks down when an entire sentence is inserted into a plagiarized paragraph. The interface for the new version is still under development and rather unusable as tested.
  • Copyscape Premium, offered by Google. This version is by paid subscription and without advertising. At a price of 5 US cent per test it is quite cheap, but the results are almost as good as for the free version, so an occasional check is better done with that one.
  • PlagAware, a German system designed to track plagiarism of web sites. One must embed a logo and a link to their page in order for the page to be tested. This, of course, will greatly enhance their page rank for search machines, but the system did get relatively good results.

Sixth place, with 32 points, is awarded to Google’s free version of its plagiarism detection system, Copyscape free. There is a limitation of 10 documents that must be accessable via URL per month. The results are fast and of average quality. Notable is that there are not a lot of extraneous hits – the sources named are quite to the point, if they are found at all.

The system XXX [**] came in seventh place with 29 points. It found many simple plagiarisms, but also marked silly parts of the text, at times beginning in the middle of a word, as a possible plagiarism. It was able to find plagiarisms from PDF sources. The system is almost unusable for more than one test, the result window is very hard to read and to compare with the source.

Two systems tied for eighth place with 26 points.

  • Turnitin, considered by many to be the industry leader because of its excellent navigation that fits well into the workflow of a university, was almost as bad as it was in the 2004 test. It still could not find the Wikipedia as a source and quit comparing text when it encountered an umlaut. It was only able to find one of the three PDF sources. In addition it was plagued by new problems: Top results were given to spam pages that just parroted the text searched, resulting in enormous confusion. One had to click away up to 19 of these spam sites in order to get to “real” pages that may or may not have been a source for the plagiarism. Even though we had to click away so much garbage, we still gave Turnitin points if they eventually found something useful. Turnitin responded to the test results immediately and said that they are now incorporating the Wikipedia into their search. They rescored the results themselves, now getting 35 points and tying for second place. But since many of the other systems have announced that they have made immediate changes to their systems as a result of our test, this is not a comparable result. The test will be repeated in September 2008. See the test of iPlagiarismCheck (below) for more on the Turnitin system.
  • ArticleChecker (from articlechecker.com) has an extremely bad user interface. Up to 5 files can be loaded at the same time, and the user selects Google, Yahoo and/or MSN. The results pages are completely unreadable. It was only possible to score the system because we knew the files tested. It does not show the offending passage in the file tested, so one must load it into an editor and use the search capabilities. The results are given as a number between 0 and 8+, which are clickable and result in the matching passages from the Internet. We were surprised that such an unusable system actually was as good as Turnitin in finding plagiarisms.

Last place in the acceptable systems was picapica with 25 points. This experimental system from the University of Weimar is supposed to analyze texts for their structure and to find differences in style as well as matches with sources on the Internet. The usability of this system was extremely bad and the reports were hard to decipher, but it was capable of finding some of the simpler plagiarisms.

3.2.4 Unacceptable Software

The following systems cannot be considered plagiarism detection systems; two even appear to be sham operations.

DocCop only achieved 17 points in the test, although it invests an enormous amount of time into obtaining useless results. It chooses a window on the text to be tested, tests the text underneath this window (without punctuation or umlauts), and then slides the window forward a character at a time, retesting the text now underneath the window. The reports take a very long time (the worst was 32 hours) and only one submission can be made at a time. The result is sent in an enormous E-Mail that contains all of the substrings tested and only links to the search machines with the relevant text. Since you have to search again and evaluate yourself, it would be quicker to do this without the software.

The system iPlagiarismCheck only managed 12 points, although we were rather surprised to see that the results were very similar to Turnitin’s results. However, since it was not possible to click away the spam sites, iPlagiarismCheck lost many points. Additionally, the first 10 results gave as the possible source our own papers! One only obtained results from the papers of other schools with Turnitin  if one permitted them to keep a copy of the paper, and we did not want that. But as it turns out, this “system” was actually a scam that was using Turnitin under a false name and reselling the results. They submitted the papers in “keep a copy” mode – and thus found the 10 essays from the first test still in Turnitin’s database! We had explicitly requested that these be removed after the first test, as we are not the copyright holders of the some of the material, we only had permission from the original authors to use their material for the plagiarism test and exercises, not as material for storage in a plagiarism database. Turnitin was able to trace this illegal use of their system and have issued cease-and-desist warnings. They assure us that our papers have been removed from the system.

The Polish system StrikePlagiarism also only reached 12 points. The only reason they got so many points was for the original works, which were said to not be plagiarisms, correctly. But so many others that were, indeed, plagiarisms, were also said to be not plagiarized. They did find 2 plagiarisms correctly, but at 2 € a test, these are very expensive false negatives.

We decided to give the online system CatchItFirst zero points,  as they did not find any plagiarisms at all. Instead, they took a lot of time to report “100% Original!”, a completely misleading statement that a user would have to pay a lot of money for. Since one cannot prove originality, only demonstrate evidence of plagiarism, we gave no points for the original papers in the test.

4. Conclusions

It is quite sobering to contrast the results of this test to the promises tendered by the marketing departments of the various plagiarism detection system companies. It was not expected that the systems would be able to find plagiarisms that were copied from books not available online or that are the results of translation efforts. For this kind of plagiarism it is clear that a human reader is capable of seeing the changes in style or the strained structure of the translation.

But we had expected that plagiarism detection systems would be much better at finding things that are actually available on the Internet. Typical problems seemed to be not looking at the Wikipedia, stopping work upon encountering a special character such as a German umlaut, and ignoring PDFs. It was surprising that such seemingly trivial plagiarisms were not detected. We were unable to determine the reasons for this; we hope that the companies are able to fix their algorithms so that our test in 2008 can show some improvements.

In general, however, we have to say that there is no “Silver Bullet” that works magi­cally to root out plagiarisms. Social problems cannot be solved by software, we have to educate the students on the definition of plagiarism, on how to avoid plagiarism, and we have to do spot checks with search machines to ensure compliance. Universities should invest in teaching teachers how to recognize plagiarism[***] and in making sure that they have a clear policy on how to handle the plagiarisms found.

5. References

[1] Weber-Wulff, D. and Wohnsdorf, G. Strategien der Plagiatsbekämpfung. In: Information: Wissenschaft & Praxis 57 (2006) 2, pp. 90-98.

[2] Haase, M. Linguistic Hacking. How to know what a text in an unknown language is about? 24th Chaos Communication Congress, 2007, http://events.ccc.de/congress/2007/Fahrplan/attachments/1025_LingHack-Paper.pdf

[3] Howard, R. M., Standing in the Shadow of Giants – Plagiarists, Authors, Collaborators. Ablex Publishing : Stamford, CT. 1999.

[4] Knuth, D.; Morris, Jr, J.H.;  Pratt, V. (1977). „Fast pattern matching in strings“. SIAM Journal on Computing, Vol. 6, Nr. 2, pp. 323–350

[5] Leung, C. and Chan, Y. 2007. A natural language processing approach to automatic plagiarism detection. In Proceedings of the 8th ACM SIGITE Conference on Information Technology Education. ACM, New York, NY, pp. 213-218.

[6] Gruner, S. and Naven, S. 2005. Tool support for plagiarism detection in text documents. In Proceedings of the 2005 ACM Symposium on Applied Computing. L. M. Liebrock, Ed. SAC ’05. ACM, New York, NY, 776-781.

[7] Prechelt, L.; Malpohl, G.; Philippsen, M. (2002) Finding Plagiarisms among a Set of Programs with JPlag. In: Journal of Universal Computer Science, Vol. 8, Nr. 11, pp. 1016-1038


[*] I am indebted to Martin Pomerenke, my research assistant working on a grant from  the FHTW Berlin, who assisted me in conducting this research and preparing the German version of the results in September 2007, published online at http://plagiat.htw-berlin.de/software/
[**]
The owner of the company for this software has threatened legal action if we publish the results of his system on our pages. So we have taken the precaution of anonymization.
[***] Our German-language E-Learning unit “Fremde Federn Finden” (Finding False Feathers) is available under Creative Commons license and can be translated into other languages. See http://plagiat.htw-berlin.de/ff/