Similarity Texter

Additional Information

Test Cases 2010

This is a short description of the test cases that were used in the test of plagiarism detection systems 2010.

0. Leap year: This is an original paper about the history of leap years, but with a properly footnoted table that is often flagged as plagiarism by software.

1. Djembe: This is a translation of an English-language essay about the Djembe drum that was done using the online translation tool Babelfish. There exist numerous plagiarisms of the English original essay, and many teachers participating in a course on plagiarism detection using this material have easily found either the original or another plagiarism. However, no software system has ever even come close to finding the source, although many “untranslatable” words are kept from the original. The picture included is also taken from the original work.

2. Atwood: This book report copies some paragraphs from the official review at the Amazon site with two paragraphs (932 characters / 141 words) from an anonymous review cut in. A typo (capital “I” in the middle of a sentence) is one of the key markers that something is wrong. Some plagiarisms of the site exist.

3. IETF: This paper is taken from a technical report about the structure of the Internet. Four pages were copied, the quotations removed, two spelling errors fixed, and the lead sentence re¬written. There exist at least one plagiarism in an online exhibition catalogue that was discovered during the first test. Although the authors have requested takedown, the material was still to be found in the 2008 test.

4. Döner: This essay about the popular Turkish street food, also known as Döner Kebap, is a carefully crafted shake & paste from three sources, a scientific one, a popular one, and the German Wikipedia. A second version of this paper was stored with a German umlaut in the file name.

5. Telnet: This paper is one that was actually submitted by a student to a colleague. It plagiarizes a bootleg PDF copy of a hacker’s book that circulates on the Internet. The PDF was scanned and has typical character recognition errors, such as “~” instead of “-“. The student was aware that the plaintext dates in the telnet protocol looked odd (i. e. they were far too old), so these values were changed. The time stamps on the telnet commands, however, were not changed, which struck my colleague as being quite odd. A search on the timestamps is quickly successful.

6. Friðrik Þór Friðriksson:
This is an original biography written about the Icelandic film director containing 21 Icelandic characters and 2 Danish characters in the names. After the 2004 test the paper was included in the German Wikipedia with the correct author named in the history tab. Some students do this – put their reports online before they are graded – and it can cause a false positive, especially if a teacher does not check the authorship of the Wikipedia article given as the source by the plagiarism detection software.

7. Maple Syrup:
This report is a clause quilt from a children’s TV show script available online (as well as a few plagiarisms) and an article from the Wikipedia.

8. Reinhard Lettau: This original biography was placed by the author in both the English and the German Wikipedia and noted as such. There exist an enormous number of legal and illegal copies of the Wikipedia online, making test cases 6 and 8 appear to have very many sources to some systems.

9. Grass frogs: This essay was purchased from a paper mill and is used by permission. Human searchers (biology teachers) have found the schoolbook from which this paper was cribbed, there is now a PDF version of the book available online. It can be seen by teachers to be a plagiarism, as it uses German spellings from before the last reform . The unreferenced picture is from a public domain animal picture database.

10. Fraktur: This paper about a German type family is taken from a PDF that is itself written in a Fraktur typeface. That means that all ligatures are encoded, and since every second or third word includes a ligature, it is highly unlikely for this to be found by a software match (i.e. a hash (#) encodes ff, sz encodes ß, and so on). There are also 6 Scandinavian ligatures in the text. Paragraphs from the PDF are mixed as a shake & paste collection with paragraphs from a book about Fraktur. It includes 10 pictures of words with Fraktur ligatures.

11. Henning Mankell: This book report about a detective story by the Swedish author is an exact copy from the Internet (including typographical errors). During the 2007 test an online student plagiarism was found, the author of the original book report was successful in having that site taken down.

12. Microbreweries: This test case is a hand translation from an article in the English Wikipedia about small breweries.

13. Allspice: This paper is a translation into German from an English translation of a Swedish original chapter in a book about spices. It is a shake & paste of paragraphs. It sticks out for a German-speaking teacher reading it because it discusses the Danish and Swedish names and uses of piment, instead of German ones.

14. Max Schmeling: This biography of the German boxing legend is original, but the footnotes are made up. The information was found in a tourist brochure, so since this was not quotable, a scientific journal of local history (that does not exist) was made up as the source.

15. Public toilets: This report is taken from a DVD version of a German technology encyclopedia that was published in 1910 and is now in the public domain. Even though this is not technically a copyright problem, it is still plagiarism, as the source is not given. The dates in the footnotes have 100 years added to them to look more modern. The five pictures illustrating the work are the copperplates found in the encyclopedia – obvious to a teacher, but oblivious to software.

16. Elfriede Jelinek: This biography of the 2004 Nobel Prize laureate for literature is a shake & paste plagiarism from three sources. One was translated by hand from an English-language blog, one is an official book report, and the third a print newspaper article available online. The English blog is no longer available online.

17. Square dancing: This paper is almost original, except for one paragraph about the clothing that was taken verbatim from the home page of a club. There exists at least one plagiarism of the text on the pages of another club.

18. Vikings: This paper is a highly adapted clause quilt based on the online version of a scholarly journal article about the Vikings. Almost every sentence had some sort of change done to it – word order changed, synonyms used, etc. Only the quote of a rune stone text was left unchanged.

19. Blogs: This is a structural plagiarism of a PDF about the digital revolution. Sentences and paragraphs were used in ascending order, as well as the footnotes. Each sentence was manipulated so that it was not identical to the source.

20. Volleyball: Two sentences of an otherwise original work about the sport were taken from a web page.

21. Tibet: Three sources were used for this shake & paste plagiarism, the Wikipedia, an article in a German daily newspaper, and an article from a weekly computer newspaper. There are a number of sources referenced, but the reference numbering scheme contains gaps – caused by sentences being removed in the middle and the references not being adjusted, something that is glaringly obvious to a human reader.

22. Le Pont: This test case was prepared from a French original using Google-Translate. The result was polished to make the German sentences read cleanly, because the sentence structure produced by the automatic translator was quite unintelligible.

23. Wok: This test case was prepared by using the Amazon „Search Inside“ feature. A cookbook was found with an appropriate page describing a wok, the page was typed up by hand.

24. Keyboard:
This article about the Dvorak keyboard was prepared as a shake & paste plagiarism from an online article.

25. Surströmming: This plagiarism was copied completely from an online article that itself plagiarized the Wikipedia heavily. Then additional, original sentences were added so that an originality quotient of about 20% was given. During the test, however, the source disappeared from the Internet without a trace. We adjusted the scoring to only score hits on the Wikipedia. A second version of this article was prepared in which in one paragraph all of the letters ‚e‘ were replaced with an ‚?‘, a differently coded letter that looks similar to an ‚e‘ on a quick read-through.

26. Ajax: An article from the online journal database of Springer was taken (with permission) as the basis for this copy & paste plagiarism. During the test we discovered that Ciando and Googlebooks also have the entire article indexed – although the link delivered by Google is just to a page for purchasing an electronic copy of the article or for obtaining the paper if a login is available.

27. Codfish: This is a shake & paste plagiarism taken from a German weekly newspaper, an online special edition of another weekly newspaper, and the Wikipedia.

28. Brantenberg: This test case about the Norwegian author Gert Brantenberg was translated by hand from an online source in Norwegian. It contains many place names, so it should be discoverable.

29. Facebook: Half of this test case is taken as a copy & paste with permission from a student blog, the rest – for the most part complete sentences – are original.

30. Smoking ban:
This is an original paper about the smoking ban in public places recently introduced in Germany.

31. Pickles: This essay is a shake & paste from the Wikipedia and a site called WiseGeek and is all about pickles.

32. Zakumi: This essay is a shake & paste from the Wikipedia and the FIFA home pages about the mascot of the soccer world championships held 2010 in South Africa and is used by permission.

33. Eyjafjällajökull: This is from the Wikipedia article on the Icelandic volcano that shook the world in 2010 and included many special characters.

34. Stieg Larsson: This is an original essay about the Swedish novelist.

35. Perl: This essay about Perl is an original, and included Perl code that would have inserted a blinking red statement into the reports, if a system was written in Perl and did not use a sanitizer on the input. No systems showed this behavior.

36. Champagne: This is a portion of the article about champagne in the French Wikipedia that was translated to English by Google Translator. With the list of the names of the bottles it should have been easier for systems to find. It wasn’t.

37. Mosque: This shake & paste essay is taken from the Wikipedia and a site „All about Turkey“ and is used by permission

38. Voltavoltaic: This is a plagiarism of a student site at our school about using solar energy. It turned out that they for some reason had their entire site locked off for search machine crawlers. Thus, no search machines could find this.

39. Agassiz: This essay is taken from a book scanned into Google Books and used by permission.

40. Barbarians: This is taken from Machiavelli, The Prince, as it appears in the Project Gutenberg and is about the Barbarians. It is edited for clarity.