11

I've finished writing my thesis and have done an online plagiarism check. I used https://www.check-plagiarism.com/ to check and got a lot of plagiarism. I don't understand why. You can find an example below. For instance, I don't see where the plagiarism is in the first sentence which is marked red. When looking into the suggested URL, I don't even find the key words from this sentence there. What should I do?

enter image description here

6
  • 2
    The two links shown in the picture are financemagnates.com/cryptocurrency/news/… and www2.deloitte.com/us/en/pages/financial-services/articles/… I also do not see how the highlighted sentences were identified at those URLs. It even seems unlikely that older versions of those articles would contain the sentences in question.
    – Anyon
    Commented Jun 2, 2022 at 15:59
  • 2
    I believe if you are a college student, you can access better plagiarism-checking software for free, like Turnitin and ithenticate, which are recommended by my university for checking research integrity. These two give you better and reliable results Commented Jun 3, 2022 at 4:32
  • 1
    That seems like a particular bad, and even dishonestly presented, plagiarism checker. Commented Jun 3, 2022 at 13:44
  • 1
    Surely the accuracy varies wildly based on how well the checker is programmed? I don't think you can make a blanket statement about all of them.
    – Brady Gilg
    Commented Jun 3, 2022 at 17:12
  • 2
    Standard dot.com bitcoin mining site with free download to research publications and 2500+ research tools; With spell checker, link maker, citation maker; grammarly to write concisely. Sigh, what a cluster fuck has the internet become! Commented Jun 4, 2022 at 6:23

4 Answers 4

45

The questions in the title and in the body of the question are different; I'll answer the question in the title, which I cite here in case of a later edit:

Online checker for plagiarism: How accurate are they?

Answer: They are inaccurate, but moreover - they are simply not "checker[s] for plagarisms", even if some of those programs have names which appear to claim the contrary.

The reason is that plagiarism is a subtle concept; determining whether some piece of writing P is plagiarized or not requires, in many cases, a proper understanding of the contents of P as well as of the context of P. Understanding those things is far beyond what software can currently (2022) do.

What software tools like the one you showed in your question really do is some kind of automated (probably statistical) analysis of texts, and highlighting of some parts of the text. The output of the software does not determine whether something is plagiarism. The fact that such a tool shows some absurd text message as a result (such as "33% plagiarized content" in your screenshot) tells us something about the software, not about the text it analyzed.

5
  • 24
    I use the term "similarity checker" since similarity is what they are actually supposed to check for.
    – kaya3
    Commented Jun 3, 2022 at 14:36
  • @kaya3: Indeed, this term makes much more sense. Commented Jun 3, 2022 at 14:51
  • 3
    I think you're being a bit harsh re: plagiarism detectors. They work for what they're designed to do, quickly comparing a large set of submissions against a much larger database of possible sources that may have been plagiarized, which can't be done by hand, and producing a ranked list of submissions worth reviewing. Commented Jun 4, 2022 at 13:39
  • @NicoleHamilton: Indeed, I agree with what you wrote (both in your answer and in your comment). I don't think though that I am harsh on those tools - I'm just harsh on misnaming those tools and on non-sensical output messages such as e.g. "33% plagiarized contents" (which I translated from the picture in the question, where the phrase is stated in German). Commented Jun 5, 2022 at 9:45
  • (But admittedly I might be a bit biased concerning those things since I already had too many bad experiences in my (not yet particularly long) life with people you misused various software tools to do things that those tools were never designed to do - with corresponding consequences for the results...) Commented Jun 5, 2022 at 9:46
29

Jochen Glueck's answer provides a good description of the general problem with "plagiarism checkers."

Looking at this particular one, clearly its statistical analysis (or whatever it's doing) is failing pretty badly. In tests on my work it's often not able to identify my original sources (from which there may well be a few phrases I did take verbatim), it suggests I've plagiarised from sources I've never seen, it claims that trivially short phrases and even single words are plagiarised, and, perhaps most astonishingly, it claims that large portions of my text are unique when it's comparing against a copy of that same document.

To see how this particular plagiarism checker works I took a couple of items from my personal technical notes and ran them through it. I picked this item on the technical details of video standards (dropping the last part to get it under 2000 characters) and this item on the design of the CMake language because these notes happen to have taken a particularly large amount of work to properly write up. (The first one because it's a complex topic where I had to use a lot of different sources, and the second one because CMake documentation is just terrible at explaining the core principles of the language design.) I am quite sure that these notes bear little resemblance to the original sources because I wouldn't have had to put such much work into them otherwise. (I am not particularly concerned about plagiarism in the notes I take for personal use in any event, but it's unlikely there would be much anyway because if a source is already in the form I need, I would just link to it rather than re-typing it.)

These didn't come out as high as your example did, 16% and 22% respectively, but some characteristics of the output are interesting to note.

In the first example it identified 22 "plagiarised" bits. Almost all of them are from pages that were not sources I used. I checked only the first few before it started to seem rather pointless because without exception in the ones I checked the text given in the plagiarism match was from my document and there was nothing resembling that text in the source to which they linked. (In the case of a link to a BBC page, the source had not only no technical information about video standards, but almost no text at all.) Several of the matches were for a single word and, in a particularly hilarious example, the word "this" from the Merriam Webster thesaurus:

Merriam Webster thesaurus: "this"

(There was more than one of those, actually.)

In the second example it correctly identified "plagiarism" from my on-line copy of the document linked above. (Given that it's from the same repo on GitHub as the first one, I have no idea why it failed to find the copy of the first one.) But somehow it decided that, though it was comparing against exactly the same document (albeit formatted slightly differently) the copy I'd given it was still "78% Unique Content."

I couldn't even be bothered for this one to chase down the sources given by the checker, but I am pleased to note that the letter "g" followed by a period is apparently quite original to me:

"g."

(I am guessing this is from one of the five uses of "e.g." I made in the text.)

In short, this "plagiarism checker" seems pretty rubbish.

7
  • 8
    Do we know that this is even a legitimate plagiarism checker and not a bit of malware?
    – shoover
    Commented Jun 3, 2022 at 3:22
  • @shoover They do have a paid tier that allows you to check documents greater than 2000 characters, so it seems that there's some attempt to make the plagiarism checking part of the site a legitimate business. (That of course doesn't mean that it doesn't also engage in distributing malware or other unsavoury activities.)
    – cjs
    Commented Jun 3, 2022 at 3:27
  • 2
    @cjs "They do have a paid tier [...]" that is a huge sign of being malware: not only they may affect your computer, they even get access to your creditcards, paypal account or the likes!
    – EarlGrey
    Commented Jun 3, 2022 at 10:13
  • 9
    @cjs It seems that all this "plagiarism checker" does is ingest your paper and flag some random sentences in it to be exact duplicates of nonexistent sentences in articles at random links. Maybe its purpose is to accumulate papers for a Chegg-like entity to sell?
    – shoover
    Commented Jun 3, 2022 at 14:36
  • 2
    @EarlGrey Having a credit card merchant account (or working with a service provider that does) seems to me a sign of not being malware; it gives them a contractual connection to the banking system and puts them under a lot more scrutiny via things like PCI audits, for example. It does not give access to things like PayPal account information.
    – cjs
    Commented Jun 4, 2022 at 5:30
17

No, plagiarism detectors are not accurate. They produce lots of false positives. All they can do is identify candidates for closer inspection and the bar is usually pretty low on what it takes to become a candidate. From there, it takes a human to carefully review the evidence (the matches) that were found and apply experience and judgement to decide if it's compelling.

They work well for what they're designed to do, quickly comparing a set of submissions against a vast database of possible sources that may have been plagiarized and producing a ranked list of submissions worth reviewing.

We used the MOSS (measure of software similarity) system at Stanford to do cheat-checking for one of our large intro CS classes at Michigan, each time comparing 1000+ new submissions against each other and our archive of roughly 10K prior submissions. This would have been impossible by hand. It scored and ranked the results and did a good job of identifying matching sections. For example, it wasn't fooled by common obfuscation, like variable renaming, simple expression rewriting, statement reordering, rewriting a for as a while, changes to comments, and so on. It really is very good at this.

If it flagged 30 submissions, we might commonly write up six to perhaps a dozen after a careful discussion with my cheat-checking team. We often agreed the rest were a little suspicious, but there just wasn't enough evidence to support a case. By these numbers, you might argue we saw a false positive rate of 40% to 80%. But as a practical matter, what we also experienced was that the cases we reported were usually the ones MOSS ranked highest.

While plagiarism detectors work well at comparing a large set of submissions against another far-larger set of possible sources of plagiarism to find possible matches, they do not work well on an individual submission except to produce a list of the closest matches it found, which may not be good matches at all.

The ranking score for a single result is completely meaningless. Ranking scores are only helpful for sorting a list of suspects (or similarly, search engine results) because rank values only have to obey ordering, not linearity. (A score of .8 may be better than .4, but it's probably not twice as good.) You need the context of other submissions it scored higher or lower to know what a score means.

In your case, you've reviewed the output and decided there's no real match. Good for you. It was a false positive, which is no surprise because these tools generate lots of false positives. (And presumably, you already knew you didn't plagiarize anyway.)

If you expect a plagiarism detector to be right most of the time, you're going to be wrong most of the time.

4
  • 7
    Checking for similarities between computer programming assignments in a single course is much more likely to produce valid matches than the OP's use case and still produces mostly false positives. Commented Jun 3, 2022 at 12:05
  • 1
    I'm planning to add "code similarity detection" to the system we currently use for preparing student submissions for programming exercises, but the way I was intending to use it was to decide the order in which I marked them, so that if there was plagiarism it would hopefully be spotted as I marked. Commented Jun 4, 2022 at 16:07
  • @DikranMarsupial Sure, you could do that. But be aware you won't get a full ranking. You'll only get top suspects ranked. Also, in reviewing a suspect, it's helpful to have a fresh idea of what most students do with assignment, especially that there's always a lot of similarity given the problem constraints and any starter code. Commented Jun 4, 2022 at 17:55
  • 1
    @NicoleHamilton yes. At the moment the way I deal with it is that I think "mmm, that bit of code is both odd and familiar" and then go through the ones I have already marked to find it. It would just make it easier to do if the more similar solutions were closer together in the pile. My plan was to look at pairwise similarities, rather than ranking by amount of similar code. I don't really trust the rankings, just want to make manual plagiarism detection more efficient. Commented Jun 4, 2022 at 18:01
3

Some similarity checkers are moderately good at finding text that is similar between two documents. If a reputable checker says two sentences are the same, it is almost certain that they actually are. Producing such tool, with a sufficiently comprehensive database of sources to check against, is also very expensive. This is why universities and colleges pay very large sums for access to software like Turnitin. Its unsurprising that a free tool on the internet cannot do such a good job.

There is a reason that often universities don't allow students access to these tools (although this is not universal) - doing so helps them to plagiarize. Reading a source and then rephrasing it sufficiently to pass a plagiarism check is still plagiarism, its just plagiarism that can't be detected as easily. Plagiarism is using other peoples thoughts without attribution, not using their words.

You must log in to answer this question.

Not the answer you're looking for? Browse other questions tagged .