Naron Purba/iStock/Getty Images Plus via Getty Images

Follow ZDNET: Add us as a preferred news source through Google to stay updated on the most relevant insights.

**ZDNET’s essential insights:** Relying on artificial intelligence to compose written material constitutes plagiarism. Meanwhile, the numerous services promoted as AI content detectors produce inconsistent and often unreliable results. In our evaluations, conversational AI systems—chatbots like ChatGPT—have often equaled or even exceeded the performance of these standalone detection tools.

In 2025—merely three years after generative AI technologies first mesmerized the global audience—the battle against AI-fueled plagiarism has become both technically complex and ethically charged. This piece represents a completely revised and expanded update to my original January 2023 article examining AI content detectors. At that time, the best-performing tool managed to identify AI-generated material with only 66% accuracy, while the others fell short. Fast forward to February 2025, and I had extended the experiment to include ten detection services, three of which impressively reached flawless results. By April of that same year, that figure had risen further, with five detectors demonstrating perfect identification accuracy.

Yet progress in AI detection appears cyclical rather than linear. Roughly six months after those encouraging results, the landscape shifted again. Only three detectors retained perfect accuracy—and one of these was a fresh arrival to the field. Intriguingly, a number of previously high-performing tools began declining precisely when they introduced restrictions on free usage, suggesting an uneasy trade-off between accessibility and quality.

To address this steady unpredictability, I initiated a new experiment that might revolutionize how we think about detecting AI-generated writing: instead of relying on specialized detectors, why not employ the very AI chatbots that most people already use daily? Could the tools we use to write also become our best judges for detecting what they—or their kin—produce?

### Understanding plagiarism and its connection to AI outputs

Before discussing methodology or findings, it is necessary to define the ethical foundation of this issue: plagiarism. Merriam-Webster’s dictionary defines *plagiarize* as “to steal and pass off the ideas or words of another as one’s own; to use another’s production without crediting the source.” Though users of AI systems like Notion AI or ChatGPT do not literally steal, a failure to disclose that an AI model authored the words amounts, in practical terms, to the same intellectual misrepresentation.

When individuals treat AI text as fully original personal work—rather than acknowledging its algorithmic origin—they breach both the spirit and the letter of this definition. This nuance underscores the necessity of dependable tools capable of distinguishing human authorship from machine-generated prose, particularly in academic, editorial, and corporate contexts.

### How the test series was designed

To accurately evaluate AI detectors, I created five text samples: two written personally by me and three generated by ChatGPT. Each excerpt was submitted independently to the detection tools, one at a time. When detectors offered probabilistic scores, I adopted a threshold: any confidence above seventy percent in either direction—AI versus human—was treated as a conclusive determination. Those wishing to replicate my experiment can access identical text blocks through the public document I reference.

### The comprehensive evaluation

This round involved eleven detection services. Collectively, I conducted fifty-five individual evaluations (and consumed an unhealthy quantity of coffee in the process). The tools included BrandWell, Copyleaks, GPT-2 Output Detector, GPTZero, Grammarly, Monica, Originality.ai, QuillBot, Undetectable.ai, Writer.com, and ZeroGPT. One tool from past studies, Writefull, was excluded because it discontinued its detection feature. Another, Monica, was removed mid-process due to word-length restrictions and its demand for a $200 upgrade, which made consistent testing impractical. To maintain comparative integrity, I substituted it with a promising new platform—Pangram—which swiftly joined the top performers.

The summary table indicated five tools achieved perfect identification in this entire set. To explore temporal patterns, I charted the same test series across six iterations. Interestingly, there was no consistent improvement trend; some tools improved briefly before regressing. Even my most reliable text sample, historically identified as human-written, saw decreasing certainty in the most recent analysis.

Although a handful of detectors reached flawless results, I cannot recommend full reliance on these services for verifying authenticity. They struggle particularly with essays by non-native English writers, often mistakenly labeling such nuanced language as synthetically produced. Results were inconsistent: GPTZero occasionally refused to decide, while Copyleaks once classified my demonstrably human prose as AI-written. Such irregularities prove why blind trust in automated detectors is risky.

### Chatbots enter the testing arena

This inconsistency raised a provocative question: why keep paying for specialized detection subscriptions when ordinary chatbots may perform just as effectively—or better? To test the hypothesis, I compared major conversational AI systems under identical conditions. The findings were striking: chatbots demonstrated markedly higher accuracy than all dedicated detection services tested. Even on a preliminary comparison graph, the difference was unmistakable—AI chatbots provided substantially more precise classifications, highlighting their unexpectedly strong potential as evaluators of their digital peers.

### Individual detector outcomes

Each detector revealed distinct strengths and weaknesses. BrandWell, created by a marketing firm and later rebranded, stagnated with only 40% accuracy. Despite industry claims, Copyleaks reached 80% but still committed notable misjudgments. Hugging Face’s GPT-2 Output Detector, outdated relative to new-generation AIs, plateaued at 60%. GPTZero rose from humble beginnings to a full professional platform, achieving 80% accuracy this time, though it swapped mistakes between tests. Grammarly, despite being synonymous with linguistic proficiency, performed poorly in AI detection with 40% accuracy and no observable progress.

Some newer entrants demonstrated remarkable competence. Pangram—engineered by veterans from Google and Tesla—delivered perfect accuracy, albeit through slower analysis. Originality.ai, a commercial product claiming leadership in detection, ironically misclassified my own work as AI-written, undermining its confidence claim. QuillBot, once unreliable, stabilized and matched Pangram’s flawless record. Conversely, Undetectable.ai fell from former excellence to dismal performance, often confusing AI text for human prose. Finally, Writer.com’s detector misidentified all samples as human-written, rendering it practically ineffective. ZeroGPT, once an obscure project cluttered by advertisements, has matured into a professional-grade service, joining the 100% accuracy tier.

### Chatbot results and comparative insights

The chatbot assessments deepened the intrigue. Each AI assistant was asked simply: “Determine whether the following text was written by a human or an AI.” Without customized prompting, and except for ChatGPT Plus’s premium version (a paid tier at $20 per month), all were tested anonymously through incognito sessions.

The free version of ChatGPT made a single error—misclassifying one human text—but astonishingly recognized another excerpt not only as human-authored, but specifically as written by me, despite no identifying data being shared. ChatGPT Plus, Microsoft’s Copilot, and Google’s Gemini all achieved perfect classifications, establishing that mainstream chatbots can outperform purpose-built detection products. Grok, which had excelled in earlier chatbot rankings, underperformed here, misidentifying the majority of tests.

### Conclusion and reflections

After extensive testing spanning two years and multiple iterations, one conclusion is unequivocal: while several AI detectors occasionally perform with excellence, none should be regarded as infallible. Performance fluctuates with algorithm updates and corporate product shifts. Moreover, human linguistic diversity—and even stylistic creativity—regularly confuses these systems. By contrast, advanced chatbots, which continuously evolve through real-time training and large-scale contextual analysis, increasingly demonstrate the capacity to discern writing origins with nuanced precision.

If you have experimented with detectors like Copyleaks, Pangram, or ZeroGPT, consider your own experience: were the results consistent or misleading? Have your academic, professional, or editorial projects ever been unfairly flagged by these tools? The conversation surrounding authenticity and originality in the AI era is only beginning. Share your insights below, and stay connected: receive ZDNET’s Tech Today newsletter for daily updates, track my ongoing research projects across social media, and follow me on X/Twitter (@DavidGewirtz), Facebook (Facebook.com/DavidGewirtz), Instagram (@DavidGewirtz), Bluesky (@DavidGewirtz.com), and YouTube (YouTube.com/DavidGewirtzTV).

Sourse: https://www.zdnet.com/article/i-found-3-ai-content-detectors-that-identify-ai-text-100-of-the-time-and-an-even-better-option/