For years, debates about whether generative AI is “creative” have been fuelled by anecdotes: a striking image here, a clever turn of phrase there, and the occasional hallucinated fact in between. The new wrinkle, as of the moment, is scale. A large study reported this month compared outputs from large language models with responses from more than 100,000 human participants on standard creativity tasks, creating one of the biggest head‑to‑head datasets yet for judging idea generation (ScienceDaily report).
According to the accompanying write‑up, the researchers didn’t rely on a single novelty stunt. They used established “divergent thinking” tasks—tests designed to measure how many varied ideas a person can produce and how original those ideas are—then ran comparable prompts through AI systems. The core claim is attention‑grabbing: on several measures, AI systems matched or exceeded average human performance in producing original ideas, though the authors also caution that creativity is multi‑dimensional and context dependent.
That nuance matters. Creativity research can be prone to over‑interpretation because it sits at the intersection of culture, psychology and value judgement. However, large samples and clear scoring rules can help move the conversation from “I like it” to “here’s what was measured”.
What the tests actually asked people (and AI) to do
The study leans on long‑running creativity paradigms, most notably the Divergent Association Task (DAT), which asks for as many novel uses as possible for a mundane object (for example: a brick, paperclip or shoe). Psychologists use it because it is simple to administer, easy to scale online, and tends to separate “fluent” ideation (lots of answers) from “original” ideation (uncommon answers). The American Psychological Association summarises the task and its place in divergent thinking research in accessible terms.
The researchers also used related prompts (often called “consequences” tasks) that ask participants to imagine outcomes of unusual scenarios—another way to elicit a spread of ideas rather than a single correct response. Crucially, the report indicates AI models were prompted under comparable constraints (for example, producing a set number of ideas within a defined format), so the comparison was not simply “humans on a stopwatch versus AI with unlimited drafting time”.
The paper’s online format matters too: when tens of thousands of people participate, you can begin to map creativity performance across a broad ability range rather than relying on small psychology cohorts. It also means the study can examine distributions: does AI only look strong against the median, or can it approach the performance of top human outliers? Those details sit in the primary article and methods material rather than the headlines.
Scoring originality: from human judgements to “semantic distance”
The hardest part of creativity research isn’t collecting ideas—it’s scoring them without turning the whole exercise into a popularity contest. Traditionally, researchers have used trained raters to judge novelty and usefulness, but that can be slow, expensive and sensitive to rater bias. At large scale, many teams blend human ratings with computational metrics.
In this study, the authors report using quantitative measures designed to estimate how “far” an idea is from typical associations—often described as semantic distance—alongside human evaluation procedures. Semantic distance approaches aim to capture the intuition that “use a brick as a doorstop” is close to common usage, while “use a brick as a heat battery in an improvised thermal mass system” is conceptually further away (even if you would still want to check whether it is practical or safe). The technical details, including prompts and scoring decisions, are laid out in supplementary materials.
That’s where the “AI versus humans” story gets complicated. AI systems are, in effect, prediction engines trained on vast text corpora. When asked for unusual uses, they can recombine patterns from many domains and rapidly generate long lists. Computational metrics may reward that breadth—particularly when ideas are phrased in ways that appear less common in everyday language. Critics of semantic distance scoring argue that it can sometimes mistake “rare phrasing” for true originality, or reward technically elaborate nonsense. Supporters counter that, when paired with human checks, it provides a consistent way to compare huge numbers of responses.
The authors’ own caveats—reported in the releases—suggest they’re aware of this tension: originality is not the same as value, and “creative” output in a lab task does not automatically translate into creative work in a professional or artistic setting.
Where AI looked strong—and where the comparison gets slippery
In divergent thinking tasks, AI’s advantages are unsurprising: speed, stamina and breadth of exposure. A model can list 20 plausible alternatives in seconds, without the social inhibition or fatigue that can narrow human brainstorming. In that sense, the results may reflect what some managers already report seeing in practice: AI can act like an endlessly patient ideation partner, providing a first pass that humans then refine.
However, the comparison becomes slippery the moment we ask what “creativity” is for. Divergent thinking tasks measure one component: the ability to generate varied possibilities. They generally don’t measure whether an idea is emotionally resonant, ethically sound, culturally appropriate, or viable under real‑world constraints. They also don’t measure long‑horizon creativity—the kind that requires sticking with a problem, gathering new evidence, and revising a concept over weeks or months.
There’s also the question of training data leakage and implicit borrowing. If an AI suggests “use a brick as a garden edging”, is that creative or merely common knowledge reproduced at scale? The study’s methods aim to compare outputs fairly, but no benchmark can perfectly separate recombination from invention. Some philosophers and cognitive scientists argue that human creativity also involves recombination—just with embodied experience, personal goals and social context layered on top. The point isn’t that one side is “really creative” and the other is “fake”, but that the mechanisms differ.
A related issue is human instruction. People interpret prompts through their own experiences and motivations. AI interprets prompts through statistical associations plus alignment constraints. That can lead to different kinds of “safe” creativity—models may avoid taboo or risky suggestions, while humans may be bolder (or, at times, more reckless). Depending on the scoring rubric, that difference could raise or lower apparent originality.
What this means for classrooms, workplaces and the next benchmarks
If you’re teaching or hiring, the most practical takeaway is not “AI is now more creative than humans”. It’s that idea generation is becoming cheaper, and that shifts what we should value. When a tool can produce dozens of concept sketches, marketing taglines or product variations quickly, the bottleneck can move to selecting, testing and improving the right ideas.
In education, that may push assessment away from “produce ten uses for an object” and towards tasks that require justification, iteration and reflection—explaining why an idea is appropriate, what trade‑offs it involves, and how it would be evaluated. The Alternative Uses Task is still useful for research, but it is not a full proxy for creative competence in a world where students can outsource fluency to a chatbot.
In workplaces, the implication is similar: teams may increasingly treat AI as a brainstorming junior—useful, tireless, and occasionally brilliant, but needing direction and review. The study’s scale helps quantify that intuition and gives organisations a more empirical basis for deciding where AI adds value, and where it mostly adds volume.
Finally, for researchers, the benchmark itself may be the bigger contribution than any single headline result. With tens of thousands of human baselines and standardised prompts, future studies can test which model features (training data, prompting strategies, tool use) improve not just fluency but judged originality and usefulness. The publication and supporting materials provide a reference point for those follow‑ups.
A measured conclusion: competition, yes—replacement, not so fast
The most defensible reading of this new head‑to‑head is that AI is now a strong performer on a particular slice of creativity: fast, flexible ideation as measured by divergent thinking tasks. That’s significant, and the dataset’s scale makes it harder to dismiss as a parlour trick.
At the same time, creativity in the real world is more than generating unusual responses to a prompt. It includes taste, intent, social meaning, and the messy work of turning a novel thought into something that matters. The study’s authors and the releases reporting it caution against over‑generalising from lab‑style tasks to all creative domains.
Originality may now have competition—and, yes, more spreadsheets than usual—but the human role appears to be shifting rather than vanishing: from being the sole generator of options to being the editor, critic, and decision‑maker who can turn options into outcomes.
