Artificial intelligence–based image generation systems draw from vast and intricately structured libraries of visual data, encompassing millions of examples designed to help these models produce novel combinations of forms, colors, and textures. Despite this immense reservoir of imagery, recent research indicates that when such systems are tasked with creating pictures from a continuously shifting series of prompts rather than isolated instructions, they consistently revert to a remarkably limited range of familiar motifs. The outcome is artwork that often feels generic or repetitive, revealing how even an algorithm trained on immense variety tends toward stylistic uniformity when challenged to sustain coherence across iterative transformations.
A study recently published in the journal *Patterns* examined this phenomenon through an experimental exercise reminiscent of the childhood game of telephone, but translated into visual language. Researchers selected two powerful AI image-generation and interpretation models—Stable Diffusion XL and LLaVA—and set up a looping process designed to test how faithfully information could travel between them. In this setup, Stable Diffusion XL was first given a carefully written, imagery-rich prompt—for instance, a description of a solitary figure discovering an ancient eight-page book hidden in the wilderness, inscribed in a forgotten tongue. After generating a corresponding picture, that image was passed along to the LLaVA model, which was instructed to produce a detailed textual description of what it perceived. That textual output was in turn fed back into Stable Diffusion XL, prompting the system to create a new image based on LLaVA’s description. This dialogue between image and text continued in sequence for one hundred rounds, creating a chain of visual reinterpretations.
As in the classic human version of the game, the original meaning, structure, and atmosphere of the initial prompt degraded rapidly. The image’s distinctive qualities dissolved as small descriptive changes accumulated over iterations, much like in those online time-lapse experiments where an AI is asked to reproduce the same picture repeatedly and inevitably drifts into abstraction or blandness. Yet what truly astonished the researchers was not simply the loss of fidelity, but the model’s tendency to collapse into one of a small collection of stylistic clichés. Out of a thousand such telephone sequences, the majority converged into just twelve dominant aesthetic themes—each one a polished yet ordinary visual template.
Though the transformation sometimes unfolded gradually, with images subtly morphing from specific to indistinct, there were also cases of abrupt stylistic shifts where the output suddenly took on an entirely different tone or color palette. In nearly every sequence, however, a homogenized look ultimately prevailed. The researchers expressed disappointment at the monotonous predictability of these results, coining the memorable phrase “visual elevator music” to describe the effect—a nod to the kind of unremarkable art one might find hanging in a corporate hallway or hotel lobby. Among the most frequently recurring motifs were maritime scenes dominated by lighthouses and seascapes, elegant yet impersonal interiors, nocturnal cityscapes drenched in artificial light, and rustic or weathered architectural forms.
Even when the experimenters substituted alternative generation and description models for either phase of the telephone process, the overall tendency remained consistent. Extending the game to 1,000 rounds revealed that, although stylistic convergence typically occurred around the hundredth iteration, further rounds did not restore diversity; the deviations that appeared later still originated from the same small group of overrepresented visual frameworks. Whatever the technical adjustments, the gravitational pull of these common motifs proved inescapable.
The implications of these findings are significant for understanding both the strengths and limitations of current generative AI. In a human game of telephone, errors and reinterpretations accumulate because each participant perceives and retells the message through the lens of personal bias, imagination, and imperfect memory, leading to unpredictable—and often humorous—variation. Artificial intelligence, by contrast, suffers the inverse limitation. Its outputs are shaped by statistical likelihood and optimization, not by creativity or intuition. Consequently, no matter how eccentric or unconventional the starting prompt, the model gravitates toward a narrow range of styles that it has internalized most confidently from its training data.
This raises important questions about human influence and aesthetic bias within the enormous training sets that feed these networks. Since the datasets are composed primarily of human-generated images, the model’s repetitive tendencies may reflect collective preferences for certain photographic subjects or compositional archetypes—those scenes and moods that people historically found visually compelling enough to capture and share. The researchers suggest a sobering conclusion: while algorithms can expertly mimic stylistic patterns, they lack the intangible discernment that constitutes taste. The study thus highlights a key limitation of machine creativity—imitating form is relatively easy, but cultivating genuine aesthetic judgment remains an exclusively human endeavor.
Sourse: https://gizmodo.com/ai-image-generators-default-to-the-same-12-photo-styles-study-finds-2000702012