Generative AI can reproduce an image when trained on as few as 200 copies

AI trained on as little as 200 images can provide passable imitations of popular artworks, according to a new study published in Cornell University’s preprint server arXiv—highlighting just how easy it can be for AI systems to mimic copyrighted work.

“Some people are surprised that it’s such a low number, and some people are also surprised that it’s a high number,” says Sahil Verma, lead author of the study and a computer science PhD at the University of Washington. Verma and his colleagues analyzed three versions of the Stable Diffusion model, and the extent to which they were able to produce images that would be considered imitations of originals. The so-called imitation threshold was calculated algorithmically, based on whether a computer system recognized an image as imitative. The computerized results were also cross-checked with humans, and found a strong correlation.

The actual total of images an AI model needs to have within its training data varies depending on the system, but is between 200 and 600 images. It also depends on what the AI is trying to depict: those looking to mimic the brushstrokes of Vincent Van Gogh might need as few as 112 images, while human faces can be replicated using as little as 234 images.

The research stemmed from Verma’s desire to learn more about interpretability of machine learning systems—the way in which models work, and how they reach their decisions. But the researchers came to even more alarming conclusions. “As we were working on it, we realized this has a huge implication for the privacy and copyright stuff,” Verma says.

“We already knew that AI can sometimes memorize or imitate works in the training data,” says Andres Guadamuz, who researches intellectual property law at the University of Sussex, and was not involved in the study. “This is setting a threshold, which may be interesting for a machine learning perspective, but I don’t see it having legal implications, at least not for now.”

Part of the reason for that is that imitation isn’t a legal term, just like memorization isn’t. “Copyright infringement happens when a work has been substantially reproduced, and the term ‘substantial’ is qualitative, not quantitative, so something could resemble a character or a person in the training data without infringing,” Guadamuz says.

However, Guadamuz believes the benchmark set by the researchers of 200 images at a minimum could help set a threshold for AI companies beyond which their systems should not go, in order to avoid a lawsuit. “This could work for them to set lower standards by which they could ensure that there is likely not going to be any reproduction in the model,” he says.

It’s an approach that Verma also believes AI companies could follow. “In the future, once we have some legal precedents of what has happened, let’s say The New York Times vs. OpenAI case, which is currently still ongoing, these findings would have larger implications on future data collection practices.”

No comments

Read more