In a world now dominated by easy access to generative artificial intelligence technology, data is king. But some UW–Madison professors are increasingly concerned about the risks of unauthorized data collection and use.
“Anytime a really new technology is introduced, there’s always a little bit of panic where the usage has to settle down and become normalized,” says Emilee Rader, an associate professor in the Information School. “People are trying to figure out what the legitimate and acceptable uses are for generative AI that align with our values.”
Conversations about the ethical use of personal data were thrown into the spotlight recently, when Hollywood actor Scarlett Johansson experienced the dark and difficult side of this issue. She's alleging that OpenAI copied and used the voicework she provided to the 2013 movie "Her" in their new conversational AI tool after she refused to agree to lend her voice to the project.
There isn’t an industry standard as the technologies continue to evolve, prompting questions about how average internet users can protect themselves from companies collecting and using their data.
“I think about privacy as the ability to control important aspects of our lives through data,” Rader says. “In some ways, AI has taken that choice away.”
According to Rader, it’s safe to assume that anything posted online has been incorporated into some AI or machine learning model.
Simply put, generative AI technologies, such as ChatGPT, Gemini and Jasper, seek to identify patterns in data — such as social media posts, Reddit threads and blog posts — to mimic the responses when prompted. Chatbots are always going to come back with an answer, regardless of whether it is right or wrong.
“It’s just making something up,” says Kyle Cranmer, a professor of physics who is also affiliated with the Departments of Computer Sciences and Statistics and serves as the director of the UW–Madison Data Science Institute.
But as companies seek to revamp and update the models to make them more accurate and efficient, they need more and more data.
“As you add more data, the results keep getting better,” says Cranmer. “All the big industry players have started throwing more money at data collection. They started scraping information off the Internet, sometimes without regard to what was allowed or not.”
The most valuable data used in AI models consists of human-derived content: social media posts, blogs, images, artwork, audio clips and videos. Cranmer underscored that a “gray area” has emerged as “data hungry” AI companies harvest swaths of data to refine their technologies.
"Many people have put text and other data up on the internet, and while it wasn't technically protected by copyright, they also never explicitly gave consent for it to be used this way," Cranmer says.
Dorothea Salo, a distinguished teaching faculty member in the Information School, believes a sense of entitlement ultimately governs companies’ approach to data collection — and what in her view constitutes a disregard for privacy.
“What do these high-tech companies think they are entitled to collect?” Salo asks. “What is it they think they’re entitled to do without consultation, with neither opt in nor opt out? The answer so far seems to be ‘whatever they please.’”
However, these questions about data privacy and company use are not new — nor are some of the technologies in question.
“There have been data abuses for a long time,” Salo says, underscoring that personal and behavioral data has always been particularly vulnerable to tech companies.
Digital assistants such as Google Home and Alexa, in addition to facial-recognition technologies, targeted online ads and Netflix recommendations, all fall into the parameters of artificial intelligence and use large amounts of user data.
"AI has never worked so amazingly," Cranmer says.
Today, artificial intelligence has reached a crescendo to include technologies that can create art and generate texts. But it has also been used extensively in the sciences.
AI can search through data to identify patterns and key findings, making scientific research and experimentation more efficient and accurate. It can be used to assist designing new materials that could sequester carbon out of the atmosphere or be key to a clean energy future, according to Cranmer.
“There’s a lot of potential for good,” he says. “AI is game changing for societal and scientific advances.”
However, as the technology matures, Cranmer, Rader and Salo each expressed concerns about potential abuses.
"The biggest near term threats are that these models are being used to create deep fakes, sophisticated phishing scams and other types of misinformation," Cranmer says.
The onset of generative AI technology has created a new host of privacy problems according to Rader, such as false image creation, identity theft and phishing attempts.
“There is a separation between a privacy violation and the actual harms that can come from it,” Rader says.
For Salo, this is ultimately indicative of a decline in the trustworthiness of the web environment as a whole.
“We are seeing a steady degradation of the web as an information environment,” Salo says. “And that distresses me incredibly. I love the web. This would be a terrible way for it to die.”
Regardless, it’s an uphill battle to change perspectives and spur the average person to take action to protect themselves — especially if they haven’t felt a direct harm from a privacy violation.
“It’s hard to get traction,” Salo says. “For the most part, people have not felt any direct harm.”
Rader, Salo and Cranmer all agree that the future of data collection and use in AI technologies will ultimately come down to policymakers. But there are actions tech companies can take to increase transparency while boosting privacy and security today.
“Companies need to be a lot more specific about uses of data,” Rader says. “We also need more research to figure out how we can explain the data uses to users in a way that allows them to make the privacy decision they want to make.”