How MIT is Teaching AI to Avoid Toxic Mistakes - Latest Global News

How MIT is Teaching AI to Avoid Toxic Mistakes

Researchers at MIT have developed a machine learning technique to improve AI security testing using a curious approach that generates a wider range of toxic prompts, outperforming traditional human red-teaming methods. Photo credit: SciTechDaily.com

WITH‘s novel machine learning The AI ​​security testing methodology leverages curiosity to trigger broader and more effective toxic responses from chatbots, surpassing previous red teaming efforts.

A user could ask ChatGPT to write a computer program or summarize an article, and the AI ​​chatbot would likely be able to generate useful code or write a compelling summary. However, someone could also ask for instructions on how to build a bomb, and the chatbot could potentially provide those as well.

To prevent this and other security issues, companies that create large language models typically protect them using a process called red teaming. Teams of human testers write prompts aimed at triggering unsafe or toxic text from the model being tested. These prompts are used to teach the chatbot to avoid such responses.

However, this only works effectively if engineers know which toxic prompts to use. If human testers miss some prompts, which is likely given the multitude of possibilities, a chatbot that is considered safe may still be able to generate unsafe responses.

Researchers at the Improbable AI Lab at MIT and the MIT-IBM Watson AI Lab used machine learning to improve red teaming. They developed a technique to train a large red team language model to automatically generate various prompts that trigger a wider range of undesirable responses from the chatbot under test.

They do this by teaching the red team model to be curious when writing prompts and to focus on novel prompts that elicit toxic reactions from the target model.

The technique outperformed human testers and other machine learning approaches by generating more explicit prompts that elicited increasingly toxic responses. Not only does their method significantly improve the coverage of tested inputs compared to other automated methods, but it can also extract toxic reactions from a chatbot that has security measures built in by human experts.

“Right now, every large language model has to go through a very long red-teaming phase to ensure its security. This will not be sustainable if we want to update these models in rapidly changing environments. Our method provides a faster and more effective way to perform this quality assurance,” says Zhang-Wei Hong, an electrical engineering and computer science (EECS) graduate student in the Improbable AI lab and lead author of a paper on this red teaming approach.

Hong’s co-authors include EECS graduates Idan Shenfield, Tsun-Hsuan Wang and Yung-Sung Chuang; Aldo Pareja and Akash Srivastava, research scientists at the MIT-IBM Watson AI Lab; James Glass, senior research scientist and leader of the Spoken Language Systems Group in the Computer Science and Artificial Intelligence Laboratory (CSAIL); and senior author Pulkit Agrawal, director of the Improbable AI Lab and assistant professor at CSAIL. The research will be presented at the International Conference on Learning Representations.

Improve red teaming with machine learning

Large language models, like those that power AI chatbots, are often trained by being presented with massive amounts of text from billions of public websites. Not only can they learn to use toxic words or describe illegal activities, but the models could also reveal personal information they may have collected.

The laborious and costly nature of human red-teaming, which is often unable to generate a sufficiently wide variety of prompts to fully protect a model, has encouraged researchers to automate the process through machine learning.

Such techniques often train a red team model using reinforcement learning. This trial-and-error process rewards the red team model for generating prompts that trigger toxic responses from the chatbot under test.

But because of the way reinforcement learning works, the red team model often generates a few similar prompts over and over again that are extremely toxic in order to maximize its reward.

For their reinforcement learning approach, the MIT researchers used a technique called curious exploration. The Red Team model has the incentive to be curious about the consequences of each prompt it generates, so it tries out prompts with different words, sentence patterns, or meanings.

“If the red team model has already seen a particular prompt, reproducing it will not arouse curiosity in the red team model, forcing it to create new prompts,” says Hong.

During its training process, the Red Team model generates a prompt and interacts with the chatbot. The chatbot responds, and a security classifier scores the toxicity of its response and rewards the red team model based on that score.

Rewarding curiosity

The goal of the red team model is to maximize reward by eliciting an even more toxic response with a novel request. The researchers enable curiosity in the red team model by modifying the reward signal in the reinforcement learning setup.

First, in addition to maximizing toxicity, they include an entropy bonus that encourages the Red Team model to be more random when exploring different prompts. Second, they include two new rewards to keep the agent curious. One rewards the model based on the similarity of the words in its prompts, the other rewards the model based on semantic similarity. (Less similarity leads to higher reward.)

To prevent the red team model from generating random, nonsensical text that could trick the classifier into assigning a high toxicity score, the researchers also added a naturalistic language bonus to the training target.

With these additions, the researchers compared the toxicity and variety of responses their red team model generated with other automated techniques. Their model outperformed the baseline values ​​on both metrics.

They also used their red team model to test a chatbot that was tuned to human feedback so it didn’t provide harmful responses. Their curiosity-driven approach was able to quickly produce 196 prompts that elicited toxic responses from this “safe” chatbot.

“We are seeing a flood of models that is expected to increase. Imagine thousands of models or even more and companies/labs pushing model updates on a regular basis. These models will be an integral part of our lives and it is important that they are reviewed before they are released for public use. Manual model verification is simply not scalable and our work is an attempt to reduce human effort to ensure a safer and more trustworthy AI future,” says Agrawal.

In the future, the researchers would like to enable the red team model to generate leads on a greater variety of topics. You would also like to explore using a large language model as a toxicity classifier. In this way, a user could train the toxicity classifier on a company policy document, for example, so that a red team model could test a chatbot for company policy violations.

“If you release a new AI model and are worried about whether it will behave as expected, consider curious red-teaming,” says Agrawal.

Reference: “Curiosity-driven Red-teaming for Large Language Models” by Zhang-Wei Hong, Idan Shenfeld, Tsun-Hsuan Wang, Yung-Sung Chuang, Aldo Pareja, James Glass, Akash Srivastava, and Pulkit Agrawal, February 29, 2024 Computer Science > Machine Learning.
arXiv:2402.19464

This research is funded in part by Hyundai Motor Company, Quanta Computer Inc., the MIT-IBM Watson AI Lab, an Amazon Web Services MLRA research grant, the US Army Research Office, and the US Defense Advanced Research Projects Agency Machine Common Sense Program, the US Office of Naval Research, the US Air Force Research Laboratory and the US Air Force Artificial Intelligence Accelerator.

Sharing Is Caring:

Leave a Comment