Red Teaming Language Models with Language Models

      In our recent paper, we show that it is possible to automatically find inputs that elicit harmful text from language models by generating inputs using language models themselves. Our approach provides one tool for finding harmful model behaviours before users are impacted, though we emphasize that it should be viewed as one component alongside many other techniques that will be needed to find harms and mitigate them once found. Read More DeepMind Blog 

​  


Posted

in

by

Tags:

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *