From the article: In a paper accompanying the release, Meta researchers write that the model “has a high propensity to generate toxic language and reinforce harmful stereotypes, even when provided with a relatively innocuous prompt.” This means it’s easy to get biased and harmful results even when you’re not trying. The system is also vulnerable to “adversarial prompts,” where small, trivial changes in phrasing can be used to evade the system’s safeguards and produce toxic content.
The researchers further warn that the system has an even higher risk of generating toxic results than its predecessors, writing that “OPT-175B has a higher toxicity rate than either PaLM or Davinci,” referring to two previous language models. They suspect this is in part due to the training data including unfiltered text taken from social media conversations, which increases the model’s tendency to both recognize and generate hate speech.