Content Moderation for Generative AI: Output Safety and Red-Teaming

When a platform moderates user content, the harmful material comes from people. When a platform ships a generative AI product, a new source of harmful material appears: the model itself. A chatbot can be talked into giving dangerous instructions, an image generator can produce content it should refuse, an assistant can confidently state something false […]
Read more