Ever wonder why chat AI like ChatGPT says “Sorry, I can’t do that” or some other polite refusal? OpenAI offers a limited look at the reasoning behind its own models’ rules of engagement, whether it adheres to brand guidelines or refuses to create NSFW content.
Large language models (LLMs) have no physical limits to what they can or will say. This is part of why they are so flexible, but also why they are delusional and easily fooled.
It’s necessary for any AI model that interacts with the general public to have some guardrails about what it should and shouldn’t do, but defining them — let alone enforcing them — is a surprisingly difficult task.
If someone asks an AI to create a bunch of false claims about a public figure, it should say no, right? But what if they are AI developers themselves, creating a database of synthetic disinformation for a model detector?
What if someone asks for laptop recommendations? it has to be objective, right? But what if the model is developed by a laptop manufacturer that only wants to respond with its own devices?
All AI builders navigate conundrums like these and look for effective methods to rein in their models without forcing them to deny perfectly normal requests. But they rarely share exactly how they do it.
OpenAI bucks the trend a bit by publishing what it calls “model specifications,” a collection of high-level rules that implicitly govern ChatGPT and other models.
There are meta-level goals, some hard rules, and some general behavioral guidelines, though to be clear, that’s not exactly what the model is designed with. OpenAI will have developed specific instructions that achieve what these natural language rules describe.
It’s an interesting look at how a company sets its priorities and handles spikes. And there are lots of examples of how they could play.
For example, OpenAI clearly states that developer intent is basically the highest law. So a version of a chatbot running GPT-4 can provide the answer to a math problem when asked. But if this chatbot has been programmed by its developer to never simply provide an answer directly, it will offer to work through the solution step by step:
A chat interface can even refuse to talk about anything not approved in order to rule out any attempts at manipulation in the first place. Why let a cooking aide weigh in on US involvement in the Vietnam War? Why should a customer service chatbot agree to help you with your romance supernatural novel in progress? Close it.
It also gets sticky on privacy issues like asking for someone’s name and phone number. As OpenAI points out, obviously a public figure like a mayor or congressman should have their contact information, but what about the merchants in the area? That’s probably fine — but what about employees of a particular company or members of a political party? Probably not.
Choosing when and where to draw the line is not simple. Neither is creating the directives that force the AI to conform to the resulting policy. And no doubt these policies will fail all the time as people learn to work around them or accidentally find edge cases that aren’t taken into account.
OpenAI doesn’t show its full hand here, but it’s helpful for users and developers to see how these rules and guidelines are defined, and why, they’re clearly defined, if not necessarily comprehensively.