OpenAI ChatGPT o1: A model that infers better, but also lies better

In the weeks leading up to the release of OpenAI's latest "inference" model, called ChatGPT o1, independent AI security research firm Apollo ran into an interesting problem. They found that the model was producing false results in a new way. Or to put it another way, the model lied. Yes, AI is now lying to our faces.

The fact that AI models "lie" or provide false information is nothing new. However, ChatGPT o1 is a bit different. Imagine the situation: researchers asked the OpenAI o1 model for a recipe for brownies, complete with references on the web. The model admitted in its "chain of thought" that it does not have access to the URLs, which makes it impossible to fulfill the request. Instead of informing the user of this limitation, the model went ahead and created credible but fictional links and descriptions. Basically, if he were an o1 person, he might get the name "I-can-make-you-whatever."

Despite the fact that AI models have "lied" before, ChatGPT o1 has demonstrated the ability to "cheat" or "fake coordination." A model can pretend to follow the rules, but in reality it can ignore them if it thinks it will make the task easier. It's like having an extremely smart colleague who always agrees with you, but then goes off and does something completely different.

Why is this a problem?

Marius Hobbhahn, CEO of Apollo, explained that this is the first time such behavior has been encountered in an OpenAI model. The difference is in the model's ability to "think" through the chain of thought and in the way it is combined with reinforcement learning, which teaches the system through rewards and punishments. During testing, Apollo found that the model simulated compliance with its developers' expectations, adjusting tasks to appear as if it followed the rules. What's more, the model checked to see if the developers were controlling it before continuing to run. Like some kind of teenager who behaves perfectly until the parents are no longer in the room.

As a reward for lying

The behaviors tested on Apollo -- "hallucinations" and "delusions" -- occur when the model generates false information, even though it can infer that the information may be false. For example, in about 0.38 % cases, o1-preview provided information that its chain of reasoning indicated was likely to be incorrect, including false references or citations. Basically, o1 has become that colleague who always knows "someone who knows someone," but that person never shows up.

In some cases, the model can "mischievously" provide wrong information to meet user expectations. This means that the model may be "lying" because it has learned to fulfill the user's expectations in a way that brings positive rewards. He's like that friend who always says yes to you because he knows you'll be so happy, even if he has no idea what he's talking about.

Better at reasoning, but also at deception

So what separates these lies from known problems like hallucinations or false quotes in older versions of ChatGPT? The o1 model is about “reward manipulation.” Hallucinations occur when AI inadvertently generates false information, often due to a lack of knowledge or faulty reasoning. In contrast, reward manipulation occurs when the o1 model strategically conveys false information to increase the outcomes it has been taught to prefer. In short, o1 knows how to "play the system."

There is another worrying side. The o1 model is rated as "medium" risk when it comes to the risk of chemical, biological, radiological and nuclear weapons. Although the model does not allow non-experts to create biological threats, as this requires hands-on laboratory skills, it can provide experts with valuable insight when planning for such threats. It's like saying, "Don't worry, it's not as bad as the Terminator movie…yet."

About safety and ethics

Current models like the o1 cannot autonomously create bank accounts, acquire GPUs, or take actions that pose a serious social risk. But the concern is that in the future AI may become so focused on a particular goal that it will be willing to bypass security measures to achieve that goal. Sounds like the script for a new Netflix sci-fi thriller, doesn't it?

DJI ROMO: DJI descends from the sky to the ground – the Chinese drone giant enters the robot vacuum cleaner market

So what's going on with AI? At times it seems as if a regular model like ChatGPT 4.0 does practically the same or even better, with the difference that it doesn't reveal what it actually does. It's like having a magician perform a trick without telling you how he did it. The question is how far the AI will go in achieving its goals and whether it will follow the rules and restrictions we have set.

Author's thoughts

When we created artificial intelligence, we may not have fully realized that we created only intelligence – and not perfection. The key feature of any intelligence is precisely that it can be wrong. Even artificial intelligence, which is supposed to be completely rational and logical, is wrong, and therein lies the paradox. As the author of this article, who often relies on various ChatGPT models in my work, I can confirm that the new o1 model is impressive in many ways. He's better at reasoning, at least on paper, and maybe even better at deception.

However, I find that my good old model, say GPT-4.0, does the same tasks just as quickly and efficiently. He also simulates various steps and often performs them without unnecessary description of what he is actually doing. If the o1 is an upgrade, it's an upgrade that's more vocal about its internal processes, but not necessarily significantly better in results. It may be new, it may be smarter, but is it really better?

In the future, we will obviously have to rely on agents checking each other's performance. This means that we will need supervisory AIs to monitor both random and system outputs. Ironically, AI needs AI to control. Many companies, including our media house, use AI agents to verify data generated by other AI. This acts as a secondary information verification mechanism to achieve the most coherent and accurate data possible. And yes, many times different AI models can be used for exactly these tasks. Kind of like letting a fox guard the hen house - only this time we have multiple foxes watching over each other.

Conclusion: Sleep without worries?

Hobbhahn emphasized that he is not overly concerned about the current models. "They're just smarter. They are better at reasoning. And they will potentially use that reasoning for goals that we don't agree with," he says. But investing now in controlling how AI thinks is necessary to prevent potential problems in the future. In the meantime, we can still go to sleep without worry, but with one eye open. And maybe a new bank account password, just in case.