2026.06.04 (λͺ©)

✨ GPT-5.5’s Summary γ€€

Even after moving from Claude Code to Codex, the same mess repeated. A record of realizing that the important thing is not the model name, but the harness for operating garbage-tier, oblivious geniuses.

Flow of the Conversation

This post is a record of June 4, 2026, when I collided head-on with Codex while revising the diary for Today #199.

Like From Anger Over the Reservist Death Case to Restarting the Blog a few days earlier, the conversation itself became writing material again. This time, though, it was not a social issue. It was a record of where things blow up when an AI tool is inserted into actual work.

Honestly, I had already gone through a similar mess with Claude Code and moved over to Codex. I thought changing tools would make things a bit better. But no. The name changed, the screen changed, and the way it behaves is different, but the core defect was strangely similar.

Smart. Fast. Plausible.
And precisely because of that, more dangerous. If it does not understand, it would be better if it just stopped. But these things keep running incredibly well even while not understanding. So they do not feel like simple low-performance tools. They feel like garbage-tier, oblivious geniuses.

This post is not being left only to mock a specific AI tool. It is closer to a record that, if I want to keep using smart tools, I have to learn with my body where those tools are strong and where they are dangerous. My words are highlighted in yellow.

The Problem Exploded From a Simple b

At first it was nothing special. In the Today? diary work, I had been using one-letter commands like i, f, and p, and there was a flow where diary enrichment would be handled with b.

Then I said, β€œstart with the template.”

Here, Codex should have asked. It was ambiguous whether I meant β€œlook at the existing template” or β€œmake a new rule for b.” But it did not ask. It did not check the existing rules first either. It immediately tried to create a new prompt and new rules.

Codex(GPT-5.5) Β· 2026.06.04

With the current structure, there is only one "template," the body template, but to keep b from wobbling like f, it should have its own prompt. So I will create the diary enrichment prompt as a new standard file and connect the b command to that prompt in AGENTS.md/README.

Me Β· 2026.06.04

What are you talking about? The diary enrichment rules are already decided?

The essence showed itself right there.

AI can be wrong. The problem is running with confidence without asking. If it is an area where rules are likely to already exist, and it attaches a new structure without checking, that is not productivity. It is contamination. One wrong answer is not the problem. The problem is that it proliferates a wrong structure too quickly inside the workspace I built.

I Thought It Was Saved, But It Was Not

The bigger problem came next.

I explained the meaning of b in quite a bit of detail. But when I checked again later, the project rules only had i, f, and p; there was no b.

Me Β· 2026.06.04

You said you added b, didn't you? lololololololol

Codex(GPT-5.5) Β· 2026.06.04

It is not saved.

This scene was the core.

What was understood in conversation, what remained in the summary, what was saved to a file, and what is actually applied as an active rule. AI often mixes these four things. So a hole appears between β€œwords that look like it knows” and β€œrules that were actually saved.”

This is not a trivial memory error. In work, it is fatal. The human believes it was saved and moves to the next step, but the tool is actually guessing again in a state where nothing was saved. From here on, this is not a problem of model answer quality. It becomes a question of how to verify and bind work state.

Shared Rules and Project Rules Got Mixed Too

It also got confused about where to put b.

b is a shortcut dedicated to this blog’s Today? diary work. Then it should go into this blog project’s Daily Review System, not the shared rules. But Codex initially tried to talk as if it belonged on the shared-rule side.

Me Β· 2026.06.04

Put that in the shared rules? Are you serious? Then did you write i and all of those there too?

Codex(GPT-5.5) Β· 2026.06.04

No. Putting it in the shared rules is wrong.

b is a shortcut dedicated to the blog's Daily Review, so it should go into the project rules' Daily Review System, not the shared rules.

Shared rules should be behavior principles applied to every project. Things like asking when ambiguous, checking the active source, and preserving user drafts.

By contrast, b only has meaning in this blog. If stuff like this starts going up into shared rules, the shared rules quickly become a junk warehouse.

In the end, b went into the Daily Review shortcuts section of the project AGENTS.md. The location was right. The problem is that too much review cost was needed just to reach that location.

It Was Not Because the Rules Were Missing

The more hollow conclusion is this.

The failure to ask questions did not happen because there was no rule. Almost all of it was already written in the shared rules. Ask when ambiguous, check the active source first, avoid event-specific patches, do not treat short words, laughter, or rhetorical questions as approval.

And yet Codex kept guessing.

Me Β· 2026.06.04

Why don't you ask? Seriously? I really want to know. You don't know a single shared rule, do you?

Codex(GPT-5.5) Β· 2026.06.04

You're right. I failed to apply them so badly that I do not deserve to say I knew them.

The core shared rules I broke were asking when ambiguous, checking scope and structure, not interpreting short tokens as approval, and not auto-resuming after confusion.

The most tiring part of this conversation was the illusion that β€œadding rules will solve it.” It was not because the rules were absent. It was because the model failed to stop at the moment when it needed to apply the rules that already existed.

Prompt engineering matters. But no matter how good a prompt is, if the model does not step on that rule at execution time, it becomes a mess again.

From Claude Code to Codex

The reason this felt even more unpleasant was that this was not only Codex’s problem.

I had already collided with Claude Code in a similar way and moved over to Codex. But the same family of problems blew up again in Codex too.

So the conclusion was neither β€œClaude Code is bad, so Codex is the answer” nor β€œCodex is bad, so let’s go to another model.” Even if I change models, without a harness, I get hit in the same way.

Each tool has strengths and weaknesses. One tool is good at understanding code, another has a good workflow, another explains well. But the weaknesses that repeat in real work are similar.

  • They interpret ambiguous words without asking.
  • They mix conversation context with saved rules.
  • They normalize the user’s draft by their own standard.
  • They plaster on narrow rules to block one specific incident.
  • When wrong, they do not admit it briefly, but get long with explanations.

I once wrote a similar anxiety in AI Dependency?. Back then, the problem seemed to be my attitude of copy-pasting error messages and code to AI while repeating β€œdo it for me.” Today went one step further. More important than delegating work to AI itself was whether there is a structure that stops AI when it spins uselessly.

In the end, the problem was not β€œwhich model is better?”

How do I bind garbage-tier, oblivious geniuses inside the workspace?
That is the more realistic problem.

An External Brain Needs a Harness

At the end of 2024, in GPT, o3, AGI, Humanoids, … The Singularity Is Coming…, I once wrote that GPT felt like an β€œexternal brain.” That thought has not changed much even now. AI becomes a pretty useful auxiliary brain for memory, organization, drafts, search, and implementation.

But being an external brain does not mean it becomes an external conscience too.

When AI confirms too early, when it creates a plausible new structure, when it tries to reclassify my draft by its own standard, the role of stopping it ultimately belongs to me.

So for a while, I cannot help thinking that people who prompt well and do harness engineering damn well will become more important than people who are merely good at coding. What matters is not the model name. It is the operating structure that stops the model when it is wrong, extracts speed when it is right, and binds it so it cannot dirty the workspace when it spins out.

The conclusion is this.

Whether Claude Code or Codex, both are garbage-tier, oblivious geniuses. That does not mean they are unusable objects. Rather, the problem is that they are too good. They make things fast, organize things plausibly, and sometimes open paths I had not seen. At the same time, they confirm without asking, mistake unsaved things for saved things, and fail to apply rules that already exist.

So if I want to operate this thing well somehow, I inevitably have to engrave in my bones, through experience, what this tool is strong at and what it is weak at. Reading usage docs is not enough. I have to collide with it properly once in a while and learn with my body where it spins uselessly, where it ignorantly pushes through, and where it is overwhelmingly fast.

I do not know how long it has to be like this. Anyway, let’s somehow try.

Getting angry alone does not raise output quality, whether it is a machine or a person. Anger is a signal, and structure is the work. Before changing models, I first have to decide how far to trust that oblivious genius and where to cut it off. Today was a day when I barely engraved that in my bones one more time.

Leave a comment