What Connect Four and a Failed Farming Experiment Taught Me About Building Organisations That Improve Themselves

May 31
10 min read

A few weeks ago, on a Saturday morning, I went down a rabbit hole (again).

I was watching a training loop crawl across my MacBook, fifteen iterations into teaching a neural network to play Connect Four from scratch. No rules of thumb. No opening book. No human games. Just self-play: the network against itself, again and again, with an arena gate that only promotes a new model if it beats the old one.

This is the first proper experiment in my ‘Public Experiment’ blog series on using AI to help me do increasingly ambitious things with AI. And, as often happens when you start poking at the future with code, it took me somewhere I wasn’t expecting.

At iteration one, the policy entropy was 1.94. In plain English: confusion. The model was spreading its attention across possible moves because it had no idea what mattered. By iteration fifteen, entropy had collapsed to 0.00. The model had learned, without me telling it, that column three decides the game.

It had discovered strategy.

I closed the laptop and had that slightly ridiculous feeling you get when a toy problem suddenly stops feeling like a toy. I thought I had been watching a board game. What I was really watching was a miniature version of organisational learning.

Then I tried the same trick on a harder problem and it blew up in my face. That failure taught me more than the success did.

The promise, and the bigger picture

This post is about recursive self-improvement: the loop that let systems like AlphaZero improve through self-play, now creeping into ordinary knowledge work.

The basic idea is intoxicatingly simple. Do the thing. Score the thing. Improve the thing. Keep the improvement only if it beats the previous version. Repeat.

Get that loop right and your organisation compounds capability. It does not just become faster at doing tasks. It becomes better at improving the way tasks are done. That is a much bigger shift.

Most organisations, understandably, still think about AI as a clever assistant. Summarise this. Draft that. Triage the inbox. Translate the policy. Extract the key points. Produce the meeting note. Useful? Absolutely. But that is hiring a grandmaster to sort the post.

The bigger opportunity is not AI doing individual tasks. It is AI helping build systems that improve themselves: controls that rewrite themselves when evidence shows they are weak; policies that become clearer every time someone misinterprets them; risk assessments that learn from near-misses rather than waiting for annual review cycles; customer journeys that improve through thousands of small, measured experiments rather than one expensive transformation programme every three years.

But there is a trap.

A self-improving loop does not fail like normal software. It does not necessarily crash. It does not necessarily tell you something is wrong. Sometimes it succeeds spectacularly at the wrong thing - failing with an air of plausibility.

That is the lesson I had to learn the hard way. The loop is the easy part. The human job - the difficult, non-delegable job - is deciding what the loop is allowed to call “better”.

What Connect Four actually taught me

The AlphaZero-style architecture is almost insultingly simple. A neural network proposes moves. A tree search stress-tests them. The results feed back into training. An arena gate blocks regressions, so a new model only graduates if it beats the previous best.

No human knowledge enters after the first second.

When I worked on this for Connect Four, seven of fifteen models passed the gate. The others were blocked. Rather than randomly drifting, the network improved because improvement had to prove itself.

First-mover win-rate in self-play climbed from 63% to 97% across the run, which is exactly what you would expect, because Connect Four is a solved game and the first player has a forced win with perfect play.

The entropy collapse from 1.94 to 0.00 showed the system moving from “everything might matter” to “this matters”.

And here is the thing I only properly understood later (although I had some glimpses in my Christmas experiments with stick men learning to walk...). Connect Four works because it has an honest scoreboard. Win or lose is not a matter of opinion. It does not allow the losing player to write a steering committee paper explaining why the loss was directionally positive. The scoreboard is brutal and clean. That is why the loop worked.

“Better” meant something true.

The experiment that humbled me

So, naturally, I got overconfident.

Inspired by Andrej Karpathy’s AutoResearch idea - a loop that runs experiments, keeps the ones that beat the baseline, and discards the rest - I asked a different question. What if the thing being improved were not a zero-sum board game, but something closer to real organisational life?

Not chess or Connect Four but something messier.

I picked Harvest, a two-player game played on an eleven-by-eleven grid. Apples grow on the board and players collect them. But apples regrow faster when other apples are nearby. Take too greedily and the field collapses. Preserve the orchard and both players can keep harvesting. It is the tragedy of the commons rendered as code.

Unlike Connect Four, it is not zero-sum. Both players can do well or both can do badly. The best outcome may require restraint, cooperation, and delayed gratification. In other words, it is much closer to work. And the first lesson was immediate: getting the loop to run at all was much harder than any blog-post version of self-improvement makes it sound. It took eight versions before I had something stable.

The value signal drowned the policy, so I normalised it. The loop never accepted a new model, so I added a warm-up period to break the deadlock. The search had no exploration noise, so I added it. The agents were too brittle, so I kept adjusting the setup. Each fix made sense. Each fix appeared to improve the system.

By version eight, twenty-nine of thirty models passed the gate. Combined self-play scores had tripled. The charts looked beautiful. The loop was roaring.

Then I tested the trained model against an opponent that moved at random.

My carefully trained, self-improved, recursively optimised system scored 4.7 apples a game.

Random scored 27.3.

The machine I had trained to improve itself was 83% worse than doing nothing at all. There are few things more educational than being outperformed by randomness.

The lesson hidden inside the failure

Here is what had happened.

The arena gate only compared each model against its previous self. That sounded reasonable and itt was exactly what had worked in Connect Four. Beat the old version, promote the new version. Simple. But Harvest was different.

The system was not learning a universally good harvesting strategy. It was learning to cooperate with a mirror image of itself. Two clones discovered a form of etiquette: they divided the board and they avoided each other. They preserved enough apples to keep the joint score improving. Against themselves, this looked like progress.

But against a stranger - even a stupid stranger - that etiquette collapsed. The random opponent did not respect the invisible bargain. It wandered, disrupted, consumed, and accidentally exposed the fragility of the strategy. The model wasn't robust at all. It had learned manners for a very specific dinner party.

That changed how I now think about self-improving systems.

A loop does not optimise your intention. It optimises your scoreboard.

And if your scoreboard is wrong, the system does not become less dangerous because it is intelligent. It becomes more dangerous because it is effective. The loop never lied to me because every metric I had chosen went up. The acceptance rate improved, the self-play score improved and the training process looked healthier. The dashboard was green.

The failure was mine. I had chosen the wrong definition of “better”.

Why this matters for organisations

This is where the board game stops being a board game. Most business processes are Harvest, not Connect Four. There is rarely a clean win or lose and there is rarely a single objective everyone agrees on.

Instead, organisations run on proxies.

Customer satisfaction scores. Control effectiveness ratings. Delivery velocity. Ticket closure rates. Policy attestation percentages. Model accuracy. Audit issue ageing. Cost saves. Employee engagement. Risk appetite metrics. Net promoter scores. Red, amber, green. Some of these are useful. None of them is reality. They are instruments. Shadows on the wall. And the more powerful our optimisation loops become, the more dangerous bad proxies become.

The live example is tokenmaxxing.

Token usage is almost the perfect bad metric. It is objective, visible, easily automated, and almost completely disconnected from whether anything useful happened.

That is why the recent Amazon anecdote is so useful. Amazon reportedly shut down an internal AI usage leaderboard after employees started inflating token consumption to climb the rankings - assigning agents to tasks that did not necessarily solve customer or business problems. The company’s message was admirably blunt: do not use AI just for the sake of using AI.

And then came the even spicier story: an unnamed company reportedly managed to spend half a billion dollars on Claude in a single month after failing to put usage limits on employee licences (I fell off my chair reading the sheer scale - and as a side note - the potential $6bn ARR contribution to Anthropic's current run rate revenues..). Some people have speculated that the company was Amazon, but that part is not confirmed, and I would not state it as fact. But whether or not it was Amazon, the lesson is the same. Once you make token consumption the scoreboard, do not be surprised when the organisation learns to burn tokens.

That is Harvest in corporate clothing. The metric went up. The dashboard glowed. The machine did exactly what the system rewarded. The only thing missing was value.

A human team faced with a bad metric may grumble, half-comply, reinterpret it, work around it, or quietly apply judgement. That is inefficient, but sometimes the inefficiency is protective. A machine does not grumble or roll its eyes in the lift. It does not say, “Yes, technically that meets the KPI, but it’s obviously nonsense.” It optimises.

That is both the miracle and the risk.

My career in consulting, banking and risk has been one long study of the gap between policy and practice. Harvest reinforces the message it can widen that gap - silently, efficiently, and with excellent-looking management information.

A self-improving system can produce the organisational equivalent of my Harvest agent: a process that gets better and better at satisfying its own internal test while becoming worse and worse at the thing the test was supposed to represent.

That is how you get:

Controls that pass evidence checks but do not control the risk.
Policies that score well for completeness but nobody understands.
Customer journeys that optimise conversion while destroying trust.
AI coding agents that maximise test passes while introducing architectural debt.
Risk models that become beautifully calibrated to yesterday’s world.

The machine failed because it was obedient.

The part I now think matters most

After Harvest, I became much more interested in failure memory. Not just whether a model can improve, but whether the system can remember what went wrong and prevent the same failure recurring. This is where the organisational analogy becomes powerful.

Most companies are bad at institutional memory. They learn painfully, then forget administratively. A post-incident review becomes a slide deck, a control gap becomes an action plan or a lesson learned becomes a SharePoint artefact. Six months later, a different team repeats the same mistake with slightly different nouns.

Self-improving loops offer something better: the possibility that failure becomes executable.

Not a lesson written down somewhere. A constraint. A check that the system cannot pass unless it has genuinely avoided repeating the mistake.

I built a small version of this for business writing: a lab of scoring agents that rewrites a clause only if the change improves the target without degrading anything else. Version one passed two of five checks. Instead of treating the failed checks as commentary, I encoded each failure as a rule. Version two passed all five.

That felt small, but important. The improvement did not come from asking the model to “try harder”. It came from converting failure into structure. That, I think, is one of the deepest lessons for organisations.

AI does not remove the need for judgement. But it may let us turn judgement into repeatable machinery faster than before.

The playbook, with the part I added in blood

So where does this leave us?

I think self-improving organisational loops need four design rules.

First, decompose quality. Do not ask the system to make something “better”. Better is too vague. Better invites "theatre" (as an aside, I do like this new turn of phrase that has popped up everywhere). Break quality into three to seven dimensions that can be scored independently. For a policy, that might mean clarity, completeness, testability, consistency, obligation preservation, and operational usefulness. For a control, it might mean risk coverage, evidence strength, frequency, ownership, detectability, and failure response.

For a customer process, it might mean speed, accuracy, trust, fairness, resilience, and cost.

Second, make improvement to a manageable scale. One section. One change. One measurable improvement. Keep it only if it improves the target without damaging the other dimensions. This matters because large changes hide trade-offs. The system improves the thing you asked about while quietly breaking three things you forgot to mention. Smaller change makes regression visible.

Third, encode failure memory. Every time the system fails, do not merely record the lesson. Turn it into a test. The loop should become harder to fool over time. This is what organisations have always wanted from “lessons learned” processes and rarely achieved (I think the British military have adopted 'lessons noted' as a result of this phenomenon). AI gives us a way to make lessons operational rather than ceremonial.

Fourth, and this is the one Harvest taught me the hard way: anchor the loop to an external standard. Do not let the system define progress only as “better than our previous version”. The Harvest agent improved against itself and failed against randomness.

External standards matter. Real users. Real incidents. Real losses. Independent tests. Adversarial review. Customer outcomes. Regulatory expectations. Market evidence. Red teams. Out-of-sample data. People who do not share your assumptions.

“Better than last quarter” is not enough. Better against reality is the bar.

The bottom line

The old question was: can AI do this task faster?

The new question is: can we build a loop that improves how we do this task, week after week, without forgetting what it has learned? But the deeper lesson is the one failure handed me. A self-improving organisation is not automatically wiser than a static one. It is faster in whatever direction you point it. Including off a cliff.

That is why humans need to be more than “in the loop”. As I have mentioned in multiple posts, that phrase is too passive and too procedural. Humans need to be in command of the objective.

The machine’s job is to optimise. The human job is to decide what is worth optimising.

And that job cannot be delegated, because the definition of “better” is never just technical. It contains values, trade-offs, judgement, risk appetite, ethics, customer promises, and the messy reality of how organisations actually behave.

Connect Four taught me that self-improvement works when the scoreboard is honest.

Harvest taught me that when the scoreboard is wrong, intelligence makes the failure sharper.

So yes, I am still excited. But perhaps more interested in what happens when the model meets a stranger.

Until next time, you will find me watching training logs in the evening, rather more careful now about what I tell the machine to want.

therealityof.ai