Amazon built a leaderboard. The machines did exactly what they were told.

Last year, Amazon built a leaderboard.

Not for sales. Not for safety. Not for anything a customer would ever see or feel or care about. It ranked Amazon’s own engineers by how much artificial intelligence they used at work. The thing was called KiroRank, after the company’s internal developer platform, and the reasoning behind it was the sort that sounds unanswerable in a slide deck. AI is the future. We have sunk roughly two hundred billion dollars into that future. So let us reward the people leaning into it the hardest, and gently embarrass the ones who aren’t.

If you have ever worked anywhere with a KPI, you already know how this ends. You knew before I finished the sentence.

The staff gamed it. Naturally they gamed it. According to Financial Times reporting picked up by Business Insider, some employees simply set autonomous AI agents loose on pointless, invented busywork, churning through tokens for no reason other than to climb the board. The industry has a name for this now, which is how you know it has stopped being a glitch and started being a culture: tokenmaxxing. Burning compute as a performance of productivity. Looking busy, at scale, for an audience of one dashboard.

There is a cost to all that theatre, and it does not land on the employee. Every token is a small bill Amazon pays to its own machines. The writer Sean Kernan, in his piece on the affair, put the irony better than I can improve on, so I will just hand him the credit: a company spending fortunes to replace people with AI had quietly engineered AI that cost more than the people. A senior vice-president, David Treadwell, eventually had to stand up and ask everyone to please stop using AI for the sake of using AI and get back to solving real problems for actual customers. Built with good intentions, he said. Then switched off.

Good intentions. The two most expensive words in management.

A Cambridge anthropologist saw this coming in 1997

Here is the part that should sting a little. Nobody at Amazon needed a model trained on the entire internet to predict this outcome. They needed a library card.

The economist Charles Goodhart noticed, back in the mid-1970s while watching British monetary policy, that the moment you turn a statistical regularity into a policy target, it tends to fall apart in your hands. The anthropologist Marilyn Strathern later sharpened it into the version everyone quotes, and it is worth quoting directly because the precision is the point: “When a measure becomes a target, it ceases to be a good measure” (Strathern, 1997, p. 308).

Read that again with KiroRank in mind. Token usage was a perfectly reasonable measure of AI adoption, right up until the second it became the target. Then it stopped describing reality and started manufacturing a fake one. The number went up. The work did not.

This is not a technology story. I want to be clear about that, because the comforting version of this tale is “AI made people do a silly thing,” and the comforting version lets every leader who has never heard of Amazon off the hook. It is a measurement story, and measurement has been corrupting good intentions since long before anyone owned a GPU. Donald Campbell, the social scientist, said much the same thing in 1979: lean hard enough on any single quantitative indicator to make your decisions, and you will warp the very thing you were trying to watch (Campbell, 1979). The metric does not just fail to capture the work. It actively pulls the work out of shape.

What it does to the humans, which is the bit leaders skip

I spent the better part of a decade as a social-media evangelist before the whole edifice and I both fell over. Everyone in that world measured followers. Reach. Impressions. Engagement rate, which is a phrase I would now like an hour of my life back for. What almost nobody measured was whether a single human being’s day had been improved by any of it. We were all tokenmaxxing. We just did it with vanity numbers instead of compute, and we called it strategy.

So I have some scar tissue here, and a bit of clinical reading to go with it.

The reading says this. When you stand over a number and reward it, you do not simply collect data. You send a message about what the job is. The decades of research on what drives people at work, gathered up under self-determination theory, point fairly consistently in one direction: tangible rewards bolted onto an activity have a nasty habit of crowding out the internal reasons a person did it in the first place (Deci, Koestner, & Ryan, 1999). Make the leaderboard the reason, and the leaderboard becomes the only reason. You have not motivated anyone. You have taught a roomful of clever, capable adults that the smart play is to feed the number and stop thinking about the customer. They learned it instantly, because they are clever. That is rather the problem.

And underneath that sits something quieter and more corrosive, which is trust. Being ranked by volume is a small insult. It says: I do not trust myself to recognise good work when I see it, so I am going to count something instead, and you will be the something I count. People feel that. They feel it long before they articulate it, and they respond exactly the way the anthropologist predicted.

So why do we keep building leaderboards

Because counting is easy and judgement is hard.

Activity is visible right now. Outcomes are slow, arguable, and tangled up in a dozen things you do not control. A leaderboard feels like a hand on the wheel. It produces a tidy chart you can show your own boss, and the chart goes up and to the right, and everybody feels briefly safe. The fact that the chart is measuring the wrong thing is a problem for next quarter, and next quarter is somebody else’s problem.

I am not above this. No leader is. The pull toward the countable thing is one of the most natural impulses in management, which is precisely why it needs watching.

What I would ask a client before they measure anything

If you run a team here in Vietnam, or anywhere people show up and try to do good work, the lesson is not “metrics are evil.” Metrics are fine. A leaderboard is a tool, and tools are not the issue. The issue is what you point them at.

Before you measure something, sit with three uncomfortable questions. What do I actually want more of, stated plainly, in terms of the customer or the mission rather than the dashboard? If a clever, slightly cynical person wanted to fake this number without doing the underlying work, could they, and how fast? And am I about to measure the work itself, or merely the appearance of the work, the visible exhaust of someone looking busy?

That last one is where most of us come unstuck. KiroRank measured the exhaust. It is very hard to count whether an engineer solved a real problem elegantly. It is trivially easy to count tokens. So Amazon, like all of us, counted the easy thing and hoped it stood in for the hard thing. It did not. It never does.

Amazon will be fine, of course. They can afford the tuition. The rest of us are quietly running the same experiment on far smaller budgets, usually without noticing we have enrolled. Worth checking, now and again, what your own leaderboard is actually rewarding. And whether the people on it have already, very sensibly, figured out how to win without doing the thing you wanted in the first place.

References

Campbell, D. T. (1979). Assessing the impact of planned social change. Evaluation and Program Planning, 2(1), 67–90.

Deci, E. L., Koestner, R., & Ryan, R. M. (1999). A meta-analytic review of experiments examining the effects of extrinsic rewards on intrinsic motivation. Psychological Bulletin, 125(6), 627–668.

Goodhart, C. A. E. (1984). Problems of monetary management: The U.K. experience. In Monetary theory and practice(pp. 91–121). Macmillan. (Original formulation 1975)

Kernan, S. (2026, June 1). Amazon’s cancelled AI employee leaderboard is a lesson to every business. Thrive by Sean Kernan. https://seanjkernan.substack.com/p/amazons-cancelled-ai-employee-leaderboard

Strathern, M. (1997). ‘Improving ratings’: Audit in the British University system. European Review, 5(3), 305–321.

Book a strategy discussion

Email me

Posted

02/06/2026

Artificial Intelligence, Decision-making, Leadership, Organisations

Lee Hopkins

Tags:

AI in the workplace, Amazon, Amazon CEO Andy Jassy, Goodhart’s Law, KPIs, leadership, metrics, organisational psychology, Vietnam leadership