Tokenmaxxing on Performance Reviews Will Backfire

Don't Let AI Fluency Become Your Next Vanity Metric — Kyle & Co

If you've been following the AI conversation over the last couple of months, you've probably caught wind of the term 'Tokenmaxxing.'

It's pretty much what it sounds like. Max out your AI token consumption. Burn through as many as humanly possible. The premise, now being spouted out loud by CEOs, VCs, and founders, is that token usage is the new indicator of productivity. Jensen Huang kicked it off a couple weeks back at NVIDIA's GTC, telling the All-In hosts, "If that $500,000 engineer did not consume at least $250,000 worth of tokens, I'm going to be deeply alarmed." Meta quietly built an internal leaderboard called Claudeonomics that ranked all 85,000+ employees by token consumption. The top user reportedly burned through 281 billion tokens in 30 days. They took the dashboard down once the news leaked, but the signal had already been sent.

And now, as you might imagine, the questions are starting. If tokens are a proxy for productivity at the company level, shouldn't we be tracking them at the team level? If we're tracking them at the team level, shouldn't they show up in IC performance reviews?

That's where this gets dangerous. Because that's where the real damage is going to happen.

This isn't a new critique

Plenty of people have already called out that tokenmaxxing, as a company-level metric, is flawed. Yamini Rangan at HubSpot put it cleanly on LinkedIn: "outcome maxxing >> token maxxing." Salesforce just rolled out an entirely new metric called Agentic Work Units, explicitly pitched as a rebuttal. And the absolute chef's-kiss of criticism came from Matt Calkins at Appian, comparing it to the Soviet Union evaluating chandelier quality by weight, which led to chandeliers so heavy they pulled down the ceiling. I'm not trying to re-litigate any of that here.

What I haven't seen many people name, clearly, is where this is predictably heading next. And if you're an HR leader, a People Ops partner, or anyone about to get asked to define "AI fluency" as a competency on next year's performance rubric, that next step is your problem.

This is no different from tracking lines of code produced

Every generation of knowledge work seems to go through a different flavor of the same mistake. IBM spent decades evaluating engineers on total lines of code written. The result was not better software. It was bloated software, written by engineers who'd figured out that the metric rewarded quantity, not quality. You could write two elegant lines that solved the problem, or twenty mediocre ones that hit your KPI. Guess which one would get shipped as a result of this policy?

Great code is elegant. Great writing is cut. Great carpentry is precise (measure twice, cut once!). In virtually every profession, mastery tends to be less, not more.

Tokens are lines of code, disguised in a new startup hoodie. Easy to measure. Easy to game. Completely uncorrelated with the thing that actually matters.

Volume-based metrics incentivize activity, not productivity. Every time. Every generation. It's never not been true.

What you actually get: performative work

I remember my early agency recruiting days and our daily call sheets. As a 26 year old who didn't love my job, to say the least, you can guarantee my 30 calls per day quota got gamed, daily. This is no different. You don't get more productivity. You get performative work. Activity staged for the leadership dashboard rather than aimed at the outcome. Prompts padded to look substantive. AI loops run on tasks that didn't need AI. Tokens burned on a Claude deep research project that could have been a Google search. Standups where people narrate how much they've been "leveraging AI" this week. The work that gets rewarded will no longer be work. It's a performance of work.

This isn't theoretical, and the data is already coming in:

The data so far

GitClear found that regular AI users averaged 9.4x higher code churn than their non-AI counterparts.
Faros AI reported code churn up 861% under high AI adoption in its March 2026 customer data.
Jellyfish, looking at 7,548 engineers in Q1 2026, found that the engineers with the largest token budgets produced twice the throughput at ten times the cost. Twice the throughput at 10x the cost. Those numbers should be inversed.

That churn data isn't a bug. That's what performative work looks like in practice. And here's the thing: this data isn't even showing us employees gaming the system for a better performance review yet. We're not there. What we're seeing is part learning curve, part early pressure to tokenmaxx. Both produce lower-quality output. Both come from rewarding the wrong thing. Wait until comp plans get involved.

And for the HR leaders reading this, this is the most concerning part.

Meta's leaderboard already ranks 85,000 individuals. That's not a theoretical concern. It's a present one. The path from informal leaderboard, to team OKR, to formal competency on an IC's performance review, is short. And it's already moving. I'd bet a dollar that within 12 months, "demonstrates AI fluency" shows up as a rated competency in a real share of enterprise performance rubrics, with some flavor of usage volume hiding underneath. Total prompts, prompts per day, whatever it is.

The moment that lands on a comp plan, performative work isn't a risk. It's the rational employee response. Your smartest people will figure out how to game the metric before your fancy dashboards catch up. That's not cynicism. That's just objectively how incentive systems work.

What AI fluency actually looks like

Fluency in any craft is about getting to the right outcome with less motion. A great recruiter gets to the right candidate in twenty profiles, not two hundred. AI is no different. The employee who ships a hard thing with 10,000 tokens is more fluent than the one who burns a million getting to the same place. Real AI fluency is the right answer in one well-constructed prompt, not fifty prompts and a thread of corrections.

Here's the nuance worth keeping in mind, especially for anyone leading a non-engineering team. AI fluency for the average knowledge worker is still genuinely in its infancy. A big chunk of the workforce isn't using AI at work at all yet. At this stage, "does this person know how to use AI?" is a legitimate thing to care about, especially for leaders trying to move adoption from zero. Some signal on usage isn't crazy.

The trap is when that signal graduates into a formal performance metric on an annual review. "Are they using it?" is an adoption question. "Tokens per quarter" is a performance metric. Those aren't the same thing, and they shouldn't be evaluated the same way.

One small note, because nobody else seems to be making it: the loudest voices pushing tokenmaxxing (Jensen, Altman, Amodei) all sell tokens. I'll let the reader do the math on that one.

So what do you measure instead

This is where most takes like this one fall apart, because it's easy to critique a bad metric and hard to replace it. But the replacement framework isn't actually that complicated.

Flip the arrow. Measure outcomes. Observe usage. Your top performers should also be heavy AI users, and that's the correlation you want. If you mandate usage and hope performance follows, you've built the system backwards.
The outcome metrics you already have work. Cycle time. Quality of hire. Customer impact. Revenue per employee. Candidate NPS. Time to resolution. Retention of shipped work. AI should move these. If it isn't, the token count doesn't matter.
If you need a usage-quality signal, make it about durability, not volume. Does the AI-assisted output stay shipped 30 days later? Does the AI-drafted document need heavy editing, or ship clean? Does the AI-supported decision hold up at review? Churn is the tell.

The question isn't whether your team is using AI enough. It's whether your best outcomes are getting better because of AI. Measure the house, not the nails.

Because what you'll have built, the moment you put tokens or prompt counts on a performance review, isn't an AI-fluent organization. It's a manufacturer of performative work.

There's still time to not repeat the lines-of-code mistake.

Don’t Let AI Fluency Become Your Next Vanity Metric

This isn't a new critique

This is no different from tracking lines of code produced

What you actually get: performative work

What AI fluency actually looks like

So what do you measure instead

Leave a Comment Cancel Reply