If you've been following the AI conversation over the last couple of months, you've probably caught wind of the term 'Tokenmaxxing.'

It's pretty much what it sounds like. Max out your AI token consumption. Burn through as many as humanly possible. The premise, now being spouted out loud by CEOs, VCs, and founders, is that token usage is the new indicator of productivity. Jensen Huang kicked it off a couple weeks back at NVIDIA's GTC, telling the All-In hosts, "If that $500,000 engineer did not consume at least $250,000 worth of tokens, I'm going to be deeply alarmed." Meta quietly built an internal leaderboard called Claudeonomics that ranked all 85,000+ employees by token consumption. The top user reportedly burned through 281 billion tokens in 30 days. They took the dashboard down once the news leaked, but the signal had already been sent.

And now, as you might imagine, the questions are starting. If tokens are a proxy for productivity at the company level, shouldn't we be tracking them at the team level? If we're tracking them at the team level, shouldn't they show up in IC performance reviews?

That's where this gets dangerous. Because that's where the real damage is going to happen.

This isn't a new critique

Plenty of people have already called out that tokenmaxxing, as a company-level metric, is flawed. Yamini Rangan at HubSpot put it cleanly on LinkedIn: "outcome maxxing >> token maxxing." Salesforce just rolled out an entirely new metric called Agentic Work Units, explicitly pitched as a rebuttal. And the absolute chef's-kiss of criticism came from Matt Calkins at Appian, comparing it to the Soviet Union evaluating chandelier quality by weight, which led to chandeliers so heavy they pulled down the ceiling. I'm not trying to re-litigate any of that here.

What I haven't seen many people name, clearly, is where this is predictably heading next. And if you're an HR leader, a People Ops partner, or anyone about to get asked to define "AI fluency" as a competency on next year's performance rubric, that next step is your problem to deal with.

This is no different from tracking lines of code produced

Every generation of knowledge work seems to go through a different flavor of the same mistake. IBM spent decades evaluating engineers on total lines of code written. The result wasn't better software. It was bloated software, written by engineers who'd figured out that the metric rewarded quantity, not quality. You could write two elegant lines that solved the problem, or twenty mediocre ones that hit your KPI. Guess which one would get shipped as a result of this policy?

Great code is simple. Great writing is cut. Great carpentry is precise (measure twice, cut once!). In virtually every profession, mastery tends to be less, not more.

Tokens are lines of code, disguised in a new startup hoodie. Easy to measure. Easy to game. Completely uncorrelated with the thing that actually matters.

Volume-based metrics incentivize activity, not productivity. Every time. Every generation. It's never not been true.

What you actually get: performative work

I remember my early agency recruiting days and our daily call sheets. As a 26 year old who didn't love my job, to say the least, I'll reluctantly admit that my 30 calls per day quota got gamed, daily. This isn't any different. You don't get more productivity. You get performative work. Activity conducted for the leadership dashboard rather than aimed at the outcome. Prompts padded to look substantive. Agents run on tasks that didn't need an Agent. Tokens burned on a Claude deep research project that could have been a Google search. Standups where people humble-brag how much they've been "leveraging AI" this week. The work that gets rewarded will no longer be work. It's a performance of work.

This isn't theoretical, and there's early receipts:

The data so far
  • GitClear found that regular AI users averaged 9.4x higher code churn than their non-AI counterparts.
  • Faros AI reported code churn up 861% under high AI adoption in its March 2026 customer data.
  • Jellyfish, looking at 7,548 engineers in Q1 2026, found that the engineers with the largest token budgets produced twice the throughput at ten times the cost. Twice the throughput at 10x the cost. Those numbers should be inversed.

That churn data isn't a bug. That's what performative work looks like in practice. And guess what: this data isn't even showing us employees gaming the system for a better performance review yet. We're not there. What we're seeing is part learning curve, part early pressure to tokenmaxx. Both produce lower-quality output. Both come from rewarding the wrong thing. Just wait until comp plans get involved.

And for the HR leaders reading this, buckle up.

Meta's leaderboard ranked 85,000 individuals. That's not a theoretical concern. It's real. The path from informal leaderboard, to team OKR, to formal competency on an IC's performance review is short. And it's already moving. I'd bet a nickel that within 12 months, "demonstrates AI fluency" shows up as a rated competency in a meaningful share of enterprise performance rubrics, with some semblance of usage volume hiding underneath. Total prompts month over month, avg daily token usage, whatever it is.

The moment that lands on a comp plan, performative work isn't a risk. It's the rational employee response. Your savviest people will figure out how to game the metric before your leadership dashboards catch up. That's not cynicism. That's just objectively how incentive systems work.

What AI fluency actually looks like

Fluency in any craft is about getting to the right outcome with less motion. A great recruiter gets to the right candidate in twenty profiles, not two hundred. AI is no different. The employee who ships a legit feature with 10,000 tokens is probably more fluent than the Meta dev who burned 281 billion tokens in a month. Real AI fluency is one well-constructed prompt, not fifty prompts and a thread of corrections.

But here's the nuance to keep in mind, especially for anyone leading a non-technical team. AI fluency for the average knowledge worker is still genuinely in its infancy. A huge chunk of the workforce isn't using AI at work at all yet. At this stage, "does this person know how to use AI?" is a legitimate thing to care about, especially for leaders trying to move adoption from zero. Some signal on usage isn't crazy.

The trap is when that signal graduates into a formal performance metric on an annual review. "Are they using it?" is an adoption question. "Tokens per quarter" is a performance metric. Those aren't the same thing, and they shouldn't be evaluated the same way.

One small note, because nobody else seems to be making it: the loudest voices pushing tokenmaxxing (Jensen, Altman, Amodei) all sell tokens. I'll let the reader do the math on that one.

So what do you measure instead

This is where most takes like this one fall apart, because it's easy to critique a bad metric and hard to replace it. But the replacement framework isn't actually that complicated.

  • Flip the arrow. Measure outcomes. Observe usage. Your top performers should also be heavy AI users, and that's the correlation you want. If you mandate usage and hope performance follows, you've built the system backwards.
  • The outcome metrics you already have work. Cycle time. Quality of hire. Customer impact. Revenue per employee. Candidate NPS. Time to resolution. Retention of shipped work. AI should move these. If it isn't, the token count doesn't matter.
  • If you need a usage-quality signal, make it about durability, not volume. Does the AI-assisted output stay shipped 30 days later? Does the AI-drafted document need heavy editing, or ship clean? Does the AI-supported decision hold up at review? Churn is the tell.

The question isn't whether your team is using AI enough. It's whether your outcomes are getting better because of AI. Measure the house, not the nails.

Because what you'll have built, the moment you put tokens or prompt counts on a performance review, isn't an AI-fluent organization. It's a theater of choreographed performative work.

There's still time to not repeat the lines-of-code mistake.