

Sustainability teams need defensible numbers for emissions that don't show up on any cloud sustainability dashboard, run on third-party infrastructure, and produce activity that's hard to attribute. AI coding agent emissions check all three boxes.
This walkthrough covers how we translate agent activity into kWh and kg CO2e, with every assumption made visible. It's one of three pieces on this topic. The audience-facing overview covers the why and what to do about it. The Claude Code launch analysis covers a related question about market dynamics. If you haven't read those yet, you can start with either.
Three things make precise per-developer numbers difficult.
Token consumption per task varies by an order of magnitude depending on model choice, task complexity, and how the agent is configured. A complementary study to the work cited below documented up to 30x run-to-run variance on identical tasks. Any single number you publish has wide error bars, and being honest about that is more credible than precision theater.
In the past year, the top of the leaderboard has shuffled more than once. Tools that led the category in mid-2025 have since dropped out of the top tier, and tools that didn't exist as agents a year ago now rank among the most active. A precise per-engineer number published today is a good way to be wrong by next quarter.
The Jegham et al. 2025 framework we use is the current best-available public methodology for LLM inference energy estimation. Newer frameworks can build on existing ones and extend to more recent models, time of day, and hopefully additional disclosure from model providers. What's defensible today won't be the final word.
The right response to all three is the same: capture raw inputs (tokens, model, provider, and time) that stay valid as methodology evolves, apply the current best framework at report time, and document your assumptions. That's what this walkthrough demonstrates.
The full chain we model:
Each step has assumptions. We make them visible so they can be challenged or improved.
A push is what a developer does (git push); a commit is what the agent produces.
We empirically derive 1.29 commits per push from an analysis of GitHub Archive PushEvent data from September 2025, the last full month before GitHub Archive removed the relevant data field on October 7, 2025. We winsorized the data as there were a number of outliers that appeared to be copying entire codebases, not the commits we were looking for.
This is the assumption that varies most by task type, and the one most worth scrutinizing. Xiao et al. (2025), Reducing Cost of LLM Agents with Trajectory Reduction, measured agent token consumption directly on SWE-bench Verified, a benchmark of real-world GitHub issues that's the closest public proxy for "produce one commit that fixes one real problem." They report ~1.0 million tokens per issue on average across baseline runs, with ~40 reasoning steps per trajectory.
A complementary study, How Do AI Agents Spend Your Money?, confirms the order of magnitude across eight frontier LLMs but documents up to 30x run-to-run variance on identical tasks, and notes that Claude Sonnet 4.5 consumes over 1.5 million more tokens per task than GPT-5.We use 1 million tokens per commit as a midpoint for what is a wide distribution.
Claude runs on AWS H100/H200 nodes. Anthropic's models default to extended-reasoning mode, which Jegham et al. (2025) prices at 7.3x the per-token energy of standard inference. Applied to Claude's hardware class with shared-node concurrency (~6 requests per GPU) and 90% node utilization, the effective per-token energy is ~1.3 Wh per 1,000 tokens.
For 1 million tokens, that's ~1.3 kWh of GPU energy per commit.
Apply the AWS data-center power usage effectiveness (PUE = 1.14, per Jegham et al. Table 3) for cooling and overhead:
Apply the AWS carbon intensity factor for the US-East regions where most Claude inference runs (CIF = 0.287 kg CO2e/kWh, per Jegham et al. Table 3):
GitClear's analysis of GitHub history from 2020 to 2024 finds the median full-time developer produces 673 commits per year. That works out to about 56 commits per month, or 2.8 commits per active workday. It's a conservative starting point, since it doesn't factor in recent productivity gains attributable to AI tools.
Drawing on Stack Overflow's 2025 Developer Survey, Faros AI's 2026 Engineering Report, and DORA's 2025 State of AI-Assisted Software Development, we estimate a hypothetical AI coding agent adopter might make about 50 agent-attributed commits per month from roughly 39 pushes per month.
Using EPA-referenced aviation emission factors and standard per-passenger intensities:
These numbers are small relative to emissions in most industries, but material for a digital-first SaaS company. They are also growing rapidly.
Accounting requires being explicit about what's outside the boundary. The numbers above leave several things uncounted.
Per GitHub's Octoverse reporting, roughly 80% of developer contribution volume happens in private repos, which our pipeline doesn't see. The volume and emission numbers should be read as the observable trajectory. It is also likely that agent usage is higher in private repositories where employers pay for expensive subscriptions.
Our token estimate covers the tokens the agent generates. It doesn't include input tokens used to prime the task — context windows, retrieved documents, multi-step reasoning prompts. That underestimates real-world consumption, especially for long-running agent tasks.
Model training is energy-intensive (the Stanford AI Index estimates Grok 4's training emissions at 72,816 tCO2e), but it's a one-time cost that's already been incurred and is amortized across all subsequent inference. We model only the inference cost of agent-attributed commits, which is the marginal footprint each new piece of AI-written code adds.
GitHub reports more than 20 million all-time Copilot users; none of that activity is visible in commit attribution because the human is still the committer. Tab completion, editor-mode suggestions, and similar tools have their own energy footprint that this methodology doesn't capture.
Water use is supported by the Jegham framework but not yet extended in our pipeline. Hardware manufacturing emissions for the GPUs running this inference (a meaningful contribution to total lifecycle footprint) aren't included either.
We use AWS US-East values for Claude inference because that's where most of it runs, but the actual region for any given workload may differ. Per-provider grid intensity values used in this analysis: 0.287 kgCO2e/kWh (AWS / Anthropic), 0.35 kgCO2e/kWh (Azure / Microsoft), 0.287 kgCO2e/kWh (Google).
Net effect: the published numbers are a floor, not a ceiling. Real total footprint is meaningfully larger.
The methodology in this piece tells you how to estimate AI coding emissions in principle. Carbonlog is how you do it in practice. It's our open-source Claude Code plugin that tracks CO2 and energy consumption per AI coding session in real time, using the same Jegham et al. framework walked through above. It produces the raw inputs your sustainability team will need for Scope 3 reporting.
The full detection pipeline, methodology, and historical data behind this analysis are also open source. We welcome contributions, especially from teams testing this against their own telemetry.