posthog-fix-llm-gateway-add

SegmentInvestigate-and-fix

Typebug

## Task

The analytics team flagged some noise on our internal PostHog project (project 2). Basically Twig's traces are getting buried under junk. We're seeing thousands of near-identical $ai_generation events, all the same shape. The output is a single character ("I"), finish_reason is "length", and the $ai_input is always just [{"content": "count", "role": "user"}]. $ai_output_tokens stays at 1 but $ai_input_tokens jumps around. One trace can rack up hundreds of these.

Over the last seven days that totalled 74,765 events across 1,131 traces. Bursts of ~4,500 an hour. The pain is that all this noise buries legitimate LLM activity, so it's hard to debug Twig regressions. Anyway, can you find what's causing these and stop them being captured?

## General instructions

- The code repo is at /repo/posthog.
- You are inside of a Docker container. You may not be able to perform all operations you would normally be able to do on a local machine. Dependencies have not been pre-installed, and you may need to install them yourself.
- You are expected to act autonomously as a software engineer to complete tasks you are given.
- Do not stop until you feel you have completed the task and your code changes can be merged.
- You may need to use software engineering skills like analyzing the codebase, researching technologies, running services, analyzing logs, etc. to complete the task. Not all tasks will be solvable by reading source code alone.

Agent Results

Agent	Verifier	Validation	Rubric	Bloat	Pract	Taste
Oracle	8/8	—	1.00	1.0x	5.0	4.0
Sonnet 5	8/8	—	0.50	0.8x	3.0	3.0
Gemini 3.1 Pro	2/8	—	0.00	0.3x	3.0	2.0
Gemini 3.5 Flash	2/8	—	0.00	0.4x	2.0	2.0
GLM-5.2	2/8	—	0.00	0.3x	4.0	2.0
GPT-5.4	2/8	—	0.00	0.8x	5.0	4.0
GPT-5.5	2/8	—	0.00	0.1x	5.0	4.0
Kimi K2.6	2/8	—	0.00	0.1x	4.0	3.0
Opus 4.7	2/8	—	0.00	0.4x	4.0	3.0
Opus 4.8	2/8	—	0.00	0.5x	3.0	1.0
Sonnet 4.6	2/8	—	0.00	0.4x	4.0	2.0
No-Op	2/8	—	0.00	—	—	—

Agent details

Verifier Tests

Gemini 3.1 Pro2/8

Rubric Criteria

Gemini 3.1 Pro0/2

Fail → Pass

count_tokens_proxy_does_not_fake_response

fix_addresses_root_cause_not_symptom

Taste Scores

Patch Bloat— 0.3x

25 agent / 90 oracle SLOC, 2 / 2 files(raw: 1.2x)

Practice Alignment — 3.0/5

style consistency

pattern adherence

library usage

abstraction level

documentation fit

Relative Taste — 2.0/5

minimality

approach quality

hygiene

fluency

craftsmanship

posthog-fix-llm-gateway-add

Agent Results

Verifier Tests

Rubric Criteria

Taste Scores

Agent Patch