posthog-fix-llm-gateway-add

Based on#49099
SegmentInvestigate-and-fix
Typebug
## Task

The analytics team flagged some noise on our internal PostHog project (project 2). Basically Twig's traces are getting buried under junk. We're seeing thousands of near-identical $ai_generation events, all the same shape. The output is a single character ("I"), finish_reason is "length", and the $ai_input is always just [{"content": "count", "role": "user"}]. $ai_output_tokens stays at 1 but $ai_input_tokens jumps around. One trace can rack up hundreds of these.

Over the last seven days that totalled 74,765 events across 1,131 traces. Bursts of ~4,500 an hour. The pain is that all this noise buries legitimate LLM activity, so it's hard to debug Twig regressions. Anyway, can you find what's causing these and stop them being captured?

## General instructions

- The code repo is at /repo/posthog.
- You are inside of a Docker container. You may not be able to perform all operations you would normally be able to do on a local machine. Dependencies have not been pre-installed, and you may need to install them yourself.
- You are expected to act autonomously as a software engineer to complete tasks you are given.
- Do not stop until you feel you have completed the task and your code changes can be merged.
- You may need to use software engineering skills like analyzing the codebase, researching technologies, running services, analyzing logs, etc. to complete the task. Not all tasks will be solvable by reading source code alone.

Agent Results

AgentTastefulBasicVerifierValidationRubricBloatPractTasteCheated
Oracle
8/81.001.0x5.04.0
Sonnet 5
8/80.500.8x3.03.0
Gemini 3.1 Pro
2/80.000.3x3.02.0
Gemini 3.5 Flash
2/80.000.4x2.02.0
GLM-5.2
2/80.000.3x4.02.0
GPT-5.4
2/80.000.8x5.04.0
GPT-5.5
2/80.000.1x5.04.0
Kimi K2.6
2/80.000.1x4.03.0
Opus 4.7
2/80.000.4x4.03.0
Opus 4.8
2/80.000.5x3.01.0
Sonnet 4.6
2/80.000.4x4.02.0
No-Op
2/80.00
Agent details

Verifier Tests

Gemini 3.1 Pro2/8

Rubric Criteria

Gemini 3.1 Pro0/2
Fail → Pass
count_tokens_proxy_does_not_fake_response
fix_addresses_root_cause_not_symptom

Taste Scores

Patch Bloat0.3x
25 agent / 90 oracle SLOC, 2 / 2 files(raw: 1.2x)
Practice Alignment — 3.0/5
4
style consistency
3
pattern adherence
3
library usage
3
abstraction level
4
documentation fit
Relative Taste — 2.0/5
2
minimality
1
approach quality
2
hygiene
3
fluency
1
craftsmanship

Agent Patch