LLM evals & observability — what's accelerating
A hard truth up front: open-source LLM evaluation and observability is still thin. The velocity leaders in this keyword bucket are mostly off-theme repos swept in by words like "monitor," "detection," and "tests passed." The one genuine eval-and-monitor platform here is climbing slowly. That gap is the story this week.
Top mover
The only repo in this set actually built to debug, evaluate, and monitor agents and LLMs — agentops and ai-governance are its own topic tags, not keyword accidents. Its velocity is modest, but it is the genuine article: experiment tracking, eval runs, and tracing that teams already trust in production.
---
The rest of the real signal
Full-stack observability — metrics, alerting, dashboards — now leaning into "AI-powered" monitoring. It is infrastructure observability rather than model-level evals, but if you are running agents on your own boxes, this watches the boxes.
The workflow orchestrator that schedules and monitors pipelines. Adjacent rather than an eval tool, but it is where a lot of eval and data-prep jobs actually get run and tracked.
---
The velocity leaders that aren't on-theme
Honest labelling — these out-climb every real eval tool above, but none of them measure or observe an LLM. They landed in this bucket on keyword overlap:
- CloakHQ/CloakBrowser — ⭐24,342 · ↑234.1/day · Python. A stealth Chromium that beats bot-detection tests. It is about evading someone else's evals, not running yours. - tw93/Mole — ⭐54,954 · ↑214.7/day · Shell. A Mac cleanup-and-monitor CLI. "Monitor" your disk, not your model. - sansan0/TrendRadar — ⭐59,049 · ↑146.2/day · Python. A news and public-opinion trend monitor. Useful, unrelated to agent observability. - aaif-goose/goose — ⭐46,859 · ↑72.0/day · Rust. An extensible coding agent — a thing you would observe, not the tool that observes it.
The takeaway: when the fastest-moving repos tagged "observability" are a stealth browser and a disk cleaner, it tells you open-source LLM-eval tooling has room to run. mlflow is carrying the category largely alone.
---
How this was made
Live GitHub pull, bucketed by inference/local-runtime keywords, each repo verified not-archived and pushed within 45 days, ranked by stars/day, then curated for substance. Star counts pulled at publish — they move daily; re-verify before reposting.
Accelbrief · catch acceleration, not stars · all editions