EDITION 07 · LLM EVALS & OBSERVABILITY2026·06·065 min readlinks verified live

LLM evals & observability — what's accelerating

A hard truth up front: open-source LLM evaluation and observability is still thin. The velocity leaders in this keyword bucket are mostly off-theme repos swept in by words like "monitor," "detection," and "tests passed." The one genuine eval-and-monitor platform here is climbing slowly. That gap is the story this week.

↑17/day

fastest climber
in the edition

picks that
earned a slot

live

counts pulled
at publish

5min

to read the
whole edition

Top mover

★ TOP MOVER

mlflow/mlflowUSEPython▲ 9.0 /day★ 26,328

The only repo in this set actually built to debug, evaluate, and monitor agents and LLMs — agentops and ai-governance are its own topic tags, not keyword accidents. Its velocity is modest, but it is the genuine article: experiment tracking, eval runs, and tracing that teams already trust in production.

Who needs itanyone who needs to measure LLM/agent quality and watch it in prod, not just ship and hope.

---

The rest of the real signal

netdata/netdataUSEC▲ 16.7 /day★ 79,075

Full-stack observability — metrics, alerting, dashboards — now leaning into "AI-powered" monitoring. It is infrastructure observability rather than model-level evals, but if you are running agents on your own boxes, this watches the boxes.

Who needs itlean teams who want host- and service-level visibility under their agent stack.

apache/airflowUSEPython▲ 11.2 /day★ 45,712

The workflow orchestrator that schedules and monitors pipelines. Adjacent rather than an eval tool, but it is where a lot of eval and data-prep jobs actually get run and tracked.

Who needs itteams scheduling recurring eval or ingestion runs as part of a larger DAG.

---

The velocity leaders that aren't on-theme

Honest labelling — these out-climb every real eval tool above, but none of them measure or observe an LLM. They landed in this bucket on keyword overlap:

- CloakHQ/CloakBrowser — ⭐24,342 · ↑234.1/day · Python. A stealth Chromium that beats bot-detection tests. It is about evading someone else's evals, not running yours. - tw93/Mole — ⭐54,954 · ↑214.7/day · Shell. A Mac cleanup-and-monitor CLI. "Monitor" your disk, not your model. - sansan0/TrendRadar — ⭐59,049 · ↑146.2/day · Python. A news and public-opinion trend monitor. Useful, unrelated to agent observability. - aaif-goose/goose — ⭐46,859 · ↑72.0/day · Rust. An extensible coding agent — a thing you would observe, not the tool that observes it.

The takeaway: when the fastest-moving repos tagged "observability" are a stealth browser and a disk cleaner, it tells you open-source LLM-eval tooling has room to run. mlflow is carrying the category largely alone.

---

How this was made

Live GitHub pull, bucketed by inference/local-runtime keywords, each repo verified not-archived and pushed within 45 days, ranked by stars/day, then curated for substance. Star counts pulled at publish — they move daily; re-verify before reposting.

1 · pull the firehose, verify live2 · bucket by keyword3 · rank by stars/day4 · separate signal from noise, by hand

Accelbrief · catch acceleration, not stars · all editions

1 · pull the firehose, verify live2 · bucket by keyword3 · rank by stars/day4 · separate signal from noise, by hand

Top mover

The rest of the real signal

The velocity leaders that aren't on-theme

How this was made

Catch the next breakout before it trends.