How Tab Agent is evaluated
The current evaluation compares a fixed rule baseline, the earlier assistant MVP, and the current autonomous browser agent. The goal is not just lower memory use. The goal is to save memory while minimizing user interruption.
Grouping quality
Does Tab Agent's grouping align with how users mentally organize their tabs and working contexts?
Memory savings
Does autonomous sleep free meaningful browser memory while staying conservative enough to avoid obvious disruption?
Workflow speed
Does the agent help users recover and manage tab context faster than manual or static-rule alternatives?
Comparison conditions
The current benchmark is product-focused rather than model-brand-focused:
Primary metrics
Evaluation is centered on the tradeoff between benefit and interruption cost:
Live telemetry
The current web app also includes a live admin dashboard at /admin for inspecting submitted sessions, reward trends, regret rate, and policy-training signals over time.