Evaluation framework

How Tab Agent is evaluated

The current evaluation compares a fixed rule baseline, the earlier assistant MVP, and the current autonomous browser agent. The goal is not just lower memory use. The goal is to save memory while minimizing user interruption.

Claim 1

Grouping quality

Does Tab Agent's grouping align with how users mentally organize their tabs and working contexts?

Claim 2

Memory savings

Does autonomous sleep free meaningful browser memory while staying conservative enough to avoid obvious disruption?

Claim 3

Workflow speed

Does the agent help users recover and manage tab context faster than manual or static-rule alternatives?

Comparison conditions

The current benchmark is product-focused rather than model-brand-focused:

Baseline A
Static rule-based tab management
Fixed inactivity thresholds with no personalization.
Baseline B
Assistant MVP
AI grouping with user-triggered actions and no autonomous loop.
Experimental
Autonomous personalized agent
Local prediction, autonomous sleep, context wake, and feedback-driven learning.

Primary metrics

Evaluation is centered on the tradeoff between benefit and interruption cost:

Benefit
Memory saved
Estimated memory saved, autonomous sleep count, and reduced open-tab footprint.
Cost
Interruption and regret
Undo rate, quick reopen after auto-sleep, manual wake after sleep, and explicit bad feedback.
User outcome
Trust and usefulness
Perceived usefulness, trust, willingness to use, and clarity of explanations.

Live telemetry

The current web app also includes a live admin dashboard at /admin for inspecting submitted sessions, reward trends, regret rate, and policy-training signals over time.