Evaluation framework

How Tab Agent is evaluated

The current evaluation compares a fixed rule baseline, the earlier assistant MVP, and the current autonomous browser agent. The goal is not just lower memory use. The goal is to save memory while minimizing user interruption.

Claim 1

Grouping quality

Does Tab Agent's grouping align with how users mentally organize their tabs and working contexts?

Claim 2

Memory savings

Does autonomous sleep free meaningful browser memory while staying conservative enough to avoid obvious disruption?

Claim 3

Workflow speed

Does the agent help users recover and manage tab context faster than manual or static-rule alternatives?

Comparison conditions

The current benchmark is product-focused rather than model-brand-focused:

Baseline A

Static rule-based tab management

Fixed inactivity thresholds with no personalization.

Baseline B

Assistant MVP

AI grouping with user-triggered actions and no autonomous loop.

Experimental

Autonomous personalized agent

Local prediction, autonomous sleep, context wake, and feedback-driven learning.

Primary metrics

Evaluation is centered on the tradeoff between benefit and interruption cost:

Benefit

Memory saved

Estimated memory saved, autonomous sleep count, and reduced open-tab footprint.

Cost

Interruption and regret

Undo rate, quick reopen after auto-sleep, manual wake after sleep, and explicit bad feedback.

User outcome

Trust and usefulness

Perceived usefulness, trust, willingness to use, and clarity of explanations.

Live telemetry

The current web app also includes a live admin dashboard at /admin for inspecting submitted sessions, reward trends, regret rate, and policy-training signals over time.