WebChallenger: A Reliable and Efficient Generalist Web Agent

Jayoo Hwang, Xiaowen Zhang, Vedant Padwal

WebChallenger is a web agent framework built around PageMem, a structured page representation that supports selective observation, persistent site memory, and compound action workflows.

Without fine-tuning, WebChallenger sets new open-model state-of-the-art across multiple web navigation benchmarks, showing scaffolding alone can drastically improve web agent performance.

Overview of WebChallenger: webpage decomposition into PageSections, intermediate PageMem interface with section summaries, and selective focus with high-level workflows.

Overview of WebChallenger. (left) Each webpage is decomposed along the DOM into sections corresponding to semantic regions. (middle) Sections are indexed by short summaries to form a PageMem, cached in per-website memory. The agent skims summaries and expands only task-relevant sections. (right) Specialized multi-step workflows are executed based on section type.

Method

Humans possess three major advantages over LLM agents when it comes to web navigation: persistent memory of website structure, selective attention to relevant page regions, and procedural fluency with common interaction patterns. At the core of WebChallenger is PageMem: a structured page representation deterministically constructed from the DOM that exposes each page as a hierarchy of semantic sections with short summaries. PageMem serves as the shared substrate for the three components below, which mirror these advantages in turn: offline exploration & memory gives the agent a persistent map of each site, divide-and-conquer observation lets it attend only to task-relevant page regions, and compound action workflows package routine multi-step interactions into single decisions.

Offline Exploration & Memory

Site graph being saved into a persistent WebsiteMem artifact

Deterministic crawl of unique clickable elements across a website with no LLM guidance or demonstrations required.
Records two click outcomes: clicks that navigate become new pages to explore; clicks that reveal new elements on the same page are saved as dropdown items on the trigger.
Template matching dedupes structurally identical pages (e.g., individual product pages), keeping the crawl tractable.
Reused across all tasks on a site, consumed token-efficiently as bookmarks and dropdown hints rather than retrieved passages.

Divide-and-Conquer Observation

Section selection: each page is initially presented to the LLM as a list of sections with short summaries, allowing the agent to choose it's own context.
Per-section extraction: each selected section gets its own focused LLM call over its accessibility subtree, with VLM-described images inlined.
Long lists handled separately: items are chunked, selected chunk-by-chunk, with an early-termination check so long tables never blow context.
The extracted information is combined into a one-paragraph task-focused summary that drives action selection and history.

Compound Action Workflows

Index-based action selection instead of using browser tool calls, the LLM simply selects an action from a numbered menu and our harness automatically handles execution.
Form submission: LLM picks fields, fills each, then enters a review loop to edit, submit, or exit.
Dropdown selection: clicks the trigger, diffs the page, and the LLM picks one of the revealed options.
Search, file upload, select-option, copy each have analogous short workflows; an end-task verification call gates final answers.

At each timestep the agent retrieves or refreshes the PageMem for the current page, runs the observation pipeline to produce a task-focused summary, and selects one high-level action — which may invoke a workflow that internally issues multiple LLM sub-calls and browser operations. Because all three components operate on the same PageMem substrate, the framework generalizes across websites with no site-specific adapters.

Results

Benchmark success rates (%). WebChallenger sets new open-model SOTA on multiple web navigation benchmarks and performs comparably to agents built on proprietary models, despite using no training. Best proprietary and open-model results are bolded. WebArena , VWA: VisualWebArena , O-M2W: Online-Mind2Web , WoA: WorkArena .

	Method	Model(s)	WebArena	VWA	O-M2W	WoA L1
Proprietary Models	GenericAgent	GPT-4o	31.4	26.7	–	45.5
	GenericAgent	Claude 3.5 Sonnet	36.2	21.0	–	56.4
	GenericAgent	GPT-5	–	–	–	79.1
	GenericAgent	GPT-4o-mini	17.4	16.9	–	27.0
	WALT	GPT-5	50.1	52.9	–	–
	IBM CUGA	–	61.7	–	–	–
	OpenAI CUA	–	58.1	–	61.3	–
	ScribeAgent	GPT-4o + Qwen2.5-32B	53.0	–	–	–
	AgentSymbiotic	Claude 3.5 + Llama3 8B	48.5	–	–	–
	AgentOccam-Judge	GPT-4	45.7	–	–	–
	WebPilot	GPT-4o	37.2	–	–	–
	Agent Workflow Memory	GPT-4o	35.5	–	–	–
	SkillWeaver	GPT-4o	29.8	–	–	–
Open-Source Models (fine-tuned)	Agent-as-Annotators	Qwen3.5-9B	41.5	33.9	–	51.5
	Mobile-Agent-v3.5	Qwen3-VL-32B	48.4	46.6	–	–
	WebDreamer	Qwen2-VL-7B	–	21.9	35.0	–
	Fara-7B	Qwen2.5-VL-7B	–	–	34.1	–
	Learn-by-Interact	Codestral 22B	24.2	–	–	–
	AgentTrek	Qwen2.5-32B	22.4	–	–	–
	Go-Browse	Qwen2.5-7B	21.7	–	–	–
	AutoWebGLM	ChatGLM3 6B	18.2	–	–	–
	TTI	Gemma 3 12B	26.1	–	–	–
Open-Source Models (zero-shot)	GenericAgent	GPT-oss-120b	–	–	–	50.9
	Tree Search	Llama-3-70B-Instruct	10.1	16.7	–	–
	WebChallenger (Ours)	GLM-4-32B + Qwen2.5-VL-7B	56.3	48.7^†	51.0	70.9

^† Our VisualWebArena experiments use Qwen3-VL-4B-Instruct in place of Qwen2.5-VL-7B-Instruct.

We evaluate WebChallenger on four web navigation benchmarks chosen to span a diverse range of capabilities: WebArena (812 tasks across 6 simulated websites), VisualWebArena (910 tasks requiring visual reasoning), Online-Mind2Web (300 tasks on 136 real-world sites), and WorkArena (330 enterprise UI tasks). WebChallenger outperforms fine-tuned baselines despite using no training of its own and approaches the performance of systems using proprietary models at a fraction of the cost. Consistent gains across simulated and live websites, text-only and visually grounded tasks, and consumer and enterprise interfaces indicate that the framework's advantages come from structural patterns shared across the web rather than site-specific adaptation.

Analysis

We run additional experiments on the 165-task WebArena-lite subset to attribute credit across the three architectural components, measure their inference cost, and test sensitivity to the backbone model.

Component ablations

Removing each pillar in turn shows that all three contribute meaningfully and non-redundantly. The observation pipeline carries the most weight (−17.6 points), followed by compound action workflows (−9.7) and persistent memory (−7.6).

Component ablation table showing per-site WebArena-lite success rates for the full system and three single-component-removed variants.

Token and step efficiency

The divide-and-conquer observation pipeline decomposes large prompts into multiple smaller prompts, removing it reduces total token usage but increases average prompt size by 4.75× (1850 → 8793 tokens). Compound action workflows collapse multi-step interactions into single decisions; removing them grows total token usage by nearly 40% (47M → 65M).

Token and step usage table for the GLM-4-32B component ablations.

Backbone comparison

Finally, we vary the backbone to separate the contribution of the harness from that of the model. The same GLM-4-32B that reaches 58.8% in our framework scores just 19.4% in a vanilla single-prompt harness, a 39.4-point gap attributable purely to system architecture. Performance scales with backbone strength as expected (GPT-5: 68.7%, GPT-4o-mini: 46.7%), but the framework retains strong performance even with weaker models.

Backbone model comparison table on WebArena-lite, including GenericAgent baseline with the same GLM-4-32B backbone.

BibTeX

@misc{hwang2026webchallengerreliableefficientgeneralist,
      title={WebChallenger: A Reliable and Efficient Generalist Web Agent}, 
      author={Jayoo Hwang and Xiaowen Zhang and Vedant Padwal},
      year={2026},
      eprint={2606.10423},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.10423}, 
}