WebChallenger: A Reliable and Efficient Generalist Web Agent

Jayoo Hwang, Xiaowen Zhang, Vedant Padwal
WebChallenger logo

WebChallenger is a web agent framework built around PageMem, a structured page representation that supports selective observation, persistent site memory, and compound action workflows.

Without fine-tuning, WebChallenger sets new open-model state-of-the-art across multiple web navigation benchmarks, showing scaffolding alone can drastically improve web agent performance.

Overview of WebChallenger: webpage decomposition into PageSections, intermediate PageMem interface with section summaries, and selective focus with high-level workflows.

Overview of WebChallenger. (left) Each webpage is decomposed along the DOM into sections corresponding to semantic regions. (middle) Sections are indexed by short summaries to form a PageMem, cached in per-website memory. The agent skims summaries and expands only task-relevant sections. (right) Specialized multi-step workflows are executed based on section type.

Method

Humans possess three major advantages over LLM agents when it comes to web navigation: persistent memory of website structure, selective attention to relevant page regions, and procedural fluency with common interaction patterns. At the core of WebChallenger is PageMem: a structured page representation deterministically constructed from the DOM that exposes each page as a hierarchy of semantic sections with short summaries. PageMem serves as the shared substrate for the three components below, which mirror these advantages in turn: offline exploration & memory gives the agent a persistent map of each site, divide-and-conquer observation lets it attend only to task-relevant page regions, and compound action workflows package routine multi-step interactions into single decisions.

Offline Exploration & Memory

Site graph being saved into a persistent WebsiteMem artifact
  • Deterministic crawl of unique clickable elements across a website with no LLM guidance or demonstrations required.
  • Records two click outcomes: clicks that navigate become new pages to explore; clicks that reveal new elements on the same page are saved as dropdown items on the trigger.
  • Template matching dedupes structurally identical pages (e.g., individual product pages), keeping the crawl tractable.
  • Reused across all tasks on a site, consumed token-efficiently as bookmarks and dropdown hints rather than retrieved passages.

Divide-and-Conquer Observation

Two red beams scanning highlighted page sections, extracting info into boxes that feed into a task summary
  • Section selection: each page is initially presented to the LLM as a list of sections with short summaries, allowing the agent to choose it's own context.
  • Per-section extraction: each selected section gets its own focused LLM call over its accessibility subtree, with VLM-described images inlined.
  • Long lists handled separately: items are chunked, selected chunk-by-chunk, with an early-termination check so long tables never blow context.
  • The extracted information is combined into a one-paragraph task-focused summary that drives action selection and history.

Compound Action Workflows

A partially filled form with arrows tracing fill order, and a menu showing click-to-open then click-to-select
  • Index-based action selection instead of using browser tool calls, the LLM simply selects an action from a numbered menu and our harness automatically handles execution.
  • Form submission: LLM picks fields, fills each, then enters a review loop to edit, submit, or exit.
  • Dropdown selection: clicks the trigger, diffs the page, and the LLM picks one of the revealed options.
  • Search, file upload, select-option, copy each have analogous short workflows; an end-task verification call gates final answers.

At each timestep the agent retrieves or refreshes the PageMem for the current page, runs the observation pipeline to produce a task-focused summary, and selects one high-level action — which may invoke a workflow that internally issues multiple LLM sub-calls and browser operations. Because all three components operate on the same PageMem substrate, the framework generalizes across websites with no site-specific adapters.

Results

Benchmark success rates (%). WebChallenger sets new open-model SOTA on multiple web navigation benchmarks and performs comparably to agents built on proprietary models, despite using no training. Best proprietary and open-model results are bolded. WebArena , VWA: VisualWebArena , O-M2W: Online-Mind2Web , WoA: WorkArena .

Method Model(s) WebArena VWA O-M2W WoA L1
Proprietary
Models
GenericAgent GPT-4o 31.4 26.7 45.5
GenericAgent Claude 3.5 Sonnet 36.2 21.0 56.4
GenericAgent GPT-5 79.1
GenericAgent GPT-4o-mini 17.4 16.9 27.0
WALT GPT-5 50.1 52.9
IBM CUGA 61.7
OpenAI CUA 58.1 61.3
ScribeAgent GPT-4o + Qwen2.5-32B 53.0
AgentSymbiotic Claude 3.5 + Llama3 8B 48.5
AgentOccam-Judge GPT-4 45.7
WebPilot GPT-4o 37.2
Agent Workflow Memory GPT-4o 35.5
SkillWeaver GPT-4o 29.8
Open-Source
Models
(fine-tuned)
Agent-as-Annotators Qwen3.5-9B 41.5 33.9 51.5
Mobile-Agent-v3.5 Qwen3-VL-32B 48.4 46.6
WebDreamer Qwen2-VL-7B 21.9 35.0
Fara-7B Qwen2.5-VL-7B 34.1
Learn-by-Interact Codestral 22B 24.2
AgentTrek Qwen2.5-32B 22.4
Go-Browse Qwen2.5-7B 21.7
AutoWebGLM ChatGLM3 6B 18.2
TTI Gemma 3 12B 26.1
Open-Source
Models
(zero-shot)
GenericAgent GPT-oss-120b 50.9
Tree Search Llama-3-70B-Instruct 10.1 16.7
WebChallenger (Ours) GLM-4-32B + Qwen2.5-VL-7B 56.3 48.7 51.0 70.9

Our VisualWebArena experiments use Qwen3-VL-4B-Instruct in place of Qwen2.5-VL-7B-Instruct.

We evaluate WebChallenger on four web navigation benchmarks chosen to span a diverse range of capabilities: WebArena (812 tasks across 6 simulated websites), VisualWebArena (910 tasks requiring visual reasoning), Online-Mind2Web (300 tasks on 136 real-world sites), and WorkArena (330 enterprise UI tasks). WebChallenger outperforms fine-tuned baselines despite using no training of its own and approaches the performance of systems using proprietary models at a fraction of the cost. Consistent gains across simulated and live websites, text-only and visually grounded tasks, and consumer and enterprise interfaces indicate that the framework's advantages come from structural patterns shared across the web rather than site-specific adaptation.

Analysis

We run additional experiments on the 165-task WebArena-lite subset to attribute credit across the three architectural components, measure their inference cost, and test sensitivity to the backbone model.

Component ablations

Removing each pillar in turn shows that all three contribute meaningfully and non-redundantly. The observation pipeline carries the most weight (−17.6 points), followed by compound action workflows (−9.7) and persistent memory (−7.6).

Component ablation table showing per-site WebArena-lite success rates for the full system and three single-component-removed variants.

Token and step efficiency

The divide-and-conquer observation pipeline decomposes large prompts into multiple smaller prompts, removing it reduces total token usage but increases average prompt size by 4.75× (1850 → 8793 tokens). Compound action workflows collapse multi-step interactions into single decisions; removing them grows total token usage by nearly 40% (47M → 65M).

Token and step usage table for the GLM-4-32B component ablations.

Backbone comparison

Finally, we vary the backbone to separate the contribution of the harness from that of the model. The same GLM-4-32B that reaches 58.8% in our framework scores just 19.4% in a vanilla single-prompt harness, a 39.4-point gap attributable purely to system architecture. Performance scales with backbone strength as expected (GPT-5: 68.7%, GPT-4o-mini: 46.7%), but the framework retains strong performance even with weaker models.

Backbone model comparison table on WebArena-lite, including GenericAgent baseline with the same GLM-4-32B backbone.

BibTeX

@misc{hwang2026webchallengerreliableefficientgeneralist,
      title={WebChallenger: A Reliable and Efficient Generalist Web Agent}, 
      author={Jayoo Hwang and Xiaowen Zhang and Vedant Padwal},
      year={2026},
      eprint={2606.10423},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.10423}, 
}