omnibrowser-agent v0.2.35

Local-first browser AI automation library

OmniBrowser Agent plans and executes DOM actions entirely in the browser — no API keys, no cloud costs, no data leaving your machine. Wire in a WebLLM model and it reasons, remembers, and acts on any webpage.

Privacy-first WebLLM + WebGPU Reflection loop Human-approved mode Custom system prompt Embeddable API
2Agent Modes
2Planner Modes
8Action Types
MITLicense
OmniBrowser Agent logo

Use Cases

  • CRM profile lookup automation
  • Guided form-filling workflows
  • Assisted data extraction flows
  • Multi-step task automation

Core Engine

  • Observer: DOM snapshot + candidate elements
  • Planner: reflection → next action
  • Safety: safe / review / blocked gating
  • Executor: DOM actions with framework compat

See OmniBrowser Agent in Action

OmniBrowser Planner LLM

We fine-tuned a purpose-built planner model that runs entirely in your browser via WebLLM + WebGPU. No API keys, no cloud — the model weights download once and execute locally on your GPU.

Base Model

Qwen2.5-1.5B-Instruct — a compact, high-quality instruction-following LLM from Alibaba.

Fine-tuning

QLoRA fine-tuned on our custom OmniBrowser planner dataset — DOM snapshots paired with correct AgentAction JSON outputs.

Quantization

q4f16_1 via MLC-LLM — 4-bit weights, 16-bit activations. Optimized for WebGPU inference in the browser.

Size

~800 MB download. Runs on any device with WebGPU support (Chrome 113+, Edge, Safari 18+).

What the model does

Given a user goal, page URL, visible DOM candidates, and action history — the model outputs a structured JSON response with evaluation, working memory, next goal, and the exact DOM action to execute:

{
  "evaluation": "Clicked the CRM tab — now on the contacts page.",
  "memory": "CRM tab active. Name field is #name, currently empty.",
  "nextGoal": "Type Jane Doe into the name field",
  "action": { "type": "type", "selector": "#name", "text": "Jane Doe", "clearFirst": true }
}

Training pipeline

The full training pipeline is open-source and reproducible:

  1. Generate a planner dataset from real DOM snapshots using notebook/scripts/generate_dataset.mjs
  2. Validate selector + action correctness with notebook/scripts/validate_dataset.mjs
  3. QLoRA fine-tune in Google Colab — see notebook/
  4. Merge adapters and quantize with mlc_llm convert_weight --quantization q4f16_1
  5. Upload to Hugging Face and load in WebLLM via custom appConfig

Load in your app

import * as webllm from "@mlc-ai/web-llm";
import { createWebLLMBridge } from "@akshayram1/omnibrowser-agent";

const appConfig = {
  model_list: [
    ...webllm.prebuiltAppConfig.model_list,
    {
      model: "https://huggingface.co/Akshayram1/omnibrowser-planner-1p5b-q4f16_1-MLC",
      model_id: "omnibrowser-planner-1p5b-q4f16_1",
      model_lib: webllm.modelLibURLPrefix + webllm.modelVersion
        + "/Qwen2-1.5B-Instruct-q4f16_1-ctx4k_cs1k-webgpu.wasm",
    },
  ],
};

const engine = await webllm.CreateMLCEngine(
  "omnibrowser-planner-1p5b-q4f16_1", { appConfig }
);
window.__browserAgentWebLLM = createWebLLMBridge(engine);
🤗 View on Hugging Face Training Notebook Model Test Demo

What's New in v0.2.35

This release implements the reflection-before-action pattern — the same loop used by leading browser agents — plus a new systemPrompt option so you can shape agent behaviour without rewriting the bridge.

Reflection Loop New

Before every action the agent now goes through a 4-step inner loop:

1 · Evaluate

What happened in the previous step? Did it succeed? What changed on the page?

2 · Remember

What key facts should be carried into the next step? Selector mappings, field values, task state.

3 · Plan

State the next goal in plain English before choosing an action.

4 · Act

Output the specific DOM action: click, type, navigate, scroll, etc.

The WebLLM bridge now returns the full reflection object:

{
  "evaluation": "The name field was filled successfully.",
  "memory":     "Name=#name done. Next: fill email at #email.",
  "next_goal":  "Type the email address into #email",
  "action":     { "type": "type", "selector": "#email", "text": "jane@example.com", "clearFirst": true }
}

The nextGoal field is surfaced in the live demo as a 💭 thought bubble before each action, so you can follow the agent's reasoning in real time.

Working Memory Across Steps New

The agent's memory string is automatically carried forward from one tick to the next inside AgentSession. The planner receives it as input.memory and can update it each step — giving the agent a scratchpad across the whole task.

Custom System Prompt New

Pass your own system prompt directly in the planner config — no need to rewrite the bridge:

const agent = createBrowserAgent({
  goal: "Fill the checkout form",
  planner: {
    kind: "webllm",
    systemPrompt: "You are a careful checkout assistant. Never submit before all required fields are filled."
  }
});

New Exports New

  • parsePlannerResult(raw) — parse the full reflection+action JSON from raw LLM output, with fallback to bare AgentAction for backward compatibility.
  • PlannerResult type — { action, evaluation?, memory?, nextGoal? }
import { parsePlannerResult } from "@akshayram1/omnibrowser-agent";

const result = parsePlannerResult(llmRawOutput);
// result.action      → AgentAction
// result.evaluation  → string | undefined
// result.memory      → string | undefined
// result.nextGoal    → string | undefined

Backward Compatible

Existing bridges that return a bare AgentAction object still work without any changes. The library normalises both formats automatically.

Docs

Everything you need to install, initialise, and run your first browser agent.

Installation

npm install @akshayram1/omnibrowser-agent

Quick Start

import { createBrowserAgent } from "@akshayram1/omnibrowser-agent";

const agent = createBrowserAgent(
  {
    goal: "Open CRM and find customer John Smith",
    mode: "human-approved",        // or "autonomous"
    planner: { kind: "heuristic" } // or "webllm"
  },
  {
    onStep:            (result, session) => console.log(result.message),
    onApprovalRequired:(action, session) => console.log("Needs approval:", action),
    onDone:            (result, session) => console.log("Done:", result.message),
    onError:           (err,    session) => console.error(err),
    onMaxStepsReached: (session)         => console.log("Max steps hit"),
  }
);

await agent.start();

// Resume after an approval prompt:
await agent.resume();

// Inspect state at any time:
console.log(agent.isRunning, agent.hasPendingAction);

// Stop:
agent.stop();

AbortSignal Support

const controller = new AbortController();
const agent = createBrowserAgent({ goal: "...", signal: controller.signal });
agent.start();

controller.abort(); // cancel from outside

Reading Reflection Fields

Every onStep result now includes optional reflection data from the planner:

onStep(result, session) {
  if (result.reflection?.nextGoal) {
    console.log("Agent thinking:", result.reflection.nextGoal);
  }
  if (result.reflection?.memory) {
    console.log("Agent memory:", result.reflection.memory);
  }
  console.log("Action:", result.message);
}

Agent Modes

human-approved

Pauses on review-rated actions and fires onApprovalRequired. Call agent.resume() to continue. Recommended for CRM, finance, and admin flows.

autonomous

Executes all safe and review actions without pausing. Best for rapid prototyping and demos.

Planner Modes

heuristic

Zero-dependency regex planner. Works fully offline. Best for simple, predictable goals: navigate, fill a field, click a button.

webllm

On-device LLM via WebGPU through window.__browserAgentWebLLM. Fully private. Supports the reflection loop and custom system prompts.

Supported Actions

ActionDescriptionRisk level
navigateNavigate to a URL (http/https only)safe
clickClick an element by CSS selectorsafe / review
typeType text into an input or textareasafe / review
scrollScroll a container or the pagesafe
focusFocus an element (useful for dropdowns)safe
waitPause for N millisecondssafe
extractExtract text from an elementreview
doneSignal task completionsafe

Safety Model

  • safe — executes immediately in all modes.
  • review — pauses in human-approved mode; executes in autonomous. Triggered by actions on labels matching delete / submit / pay / confirm / transfer.
  • blocked — never executes. Triggered by javascript:, file:, or malformed URLs.
OmniBrowser Agent extension logo
Chrome Extension
popup + background worker
Load the dist/ folder as an unpacked extension. Enter a goal in the popup, pick a mode, and hit Start. The background service worker drives the tick loop across tabs.
MV3 background worker
OmniBrowser Agent library logo
npm Library
createBrowserAgent()
Embed the agent directly into any web app. Import, configure, and wire up event callbacks. The same core engine powers both — zero duplication.
@akshayram1/omnibrowser-agent
both share the same core engine
one tick = observe → plan → assess risk → execute
01
Observe
observer.ts
Scans the live DOM. Filters invisible elements. Prioritises in-viewport candidates. Resolves ARIA labels. Returns a PageSnapshot.
02
Plan
planner.ts
Takes the goal, snapshot, and history. Returns the next AgentAction — plus optional reflection (evaluation, memory, nextGoal).
03
Assess risk
safety.ts
Every action gets a risk level: safe, review, or blocked. Risky actions pause for human approval in human-approved mode.
04
Execute
executor.ts
Performs the DOM action. Dispatches proper InputEvents for framework compat. Verifies success. Feeds errors back to the planner.
planner.ts
planNextAction(config, input)
  • URL, fill, search, click regex patterns
  • Fallback: first input → first button → done
  • WebLLM bridge via window.__browserAgentWebLLM
  • Returns PlannerResult with reflection fields
  • lastError fed back on retry
observer.ts
collectSnapshot()
  • Queries a, button, input, textarea, select…
  • Filters hidden & zero-dimension elements
  • In-viewport elements listed first
  • Resolves label via aria-labelledby, for/id, aria-label
  • Max 60 candidates, 1500-char text preview
executor.ts
executeAction(action)
  • click — disabled-check before firing
  • type — InputEvent + change dispatch (React/Vue safe)
  • navigate — sets window.location.href
  • extract — verifies non-empty text
  • scroll, focus, wait actions
// safety.ts — assessRisk()
safe
navigate (http/s), click, scroll, wait, focus, done
review
extract, click/type on delete · pay · submit · transfer
blocked
javascript: · file: · malformed URLs
// planner modes
Heuristic — zero deps, works offline
Pure regex matching against your goal string. Handles navigate, fill, search, and click patterns. Falls back to the first visible input or button.
const agent = createBrowserAgent({
  goal: "search for John Smith",
  planner: { kind: "heuristic" }
});
WebLLM — on-device via WebGPU
Delegates to window.__browserAgentWebLLM. Fully private — no API calls, runs local inference via WebGPU. Wire in your own bridge implementation.
const agent = createBrowserAgent({
  goal: "open CRM and find customer",
  planner: { kind: "webllm", modelId: "Llama-3" }
}); // needs bridge on window
Executed status icon
Executed
Action is safe. Runs immediately. Result message appended to session history. Next tick begins.
status: "executed"
Needs approval status icon
Needs approval
Risk level is "review" in human-approved mode. Agent pauses, fires onApprovalRequired. Resume with agent.resume().
status: "needs_approval"
Blocked status icon
Blocked
Dangerous protocol or malformed URL. Action is rejected entirely. Session stops without executing anything.
status: "blocked"

Data Flow — One Tick

goal + history + memory
        │
        ▼
observer.collectSnapshot()   →  PageSnapshot (url, title, candidates[])
        │
        ▼
planner.planNextAction()     →  PlannerResult
                                  { action, evaluation?, memory?, nextGoal? }
        │
        ▼
safety.assessRisk(action)    →  safe | review | blocked
        │
   ┌────┴──────────────────────────┐
blocked                  review (human-approved mode)
   │                               │
  stop                   pause → user approves → resume()
                                   │
                              safe / approved
                                   │
                                   ▼
                executor.executeAction(action)  →  result string
                                   │
                                   ▼
                       session.history.push(result)
                       session.memory = plannerResult.memory
                       → next tick

WebLLM Bridge Contract

Attach an object to window.__browserAgentWebLLM before starting the agent. The bridge can return either the new PlannerResult format or a bare AgentAction (backward compatible).

window.__browserAgentWebLLM = {
  async plan(input, modelId) {
    // input.goal, input.snapshot, input.history,
    // input.lastError, input.memory, input.systemPrompt
    return {
      evaluation: "Previous step succeeded.",
      memory:     "Name field is #name.",
      next_goal:  "Fill the email field.",
      action: { "type": "type", "selector": "#email", "text": "jane@example.com", "clearFirst": true }
    };
  }
};

Current Limitations

  • No persistent long-term memory (IndexedDB) yet
  • No goal decomposition / multi-step task graphs yet
  • Risk scoring is keyword-based, not semantic
  • No selector healing or fallback strategy yet
@akshayram1/omnibrowser-agent
v0.2.35 MIT TypeScript
$ npm install @akshayram1/omnibrowser-agent
// Minimal setup — heuristic planner, autonomous mode
import { createBrowserAgent } from '@akshayram1/omnibrowser-agent';
 
const agent = createBrowserAgent({
  goal: "search for contact John Smith in CRM",
  mode: "autonomous",
  planner: { kind: "heuristic" }
});
 
await agent.start();
// Human-approved mode — agent pauses on risky actions
import { createBrowserAgent } from '@akshayram1/omnibrowser-agent';
 
const agent = createBrowserAgent({
  goal: "fill out the payment form",
  mode: "human-approved", // pauses on delete / submit / pay
  planner: { kind: "heuristic" },
  maxSteps: 20,
  stepDelayMs: 500
});
 
await agent.start();
 
// Later — after user reviews the pending action:
await agent.resume();
// Full event callback API
import { createBrowserAgent } from '@akshayram1/omnibrowser-agent';
 
const agent = createBrowserAgent(
  { goal: "open CRM and find customer", mode: "human-approved" },
  {
    onStart: (session) => console.log('started', session.id),
    onStep: (result, session) => console.log(result.message),
    onApprovalRequired: (action, session) => {
      console.log('Review:', action);
      // call agent.resume() after user confirms
    },
    onDone: (result) => console.log('Done:', result.message),
    onError: (err) => console.error(err),
    onMaxStepsReached: (session) => console.warn('max steps', session)
  }
);
 
await agent.start();
// AbortSignal — cancel from outside at any time
import { createBrowserAgent } from '@akshayram1/omnibrowser-agent';
 
const controller = new AbortController();
 
const agent = createBrowserAgent({
  goal: "extract all product prices",
  signal: controller.signal // wire in the abort signal
});
 
agent.start();
 
// Cancel externally (e.g. button click, timeout, unmount)
setTimeout(() => controller.abort(), 5000);
 
// Or call stop() directly:
agent.stop();
// WebLLM bridge with reflection loop
import * as webllm from "@mlc-ai/web-llm";
import { createBrowserAgent, parsePlannerResult } from "@akshayram1/omnibrowser-agent";
 
const engine = await webllm.CreateMLCEngine("Llama-3.2-3B-Instruct-q4f16_1-MLC");
 
window.__browserAgentWebLLM = {
  async plan(input, modelId) {
    const resp = await engine.chat.completions.create({
      messages: [{ role: "user", content: `Goal: ${input.goal}` }],
      temperature: 0, max_tokens: 200
    });
    return parsePlannerResult(resp.choices[0].message.content);
  }
};
exports
createBrowserAgent() parseAction() parsePlannerResult() AgentAction AgentMode AgentSession PlannerConfig ContentResult RiskLevel

Embedding Guide

Embed OmniBrowser Agent as a library in any web application. Full reference in docs/EMBEDDING.md.

Heuristic Planner (zero setup)

import { createBrowserAgent } from "@akshayram1/omnibrowser-agent";

const agent = createBrowserAgent(
  {
    goal: "Search contact Jane Doe and open profile",
    mode: "human-approved",
    planner: { kind: "heuristic" },
    maxSteps: 15,
    stepDelayMs: 400
  },
  {
    onStep:             (result) => console.log("step", result),
    onApprovalRequired: (action) => showApprovalModal(action),
    onDone:             (result) => console.log("done", result),
    onError:            (error)  => console.error(error)
  }
);

await agent.start();

// Approve a paused action:
await agent.approvePendingAction();

// Stop at any time:
agent.stop();

WebLLM Planner with Reflection

Load a WebLLM engine, wire the bridge, then start the agent. The bridge receives the full reflection input and should return the reflection+action object:

import * as webllm from "@mlc-ai/web-llm";
import { createBrowserAgent, parsePlannerResult } from "@akshayram1/omnibrowser-agent";

const engine = await webllm.CreateMLCEngine("Llama-3.2-3B-Instruct-q4f16_1-MLC");

window.__browserAgentWebLLM = {
  async plan(input, modelId) {
    const { goal, history, lastError, memory, systemPrompt } = input;

    const defaultSystem = `You are a browser automation agent.
Output ONLY a JSON object in this format:
{"evaluation":"...","memory":"...","next_goal":"...","action":{...}}`;

    const resp = await engine.chat.completions.create({
      messages: [
        { role: "system", content: systemPrompt || defaultSystem },
        { role: "user",   content: `Goal: "${goal}"\nHistory: ${history.slice(-4).join(" → ")}${memory ? "\nMemory: " + memory : ""}${lastError ? "\nLast error: " + lastError : ""}` }
      ],
      temperature: 0,
      max_tokens: 200
    });

    return parsePlannerResult(resp.choices[0].message.content);
  }
};

const agent = createBrowserAgent({
  goal: "Fill the checkout form with my details",
  planner: { kind: "webllm" }
}, {
  onStep(result) {
    if (result.reflection?.nextGoal) console.log("💭", result.reflection.nextGoal);
    console.log("✅", result.message);
  }
});

await agent.start();

Custom System Prompt

Shape the agent's personality or constraints without touching the bridge:

const agent = createBrowserAgent({
  goal: "Book a meeting room for tomorrow",
  planner: {
    kind: "webllm",
    systemPrompt: `You are a careful meeting room booking assistant.
Always confirm the room is available before clicking Book.
Never navigate away from the booking portal.`
  }
});

Bring Your Own Model (Colab Notebook)

Want a planner model tuned for OmniBrowser's DOM snapshots? Use the training + quantization assets in notebook/:

Flow: collect traces → fine-tune in Colab → run mlc_llm convert_weight --quantization q4f16_1 + gen_config → upload to HuggingFace → load via custom appConfig.

Custom WebLLM Model

To use a model that isn't in WebLLM's built-in list, compile it with MLC-LLM and register it via appConfig:

import * as webllm from "@mlc-ai/web-llm";

// 1. Describe your compiled model
const myModel = {
  model:    "https://huggingface.co/your-org/your-model-MLC/resolve/main/",
  model_id: "your-model-MLC",
  // point to the compiled .wasm lib (same arch as a similar built-in model)
  model_lib: webllm.modelLibURLPrefix + webllm.modelVersion
           + "/Mistral-7B-Instruct-v0.3-q4f16_1-ctx4k_cs1k-webgpu.wasm",
};

// 2. Merge with the prebuilt catalog so existing models still work
const engine = await webllm.CreateMLCEngine("your-model-MLC", {
  appConfig: {
    model_list: [
      ...webllm.prebuiltAppConfig.model_list,
      myModel,
    ],
  },
  initProgressCallback({ progress, text }) {
    console.log(Math.round(progress * 100) + "%", text);
  },
});

// 3. Wire it up as usual
window.__browserAgentWebLLM = {
  async plan(input) {
    const resp = await engine.chat.completions.create({
      messages: [{ role: "user", content: `Goal: ${input.goal}` }],
      temperature: 0,
      max_tokens: 200,
    });
    return parsePlannerResult(resp.choices[0].message.content);
  }
};

Steps to compile a custom model: (1) Install mlc-llmllm.mlc.ai/docs/install/mlc_llm. (2) Run mlc_llm convert_weight + mlc_llm gen_config. (3) If your architecture has no compatible prebuilt kernel, compile with mlc_llm compile. (4) Upload weights + mlc-chat-config.json to a CORS-enabled host (e.g. HuggingFace).

Notes

  • The WebLLM bridge is not bundled — bring your own engine and attach it to window.__browserAgentWebLLM.
  • Use human-approved mode for CRM, finance, and admin actions.
  • Bridges returning a bare AgentAction still work — backward compatible.
  • For production apps, mount inside an authenticated shell and add your own permission checks.

Roadmap

Full roadmap in docs/ROADMAP.md.

v0.1

  • Extension runtime loop
  • Shared action contracts
  • Heuristic + WebLLM planner switch
  • Human-approved mode

v0.2 stable

  • New actions: scroll, focus
  • Improved heuristic planner with regex goal patterns
  • Better page observation (visibility filtering, up to 60 candidates)
  • Library API: resume(), isRunning, hasPendingAction, AbortSignal, onMaxStepsReached
  • CI pipeline with auto version bump on push to main

v0.2.35 current

  • Reflection-before-action pattern (evaluation → memory → next_goal → act)
  • Working memory carried across ticks via AgentSession.memory
  • parsePlannerResult() exported from library
  • systemPrompt option in PlannerConfig
  • Thought bubble (💭) messages in live demo
  • Chatbot UI redesign: tabs, typing indicator, right-aligned messages
  • Doc Viewer example with hidden tabs + side chat
  • Live Examples hub at /examples

v0.3

  • Expanded WebLLM model catalog (new 7B/8B options + compatibility matrix)
  • Improved model loading UX (recommended presets by speed/quality and device memory)
  • Enhanced default system prompts for safer, clearer multi-step planning
  • Prompt presets for common workflows (docs navigation, CRM form fill, task automation)

v1.0

  • Advanced prompt orchestration (goal-aware system prompt routing and contextual guardrails)
  • Functionality expansion: richer action toolkit and stronger extraction/navigation reliability
  • Adaptive planner behaviour (model-aware retries, fallback strategies, and recovery flows)
  • Evaluation suite for prompt and model quality across benchmark browser tasks

Contact

Maintainer: Akshay Chame

For feature requests or bugs, please open an issue on GitHub with reproduction steps.