Skip to content

Proposal: BoxLite as VM infrastructure for hardware-isolated computer automation #99

@DorianZheng

Description

@DorianZheng

Proposal: BoxLite as VM infrastructure for hardware-isolated computer automation

Hi Google Gemini team! 👋

First off, excellent work on computer-use-preview—this is a fantastic demonstration of Gemini's computer use capabilities. The abstraction between Playwright and Browserbase is particularly well-designed, making it easy to swap execution environments.

I've been working on BoxLite (github.com/boxlite-labs/boxlite), an embeddable VM runtime, and I think there's a natural opportunity to add a third backend option that provides hardware-isolated, local VM-based computer automation—enabling Gemini to control entire desktop environments (not just browsers), especially useful for multi-tenant deployments, reproducible environments, and scenarios requiring stronger isolation than browser sandboxes.

Demo: BoxLite with Claude Code (AI Agent Desktop Automation)

▶️ Watch the demo on YouTube

boxlite-mcp-demo-compressed.mp4

This shows BoxLite providing full desktop automation for Claude Code (an AI coding agent). The same approach could work for Gemini controlling computer-use-preview environments.


Current State & Opportunity

computer-use-preview currently provides two excellent execution backend options:

Playwright (Local):

  • ✅ Fast local execution
  • ✅ Simple setup
  • ✅ No cloud dependency
  • ⚠️ Runs on host system (less isolation)
  • ⚠️ Limited to browser context
  • ⚠️ OS dropdown rendering issues

Browserbase (Cloud):

  • ✅ Better OS element support
  • ✅ Managed infrastructure
  • ✅ Network isolation
  • ⚠️ Cloud dependency
  • ⚠️ Latency
  • ⚠️ Limited to browser context

There's an opportunity for a third option that enables full computer automation:

  • Hardware-isolated like Browserbase (but local)
  • No cloud dependency like Playwright (but more isolated)
  • Full desktop environment (control any GUI application, not just browsers)
  • Native OS support (VMs render OS elements correctly)

What is BoxLite?

BoxLite (github.com/boxlite-labs/boxlite) takes an "embeddable library" approach to sandboxing—think SQLite for VMs. Instead of requiring Docker Desktop or a daemon, it's a library that provides hardware-level isolated environments.

Core characteristics:

  • Hardware virtualization (KVM/Hypervisor.framework) — Real VMs, not just browser sandboxes
  • No daemon dependency — Just a Python library (pip install boxlite)
  • OCI-compatible — Uses standard Docker images from any registry
  • Cross-platform — macOS (Apple Silicon) and Linux (x86_64, ARM64)
  • Full desktop environment — Run any GUI application, not just browsers

Architecture:

BrowserAgent (existing)
└── Computer (abstraction)
    ├── PlaywrightComputer (existing - browser only)
    ├── BrowserbaseComputer (existing - browser only)
    └── BoxliteComputer (proposed) ← New backend
        └── Micro-VM (hardware virtualized)
            └── Full Desktop Environment
                ├── Browsers (Chrome/Firefox)
                ├── Terminal applications
                ├── GUI applications
                └── Any desktop software

Integration with Existing Architecture

BoxLite would fit naturally into the existing computers/ abstraction as a new backend:

New Implementation

# computers/boxlite/boxlite_computer.py
import boxlite
from computers.computer import Computer

class BoxliteComputer(Computer):
    """Computer implementation using BoxLite VMs for hardware isolation"""

    async def __init__(self, image="ubuntu-desktop:latest"):
        # Start VM with desktop environment
        self.desktop = await boxlite.ComputerBox(
            cpu=2,
            memory=2048,
            image=image
        ).__aenter__()

    async def screenshot(self) -> bytes:
        """Capture screenshot from VM desktop"""
        result = await self.desktop.screenshot()
        return result['data']  # base64 encoded PNG

    async def click_at(self, x: int, y: int):
        """Click at normalized coordinates"""
        await self.desktop.mouse_move(x, y)
        await self.desktop.left_click()

    async def type_text_at(self, x: int, y: int, text: str):
        """Type text at coordinates"""
        await self.desktop.mouse_move(x, y)
        await self.desktop.left_click()
        await self.desktop.type(text)

    async def scroll_at(self, x: int, y: int, direction: str):
        """Scroll at coordinates"""
        await self.desktop.scroll(x, y, direction, amount=3)

    async def key_combination(self, keys: str):
        """Execute keyboard shortcut"""
        await self.desktop.key(keys)

    async def close(self):
        """Cleanup VM"""
        await self.desktop.__aexit__(None, None, None)

Usage

# Use BoxLite backend for browser automation
python main.py --computer=boxlite "Book a flight to Tokyo"

# Or any desktop automation task
python main.py --computer=boxlite "Open VS Code and create a new Python file"
python main.py --computer=boxlite "Take a screenshot of the desktop"

# Same agent, different infrastructure
# All actions happen in hardware-isolated VM with full desktop

Use Cases

1. Multi-Tenant SaaS 🏢

Challenge: Running multiple users' computer automation tasks safely

With BoxLite:

  • Each user gets dedicated VM with full desktop
  • Hardware isolation prevents cross-contamination
  • Safe for untrusted automation tasks (any application, not just browsers)

2. Reproducible Environments

Challenge: "Works on my machine" issues

With BoxLite:

# Everyone uses same OCI image
BoxliteComputer(image="mycompany/browser-env:v1.0")

# Guaranteed same:
# - Browser version
# - System libraries
# - Screen resolution
# - Installed fonts

3. Native OS Elements 🎯

Challenge: Playwright can't capture OS-rendered dropdowns

With BoxLite:

  • Full VM desktop renders OS elements
  • Screenshot captures everything
  • No Browserbase cloud dependency

4. Local Development 💻

Challenge: Need Browserbase-like isolation locally

With BoxLite:

  • Same isolation as Browserbase
  • No cloud account needed
  • No network latency
  • Works offline

5. Full Computer Automation 🚀

True computer use: BoxLite ComputerBox provides full desktop environment

# Control any GUI application, not just browsers
await desktop.open_application("Firefox")
await desktop.open_application("VS Code")
await desktop.open_application("Terminal")
await desktop.type("git clone https://github.com/...")

# Gemini can automate entire desktop workflows
# - Code editing in IDEs
# - Terminal commands
# - File management
# - Cross-application workflows

Working Integration Example

BoxLite already has a working integration with Claude Code via MCP:

  • Repository: https://github.com/boxlite-labs/boxlite-mcp
  • Shows: How AI agents (Claude Code) use BoxLite for safe code execution and desktop automation
  • Proves: Integration pattern works well with AI agents controlling full desktop environments

The same pattern applies here—computer-use-preview using BoxLite's ComputerBox for hardware-isolated, full-desktop computer automation.


Potential Benefits

1. Enhanced Security

  • Hardware isolation - Separate VM per session
  • Kernel-level protection - Not just browser sandbox
  • Safe untrusted tasks - Run any automation safely

2. Reproducibility

  • OCI images - Same environment everywhere
  • Version control - mycompany/browser-env:v1.0
  • Consistent results - Dev matches production

3. Developer Experience

  • No Docker Desktop - Works without daemon on macOS
  • No cloud account - No Browserbase subscription needed
  • Fast startup - Micro-VMs in ~100-500ms
  • Offline capable - Local execution

4. Flexibility

  • Full desktop - Beyond just browsers
  • Custom images - Pre-configure everything
  • Any GUI app - Not limited to Chrome/Firefox

Trade-offs & Considerations

When Playwright is Better

  • Quick local testing - Fastest, simplest setup
  • Simple scripts - No VM overhead needed
  • Speed-critical - Direct execution, no VM layer

When Browserbase is Better

  • Managed infrastructure - No local setup
  • Enterprise scale - Managed service, support
  • Team collaboration - Shared cloud resources

When BoxLite Helps

  • Multi-tenant SaaS - Hardware isolation per user
  • Local VMs - Browserbase-like isolation locally
  • Reproducibility - OCI images, version control
  • Native OS elements - Full desktop rendering
  • Offline development - No cloud dependency
  • Full computer automation - Any GUI application, not just browsers

Recommendation: Offer all three options—users choose based on their needs.


BoxLite Status


Potential Next Steps

If this seems interesting, I'd be happy to:

  1. Implement computers/boxlite/ - Create BoxliteComputer backend following existing patterns
  2. Provide example OCI images - Pre-built desktop images with Chrome/Firefox, VS Code, Terminal for testing
  3. Share benchmarks - Show startup time, memory usage, performance comparisons
  4. Document integration - Add BoxLite setup instructions to README
  5. Demo full computer automation - Show Gemini controlling entire desktop (not just browsers)

No pressure—mainly wanted to share this as a potential third backend option for scenarios requiring hardware isolation, local VM infrastructure, or full desktop automation capabilities.


Feedback Welcome

I'd love to hear your thoughts on:

  • Whether hardware-isolated local VMs would be valuable for computer-use-preview users
  • If the computers/ abstraction makes this integration straightforward
  • What scenarios you see benefiting most from VM-based computer automation (full desktop, not just browsers)
  • Any concerns about the approach

And if you're interested in BoxLite for other projects, feel free to check it out—we're building in public and feedback helps! A ⭐ on GitHub would be appreciated if you find it useful.


Disclosure: I'm one of the BoxLite maintainers, but I genuinely think there's natural synergy here—BoxLite's ComputerBox was designed for exactly this use case (AI agents controlling desktop environments), and your abstraction layer makes integration straightforward. Looking forward to your thoughts!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions