-
Notifications
You must be signed in to change notification settings - Fork 348
Description
Proposal: BoxLite as VM infrastructure for hardware-isolated computer automation
Hi Google Gemini team! 👋
First off, excellent work on computer-use-preview—this is a fantastic demonstration of Gemini's computer use capabilities. The abstraction between Playwright and Browserbase is particularly well-designed, making it easy to swap execution environments.
I've been working on BoxLite (github.com/boxlite-labs/boxlite), an embeddable VM runtime, and I think there's a natural opportunity to add a third backend option that provides hardware-isolated, local VM-based computer automation—enabling Gemini to control entire desktop environments (not just browsers), especially useful for multi-tenant deployments, reproducible environments, and scenarios requiring stronger isolation than browser sandboxes.
Demo: BoxLite with Claude Code (AI Agent Desktop Automation)
boxlite-mcp-demo-compressed.mp4
This shows BoxLite providing full desktop automation for Claude Code (an AI coding agent). The same approach could work for Gemini controlling computer-use-preview environments.
Current State & Opportunity
computer-use-preview currently provides two excellent execution backend options:
Playwright (Local):
- ✅ Fast local execution
- ✅ Simple setup
- ✅ No cloud dependency
⚠️ Runs on host system (less isolation)⚠️ Limited to browser context⚠️ OS dropdown rendering issues
Browserbase (Cloud):
- ✅ Better OS element support
- ✅ Managed infrastructure
- ✅ Network isolation
⚠️ Cloud dependency⚠️ Latency⚠️ Limited to browser context
There's an opportunity for a third option that enables full computer automation:
- Hardware-isolated like Browserbase (but local)
- No cloud dependency like Playwright (but more isolated)
- Full desktop environment (control any GUI application, not just browsers)
- Native OS support (VMs render OS elements correctly)
What is BoxLite?
BoxLite (github.com/boxlite-labs/boxlite) takes an "embeddable library" approach to sandboxing—think SQLite for VMs. Instead of requiring Docker Desktop or a daemon, it's a library that provides hardware-level isolated environments.
Core characteristics:
- Hardware virtualization (KVM/Hypervisor.framework) — Real VMs, not just browser sandboxes
- No daemon dependency — Just a Python library (
pip install boxlite) - OCI-compatible — Uses standard Docker images from any registry
- Cross-platform — macOS (Apple Silicon) and Linux (x86_64, ARM64)
- Full desktop environment — Run any GUI application, not just browsers
Architecture:
BrowserAgent (existing)
└── Computer (abstraction)
├── PlaywrightComputer (existing - browser only)
├── BrowserbaseComputer (existing - browser only)
└── BoxliteComputer (proposed) ← New backend
└── Micro-VM (hardware virtualized)
└── Full Desktop Environment
├── Browsers (Chrome/Firefox)
├── Terminal applications
├── GUI applications
└── Any desktop software
Integration with Existing Architecture
BoxLite would fit naturally into the existing computers/ abstraction as a new backend:
New Implementation
# computers/boxlite/boxlite_computer.py
import boxlite
from computers.computer import Computer
class BoxliteComputer(Computer):
"""Computer implementation using BoxLite VMs for hardware isolation"""
async def __init__(self, image="ubuntu-desktop:latest"):
# Start VM with desktop environment
self.desktop = await boxlite.ComputerBox(
cpu=2,
memory=2048,
image=image
).__aenter__()
async def screenshot(self) -> bytes:
"""Capture screenshot from VM desktop"""
result = await self.desktop.screenshot()
return result['data'] # base64 encoded PNG
async def click_at(self, x: int, y: int):
"""Click at normalized coordinates"""
await self.desktop.mouse_move(x, y)
await self.desktop.left_click()
async def type_text_at(self, x: int, y: int, text: str):
"""Type text at coordinates"""
await self.desktop.mouse_move(x, y)
await self.desktop.left_click()
await self.desktop.type(text)
async def scroll_at(self, x: int, y: int, direction: str):
"""Scroll at coordinates"""
await self.desktop.scroll(x, y, direction, amount=3)
async def key_combination(self, keys: str):
"""Execute keyboard shortcut"""
await self.desktop.key(keys)
async def close(self):
"""Cleanup VM"""
await self.desktop.__aexit__(None, None, None)Usage
# Use BoxLite backend for browser automation
python main.py --computer=boxlite "Book a flight to Tokyo"
# Or any desktop automation task
python main.py --computer=boxlite "Open VS Code and create a new Python file"
python main.py --computer=boxlite "Take a screenshot of the desktop"
# Same agent, different infrastructure
# All actions happen in hardware-isolated VM with full desktopUse Cases
1. Multi-Tenant SaaS 🏢
Challenge: Running multiple users' computer automation tasks safely
With BoxLite:
- Each user gets dedicated VM with full desktop
- Hardware isolation prevents cross-contamination
- Safe for untrusted automation tasks (any application, not just browsers)
2. Reproducible Environments ✅
Challenge: "Works on my machine" issues
With BoxLite:
# Everyone uses same OCI image
BoxliteComputer(image="mycompany/browser-env:v1.0")
# Guaranteed same:
# - Browser version
# - System libraries
# - Screen resolution
# - Installed fonts3. Native OS Elements 🎯
Challenge: Playwright can't capture OS-rendered dropdowns
With BoxLite:
- Full VM desktop renders OS elements
- Screenshot captures everything
- No Browserbase cloud dependency
4. Local Development 💻
Challenge: Need Browserbase-like isolation locally
With BoxLite:
- Same isolation as Browserbase
- No cloud account needed
- No network latency
- Works offline
5. Full Computer Automation 🚀
True computer use: BoxLite ComputerBox provides full desktop environment
# Control any GUI application, not just browsers
await desktop.open_application("Firefox")
await desktop.open_application("VS Code")
await desktop.open_application("Terminal")
await desktop.type("git clone https://github.com/...")
# Gemini can automate entire desktop workflows
# - Code editing in IDEs
# - Terminal commands
# - File management
# - Cross-application workflowsWorking Integration Example
BoxLite already has a working integration with Claude Code via MCP:
- Repository: https://github.com/boxlite-labs/boxlite-mcp
- Shows: How AI agents (Claude Code) use BoxLite for safe code execution and desktop automation
- Proves: Integration pattern works well with AI agents controlling full desktop environments
The same pattern applies here—computer-use-preview using BoxLite's ComputerBox for hardware-isolated, full-desktop computer automation.
Potential Benefits
1. Enhanced Security
- Hardware isolation - Separate VM per session
- Kernel-level protection - Not just browser sandbox
- Safe untrusted tasks - Run any automation safely
2. Reproducibility
- OCI images - Same environment everywhere
- Version control -
mycompany/browser-env:v1.0 - Consistent results - Dev matches production
3. Developer Experience
- No Docker Desktop - Works without daemon on macOS
- No cloud account - No Browserbase subscription needed
- Fast startup - Micro-VMs in ~100-500ms
- Offline capable - Local execution
4. Flexibility
- Full desktop - Beyond just browsers
- Custom images - Pre-configure everything
- Any GUI app - Not limited to Chrome/Firefox
Trade-offs & Considerations
When Playwright is Better
- ✅ Quick local testing - Fastest, simplest setup
- ✅ Simple scripts - No VM overhead needed
- ✅ Speed-critical - Direct execution, no VM layer
When Browserbase is Better
- ✅ Managed infrastructure - No local setup
- ✅ Enterprise scale - Managed service, support
- ✅ Team collaboration - Shared cloud resources
When BoxLite Helps
- ✅ Multi-tenant SaaS - Hardware isolation per user
- ✅ Local VMs - Browserbase-like isolation locally
- ✅ Reproducibility - OCI images, version control
- ✅ Native OS elements - Full desktop rendering
- ✅ Offline development - No cloud dependency
- ✅ Full computer automation - Any GUI application, not just browsers
Recommendation: Offer all three options—users choose based on their needs.
BoxLite Status
- Current version: 0.4.4 on PyPI
- License: Apache 2.0 (same as this project)
- Platforms: macOS (Apple Silicon), Linux (x86_64, ARM64)
- GitHub: https://github.com/boxlite-labs/boxlite
- MCP Integration: https://github.com/boxlite-labs/boxlite-mcp (working with Claude Code)
- Python SDK: Stable, asyncio-native
- Production readiness: Early stage, used in production by some teams
Potential Next Steps
If this seems interesting, I'd be happy to:
- Implement
computers/boxlite/- Create BoxliteComputer backend following existing patterns - Provide example OCI images - Pre-built desktop images with Chrome/Firefox, VS Code, Terminal for testing
- Share benchmarks - Show startup time, memory usage, performance comparisons
- Document integration - Add BoxLite setup instructions to README
- Demo full computer automation - Show Gemini controlling entire desktop (not just browsers)
No pressure—mainly wanted to share this as a potential third backend option for scenarios requiring hardware isolation, local VM infrastructure, or full desktop automation capabilities.
Feedback Welcome
I'd love to hear your thoughts on:
- Whether hardware-isolated local VMs would be valuable for computer-use-preview users
- If the
computers/abstraction makes this integration straightforward - What scenarios you see benefiting most from VM-based computer automation (full desktop, not just browsers)
- Any concerns about the approach
And if you're interested in BoxLite for other projects, feel free to check it out—we're building in public and feedback helps! A ⭐ on GitHub would be appreciated if you find it useful.
Disclosure: I'm one of the BoxLite maintainers, but I genuinely think there's natural synergy here—BoxLite's ComputerBox was designed for exactly this use case (AI agents controlling desktop environments), and your abstraction layer makes integration straightforward. Looking forward to your thoughts!