Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mouse position into env observation #282

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

ryanhoangt
Copy link

Hi, thanks for the project! I'm trying to implement and experiment with coordinate-based actions from browsergym and it would be useful if the environment exposes this info via the observation. Not sure what the team thinks about this?

One quirk is seems like there're no direct ways to get the mouse position from Playwright so I use a kinda hacky way to get that info.

Copy link
Collaborator

@gasse gasse left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice feature! See my comments to make it robust to iFrames

browsergym/core/src/browsergym/core/env.py Outdated Show resolved Hide resolved
@@ -271,7 +268,7 @@ def override_property(task, env, property):
window.addEventListener("focusin", () => {window.browsergym_page_activated();}, {capture: true});
window.addEventListener("load", () => {window.browsergym_page_activated();}, {capture: true});
window.addEventListener("pageshow", () => {window.browsergym_page_activated();}, {capture: true});
window.addEventListener("mousemove", () => {window.browsergym_page_activated();}, {capture: true});
window.addEventListener("mousemove", (event) => {window.browsergym_page_activated(); window.pageX = event.clientX; window.pageY = event.clientY;}, {capture: true});
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean and simple, I like this

Returns:
An array of the x and y coordinates of the mouse location.
"""
position = page.evaluate("""() => {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will work for simple pages, but I'm worried about iframes. Here is something that could work:

  • in the JS callback (mousemove), record the position in JS in the window object, and also record which page / frame received this event, in Python with a method similar to _activate_page_from_js().
  • to extract the mouse position in the browser viewport, take the latest mouse position (last iframe that received a mousemove event), and work your way up the frame hierarchy to reconstruct the current mouse position. See how we do that to get the coordinates of all elements in all iframes here:

https://github.com/ServiceNow/BrowserGym/blob/main/browsergym/core/src/browsergym/core/observation.py#L293-L377

@@ -1141,6 +1142,7 @@ def get_checkbox_elem(obs):

obs, reward, term, trunc, info = env.step(action)
checkbox = get_checkbox_elem(obs)
assert obs['mouse_position'] == [x, y]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good test, can you do the same for other pages which have iFrames, and check you get the correct coordinates when clicking on elements inside the iframe? (clicking with coordinates, and with bid)

@gasse gasse force-pushed the add-mouse-position branch from cd33d61 to a701498 Compare December 3, 2024 18:49
@gasse
Copy link
Collaborator

gasse commented Dec 3, 2024

BTW, a cool way to try this feature is to run an openended agent on a whiteboard and ask it to draw simple forms, like we did for the demo video here
https://github.com/ServiceNow/BrowserGym/

@gasse
Copy link
Collaborator

gasse commented Dec 3, 2024

Seems like there is pageX, pageY but also clientX, clientY
https://michaelwornow.net/2024/01/02/display-x-y-coords-chrome-debugger

https://developer.mozilla.org/en-US/docs/Web/API/MouseEvent/clientX
https://developer.mozilla.org/en-US/docs/Web/API/MouseEvent/pageX

Only way to know how / which one of these to use is to write some tests :)

@ryanhoangt
Copy link
Author

Seems like there is pageX, pageY but also clientX, clientY
https://michaelwornow.net/2024/01/02/display-x-y-coords-chrome-debugger

From the blog seems like clientX/clientY is relative to viewport, and pageX/pageY is relative to the whole webpage. I think clientX/clientY is closer to what we want 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants