hello-robot · hello-peiqi · Nov 5, 2024 · Jan 10, 2025 · Jan 16, 2025 · Jan 16, 2025
diff --git a/README.md b/README.md
@@ -16,6 +16,7 @@
 - LLM agents
 - text to speech and speech to text
 - visualization and debugging
+- embodied question answering
 
 Much of the code is licensed under the Apache 2.0 license. See the [LICENSE](LICENSE) file for more information. Parts of it are derived from the Meta [HomeRobot](https://github.com/facebookresearch/home-robot) project and are licensed under the [MIT license](META_LICENSE).
 
@@ -121,6 +122,7 @@ Check out additional documentation for ways to use Stretch AI:
 - [Add a New LLM Task](docs/adding_a_new_task.md) -- How to add a new task that can be called by an LLM
 - [DynaMem](docs/dynamem.md) -- Run the LLM agent in dynamic scenes, meaning you can walk around and place objects as the robot explores
 - [Data Collection for Learning from Demonstration](docs/data_collection.md) -- How to collect data for learning from demonstration
+- [Embodied Question Answering](docs/eqa.md) -- Allow the robot to explore the environment and answer questions from the users about the environment.
 - [Learning from Demonstration](docs/learning_from_demonstration.md)  -- How to train and evaluate policies with LfD
 - [Open-Vocabulary Mobile Manipulation](docs/ovmm.md) -- Experimental code which can handle more complex language commands
 - [Apps](docs/apps.md) -- List of many different apps that you can run

diff --git a/docs/apps.md b/docs/apps.md
@@ -21,6 +21,7 @@ Finally:
 - [Dex Teleop data collection](#dex-teleop-for-data-collection) - Dexterously teleoperate the robot to collect demonstration data.
 - [Learning from Demonstration (LfD)](learning_from_demonstration.md) - Train SOTA policies using [HuggingFace LeRobot](https://github.com/huggingface/lerobot)
 - [Dynamem OVMM system](dynamem.md) - Deploy open vocabulary mobile manipulation system [Dynamem](https://dynamem.github.io)
+- [Embodied question answering (EQA) system](eqa.md) - Deploy embodied question answering system borrowing the idea of [GraphEQA](https://grapheqa.github.io)
 
 There are also some apps for [debugging](debug.md).
 

diff --git a/docs/eqa.md b/docs/eqa.md
@@ -0,0 +1,113 @@
+# The Stretch AI EQA Module
+
+The **Embodied Question Answering (EQA) Module** enables a robot to actively explore its environment, gather visual and spatial data, and answer user queries about what it sees. To answer queries, the EQA module has the robot explore the environment to acquire useful information to answer the question, produces a semantic representation of the environment, and processes questions from the user. Systems like the EQA module have the potential to be used in a variety of applications. For example, it might help people find objects in their home, which could be useful to many people, including people with visual and cognitive impairments. It might also help people monitor their home when they're away by enabling them to ask the robot to check on things.
+
+## Demo Video
+
+[The following](https://youtu.be/6tHGBYFkyMU) shows Stretch AI EQA running in one of our developers' homes.
+
+_Click this large image to follow the link to YouTube:_
+
+[![A demonstration of the EQA module in action](images/eqa.png)](https://youtu.be/6tHGBYFkyMU)
+
+# Motivation and Methodology
+
+In previous EQA work [GraphEQA](https://arxiv.org/abs/2412.14480), researchers provided a multimodal large language models (mLLMs), such as Google's Gemini and OpenAI's GPT, with a prompt that includes a object-centric semantic scene graph and task-relevant robot image observations. GraphEQA utilizes third party scene graph modules [Hydra](https://arxiv.org/abs/2201.13360) based on ROS Noetic. Installing this module can be difficult due to OS and software version compatibility. To provide a more user friendly alternative, we adapted the methods of [GraphEQA](https://arxiv.org/abs/2412.14480) for use with existing code in the Stretch AI repo.
+
+In GraphEQA, mLLMs are expected to answer the question based on task-relevant image observations and plan exploration based on a scene graph string. Stretch AI has useful capabilities that can serve similar roles. For example, [DynaMem system](dynamem.md) finds task-relevant images and VLM models, such as [Qwen](../src/stretch/llms/qwen_client.py) and [OpenAI GPT](../src/stretch/llms/openai_client.py), extract visual clues from image observations by listing featured objects in the images such as beds, tables, etc. 
+
+The Stretch AI EQA module builds on these existing capabilities resulting in a pipeline that only requires a Stretch robot, a GPU machine with 12GB VRAM, and Internet connection to the following cloud-based AI models:
+- A lightweight VLM running on the local worstation. We use `Qwen-VL-2.5-3B` here.
+- A vision language encoder trained in contrastive manner. We use `SigLip-v1-so400m` here
+- A powerful mLLM. We use `gemini-2.5-pro-preview-03-25` here.
+
+When receiving a new question, the robot follows a recipe to find the answer:
+- Extract keywords from the question using the light weighted VLM. For example, `is there a hand sanitizer near sink?` will result in keywords `hand sanitizer` and `sink`.
+- Rotate the head pan to look around and use DynaMem to add images into a voxel-based semantic memory, which extracts pixel level vision language features with a vision language encoder and then projects 2D pixels into the 3D points to add to the voxel map.
+- Use the lightweight VLM to identify featured object names from image observations and add these visual clues to a list.
+- Query DynaMem to identify few task relevant images.
+- Identify image observations selected as task-relevant images and image observations corresponding to unexplored frontiers. Add this information to augment visual clues.
+- Prompt mLLM with relevant images along with augmented visual clues to answer questions. Following GraphEQA, we also ask an mLLM to provide confidence with the answers. If the mLLM is not confident with the answers, it should also output an image ID indicating areas that should be explored. 
+- If no certain answer can be provided, the robot should navigate to the selected image ID.
+- Iterate the above process until an answer can be provided.
+
+
+## Understanding EQA's code structure
+
+This module shares or extends core dependencies (mapping, perception, llms) with other Stretch AI modules like AI Pickup and DynaMem. The following code is relevant to this module: 
+
+| File locations                  | Purpose                                                     |
+| ----------------------- | ---------------------------------------------------------------- |
+| [`src/stretch/app/run_eqa.py`](../src/stretch/app/run_eqa.py)       |       Entry point for EQA module                       |
+| [`src/stretch/agent/task/dynamem/dynamem_task.py`](../src/stretch/agent/task/dynamem/dynamem_task.py?plain=1#L409)  | An executor wrapper for EQA module |
+| [`src/stretch/agent/robot_agent_eqa.py`](../src/stretch/agent/robot_agent_eqa.py)             | Robot agent class containing all useful APIs for question answering  |
+| [`src/stretch/mapping/voxel/voxel_dynamem.py`](../src/stretch/mapping/voxel/voxel_dynamem.py#928)         | We added EQA utilities to [DynaMem voxel.py](../src/stretch/mapping/voxel/voxel_dynamem.py)            |
+
+## Instructions
+
+### Installation and preparation
+
+The very first step is to install all necessary packages on both your Stretch robot and your workstation following [this instruction](./install_details.md). 
+
+Next you should install Gemini following [Google's docs](https://ai.google.dev/gemini-api/docs/quickstart?lang=python) and obtain a Google API key with Tier 1. Tier 1 belong to Pay-as-you-go catogories. 
+
+**So BE VERY CAUTIOUS! That means you will be charged as you use Gemini models as you attempt EQA module!**.  
+
+But Gemini model usage in this module is fairly cheap. You can check [pricing](https://ai.google.dev/gemini-api/docs/pricing) and [rate limit](https://ai.google.dev/gemini-api/docs/rate-limits) for `gemini-2.5-pro-preview-03-25`.
+
+If you also want to try Discord bot, which is a more beautiful, user friendly communication interface compared with the naive terminal and command line, you should also need to install dependencies and obtain your discord tokens following [discord_bot.md](./discord_bot.md)
+
+### Run EQA module
+
+Launch the EQA agent via the `run_eqa` entry-point. By default, the robot will first rotate in place to scan its surroundings, pop out a rerun window (but by default the rerun contents will not be automatically saved, once you close the rerun window, you lose all visualization data), and you will be asked to enter your questions in the terminal.
+
+You need to know the IP address of your robot to send commands to your robot. Once you know your `ROBOT_IP`, you can start running the following commands to try this EQA module. 
+
+You also need to set up your Gemini key before running EQA scripts by
+
+```bash
+export GOOGLE_API_KEY=$YOUR_GEMINI_TOKEN
+```
+
+If you also want to try discord bot, you need to set up discord token as well
+
+```bash
+export DISCORD_TOKEN=$YOUR_DICORD_TOKEN
+```
+
+```bash
+python -m stretch.app.run_eqa --robot_ip $ROBOT_IP
+```
+
+Other options
+
+- `--not_rotate_in_place`, `-N` : skip initial rotation-in-place scan
+- `--discord`, `-D`: launch Discord bot for a better interface than the terminal and command line
+- `--save_rerun`, `--SR`: save Rerun log files to `dynamem_log/debug_*` as rrd file for offline replay (but rerun window online streaming would be disabled)
+
+**Example runs**:
+Assume your robot ip is `192.168.1.42`.
+* Skip initial rotation-in-place scan:
+
+  ```bash
+  python -m stretch.app.run_eqa --robot_ip 192.168.1.42 -N
+  ```
+* Enable Discord for remote users:
+
+  ```bash
+  python -m stretch.app.run_eqa --robot_ip 192.168.1.42 -D
+  ```
+* No initial rotation-in-place scan, save rerun visualization, enable discord:
+
+  ```bash
+  python -m stretch.app.run_eqa --robot_ip 192.168.1.42 -N -D --SR
+  ```
+
+
+## Contributing
+
+This is an active component within the Stretch repository. Please follow the main [CONTRIBUTING.md](./CONTRIBUTING.md) guidelines for branching, testing, and pull requests.
+
+---
+
+*Last updated: May 2025*
diff --git a/docs/images/eqa.png b/docs/images/eqa.png
diff --git a/install.sh b/install.sh
@@ -204,13 +204,13 @@ else
     echo "Install detectron2 for perception (required by Detic)"
     git submodule update --init --recursive
     cd third_party/detectron2
-    pip install -e .
+    python -m pip install -e .
 
     echo "Install Detic for perception"
     cd ../../src/stretch/perception/detection/detic/Detic
     # Make sure it's up to date
     git submodule update --init --recursive
-    pip install -r requirements.txt
+    python -m pip install -r requirements.txt
 
     # cd ../../src/stretch/perception/detection/detic/Detic
     # Create folder for checkpoints and download

diff --git a/src/setup.py b/src/setup.py
@@ -53,24 +53,32 @@
         # From openai
         "openai",
         "openai-clip",
+        # For gemini
+        "google-genai",
         # For Yolo
         # "ultralytics",
         # Hardware dependencies
         "hello-robot-stretch-urdf",
         "pyrealsense2",
         "urchin",
         # Visualization
-        "rerun-sdk>=0.18.0",
+        "rerun-sdk==0.18.0",
         # For siglip encoder
         "sentencepiece",
         # For git tools
         "gitpython",
         # Configuration tools and neural networks
         "hydra-core",
         "timm>1.0.0",
-        "huggingface_hub[cli]",
-        "transformers>=4.39.2",
-        "accelerate",
+        "huggingface_hub[cli]>=0.24.7",
+        # "flash-attn",
+        "transformers>=4.50.0",
+        "retry",
+        "qwen_vl_utils",
+        "bitsandbytes",
+        "autoawq>=0.1.5",
+        "triton >= 3.0.0",
+        "accelerate >= 1.5.0",
         "einops",
         # Meta neural nets
         "segment-anything",

diff --git a/src/stretch/agent/robot_agent_dynamem.py b/src/stretch/agent/robot_agent_dynamem.py
@@ -20,7 +20,6 @@
 from typing import Any, Dict, List, Optional, Union
 from uuid import uuid4
 
-import cv2
 import numpy as np
 import rerun as rr
 import rerun.blueprint as rrb
@@ -63,7 +62,10 @@
 
 
 class RobotAgent(RobotAgentBase):
-    """Basic demo code. Collects everything that we need to make this work."""
+    """
+    Extending from Basic demo code robot_agent.py. Adds new functionality that implements DynaMem.
+    https://dynamem.github.io
+    """
 
     def __init__(
         self,
@@ -76,7 +78,6 @@ def __init__(
         show_instances_detected: bool = False,
         use_instance_memory: bool = False,
         realtime_updates: bool = False,
-        obs_sub_port: int = 4450,
         re: int = 3,
         manip_port: int = 5557,
         log: Optional[str] = None,
@@ -108,12 +109,6 @@ def __init__(
         # For placing
         self.owl_sam_detector = None
 
-        # if self.parameters.get("encoder", None) is not None:
-        #     self.encoder: BaseImageTextEncoder = get_encoder(
-        #         self.parameters["encoder"], self.parameters.get("encoder_args", {})
-        #     )
-        # else:
-        #     self.encoder: BaseImageTextEncoder = None
         self.device = "cuda" if torch.cuda.is_available() else "cpu"
 
         if not os.path.exists("dynamem_log"):
@@ -170,13 +165,6 @@ def __init__(
         self._manipulation_radius = parameters["motion_planner"]["goals"]["manipulation_radius"]
         self._voxel_size = parameters["voxel_size"]
 
-        # self.image_processor = VoxelMapImageProcessor(
-        #     rerun=True,
-        #     rerun_visualizer=self.robot._rerun,
-        #     log="dynamem_log/" + datetime.now().strftime("%Y%m%d_%H%M%S"),
-        #     robot=self.robot,
-        # )  # type: ignore
-        # self.encoder = self.image_processor.get_encoder()
         context = zmq.Context()
         self.manip_socket = context.socket(zmq.REQ)
         self.manip_socket.connect("tcp://" + server_ip + ":" + str(manip_port))
@@ -208,6 +196,9 @@ def __init__(
         self._start_threads()
 
     def create_obstacle_map(self, parameters):
+        """
+        This function creates the MaskSiglipEncoder, Owlv2 detector, voxel map util class and voxel map navigation space util class
+        """
         if self.manipulation_only:
             self.encoder = None
         else:
@@ -249,7 +240,6 @@ def create_obstacle_map(self, parameters):
             smooth_kernel_size=parameters.get("filters/smooth_kernel_size", -1),
             use_median_filter=parameters.get("filters/use_median_filter", False),
             median_filter_size=parameters.get("filters/median_filter_size", 5),
-            median_filter_max_error=parameters.get("filters/median_filter_max_error", 0.01),
             use_derivative_filter=parameters.get("filters/use_derivative_filter", False),
             derivative_filter_threshold=parameters.get("filters/derivative_filter_threshold", 0.5),
             detection=self.detection_model,
@@ -267,6 +257,9 @@ def create_obstacle_map(self, parameters):
         self.planner = AStar(self.space)
 
     def setup_custom_blueprint(self):
+        """
+        This function define rerun blueprint of DynaMem module.
+        """
         main = rrb.Horizontal(
             rrb.Spatial3DView(name="3D View", origin="world"),
             rrb.Vertical(
@@ -285,35 +278,14 @@ def setup_custom_blueprint(self):
         )
         rr.send_blueprint(my_blueprint)
 
-    def compute_blur_metric(self, image):
+    def update_map_with_pose_graph(self):
         """
-        Computes a blurriness metric for an image tensor using gradient magnitudes.
 
-        Parameters:
-        - image (torch.Tensor): The input image tensor. Shape is [H, W, C].
+        Update our voxel map using a pose graph. Used for realtime update.
+        By default DynaMem will ask the robot stop to take new observations so this function will not be called.
 
-        Returns:
-        - blur_metric (float): The computed blurriness metric.
         """
 
-        # Convert the image to grayscale
-        gray_image = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
-
-        # Compute gradients using the Sobel operator
-        Gx = cv2.Sobel(gray_image, cv2.CV_64F, 1, 0, ksize=3)
-        Gy = cv2.Sobel(gray_image, cv2.CV_64F, 0, 1, ksize=3)
-
-        # Compute gradient magnitude
-        G = cv2.magnitude(Gx, Gy)
-
-        # Compute the mean of gradient magnitudes
-        blur_metric = G.mean()
-
-        return blur_metric
-
-    def update_map_with_pose_graph(self):
-        """Update our voxel map using a pose graph"""
-
         t0 = timeit.default_timer()
         self.pose_graph = self.robot.get_pose_graph()
 
@@ -392,10 +364,7 @@ def update_map_with_pose_graph(self):
         #     if obs.is_pose_graph_node:
         #         self.voxel_map.add_obs(obs)
         if len(self.obs_history) > 0:
-            obs_history = self.obs_history[-5:]
-            blurness = [self.compute_blur_metric(obs.rgb) for obs in obs_history]
-            obs = obs_history[blurness.index(max(blurness))]
-            # obs = self.obs_history[-1]
+            obs = self.obs_history[-1]
         else:
             obs = None
 
@@ -462,6 +431,10 @@ def update(self):
             )
 
     def look_around(self):
+        """
+        Let the robot look around to check its surroudings.
+        Rotating the robot head to compensate for the narrow field of view of realsense head camera
+        """
         print("*" * 10, "Look around to check", "*" * 10)
         for pan in [0.6, -0.2, -1.0, -1.8]:
             tilt = -0.6
@@ -482,6 +455,7 @@ def execute_action(
         self,
         text: str,
     ):
+        """ """
         if not self._realtime_updates:
             self.robot.look_front()
             self.look_around()
@@ -497,6 +471,8 @@ def execute_action(
 
         if len(res) > 0:
             print("Plan successful!")
+            # This means that the robot has already finished all of its trajectories and should stop to manipulate the object.
+            # We will append two nan on the trajectory to denote that the robot is reaching the target point
             if len(res) >= 2 and np.isnan(res[-2]).all():
                 if len(res) > 2:
                     self.robot.execute_trajectory(
@@ -507,9 +483,8 @@ def execute_action(
                     )
 
                 return True, res[-1]
+            # The robot has not reached the object. Next it should look around and continue navigation
             else:
-                # print(res)
-                # res[-1][2] += np.pi / 2
                 self.robot.execute_trajectory(
                     res,
                     pos_err_threshold=self.pos_err_threshold,
@@ -522,10 +497,12 @@ def execute_action(
             return None, None
 
     def run_exploration(self):
-        """Go through exploration. We use the voxel_grid map created by our collector to sample free space, and then use our motion planner (RRT for now) to get there. At the end, we plan back to (0,0,0).
+        """
+        Go through exploration when the robot has not received any text query from the user.
+        We use the voxel_grid map created by our collector to sample free space, and then use A* planner to get there.
+        """
 
-        Args:
-            visualize(bool): true if we should do intermediate debug visualizations"""
+        # "" means the robot has not received any text query from the user and should conduct exploration just to better know the environment
         status, _ = self.execute_action("")
         if status is None:
             print("Exploration failed! Perhaps nowhere to explore!")
@@ -668,6 +645,11 @@ def process_text(self, text, start_pose):
         return traj
 
     def navigate(self, text, max_step=10):
+        """
+        The robot calls this function to navigate to the object.
+        It will call execute_action function until it is ready for manipulation
+        """
+        # Start a new rerun recording to avoid an overly large rerun video.
         rr.init("Stretch_robot", recording_id=uuid4(), spawn=True)
         finished = False
         step = 0