Speechrobot

Specifically, by grounding objects and their spatial relations, we allow specification of complex placement instructions, e.g. "place it behind the middle red bowl". Our results obtained using a real-world PR2 robot demonstrate the effectiveness of our method in understanding pick-and-place language instructions and sequentially composing them to solve tabletop manipulation tasks.

Our approach consists of two neural networks. The first network learns to segment objects in a scene and to comprehend and generate referring expressions. The second network estimates pixelwise object placement probabilities for a set of spatial relations given an input image and a reference object. The interplay between both networks allows for an effective grounding of object semantics and their spatial relationships, without assuming a predefined set of object categories.

Figure: Overview of the system architecture. Our grounding network processes the input sentence and visual object candidates detected with Mask-RCNN and performs referential expression comprehension. Additionally, it generates referential expressions for each object candidate to disambiguate unclear instructions. Once the reference object of a relative placement instruction has been identified, a second network predicts object placing locations for a set of spatial relations.

Once an object has been picked, our system needs to be able to place it in accordance with the instructions from the human operator. We combine referring expression comprehension with the grounding of spatial relations to enable complex object placement commands such as "place the ball inside the left box". Given an input image of the scene and the location of the reference item, identified with our aforementioned grounding module, we generate pixelwise object placement probabilities for a set of spatial relations by leveraging the Spatial-RelNet architecture we introduced in our previous work.

Figure: Our Spatial-RelNet network processes the input RGB image and an object attention mask to produce pixelwise probability maps over a set of spatial relations. During training, we sample locations (u, v) from the predicted distributions, implant inside an auxiliary classifier network at the sampled locations high level features of objects and classify the hallucinated scene representation to get a learning signal for Spatial-RelNet. At test time the auxiliary network is not used.

As natural language placement instructions do not uniquely identify a location in a scene, Spatial-RelNet predicts non-parametric distributions to capture the inherent ambiguity. A key challenge to learning such pixelwise spatial distributions is the lack of ground-truth data. Spatial-RelNet overcomes this problem by leveraging a novel auxiliary learning formulation. Concretely, it classifies hallucinated scene representations by implanting high-level features of objects at different locations to get a learning signal.

Composing Pick-and-Place Tasks By Grounding Language
Oier Mees, Wolfram Burgard
International Symposium on Experimental Robotics (ISER) 2021

Pdf BibTeX

Learning Object Placements For Relational Instructions by Hallucinating Scene Representations
Oier Mees, Alp Emek, Johan Verten, Wolfram Burgard
IEEE International Conference on Robotics and Automation (ICRA) 2020

Pdf BibTeX

Composing Pick-and-Place Tasks By Grounding Language

Technical Approach

Relational Object Placement

Qualitative Results

Table Setting

Failure Cases

Videos

Dataset

Publications

People

Oier Mees

Wolfram Burgard