Controlling robots to perform tasks via natural language is one of the most challenging topics in human-robot interaction. In this work, we present a robot system that follows unconstrained language instructions to pick and place arbitrary objects and effectively resolves ambiguities through dialogues. Our approach infers objects and their relationships from input images and language expressions and can place objects in accordance with the spatial relations expressed by the user. Unlike previous approaches, we consider grounding not only for the picking but also for the placement of everyday objects from language.

Specifically, by grounding objects and their spatial relations, we allow specification of complex placement instructions, e.g. "place it behind the middle red bowl". Our results obtained using a real-world PR2 robot demonstrate the effectiveness of our method in understanding pick-and-place language instructions and sequentially composing them to solve tabletop manipulation tasks.

Technical Approach

Our approach consists of two neural networks. The first network learns to segment objects in a scene and to comprehend and generate referring expressions. The second network estimates pixelwise object placement probabilities for a set of spatial relations given an input image and a reference object. The interplay between both networks allows for an effective grounding of object semantics and their spatial relationships, without assuming a predefined set of object categories.

Network architecture
Figure: Overview of the system architecture. Our grounding network processes the input sentence and visual object candidates detected with Mask-RCNN and performs referential expression comprehension. Additionally, it generates referential expressions for each object candidate to disambiguate unclear instructions. Once the reference object of a relative placement instruction has been identified, a second network predicts object placing locations for a set of spatial relations.

Relational Object Placement

Once an object has been picked, our system needs to be able to place it in accordance with the instructions from the human operator. We combine referring expression comprehension with the grounding of spatial relations to enable complex object placement commands such as "place the ball inside the left box". Given an input image of the scene and the location of the reference item, identified with our aforementioned grounding module, we generate pixelwise object placement probabilities for a set of spatial relations by leveraging the Spatial-RelNet architecture we introduced in our previous work.

Network architecture
Figure: Our Spatial-RelNet network processes the input RGB image and an object attention mask to produce pixelwise probability maps over a set of spatial relations. During training, we sample locations (u, v) from the predicted distributions, implant inside an auxiliary classifier network at the sampled locations high level features of objects and classify the hallucinated scene representation to get a learning signal for Spatial-RelNet. At test time the auxiliary network is not used.

As natural language placement instructions do not uniquely identify a location in a scene, Spatial-RelNet predicts non-parametric distributions to capture the inherent ambiguity. A key challenge to learning such pixelwise spatial distributions is the lack of ground-truth data. Spatial-RelNet overcomes this problem by leveraging a novel auxiliary learning formulation. Concretely, it classifies hallucinated scene representations by implanting high-level features of objects at different locations to get a learning signal.



Coming soon.


Composing Pick-and-Place Tasks By Grounding Language
Oier Mees, Wolfram Burgard
International Symposium on Experimental Robotics (ISER) 2021

Learning Object Placements For Relational Instructions by Hallucinating Scene Representations
Oier Mees, Alp Emek, Johan Verten, Wolfram Burgard
IEEE International Conference on Robotics and Automation (ICRA) 2020