How Robots Can Identify Objects in Cluttered Spaces

By Colette Barron Last updated Feb 19, 2024

If a robot can’t “see” a target object amid clutter, there’s a chance they will be confused. They lack what psychologists call “object unity,” the ability to identify a thing even when they can’t see all of that thing.

Researchers at the University of Washington are teaching robots via a method called THOR to identify objects on a cluttered shelf. In a recent paper published in IEEE Transactions on Robotics, they demonstrated that THOR outperformed current state-of-the-art models.

Robots sense their surroundings using one or more types of sensors and “see” things using standard color cameras or more complex stereo or depth cameras. Regardless of how good these sensors are, they don’t enable a robot to make “sense” of their surroundings. Robots need a visual perception system to process images and detect where they are, estimate orientation, identify objects, and parse text written on them.

Two challenges exist. First, when viewing many objects of varying shapes and sizes, robots have difficulty distinguishing between the different object types. Second, when several objects are located close to each other, objects can obstruct the view of other objects.

We know that partially visible objects aren’t broken or entirely new objects by using the shape of objects in a scene to create a 3D representation of each object in our minds. THOR mimics this method and then uses topology to assign each object to a “most likely” object class by comparing its 3D representation to a library of stored representations.

THOR does not rely on training machine learning models with images of cluttered rooms or require the robot to have specialized and expensive sensors or processors. This means that THOR is very easy to build and is readily useful for completely new spaces with diverse backgrounds, lighting conditions, object arrangements, and degrees of clutter. It also works better than the existing 3D shape-based recognition methods because its 3D representation of the objects is more detailed, which helps identify the objects in real time.