Computer vision is the algorithm that makes a computer identify objects in an image, for example, differentiating a car from a bicycle. To train artificial intelligence computers with this capability, it is necessary to name a large amount of images and many hours of work. That now changes with a new MIT algorithm.
MIT CSAIL researchers in collaboration with Cornell University and Microsoft developed STEGO (Self-supervised Transformer with Energy-based Graph Optimization). Recently disclosedthe algorithm can identify each pixel of an image, which makes computer vision faster and easier.
The technique used in STEGO is called semantic segmentation, which applies a name to a set of similar pixels in the image to give artificial intelligence a more accurate view of the scene.
While the number of pixels in an image can be in the millions, STEGO cuts down on the work by identifying similar objects in a visual dataset. “Thus, associating similar objects helps to build a consistent world view from multiple training images,” the researchers stated.
In some scenarios, computer vision can see better than humans
“When we look at oncology scans, the surface of planets or high resolution images, it is difficult to know which objects to identify without specific knowledge. In emerging domains, even expert humans can’t see what the relevant objects are,” says MIT doctoral student and Microsoft software engineer Mark Hamilton. “In these situations, in which you want to design a method for new frontiers in science, it is better to rely first on the eyes of machines than humans”.
STEGO has been tested in many types of scenarios, from aerial photographs to scenarios that a driver would see. In each group of images, the algorithm was able to identify and classify relevant objects and come close to human judgment. In one of its best results, with a very diverse group of images, the algorithm surprised in the details. Before, humans were seen as a bubble, a motorcycle was mistaken for a person and geese were not recognized. The survey indicated that everyone was recognized in addition to animals, buildings, furniture and other objects.
The discovery was recorded in a article by Hamilton, MIT doctoral student CSAIL Zhoutong Zhang, assistant professor Bharath Hariharan of Cornell University, associate professor Noah Snavely of Cornell Tech, and MIT professor William T. Freeman. The paper will be presented at the International Conference on Learning Representations (ICLR) 2022.