In computer vision, understanding objects in a scene and their locations is a standard task. Training a 3D spatial recognition system usually involves capturing the scene using sensors and then manually labeling the spatial extent of objects in the scene with a 3D box, including marking their locations. Although manual labeling is a popular and powerful method for training AI models, it is very time consuming. In a small indoor 3D scene, it takes on average more than 20 minutes to annotate and draw the box. It would be faster and easier to add labels to a 3D scene if there were no boxes, but rather a collection of scene-level labels (such as a list of objects present in the scene).
Therefore, researchers at Facebook’s Artificial Intelligence Institute and the University of Illinois at Urbana-Champaign propose to perform spatial recognition without spatially labeled 3D.
In the paper titled “Recognizing 3D spaces without spatial labels”, the team essentially asks the following question: Is it possible to learn to use only scene-level labels (e.g., a list of objects present in the scene) as supervision during training and then perform spatial recognition in 3D data (e.g., point clouds), such as detection and segmentation? objects?
For the WyPR approach proposed in the paper, the team demonstrates its ability to learn an efficient representation of such weakly supervised problems by jointly processing the two naturally mutually constrained tasks of segmentation and detection. wyPR can combine advances in 2D weakly supervised learning with the unique properties of 3D point cloud data. Experiments show that for challenging datasets (ScanNet), it achieves 6% higher mIOU than previous techniques. This establishes a new benchmark and baseline for future research, the researchers noted.
Why it matters
Spatial 3D scene understanding is important for a variety of downstream tasks, such as projecting a colleague sitting at a dining room table through an AR device. wyPR provides the model with spatial 3D understanding capabilities while eliminating the need to mark training scenes at the point level (a very time-consuming process). By reducing the barriers of training data and achieving finer-grained understanding across a large number of classes, WyPR can help the system understand spatial 3D scenes more easily.
The recognition of 3D objects (i.e., segmentation and detection) is a key step towards scene understanding. With the development of consumer-level depth sensors and advances in computer vision algorithms, 3D data acquisition has become more convenient and inexpensive. However, existing 3D recognition systems are often not scalable because they rely on robust supervision, such as point-level semantic labels or 3D bounding boxes (both of which require time-consuming acquisition). For example, although the popular large indoor 3D dataset ScanNet was collected by only 20 people, the annotation effort involved more than 500 annotators and took nearly 22.3 minutes per scan. In addition, due to the high cost of annotation, existing 3D target detection datasets are limited to a small number of target classes. This time-consuming labeling process is a major bottleneck preventing the community from expanding 3D recognition.
How it works
WyPR first uses standard 3D deep learning techniques to extract point-level feature representations from the input. To obtain the object segmentation, it classifies each point as an object class. instead of using point-level supervision to train this part of the network, WyPR uses Multiple Instance Learning (MIL) and self-supervised targets for training.
Next, to obtain the object bounding box, it utilizes a new 3D object suggestion technique inspired by selective search: geometric selective search (GSS). Each suggestion is classified to an object class using MIL as before. Finally, WyPR enforces consistency in the predictions made by the segmentation and detection subsystems, such that all points within the detected bounding box are consistent with the box-level predictions. The following figure illustrates the whole process.
As shown in the semantic segmentation results below, WyPR is able to detect and segment objects in a scene well, even when no scenes have been seen to be labeled at the point level. In addition, WyPR formalizes the setup of the weakly supervised 3D detection problem, including the setup of baselines and benchmarks, which the team believes will drive future research in this area.
For the paper titled “Recognizing 3D spaces without spatial labels”, its main contributions include
A new point cloud framework for joint learning of weakly supervised semantic segmentation and target detection is proposed, which significantly outperforms a single-task baseline.
An unsupervised 3D scheme generation algorithm for point cloud data: geometric selection search (GSS)
Advanced results in weakly supervised semantic segmentation and a new benchmark for weakly supervised proposal generation and target detection are set.
Recognizing 3D spaces without spatial labels
It is worth mentioning that the team will host the WyPR source code on GitHub, and interested readers can visit this introductory page first.
Posted by:CoinYuppie，Reprinted with attribution to:https://coinyuppie.com/facebook-proposes-wypr-a-3d-spatial-recognition-method-without-spatial-annotation/
Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.