What kind of technical support is needed to manufacture fully autonomous vehicles? Different companies and researchers have different opinions on the answer to this question. In fact, there are many ways to realize autonomous driving, ranging from only requiring cameras and computer vision to combining computer vision with advanced sensors. Among them, Tesla has always been a proponent of the purely visual self-driving method. At this year’s Computer Vision and Pattern Recognition Conference (CVPR), the company’s chief AI scientist Andrej Karpathy explained the reason. .
Over the past few years, Capaci has been responsible for leading Tesla’s autonomous driving project. He delivered a speech at the CVPR 2021 Autonomous Driving Symposium, detailing how Tesla developed deep learning systems that only need video input to understand the surrounding environment of the car. Capaci also believes that Tesla is most likely to make vision-based self-driving cars a reality.
General Computer Vision System
The deep neural network is one of the main components of the autonomous driving technology stack, which mainly analyzes roads, signs, cars, obstacles and pedestrians in the on-board camera. But deep learning can also make mistakes when detecting objects in images. For this reason, most self-driving car companies (including Google’s parent company Alphabet’s self-driving car subsidiary Waymo) use lidar, which can emit laser beams in all directions To create a 3D map around the car. Lidar provides additional information that can fill the gap left by neural networks.
However, adding lidar to the autonomous driving technology stack is also very complicated. Capaci said: “You have to pre-map the environment with lidar, and then you have to create a high-definition map, including all the lanes and traffic lights, and figure out how they interact. During the test, you just need to locate The map can be driven around. However, it is extremely difficult to accurately map each location where the autonomous car will drive. At the same time, the collection, construction and maintenance of these high-definition lidar maps lack scalability, and these infrastructures Keeping up to date is also very difficult.”
Tesla does not use lidar and high-definition maps in its autonomous driving technology stack. Capaci explained: “Everything that happened was the first time that happened in the car. This was based on the video taken by the eight cameras around the car.”
Autonomous driving technology must figure out where the lanes are, where the traffic lights are, what their status is, and which ones are related to the vehicle. And the technology must complete all these operations without any predefined information about the road it is navigating. Capassi admits that vision-based autonomous driving is technically more difficult because it requires a neural network that can operate based on video feed alone. But he said: “When you put it into use, it’s like a general-purpose computer vision system, basically it can be deployed anywhere on the planet.”
With the universal vision system, your car no longer needs any auxiliary devices. Capaci said that Tesla is already moving in this direction. Previously, the company’s cars used a combination of radar and cameras to support autonomous driving, but it has recently begun to introduce cars that are not equipped with radar. He said: “We removed the radar. These cars are driven only by vision. Because Tesla’s deep learning system has performed 100 times better than the radar, now the radar is starting to drag its feet.”
The main argument against the purely computer vision autopilot method is whether the neural network can measure the distance without the help of the lidar high-definition map and estimate the uncertainty. Capaci said: “Obviously, humans drive by vision, so our neural network can process visual input to understand the depth and speed of objects around us. But the biggest question is whether synthetic neural networks can also do this. I think that in the past few months, our internal answer to this question has been clear and affirmative.”
Tesla engineers want to develop a deep learning system that can detect objects in depth, speed, and acceleration. They decided to treat this challenge as a supervised learning problem. In this problem, the neural network learns to detect objects and their related attributes after training on the annotation data.
In order to train their deep learning architecture, the Tesla team needs a massive data set consisting of millions of videos, and carefully annotated them with the objects and their attributes. Creating a dataset of self-driving cars is particularly tricky, and engineers must make sure to find less common road settings and edge conditions. Capassi said: “When you have a large, clean, and diverse data set on which you train a large neural network, we find that there is a possibility of success in practice.”
Automatically label data sets
Tesla has sold millions of cars with cameras around the world, which can well collect the data needed to train car vision deep learning models. The Tesla Autopilot team has accumulated 1.5PB of data, including 1 million 10-second videos and 6 billion objects marked with bounding boxes, depth, and speed. But labeling such data sets is a huge challenge. One way is to manually annotate them through online platforms such as data labeling companies or Amazon Turk. But this will require a lot of labor, may require huge expenditures, and the whole process is very slow.
Instead, the Tesla team used an automatic labeling technology that combines neural networks, radar data, and manual review. Since the dataset is annotated offline, the neural network can play back the video, compare its predictions with ground facts, and adjust its parameters. This is in contrast to the so-called “test time reasoning”, in which everything happens in real time, and the deep learning model cannot be traced.
Offline annotation also enables engineers to apply very powerful computationally intensive object detection networks that cannot be deployed in cars or used for real-time, low-latency applications. They used radar sensor data to further verify the inferences of the neural network. All these have improved the accuracy of the tag network. Capaci said: “If you are offline, you can benefit from it, so you can better calmly fuse different sensor data. In addition, you can also involve humans in recent times, they can clean up, verify, edit, etc. jobs.”
However, Capaci did not specify how much manpower is required to make final corrections to the automatic labeling system, but human cognition has played a key role in guiding the automatic labeling system in the right direction.
While developing the data set, the Tesla team found more than 200 triggers, which indicate that object detection needs adjustment. These problems include inconsistent detection results from different cameras, or inconsistent detection results between cameras and radars. They also identified scenes that may require special attention, such as tunnels entering and exiting and cars with objects on top.
It took four months to develop and master all these triggers. As the tag network gets better, it is deployed in “shadow mode”, which means that it is installed on consumer cars, runs silently without issuing commands to the car, and combines the output of the network with traditional networks, Compare the behavior of the radar and the driver.
The Tesla team has gone through seven data engineering iterations. They start with an initial data set and train their neural network on this basis. Then, they deploy deep learning in the “shadow mode” of real cars and use triggers to detect inconsistencies, errors, and special situations. Then modify and correct the error. If necessary, they will also add new data to the data set. Capaci said: “We repeat this cycle over and over again, until the network becomes great!”
Therefore, the architecture can be better described as a semi-automatic labeling system with clever division of labor. In this system, neural networks are responsible for repetitive tasks, while humans are responsible for solving high-level cognitive problems and rare cases.
Interestingly, when a participant asked whether the generation of Capaci triggers can be automated, he said: “Automated triggers are a very tricky scenario, because you can have generic triggers, but they can’t be represented correctly. Wrong mode. For example, it is difficult to automatically have the trigger to enter and exit the tunnel. This is an ability that you as a person must gain through intuition. This is a huge challenge, and the specific principles are still unclear.”
Hierarchical deep learning architecture
Tesla’s self-driving team needs very efficient and well-designed neural networks to maximize the use of the high-quality data sets they collect. The company created a hierarchical deep learning architecture composed of different neural networks that are responsible for processing information and feeding its output back to the next set of networks.
The deep learning model uses a convolutional neural network to extract features from videos taken by eight cameras installed around the car, and uses a transform neural network to fuse them together. Then, it merges them in time, which is very important for tasks such as trajectory prediction and smooth reasoning inconsistency. Then, the spatial and temporal features are input into the hierarchical structure of the neural network, which Capasi describes as the head, body, and terminal. He said: “The reason you want this hierarchical structure is because you are interested in a large number of outputs, but you can’t afford the cost of having a neural network for each output.”
The hierarchical structure allows Tesla to reuse components for different tasks and allows features to be shared between different inference paths.
Another benefit of the modular network architecture is the possibility of distributed development. Tesla currently employs a large team of machine learning engineers dedicated to the research and development of autonomous driving neural networks. Each of them works on a small component of the network and inserts their research results into the larger network. Capaci said: “We have a team of about 20 people who are responsible for training neural networks full-time. They all collaborate on independent neural networks.”
Vertical integration advantage
In CVPR’s speech, Capaci shared many details about the supercomputer Tesla is using to train and fine-tune its deep learning model. The entire computing cluster consists of 80 nodes, each node contains 8 NVIDIA A100 GPUs and 80 GB of video memory, a total of 5760 GPUs and more than 450 TB of VRAM. This supercomputer also has 10PB of NVME ultra-high-speed storage and 640Tbps of networking capacity to connect all nodes and allow efficient distributed training of neural networks.
Tesla also owns and manufactures AI chips installed in its cars. Capaci said: “These chips are specifically designed for neural networks that we want to run for fully autonomous driving applications.”
One of Tesla’s strengths is its ability to integrate vertically. The company owns the entire self-driving car technology stack, it produces cars and self-driving hardware, and is in a unique position in collecting various telemetry and video data from millions of cars that have been sold. The company also creates and trains its neural network on its proprietary data sets, internal special computing clusters, and validates and fine-tunes the network through shadow tests on its cars. Of course, Tesla has an outstanding team of machine learning engineers, researchers and hardware designers who can combine all the components.
Capaci said: “You can design together on all levels of the stack, no third party is holding you back. You are in complete control of your own destiny, which I think is incredible.”
This vertical integration and repetitive cycle of creating data, adjusting machine learning models and deploying them on many cars puts Tesla in a unique position in achieving vision-only self-driving car capabilities. In his speech, Capaci showed several examples, showing that the new neural network itself surpassed the traditional ML model combined with radar information. If the system continues to improve, as Capaci said, Tesla may embark on the trajectory of eliminating lidar. Moreover, any other company may not be able to copy Tesla’s approach.
But the question remains, in the current state, whether deep learning is sufficient to overcome all the challenges of autonomous driving. Of course, target detection, speed and distance estimation play an important role in driving. But human vision also has many other complex functions, which scientists call the “dark matter” of vision. These are all important components of conscious and subconscious analysis of visual input and navigation in different environments.
It is also difficult for deep learning models to make causal inferences, which can be a huge obstacle when the model faces new situations that have never been seen before. Therefore, although Tesla has successfully created a very large and diverse data set, the open road environment is very complex and may encounter many unexpected or unseen situations before the model.
The difference in the AI community is whether you need to explicitly integrate causality and reasoning into a deep neural network, or you can overcome causal barriers through “direct fitting” (Direct Fit). In this case, a well-distributed large data set is sufficient to achieve general deep learning. Tesla’s vision-based autonomous driving team seems to lean towards the latter, but this technology needs to stand the test of time.
Posted by:CoinYuppie，Reprinted with attribution to:https://coinyuppie.com/why-dont-self-driving-cars-need-radar-teslas-chief-ai-scientist-gave-an-explanation-2/
Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.