Master the technology and get your ticket to the Metaverse

How will computer vision develop in 2021? What technology trends to watch in 2022?

Master the technology and get your ticket to the Metaverse

Looking back on the past 2021, the most core keyword I want to summarize is “evolution”.

From the perspective of the business world, our society has gradually evolved from relying on traditional carbon-based energy to absorbing digital energy. From the physical world, mine data, refine information, aggregate wisdom, and ultimately increase productivity.

On the other hand, the environment on which we humans live has also undergone drastic changes in recent years. The new crown epidemic broke out suddenly at the beginning of 2020. The virus itself is evolving at an extremely fast speed, and the corresponding vaccines are also being developed rapidly. In the future, the virus will continue to evolve and mutate. The changes and threats of the virus have prompted many technologies to develop rapidly. As if the process of human civilization has been pressed the fast-forward button, all the events of magic realism are really happening around us.

This is like the scene described in “The Three-Body Problem”. The three-body planet is in a harsh environment of chaos or destruction all year round, but its technology is thousands of years ahead of the earth’s civilization; after mankind was later ruled by the three-body civilization, the ceiling of theoretical physics was locked. , but various application technologies have advanced by leaps and bounds, surpassing the previous technological level. All of this is because the power of evolution is driving technology to develop in a direction that is more adaptable to the objective environment, and the only constant is change itself.

Standing at the tail end of 2021, looking back on the past year, the author summarizes several developments in computer vision that are worthy of everyone’s attention in the industry and academia due to the power of evolution:

Looking back at the year in computer vision 

Embodied intelligence, moving from passive AI to active AI 

Embodied intelligence is translated from English embodied AI, which literally means artificial intelligence with a body. The emphasis here is that the agent needs to interact with the real world, embodied AI, emphasizing that the agent needs to interact with the real world, and through multimodal interaction – not just letting AI learns to extract high-dimensional visual features, which are “input” into the cognitive world. Instead, it actively obtains real feedback from the physical world through the six senses of “eyes, ears, nose, tongue, body, and mind”. Through feedback, the agent can learn and make it better. “Intelligence”, and even “evolution”.

In 1986, the famous artificial intelligence expert Rodney Brooks proposed: intelligence is embodied and contextualized, the traditional classic AI approach with representation as the core is wrong, and the way to remove representation is to create behavior-based robots. . This theory and the cognitive intelligence first proposed in the 1860s are also contrary to “the current mainstream deep neural network is based on the complex system based on neuron connections – representation and processing based on information”.

When it comes to embodied intelligence and evolution, we have to mention a very new computing framework, DERL (deep evolution reinforcement learning), proposed by Feifei Li this year. She mentioned the relationship between biological evolution theory and agent evolution, and drew on the theory of evolution theory and applied it to the evolutionary learning of hypothetical agents (unimal cosmic animals).

Li Feifei first proved the Baldwin effect in the article, that is, human behaviors and habits without any genetic information basis (sexual reproduction evolution without genetic mutation), after many generations of transmission, eventually evolved into behaviors with genetic information basis The phenomenon of habit (evolutionary reinforcement learning).

And Li Feifei also referred to Lamarck’s theory in the process of designing unimal to train the intelligent body to traverse different complex terrains. “Use it and lose it” means that the organs of the organism will become developed if they are used frequently, and they will become more developed if they are not used frequently. will gradually degenerate. unimal defines three ways through asexual evolution (a. delete limbs b. adjust limb length c. add limbs).

Facebook evolves into meta, all in metaverse 

Zuckerberg proposed that the cloud universe needs to have the following eight elements: Presence development platform/kit, avatars, home space, teleporting, interoperability, privacy and Privacy and safety, Virtual goods, and Natural interfaces.

Among them, Presence is the metaverse basic development kit provided by meta for Oculus VR headset developers. It provides toolsets based on computer vision and intelligent voice technology, namely insight sdk, interaction sdk and voice sdk.

Insight sdk is based on space anchor and scene understanding technology, which can help developers to place virtual objects in real space, and conform to the spatial relationship and occlusion relationship between objects, similar to google AR core launched by google and Apple’s AR The kit; interaction sdk is mainly based on the interaction of hand movements, and the specific operations include pointing, poke, pinching, projection, etc.; the voice sdk is supported by the natural language platform, which can provide developers with functions such as voice navigation and search.

Entering the metaverse requires the ticket of intelligent perception and interaction technology, and the visual and voice technology in this ticket is the most important cornerstone.

Tesla’s trillion-dollar market value supported by autonomous driving and full vision solutions

2021 is called the first year of autonomous driving.

The Ministry of Transport issued the “Guiding Opinions on Promoting the Development and Application of Autonomous Driving Technology in Road Traffic”, and the policy is conducive to the development of the autonomous driving industry.

In the past year, we have witnessed the rapid rise of a number of self-driving unicorn companies and the myth of Tesla’s trillion-dollar market value. At the 2021 tesla Open AI day, Senior Director Andrej Karpathy introduced Tesla’s latest autonomous driving progress.

As we all know, Tesla abandoned lidar and adopted a full-vision solution to complete the perception and modeling of space through eight RGB cameras. The multi-camera feature-to-result prediction is achieved through the Transformer, and accurate spatial position mapping is obtained by integrating the position information of different cameras.

However, the visual information itself lacks timing information, so Tesla built a video timing network framework, integrated IMU information to improve the accuracy of positioning/tracking, and proposed the spatial RNN video module. It has built its own labeling team and automatic labeling platform tool, from 2D-3D to today’s 4D labeling (space-time labeling), that is, one labeling can cover multiple cameras and multiple frames, and 3D and 4D data can pass through the target. Movement and direction conversion to obtain 2D images of different angles and fields of view. At the same time, the number of targets is adjusted through the simulation of the environment (light, weather, angle), vehicles, people, roads and other scenes, and dynamic parameters are used to reconstruct the endless data flow to reconstruct various scenes to continuously train and improve the model. border.

At the same time, Tesla also demonstrated its self-developed dojo cluster, a symmetric distributed computing architecture, which is different from the mainstream asymmetric distributed architecture. At the same time, it has good programming flexibility. The three-wheel drive of “algorithm + data + computing power” has created Tesla’s trillion-dollar market value and made it far behind its competitors.

Combining this year’s technological breakthroughs and innovations with future-oriented thinking, the author summarizes the following three major trends from the troika of artificial intelligence—algorithms, data and computing power:

In 2022, three trends to watch 

AIGC for content generation (algorithmic level) 

We have gradually entered the metaverse age.

The biggest difference between the metaverse and the traditional game world is that the metaverse is the digital twin of the real universe and follows objective laws such as the uniqueness of matter in the objective world. Therefore, the metaverse world also needs to twin a large number of real-world objects or objects in the real world. These massive reconstructions must not be hand-made by CG engineers one by one according to the methods in the traditional game world, and their efficiency is far from meeting the needs of the actual scene.

Therefore, AIGC (algorithmic level) for content generation is necessary. Relevant technical directions include: image super-resolution, domain transfer, extrapolation, implicit neural representation similar to CLIP (contrastive language image pre-training model, which can effectively learn visual models from natural language supervision) – generating images from text descriptions, etc. Multimodal (CV+NLP) and other related technologies.

SCV synthesis (data level) 

Virtual reality engines have specialized components that generate synthetic data (eg NVIDIA IsaacSim, Unity Perception) that are not only beautiful but also help train better algorithms.

The generated/synthesized data is not only an essential element of the metaverse, but also an important raw material for training models. As mentioned earlier, Tesla will use virtual reality technology to generate edge scenes of driving scenes and produce more new perspectives. If we have the right tools to build the dataset, we can save the tedious process of manually labeling the data and better develop and train computer vision algorithms.

What the human eye can see is far less rich than the real world, and the algorithms we build can only focus on the information humans understand and label. But that may not be the case, and we can build algorithms for sensors to measure things beyond the range of human perception. These algorithms can be efficiently trained programmatically in virtual reality.

The well-known data analysis company Gartner believes that in the next 3 years, synthetic data will be more dominant than real data. In Synthetic Computer Vision (SCV), we train a computer vision model using a virtual reality engine and deploy the trained model to the real world.

High energy efficiency model (computing power level) 

Although many SOTA models in the academic world are difficult to run offline on some wearable devices such as mobile phones, the heavier the model, the longer the corresponding delay. If it is fully run in the cloud, it will introduce problems such as cost, network delay, and privacy. At the same time, it takes up a lot of cloud computing power, and it will also generate massive energy consumption, which is not conducive to the demands of the entire society for carbon peaking/carbon neutrality.

Therefore, energy-efficient inference models are bound to become the mainstream trend in the future. The first solution is distributed training, that is, introducing 0 into the matrix to train the neural network, because not all dimensions are important. Although it may affect performance, it will be large. Scale reduces dot product operations and thus reduces network training time. At the same time, the introduction of quantitative training, pruning, perceptual quantitative training, etc. can also help to greatly reduce the model inference time, thereby improving the energy efficiency of the model, and at the same time greatly avoiding the loss of accuracy caused by quantitative training. Training the student model by training a high-performance teacher model through knowledge distillation can also help improve model energy efficiency.


Descartes said: “I think therefore I am” because consciousness determines my existence. Heidegger later criticized Descartes’ point of view. He proposed that “I am therefore I think”, because people are conscious and perceive the world only if they exist. If people are not people, but exist in other organisms, such as Butterflies, whales, then people’s cognition of the world will be different.

The author believes that both the traditional representation-based deep learning and the newly proposed embodied intelligence based on existence and time still have a long way to go.

But there is no doubt that if we want to achieve a general artificial intelligence, a multimodal, embodied, active and interactive artificial intelligence agent must be the only way.

Why so sure? Because artificial intelligence is an artificial, human-like advanced intelligence trained by the definition of human-advanced intelligence. In this case, should it have the characteristics of an advanced intelligent body like a human being? For example, the evolution of organisms, such as high-level intelligence: reasoning, deduction, playing chess, etc., also includes low-level intelligence: walking, talking, perception. The direction of future-oriented AI products should be from traditional 2D plane artificial intelligence (such as image classification, target detection, segmentation and other tasks) to 3D space and 4D (existence and time).

We have seen that short videos/video vlogs have greatly developed over the past few years compared to the original Weibo graphics, because they can bring users more information based on time and space and the environment; the development is to give users a more immersive experience, For example, AR/VR provides a full range of perception and experience based on space, environment, and time; it develops into a virtual digital human/AI intelligent assistant, and a humanoid robot such as tesla bot is a multi-modal vision + voice Actively interact with users; develop into smart cars, adapt to the environment externally to adapt to complex road conditions and traffic conditions for intelligent driving, and internally provide drivers and passengers with a real “third space” to meet the needs of users in different scenarios.

Although the human body evolves slowly, the technological evolution created by the human spirit is changing with each passing day. As a scientific and technological worker and an AI practitioner, I am looking forward to that day. I hope that AI will create a world where small humans can resist the sudden changes in the external environment, empower human beings, and empower human beings with civilization.

Posted by:CoinYuppie,Reprinted with attribution to:
Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.