Hinton, Lecun and Bengio join forces in another 10,000-word article: Deep Learning Yesterday, Today and Tomorrow

Where is the road to deep learning?

2018 Turing Award winners Yoshua Bengio, Yann LeCun and Geoffrey Hinton were invited by ACM to meet again to review the basic concepts and some breakthrough results of deep learning, and to talk about the origin, development and future challenges of deep learning.

In 2018, ACM (International Computer Society) decided to award the Turing Award, the highest prize in computing, to Yoshua Bengio, Yann LeCun and Geoffrey Hinton for their contributions to the field of deep learning in computing.

Hinton, Lecun and Bengio join forces in another 10,000-word article: Deep Learning Yesterday, Today and Tomorrow

This is also the third time that the Turing Award has been given to three winners at the same time.

Artificial neural networks for deep computer learning have been proposed in the 1980s, but they have not received appropriate attention in the scientific community at that time due to their lack of theoretical support and limited computing power.

Hinton, Lecun and Bengio join forces in another 10,000-word article: Deep Learning Yesterday, Today and Tomorrow

It is these three giants that have been consistently using deep learning methods and have conducted intensive research in related fields. Many amazing results have been discovered through experiments and have contributed to proving the practical advantages of deep neural networks.

So it is no exaggeration to say that they are the fathers of deep learning.

In the AI world, when the three gods, Yoshua Bengio, Yann LeCun and Geoffrey Hinton, are out at the same time, something big is bound to happen.

Recently, the three giants of deep learning were invited by ACM Communications magazine to conduct an in-depth interview on the topic of deep learning, outlining the basic concepts, the latest progress, and the future challenges of deep learning.

AI developers, after reading the guidance from the top people, do you have more clarity on the road ahead? Let’s see what they talked about.

The Rise of Deep Learning
In the early 2000s, deep learning introduced elements that made it easier to train deeper networks and thus re-energized research in neural networks.

Hinton, Lecun and Bengio join forces in another 10,000-word article: Deep Learning Yesterday, Today and Tomorrow

The availability of GPUs and large datasets, which are key factors for deep learning, is also enhanced by the role of open source, flexible software platforms with automatic differentiation (e.g. Theano, Torch, Caffe, TensorFlow, etc.). It has also become easier to train complex deep networks and to reuse the latest models and their building blocks. And the combination of more layers of networks allows for more complex nonlinearities, achieving unexpected results in perceptual tasks.

Hinton, Lecun and Bengio join forces in another 10,000-word article: Deep Learning Yesterday, Today and Tomorrow

Where does deep learning go deep? Some people think that deeper neural networks may be more powerful, and this idea predates the advent of modern deep learning techniques. However, such an idea actually came about by continuous advances in architecture and training procedures, and brought about the remarkable progress associated with the rise of deep learning.

Deeper networks are able to better generalize the ‘type of input-output relationship’, and this is not just because there are more parameters. Deeper networks typically have better generalization capabilities than shallow networks with the same number of parameters. For example, a popular class of computer vision convolutional network architectures is the ResNet family, the most common of which is ResNet-50, with 50 layers.

Hinton, Lecun and Bengio join forces in another 10,000-word article: Deep Learning Yesterday, Today and Tomorrow

Image source: Zhihu @stinky salty fish

Deep networks stand out because they utilize a specific form of combinatoriality, where features at one layer are combined in many different ways so that more abstract features can be created at the next layer.

Unsupervised pre-training. When the number of labeled training examples is small and the complexity of the neural network needed to perform the task is small, it is possible to use some other information sources to create feature detector layers and then fine-tune these feature detectors with limited labels. In migration learning, information sources are another supervised learning task with a large number of labels. But it is also possible to create multiple layers of feature detectors by stacking autoencoders without using any labels.

Hinton, Lecun and Bengio join forces in another 10,000-word article: Deep Learning Yesterday, Today and Tomorrow

The mystery of the success of linear rectification units. Early on, the success of deep networks was due to the unsupervised pre-training of hidden layers using logistic sigmoid nonlinear functions or the closely related hyperbolic tangent function.

Neuroscience has long assumed linear rectification units and has used them in some variants of RBMs and convolutional neural networks. To the surprise of no one, it has been discovered that nonlinear rectification makes it easier to train deep networks by back propagation and stochastic gradient descent, eliminating the need for layer-by-layer pretraining. This is one of the technological advances that make deep learning superior to previous object recognition methods.

Breakthroughs in speech and object recognition. Acoustic models convert sound waves into probability distributions of phoneme fragments.Robinson, Morgan et al. have used wafer machines and DSP chips, respectively, and their attempts have all shown that neural networks can rival state-of-the-art acoustic modeling techniques if sufficient processing power is available.

Hinton, Lecun and Bengio join forces in another 10,000-word article: Deep Learning Yesterday, Today and Tomorrow

In 2009, two graduate students, using NVIDIA GPUs, demonstrated that a pre-trained deep neural network slightly outperformed SOTA on the TIMIT dataset. In 2012, Google dramatically improved speech search on Android. This was an early demonstration of the disruptive power of deep learning.

Around the same time, deep learning scored a dramatic victory in the 2012 ImageNet competition, nearly halving its error rate in recognizing a thousand different classes of objects in natural images. Key to this victory was Feifei Li and his collaborators’ collection of over a million labeled images for the training set and Alex Krizhevsky’s efficient use of multiple GPUs.

Hinton, Lecun and Bengio join forces in another 10,000-word article: Deep Learning Yesterday, Today and Tomorrow

Deep convolutional neural networks have novelties, such as ReLU to speed up learning and dropout to prevent overfitting, but it is basically just a feed-forward convolutional neural network that Yann LeCun and collaborators have been working on for years.

The computer vision community has responded admirably to this breakthrough. The evidence for the superiority of convolutional neural networks was incontrovertible, and the community quickly abandoned the previous hand-designed approach in favor of deep learning.

Major recent achievements in deep learning
The three greats selectively discuss some recent advances in deep learning, such as soft attention and the Transformer architecture.

A major development in deep learning, especially in sequential processing, is the use of multiplicative interactions, especially in the form of soft attention. This is a transformative addition to the neural network toolbox, as it transforms neural networks from pure vector transformation machines to architectures that can dynamically choose which inputs to operate on and store the information in associative memory. The key property of such architectures is that they can efficiently operate on different types of data structures.

Soft attention can be used for modules at a particular layer that can dynamically select which vectors they come from the previous layer and thus combine and compute the outputs. This can make the outputs independent of the order in which the inputs are presented (think of them as a set) or take advantage of the relationships between the different inputs (think of them as graphs).

Hinton, Lecun and Bengio join forces in another 10,000-word article: Deep Learning Yesterday, Today and Tomorrow

The Transformer architecture has become the dominant architecture in many applications, stacking many layers of “self-attention” modules. Each module in the same layer uses a scalar product to compute a match between its query vector and the key vectors of other modules in that layer. The matches are normalized to a sum of 1, and the resulting scalar coefficients are then used to form a convex combination of the value vectors generated by the other modules in the previous layer. The resultant vectors form the input to the modules in the next computation stage.

Hinton, Lecun and Bengio join forces in another 10,000-word article: Deep Learning Yesterday, Today and Tomorrow

Modules can be multidirectional so that each module computes several different vectors of queries, keys and values, thus making it possible for each module to have several different inputs, each selected in a different way from the previous stage. In this operation, the order and number of modules is irrelevant, so that the set of vectors can be operated on instead of individual vectors as in traditional neural networks. For example, a language translation system generating a word in an output sentence can choose to focus on the corresponding set of words in the input sentence, independent of its position in the text.

Future challenges
The importance as well as the applicability of deep learning is constantly being validated and is being adopted in an increasing number of domains. For deep learning, there is a simple and straightforward way to improve its performance performance – increase the model size.

With more data and computation, it usually gets smarter. For example, the GPT-3 large model with 175 billion parameters (but still a small number compared to the neuronal synapses in the human brain) achieves a significant improvement over the GPT-2 with only 1.5 billion parameters.

Hinton, Lecun and Bengio join forces in another 10,000-word article: Deep Learning Yesterday, Today and Tomorrow

But the triumvirate also revealed in the discussion that for deep learning there are still flaws that cannot be solved by improving the parametric model and computation.

For example, with the human learning process, today’s machine learning still needs to make breakthroughs in the following directions.

  1. supervised learning requires too much data labeling, while model-free reinforcement learning requires too much trial and error. For humans, as to learn a skill certainly does not require so much practice.
Hinton, Lecun and Bengio join forces in another 10,000-word article: Deep Learning Yesterday, Today and Tomorrow

2, the robustness of today’s systems for adapting to changes in distribution is so much worse than that of humans, who need only a few examples to be able to adapt quickly to similar changes.

  1. Today’s deep learning is undoubtedly the most successful for perception, which is known as system 1 type of task, and how to perform system 2 type of task by deep learning requires prudent generic steps. Research in this area is promising.

In the early days, machine learning theorists have always focused on the assumption of independent similarity distribution, meaning that the test model obeys the same distribution as the training model. And unfortunately, this assumption does not hold in the real world: for example, unevenness is triggered by the changes brought to the world by the behavior of various agents; or the boundaries of intelligence of learning agents who always have new things to learn and discover are constantly rising.

So the reality tends to be that even today’s most powerful AI will still perform much worse when put into practical applications from the lab.

So one of the three great expectations for the future of deep learning is to be able to quickly adapt and improve robustness when the distribution changes (so-called distribution-independent generalized learning) so that the number of samples can be reduced in the face of new learning tasks.

Today’s supervised learning systems require more instances than humans to learn new things, and this is even worse for model-free reinforcement learning – because the reward mechanism provides so little feedback compared to the labeled data.

So, how can we design a new mechanical learning system that can have better adaptability in the face of distributional changes?

From homogeneous layers to groups of neurons representing entities

Today’s evidence suggests that groups of adjacent neurons may represent higher-level vector units, capable of passing not only scalars but also a set of coordinate values. Such an idea is at the heart of the capsule architecture, where elements in a unit are associated with a vector from which a key vector, a numerical vector (and sometimes possibly a query vector), can be read.

Hinton, Lecun and Bengio join forces in another 10,000-word article: Deep Learning Yesterday, Today and Tomorrow

Adaptation to multiple time scales

Most neural networks have only two time scales: weights adapt very slowly in many examples, while behavior adapts very quickly to changes in each new input. By adding superimposed layers of fast adapting and fast decaying “fast weights”, it gives the computer very interesting new capabilities.

In particular, it creates a high capacity short-term store that allows the neural network to perform true recursion, where the same neurons can be reused in recursive calls because their activity vectors in higher-level calls can be reconstructed later using the information in the fast weights.

The function of multi-timescale adaptation is gradually being adopted in meta-learning.

Hinton, Lecun and Bengio join forces in another 10,000-word article: Deep Learning Yesterday, Today and Tomorrow

A higher level of cognition

When considering new tasks, such as driving in a city with different traffic rules or even imagining driving a vehicle on the moon, we can take the knowledge and general skills we already have and dynamically recombine them in new ways.

But when we adopt known knowledge to adapt to a new setup, how can we avoid the noise interference that known knowledge brings to a new task? (The starting step can be the Transformer architecture and Recurrent Independent Mechanisms).

The processing power for System 1 allows us to guess potential benefits or dangers when planning or speculating. But at a more advanced system level, the value function of AlphaGo’s Monte Carlo tree search may be needed.

Mechanical learning relies on inductive bias or a priori experience to encourage learning in compatible directions regarding assumptions about the world. System 2, dealing with the nature of processing and their cognitive neuroscience theory, proposes several such inductive biases and architectures that can come to design more novel deep learning systems. So how can neural networks be trained to be able to discover some causal properties of the world underlying them?

What research directions have been pointed out to us by several representative AI research projects proposed in the 20th century? Clearly, these AI projects all want to achieve system 2 capabilities, such as the ability to reason, the ability to break down knowledge into simple computer steps quickly, and the ability to control abstract variables or examples. This is also an important direction for future AI technology to move forward.

After listening to the discussion of the three, do you feel that there is unlimited light on the road of AI?

References.

Posted by:CoinYuppie,Reprinted with attribution to:https://coinyuppie.com/hinton-lecun-and-bengio-join-forces-in-another-10000-word-article-deep-learning-yesterday-today-and-tomorrow/
Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.

Like (0)
Donate Buy me a coffee Buy me a coffee
Previous 2021-07-02 05:57
Next 2021-07-02 06:04

Related articles