Not all images are worth 16×16 words, Tsinghua and Huawei propose dynamic ViT

In NLP, Transformer has been widely successful in image recognition problems with the self-attentive model mechanism as a magic bullet.

In particular, ViT is particularly widely used because of its particularly high performance on large-scale image networks.

However, as the size of the dataset grows, it leads to a sharp increase in computational cost as well as a gradual increase in the number of tokens in self-attentive!

Recently, Gao Huang’s research team, an assistant professor in the Department of Automation at Tsinghua, and researchers at Huawei have taken an alternative approach and proposed a Dynamic Vision Transformer (DVT) that automatically configures the appropriate number of tokens for each input image, thereby reducing redundant computation and significantly passing efficiency.

The paper, titled “Not All Images are Worth 16×16 Words: Dynamic Vision Transformers with Adaptive Sequence Length,” has been published in arXiv.

Proposing Dynamic ViT

It is clear that the current ViT faces the challenges of computational cost and the number of tokens.

To achieve the best balance between accuracy and speed, the number of tokens is typically 14×14/16×16.

The research team observed that

There are many “simple” images in a typical sample that can be accurately predicted with a number of 4×4 tokens, and the computational cost (14×14) is now equivalent to an 8.5-fold increase, while only a small fraction of the “difficult” images actually require finer characterization.

By dynamically adjusting the number of tokens, the computational efficiency is not evenly distributed between “easy” and “difficult” samples, and there is a lot of room for efficiency.

Based on this, the team proposes a novel dynamic ViT (DVT) framework with the goal of automatically configuring the number of tokens adjusted on each image, thus achieving high computational efficiency.

This DVT was designed as a generic framework.

At test time, these models are activated sequentially starting with a smaller number of tokens.

The inference process is terminated as soon as a prediction with sufficient confidence has been generated.

Not all images are worth 16x16 words, Tsinghua and Huawei propose dynamic ViT

The main architecture of the model uses the current state-of-the-art image recognition Transformer, such as ViT, DeiT and T2T-ViT, which can improve efficiency.

This approach is also highly flexible.

This is because the computational volume of DVT can be tuned by a simple early termination criterion.

This feature makes DVT suitable for situations where the available computational resources vary dynamically, or where a given performance is achieved by minimizing power consumption.

Both of these situations are common in real-world applications, as often seen in search engines and mobile applications.

Based on the flowchart above, the careful reader will also notice that

Once the computation from upstream to downstream fails to work, further data training is achieved by reusing previous information or upstream information.

Building on this, the research team further proposed a feature reuse mechanism and a relationship reuse mechanism, both of which can significantly improve test accuracy by minimizing computational costs to reduce redundant computations.

The former allows training downstream data based on previously extracted deep features, while the latter can leverage existing upstream self-attentive models to learn more accurate attention.

The real-world effect of this dynamic allocation of “easy” and “hard” can be illustrated by the example in the figure below.

Not all images are worth 16x16 words, Tsinghua and Huawei propose dynamic ViT

So, let’s take a look at how these two mechanisms do exactly that?

Feature Reuse Mechanism

All Transformers in DVT have a common goal: extracting feature signals to achieve accurate recognition.

Therefore, the downstream model should learn on the basis of previously acquired deep features, instead of extracting features from scratch.

The computation performed in the upstream model contributes to both itself and the subsequent model, which would make the model more efficient.

To implement this idea, the research team proposed a feature reuse mechanism.

Simply put, the image tokens output from the last layer of the upstream Transformer are used to learn the layer-by-layer contextual embedding and integrate it into the MLP block of each downstream Transformer.

Not all images are worth 16x16 words, Tsinghua and Huawei propose dynamic ViT

Relational Reuse Mechanism

One of the outstanding advantages of Transformer is that

Self-attentive blocks are able to integrate information across the image, thus effectively modeling long-term dependencies in the data.

Usually, the model needs to learn a set of attentional maps at each layer to describe the relationships between tokens.

In addition to the deep features mentioned above, the downstream model has access to the self-attentive graphs generated in the previous model.

The research team believes that these learned relationships can also be reused to facilitate downstream Transformer learning, specifically using the additive operation of logarithms.

Not all images are worth 16x16 words, Tsinghua and Huawei propose dynamic ViT

How does it work?

Let’s see how well it works in practice without further ado.

The Top-1 accuracy v.s. throughput on ImageNet is shown below.

Not all images are worth 16x16 words, Tsinghua and Huawei propose dynamic ViT

It can be seen that DVT is computationally better than T2T-VIT.

When the budget range is within 0.5-2 GFLOPs, DVT is 1.7-1.9 times less computational than T2T-ViT with the same performance.

In addition, this method provides the flexibility to reach all points on each curve by adjusting the confidence threshold of DVT only once.

CIFAR’s Top-1 accuracy v.s. GFLOP is shown below.

Not all images are worth 16x16 words, Tsinghua and Huawei propose dynamic ViT

The Top-1 accuracy v.s. throughput on ImageNet is shown in the table below.

Not all images are worth 16x16 words, Tsinghua and Huawei propose dynamic ViT

A sample of “easy” and “difficult” visualizations in the DVT is shown below.

Not all images are worth 16x16 words, Tsinghua and Huawei propose dynamic ViT

The extensive empirical results on ImageNet, CIFAR-10 and CIFAR-100 above show that

DVT methods are significantly better than other methods in terms of theoretical computational efficiency and actual inference speed.

Aren’t you excited to see such beautiful results?

If you are interested, please go to the original article.

Transmission door

The address of the paper is

Research Team

Not all images are worth 16x16 words, Tsinghua and Huawei propose dynamic ViT

Gao Huang

Currently only 33 years old, he is already an assistant professor and PhD supervisor in the Department of Automation at Tsinghua University.

He has received the 2020 Alibaba Dharma Institute Green Orange Award, and his research areas include machine learning, deep learning, computer vision, and reinforcement learning.

Posted by:CoinYuppie,Reprinted with attribution to:https://coinyuppie.com/not-all-images-are-worth-16x16-words-tsinghua-and-huawei-propose-dynamic-vit/
Coinyuppie is an open information publishing platform, all information provided is not related to the views and positions of coinyuppie, and does not constitute any investment and financial advice. Users are expected to carefully screen and prevent risks.

Like (0)
Donate Buy me a coffee Buy me a coffee
Previous 2021-06-05 03:45
Next 2021-06-05 03:52

Related articles