A ConvNet for the 2020s

Article

Abstract

In this work, we reexamine the design spaces and test the limits of what a pure ConvNet can achieve. We gradually “modernize” a standard ResNet toward the design of a vision Transformer, and discover several key components that contribute to the performance difference along the way. The outcome of this exploration is a family of pure ConvNet models dubbed ConvNeXt. Constructed entirely from standard ConvNet modules, ConvNeXts compete favorably with Transformers in terms of accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers on COCO detection and ADE20K segmentation, while maintaining the simplicity and efficiency of standard ConvNets.

Name

Description

The "Roaring 20s" of visual recognition saw Vision Transformers (ViTs) surpass Convolutional Neural Networks (ConvNets) for image classification. However, ViTs faced limitations in other computer vision tasks. Hierarchical Transformers like Swin Transformers reintroduced ConvNet priors, making Transformers more practical. This study explores pure ConvNets and develops ConvNeXt models that compete favorably with Transformers in accuracy and scalability, achieving 87.8% ImageNet top-1 accuracy and outperforming Swin Transformers in object detection and semantic segmentation while maintaining simplicity and efficiency.

Types

Article

Publish date

03/02/2022

Publisher

Cornell University

Web URL

https://arxiv.org/abs/2201.03545

Referenced by

ConvNeXT (large-sized model)

AI Model