PEM: Prototype-based Efficient MaskFormer for Image Segmentation


1Politecnico di Torino, 2Focoos AI
*Equal Contribution

CVPR 2024
PEM architecture scheme.
Figure 1. Architecture of PEM with the three main components highlighted: backbone, pixel decoder and transformer decoder. The backbone extracts features from the input image; the pixel decoder provides features upsampling to extract high-resolution features; the transformer decoder, which takes as input a set of learnable queries and the high-resolution features to produce refined queries for inference.


Abstract

Recent transformer-based architectures have shown impressive results in the field of image segmentation. Thanks to their flexibility, they obtain outstanding performance in multiple segmentation tasks, such as semantic and panoptic, under a single unified framework. To achieve such impressive performance, these architectures employ intensive operations and require substantial computational resources, which are often not available, especially on edge devices. To fill this gap, we propose Prototype-based Efficient MaskFormer (PEM), an efficient transformer-based architecture that can operate in multiple segmentation tasks. PEM proposes a novel prototype-based cross-attention which leverages the redundancy of visual features to restrict the computation and improve the efficiency without harming the performance. In addition, PEM introduces an efficient multi-scale feature pyramid network, capable of extracting features that have high semantic content in an efficient way, thanks to the combination of deformable convolutions and context-based self-modulation. We benchmark the proposed PEM architecture on two tasks, semantic and panoptic segmentation, evaluated on two different datasets, Cityscapes and ADE20K. PEM demonstrates outstanding performance on every task and dataset, outperforming task-specific architectures while being comparable and even better than computationally-expensive baselines.

Model

Prototype-based Masked Cross-Attention

The core component of PEM is our novel Prototype-based Masked Cross-Attention (PEM-CA). PEM-CA is an more efficient alternative to standard cross-attention operation in image segmentation.
Two main enhancements are introduced in PEM-CA:

  • First, PEM-CA capitalizes on the intrinsic redundancy of visual features in segmentation to significantly reduce the number of input tokens in attention layers through a prototype selection mechanism. Indeed, during training, features related to the same segment naturally align and we can therefore exploit this redundancy to process only a subset of the visual tokens.
  • Second, inspired by recent advancements in the efficiency of attention modules, PEM-CA redesigns the cross-attention operation, modeling interactions by means of computationally cheap element-wise operations.

Prototype-based Masked Cross-Attention scheme image.
Figure 2. Scheme of the proposed Prototype-based Masked Cross-Attention. The prototype selection mechanism reduces the token dimension from HW to N, the number of queries, significantly reducing the computational burden.

Efficient Multi-scale pixel decoder

The pixel decoder covers a fundamental role in extracting multi-scale features which allow a precise segmentation of the objects. Mask2Former implements it as a feature pyramid network (FPN) enhanced with deformable attention. However, ssing deformable attention upon an FPN introduces a computation overhead that makes the pixel decoder inefficient and unsuitable for real-world applications. To maintain the performance while being computationally efficient, we use a fully convolutional FPN where we restore the benefits of deformable attention by leveraging two key techniques.
First, to reintroduce the global context (i) and the dynamic weights (ii), we implement context-based self-modulation (CSM) modules that adjust the input channels using a global scene representation. Moreover, to enable deformability (iii), we adopt deformable convolutions that focus on relevant regions of the image by dynamically adapt the receptive field. This dual approach yields competitive performance while preserving the computational efficiency.


Results

Panoptic Segmentation

Cityscapes

cityscapes panoptic results table.

ADE20K

ade20k panoptic results table.

Semantic Segmentation

Cityscapes

cityscapes panoptic results table.

ADE20K

ade20k panoptic results table.

BibTeX

@article{cavagnero2024pem,
    title   = {PEM: Prototype-based Efficient MaskFormer for Image Segmentation},
    author  = {Cavagnero, Niccol{\`o} and Rosi, Gabriele and Cuttano, Claudia and 
    Pistilli, Francesca and Ciccone, Marco and Averta, Giuseppe and Cermelli, Fabio},
    journal = {CVPR},
    year    = {2024}
}

Acknowledgements

This website is adapted from Nerfies, licensed under a Creative Commons Attribution-ShareAlike 4.0 International License