Enhancing Fine-grained Multi-modal Alignment via Adapters: A Parameter-Efficient Training Framework for Referring Image Segmentation

WANT @ ICML 2024

Zunnan Xu, Jiaqi Huang, Ting Liu, Yong Liu, Haonan Han, Kehong Yuan, Xiu Li

SIGS, Tsinghua University

Paper arXiv (cooming soon) Code (cooming soon)

In the domain of computer vision, Parameter-Efficient Training (PET) is increasingly replacing the traditional paradigm of pre-training followed by full fine-tuning. PET is particularly favored for its effectiveness in large scale models, as it streamlines transfer learning costs and optimizes hardware utilization. However, the prevailing PET methods are primarily designed for single-modal optimization without fine-grained feature extraction design. When applied to multi-modal dense prediction tasks, these methods typically do not match the performance of full fine-tuning methods that utilize more resources. In this paper, we do an investigation of efficient training problems on referring image segmentation. We introduce DenseCrossAdapter, a parameter-efficient module designed to enhance low-rank visual feature propagation by establishing dense interconnections between each layer and all preceding layers. This facilitates robust cross-modal feature interaction. We also suggest using text adapters to improve textual features. Our approach greatly surpasses state-of-the-art methods with only 0.9% to 1.8% backbone parameter updates, evaluated on challenging benchmarks.

BibTeX

@inproceedings{xu2024enhancing,
  title={Enhancing Fine-grained Multi-modal Alignment via Adapters: A Parameter-Efficient Training Framework for Referring Image Segmentation},
  author={Xu, Zunnan and Huang, Jiaqi and Liu, Ting and Liu, Yong and Han, Haonan and Yuan, Kehong and Li, Xiu},
  booktitle={2nd Workshop on Advancing Neural Network Training: Computational Efficiency, Scalability, and Resource Optimization (WANT@ ICML 2024)}
}