As a guest user you are not logged in or recognized by your IP address. You have
access to the Front Matter, Abstracts, Author Index, Subject Index and the full
text of Open Access publications.
In this work, we present a self-supervised dual constraint contrastive method for efficiently fine-tuning the vision-language pre-trained (VLP) models that have achieved great success on various cross-modal tasks, since full fine-tune these pre-trained models is computationally expensive and tend to result in catastrophic forgetting restricted by the size and quality of labeled datasets. Our approach freezes the pre-trained VLP models as the fundamental, generalized, and transferable multimodal representation and incorporates lightweight parameters to learn domain and task-specific features without labeled data. We demonstrated that our self-supervised dual contrastive model performs better than previous fine-tuning methods on MS COCO and Flickr 30K datasets on the cross-modal retrieval task, with an even more pronounced improvement in zero-shot performance. Furthermore, experiments on the MOTIF dataset prove that our self-supervised approach remains effective when trained on a small, out-of-domain dataset without overfitting. As a plug-and-play method, our proposed method is agnostic to the underlying models and can be easily integrated with different VLP models, allowing for the potential incorporation of future advancements in VLP models.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.