As a guest user you are not logged in or recognized by your IP address. You have
access to the Front Matter, Abstracts, Author Index, Subject Index and the full
text of Open Access publications.
In the realm of cross-modal Image-Text Retrieval (ITR), this paper focuses on the effectiveness in the Chinese language environment. Based on Chinese CLIP, the model pretrains on a large-scale Chinese image-text dataset and employs carefully selected vision and text encoders with a two-stage pretraining strategy, which enables the model to develop a nuanced semantic alignment between images and texts within a Chinese context. This approach facilitates a deep understanding of the matching between images and Chinese texts. To further enhance the model’s capabilities, the paper introduces domain-specific dataset to implement fine-tuning strategies. Well-designed experiments are conducted between the popular models and ours on public dataset Flickr30K-CN, COCO-CN, and domain-specific dataset AP-ID. Our model attains state-of-the-art (SOTA) results on Chinese text-image retrieval datasets, demonstrating its robustness and effectiveness in handling Chinese-language cross-modal ITR task.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.
This website uses cookies
We use cookies to provide you with the best possible experience. They also allow us to analyze user behavior in order to constantly improve the website for you. Info about the privacy policy of IOS Press.