COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation

1Tsinghua Shenzhen International Graduate School, Tsinghua University
2Shenzhen Technology University     3University of Science and Technology of China

CVPR 2025
Previous Cache-based Methods

(a) Previous Cache-based Methods

COSMIC

(b) COSMIC (Ours)

(a) In the previous cache-based method, the cache has only dull information with coarse-grained clip visual features and simple query way via similarity between samples and cached visual class centers. (b) In our COSMIC, the cache has diverse structural information via extra fine-grained DINOv2 visual features and effective query way via similarity between samples and meticulously designed hyper-class centers.

Abstract

Recent vision-language models (VLMs) face significant challenges in test-time adaptation to novel domains. While cache-based methods show promise by leveraging historical information, they struggle with both caching unreliable feature-label pairs and indiscriminately using single-class information during querying, significantly compromising adaptation accuracy.

To address these limitations, we propose COSMIC (Clique-Oriented Semantic Multi-space Integration for CLIP), a robust test-time adaptation framework that enhances adaptability through multi-granular, cross-modal semantic caching and graph-based querying mechanisms. Our framework introduces two key innovations: Dual Semantics Graph (DSG) and Clique Guided Hyper-class (CGH). The Dual Semantics Graph constructs complementary semantic spaces by incorporating textual features, coarse-grained CLIP features, and fine-grained DINOv2 features to capture rich semantic relationships. Building upon these dual graphs, the Clique Guided Hyper-class component leverages structured class relationships to enhance prediction robustness through correlated class selection. Extensive experiments demonstrate COSMIC's superior performance across multiple benchmarks, achieving significant improvements over state-of-the-art methods: 15.81% gain on out-of-distribution tasks and 5.33% on cross-domain generation with CLIP RN-50.

Overview

COSMIC Overview

Overview of COSMIC

Overview of COSMIC. To refine cache with cross-modal, multi-granular class features, we construct Dual Semantics Graph with complementary semantics, incorporating both joint modalities and fine-grained visual information. To efficiently query the compatibility of diverse semantics, we propose novel Clique Guided Hyper-class to model different communities in the cache as the test domain evolves, enabling adaptive querying of test samples.


COSMIC Evaluation

We evaluated COSMIC alongside state-of-the-art methods on on out-of-distribution and cross-domain benchmarks, using CLIP with a ViT-B/16 backbone. The tables below present our comprehensive results.

Top-1 accuracy (%) comparison on ImageNet and its OOD variants
Method ImageNet ImageNet-A ImageNet-V2 ImageNet-R ImageNet-S Average OOD Average
CLIP-ViT-B/16 66.73 47.87 60.86 73.98 46.09 59.11 57.20
TPT (NeurIPS'22) 68.98 54.77 63.45 77.06 47.94 62.44 60.81
DiffTPT (ICCV'23) 70.30 55.68 65.10 75.00 46.80 62.58 60.65
TDA (CVPR'24) 69.51 60.11 64.67 80.24 50.54 65.01 63.89
DMN (CVPR'24) 72.25 58.28 65.17 78.55 53.20 65.49 63.80
COSMIC (Ours) 78.19 73.32 69.62 85.60 62.79 73.90 72.83

Top-1 accuracy (%) comparison on 10 diverse cross-domain datasets
Method Aircraft Caltech101 Cars DTD EuroSAT Flower102 Food101 Pets SUN397 UCF101 Average
CLIP-ViT-B/16 23.22 93.55 66.11 45.04 50.42 66.99 82.86 86.92 65.63 65.16 64.59
TPT (NeurIPS'22) 24.78 94.16 66.87 47.75 42.44 68.98 84.67 87.79 65.50 68.04 65.10
DiffTPT (ICCV'23) 25.60 92.49 67.01 47.00 43.13 70.10 87.23 88.22 65.74 62.67 64.92
TDA (CVPR'24) 23.91 94.24 67.28 47.40 58.00 71.42 86.14 88.63 67.62 70.66 67.53
DMN (CVPR'24) 30.03 95.38 67.96 55.85 59.43 74.49 85.08 92.04 70.18 72.51 70.30
COSMIC (Ours) 31.44 96.80 71.31 58.23 58.82 82.14 86.60 94.19 72.33 76.20 72.81

BibTeX

@article{huang2024cosmic,
      title     = {COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation},
      author    = {Huang, Fanding and Jiang, Jinyan and Jiang, Qinting and Li, Hebei and Khan, Faisal Nadeem and Wang, Zhi},
      journal   = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition},
      year      = {2025}
  }