Recent vision-language models (VLMs) face significant challenges in test-time adaptation to novel domains. While cache-based methods show promise by leveraging historical information, they struggle with both caching unreliable feature-label pairs and indiscriminately using single-class information during querying, significantly compromising adaptation accuracy.
To address these limitations, we propose COSMIC (Clique-Oriented Semantic Multi-space Integration for CLIP), a robust test-time adaptation framework that enhances adaptability through multi-granular, cross-modal semantic caching and graph-based querying mechanisms. Our framework introduces two key innovations: Dual Semantics Graph (DSG) and Clique Guided Hyper-class (CGH). The Dual Semantics Graph constructs complementary semantic spaces by incorporating textual features, coarse-grained CLIP features, and fine-grained DINOv2 features to capture rich semantic relationships. Building upon these dual graphs, the Clique Guided Hyper-class component leverages structured class relationships to enhance prediction robustness through correlated class selection. Extensive experiments demonstrate COSMIC's superior performance across multiple benchmarks, achieving significant improvements over state-of-the-art methods: 15.81% gain on out-of-distribution tasks and 5.33% on cross-domain generation with CLIP RN-50.
Overview of COSMIC. To refine cache with cross-modal, multi-granular class features, we construct Dual Semantics Graph with complementary semantics, incorporating both joint modalities and fine-grained visual information. To efficiently query the compatibility of diverse semantics, we propose novel Clique Guided Hyper-class to model different communities in the cache as the test domain evolves, enabling adaptive querying of test samples.
We evaluated COSMIC alongside state-of-the-art methods on on out-of-distribution and cross-domain benchmarks, using CLIP with a ViT-B/16 backbone. The tables below present our comprehensive results.
Method | ImageNet | ImageNet-A | ImageNet-V2 | ImageNet-R | ImageNet-S | Average | OOD Average |
---|---|---|---|---|---|---|---|
CLIP-ViT-B/16 | 66.73 | 47.87 | 60.86 | 73.98 | 46.09 | 59.11 | 57.20 |
TPT (NeurIPS'22) | 68.98 | 54.77 | 63.45 | 77.06 | 47.94 | 62.44 | 60.81 |
DiffTPT (ICCV'23) | 70.30 | 55.68 | 65.10 | 75.00 | 46.80 | 62.58 | 60.65 |
TDA (CVPR'24) | 69.51 | 60.11 | 64.67 | 80.24 | 50.54 | 65.01 | 63.89 |
DMN (CVPR'24) | 72.25 | 58.28 | 65.17 | 78.55 | 53.20 | 65.49 | 63.80 |
COSMIC (Ours) | 78.19 | 73.32 | 69.62 | 85.60 | 62.79 | 73.90 | 72.83 |
Method | Aircraft | Caltech101 | Cars | DTD | EuroSAT | Flower102 | Food101 | Pets | SUN397 | UCF101 | Average |
---|---|---|---|---|---|---|---|---|---|---|---|
CLIP-ViT-B/16 | 23.22 | 93.55 | 66.11 | 45.04 | 50.42 | 66.99 | 82.86 | 86.92 | 65.63 | 65.16 | 64.59 |
TPT (NeurIPS'22) | 24.78 | 94.16 | 66.87 | 47.75 | 42.44 | 68.98 | 84.67 | 87.79 | 65.50 | 68.04 | 65.10 |
DiffTPT (ICCV'23) | 25.60 | 92.49 | 67.01 | 47.00 | 43.13 | 70.10 | 87.23 | 88.22 | 65.74 | 62.67 | 64.92 |
TDA (CVPR'24) | 23.91 | 94.24 | 67.28 | 47.40 | 58.00 | 71.42 | 86.14 | 88.63 | 67.62 | 70.66 | 67.53 |
DMN (CVPR'24) | 30.03 | 95.38 | 67.96 | 55.85 | 59.43 | 74.49 | 85.08 | 92.04 | 70.18 | 72.51 | 70.30 |
COSMIC (Ours) | 31.44 | 96.80 | 71.31 | 58.23 | 58.82 | 82.14 | 86.60 | 94.19 | 72.33 | 76.20 | 72.81 |
@article{huang2024cosmic,
title = {COSMIC: Clique-Oriented Semantic Multi-space Integration for Robust CLIP Test-Time Adaptation},
author = {Huang, Fanding and Jiang, Jinyan and Jiang, Qinting and Li, Hebei and Khan, Faisal Nadeem and Wang, Zhi},
journal = {The IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year = {2025}
}