AdaBoosting Text Prompts
for Vision-Language Models

KT Corporation1
POSTECH2
National AI Research Lab3

*Equal contribution. Corresponding author.

Abstract

The classification accuracy of pretrained Vision-Language Models (VLMs) relies on the quality of the text prompts. Handcrafted templates and Large Language Model (LLM)-generated descriptions not only make predictions more interpretable, but also enable reuse of the same prompts across heterogeneous VLMs. Recent works construct task-adapted text prompts with a small number of labeled images. However, existing few-shot text prompting methods do not explicitly focus on misclassified examples during prompt construction, leading to only marginal improvements even as more shots become available. To fully exploit few-shot supervision, we propose Text Prompt Boosting (TPB), an AdaBoost-inspired framework that treats each text-prompt-based classifier as a weak learner and sequentially aggregates them into a strong ensemble by explicitly targeting hard, misclassified examples. Extensive experiments show that TPB preserves task-intrinsic, model-agnostic cues in text space, enabling robust cross-model transfer. Across eleven classification benchmarks, TPB improves accuracy on the source model and preserves shot-driven gains when transferred to larger, more capable VLMs, where existing methods struggle to sustain such improvements.

Method Overview

Text Prompt Boosting treats each prompt-based classifier as a weak learner. At every boosting round, TPB reweights misclassified few-shot examples, selects a new prompt bank with Greedy Prompt Composition, and aggregates all weak classifiers into a final strong classifier.

Overview of the Text Prompt Boosting framework
TPB repeatedly focuses on hard examples and builds a strong natural-language prompt ensemble.

Shot Scalability

Unlike prior text-prompting methods that quickly saturate, TPB continues to benefit from additional few-shot supervision.

Original paper table comparing shot scalability across text-based methods
Average and per-dataset Top-1 accuracy on eleven classification benchmarks with OpenAI CLIP RN50.

Transfer Robustness

Because TPB keeps adaptation in natural-language prompt space, the learned prompt ensemble can be re-embedded and transferred to larger heterogeneous VLMs without model-specific tuning.

Original paper table comparing cross-model transfer robustness on ViT-L and ViT-H target models
Average Top-1 accuracy after transferring prompts optimized on OpenAI CLIP ViT-B/32 to larger heterogeneous target VLMs.

Presentation Video

Presentation video placeholder

Poster

Poster placeholder

BibTeX