Abstract
The classification accuracy of pretrained Vision-Language Models (VLMs) relies on the quality of the text prompts. Handcrafted templates and Large Language Model (LLM)-generated descriptions not only make predictions more interpretable, but also enable reuse of the same prompts across heterogeneous VLMs. Recent works construct task-adapted text prompts with a small number of labeled images. However, existing few-shot text prompting methods do not explicitly focus on misclassified examples during prompt construction, leading to only marginal improvements even as more shots become available. To fully exploit few-shot supervision, we propose Text Prompt Boosting (TPB), an AdaBoost-inspired framework that treats each text-prompt-based classifier as a weak learner and sequentially aggregates them into a strong ensemble by explicitly targeting hard, misclassified examples. Extensive experiments show that TPB preserves task-intrinsic, model-agnostic cues in text space, enabling robust cross-model transfer. Across eleven classification benchmarks, TPB improves accuracy on the source model and preserves shot-driven gains when transferred to larger, more capable VLMs, where existing methods struggle to sustain such improvements.
Method Overview
Text Prompt Boosting treats each prompt-based classifier as a weak learner. At every boosting round, TPB reweights misclassified few-shot examples, selects a new prompt bank with Greedy Prompt Composition, and aggregates all weak classifiers into a final strong classifier.
Shot Scalability
Unlike prior text-prompting methods that quickly saturate, TPB continues to benefit from additional few-shot supervision.
Transfer Robustness
Because TPB keeps adaptation in natural-language prompt space, the learned prompt ensemble can be re-embedded and transferred to larger heterogeneous VLMs without model-specific tuning.
Presentation Video
Poster
BibTeX