Until now, Sentence Transformers handled text. Version 5.4 changes that.
The library - the standard Python tool for creating embeddings (numerical representations of content that capture semantic meaning, used to power search that understands intent rather than just keywords) - now encodes text, images, audio, and video into the same shared space. That means you can compare across modalities: search a product image catalog with a text query, find video clips that match a description, or match screenshots to search terms. All with the same .encode() call the library has always used.
Four multimodal embedding models are ready to go: Qwen's Qwen3-VL-Embedding in 2B and 8B parameter sizes (handling text, image, and video), plus two Nvidia models - llama-nemotron-embed-vl-1b-v2 at 1.7B and omni-embed-nemotron-3b at 4.7B - both handling text and images.
Rerankers Get the Multimodal Treatment Too
v5.4 also ships multimodal reranker models. In a typical search pipeline, embeddings handle fast first-pass retrieval - you get back 50-100 candidates quickly. A reranker then scores each candidate pair more carefully, improving precision on those top results. Previously, that scoring step was text-only. Now it can handle image-text pairs, with models from Qwen (2B and 8B), Nvidia, and Jina AI available out of the box.
Text-only rerankers are also new to the library, with options from mixedbread-ai (0.5B and 2B parameter models) and Qwen (0.6B and 4B). These are useful for anyone who wants better search precision on text-only pipelines without adding multimodal complexity.
GPU Required, CLIP Still Works on CPU
The VLM-based models need GPU memory: around 8 GB VRAM for the 2B variants, roughly 20 GB for the 8B. The Hugging Face team notes CPU inference will be "extremely slow" - this isn't a laptop project. For lower-resource environments, the classic CLIP models (ViT-B-32 through ViT-L-14) remain supported and run on CPU. The multilingual CLIP variant covers 50-plus languages.
Installation uses optional extras to keep the base package light:
pip install "sentence-transformers[image]"
pip install "sentence-transformers[image,video,audio]"
One calibration note before you build: cross-modal similarity scores run lower than within-modal scores. A text-to-text exact match approaches 1.0; a correct text-to-image match might score 0.5 or 0.6. The relative ordering across results is what matters for retrieval, not the absolute number. Set your similarity thresholds accordingly, or you'll filter out valid results.
A fine-tuning guide for multimodal models is coming from the Hugging Face team. The models themselves are live on the Hub today.