This week, Liquid AI released two new recovery models. they are LFM2.5-Colbert-350M And lfm2.5-embedding-350m. Both have 350M parameters. Both are the first bidirectional members of the LFM family. they build LFM2.5-350M-BaseReleased in March. The pair targets fast multilingual and cross-lingual search in 11 languages. Their footprints are so small that they can run almost anywhere. Both are now available on Hugging Face under the LFM Open License v1.0.
LFM2.5 Retrievers
Both models share the same foundation but represent text in different ways. lfm2.5-embedding-350m is a dense bi-encoder. This turns each document into a single vector. Choose this when you want the fastest search and smallest, cheapest index.
LFM2.5-Colbert-350M There is a late-interaction model. This converts each document into a vector instead of each token. This lets it match questions word-for-word for higher accuracy and better generalization. The trade-off is a big one. Choose this when accuracy matters more than storage. Its query length is limited to 32 tokens. It can re-rank the results of the first stage retriever without building any indexes.
Both target short-context search. Good fits include product catalogs, FAQ knowledge base, and support documentation. Liquid AI positions both as one drop-in replacement For the existing RAG pipeline.
Architecture change: bidirectional by reason
Both models start with the LFM2.5-350M-Base, which is a mid-range general purpose checkpoint. A small set of Liquid AI applies bidirectional patch For LFM2 architecture. These adapt it from a causal decoder to a bidirectional encoder.
In a causal setup, each token uses only itself and the previous token. It is suitable for left-to-right generation but less natural for retrieval. The team replaces the causal attention mask with a bidirectional mask. Now each token can contain both left and right contexts. They also make LFM2 short resolutions non-causal. These mix not only past but also local information symmetrically around each token.
This full-context representation preserves the efficiency of the LFM2 backbone when producing retrieval requirements. Each model consists of 17 layers: 10 convolution, 6 attention, and 1 pooling or dense. Reference length reaches 32,768 tokens, although documents are limited to 512 tokens. Apart from the shared encoder, both models differ only in the output. The embedding uses CLS-style pooling for a 1024-dim vector. Colbert maintains a 128-dim per-token embedding for maxsim late interactions.
training and data
Both models follow the same three-step recipe:
- Phase one is largely adversarial pre-training in English.
- Step two is multilingual and inter-lingual distillation from a strong teacher in all 11 languages.
- Phase three is the final correction on the harsh negatives.
The embedding model captures slightly more cross-lingual data than Colbert. Cross-lingual retrieval emerges more naturally in late-interaction setups. The training data combines curated internal data with open-source English retrieval datasets. LLM-based translation spans multilingual and inter-lingual pairings.
benchmark
Liquid AI evaluated two capabilities. The first is multilingual retrieval with NanoBEIR. The second is cross-lingual open-domain QA with MKQA-11. Both report results in all 11 languages: Arabic, German, English, Spanish, French, Italian, Japanese, Korean, Norwegian, Portuguese and Swedish.
On average, both models lead their class. Here are the comparative details:
| Sample | Type | NanoBEIR ML (NDCG@10) | MKQA-11 (Recall@20) |
|---|---|---|---|
| LFM2.5-Colbert-350M | late talk | 0.605 | 0.694 |
| lfm2.5-embedding-350m | dense | 0.577 | 0.691 |
| quen/quen3-embedding-0.6b | dense | 0.556 | 0.638 |
| LFM2-Colbert-350M | late talk | 0.540 | 0.646 |
| alibaba-nlp/gte-multilingual-base | dense | 0.528 | 0.675 |
| lightonai/gte-moderncolbert-v1 | late talk | 0.489 | 0.459 |
| BAAI/bge-large-en-v1.5 | dense | 0.359 | 0.413 |
Colbert leads on both averages. The embedding is far behind on MKQA-11 at 0.691. Both defeat a larger model, Qwen3-Embedding-0.6B. The new ColBERT also improves on the earlier LFM2-ColBERT-350M, from 0.540 to 0.605 on NanoBEIR. Liquid AI also notes that the NanoBEIR English tracks the more expensive Full BEIR. The two remain highly correlated, with NanoBEIR’s score remaining almost constant at ~15% higher. The research team therefore uses NanoBEIR as a practical proxy during training.
Latency and edge deployment
Liquid AI releases GGUF variant llama.cpp. These allow both models to run on CPUs, laptops, and edge devices. The figures below use a MacBook Pro M4 Max on FP16. The questions have 32 tokens; The document has 256 tokens.
| Sample | stage | Documents cached | p50 |
|---|---|---|---|
| lfm2.5-embedding-350m | query embedding | Yes | 7.3 ms |
| LFM2.5-Colbert-350M | Query Embedding + MaxSim | Yes | 8.2 ms |
| LFM2.5-Colbert-350M | Query + Document Embedding + Maxsim | No | 34.3 ms |
When document embeddings are pre-computed, the median (P50) query latency remains less than 10 ms. Encoding documents at query time brings ColBERT to 34.3 ms. For enterprise scale, Liquid AI also built an internal GPU stack. On the H100 on FP16, it sees latency as low as 1 ms. The embedding query latency is 1.5 ms p50.
use cases with examples
- e-commerce: Search product listings in multiple languages with an index. A buyer types in a Korean query and the system displays an English product listing. Cross-lingual retrieval does this without per-language indexing.
- FAQ and Support Knowledge Base:Reliably get the right answer on customer-facing surfaces. A French help question refers to an English help article.
- On-Device Semantic Search: Search files, emails and notes locally on consumer hardware. GGUF builds keep data on the device at almost zero cost.
- enterprise knowledge assistant:Retrieve internal legal, financial and technical documents in all languages. Colbert is suitable for when answer accuracy outweighs index size.
code:get started
embedding model runs sentence-transformers. Always pass asymmetric signals, query: And document:. Leaving them alone degrades recovery quality.
from sentence_transformers import SentenceTransformer
model = SentenceTransformer(
"LiquidAI/LFM2.5-Embedding-350M",
trust_remote_code=True,
)
queries = ["What is the capital of France?"]
documents = ["Paris is the capital and largest city of France."]
q_emb = model.encode(queries, prompt_name="query", normalize_embeddings=True)
d_emb = model.encode(documents, prompt_name="document", normalize_embeddings=True)
scores = q_emb @ d_emb.T # shape: (n_queries, n_documents)
colbert model runs pilates. Its PLAID index uses FastPLAID for efficient similarity search.
from pylate import indexes, models, retrieve
model = models.ColBERT(
model_name_or_path="LiquidAI/LFM2.5-ColBERT-350M",
trust_remote_code=True,
)
model.tokenizer.pad_token = model.tokenizer.eos_token
index = indexes.PLAID(index_folder="pylate-index", index_name="index", override=True)
docs_emb = model.encode(["document 1 text", "document 2 text"], is_query=False)
index.add_documents(documents_ids=["1", "2"], documents_embeddings=docs_emb)
retriever = retrieve.ColBERT(index=index)
q_emb = model.encode(["a search query"], is_query=True)
scores = retriever.retrieve(queries_embeddings=q_emb, k=10)
To re-rank the existing first-stage pipeline, drop the index and use rank.rerank.
from pylate import models, rank
model = models.ColBERT(model_name_or_path="LiquidAI/LFM2.5-ColBERT-350M", trust_remote_code=True)
queries = ["query A"]
documents = [["candidate doc 1", "candidate doc 2"]]
documents_ids = [[1, 2]]
q_emb = model.encode(queries, is_query=True)
d_emb = model.encode(documents, is_query=False)
reranked = rank.rerank(
documents_ids=documents_ids,
queries_embeddings=q_emb,
documents_embeddings=d_emb,
)
You can also fine-tune any model based on your data. Provides snippets using embedding cards sentence-transformers And MultipleNegativesRankingLoss.
key takeaways
- Liquid AI’s LFM2.5-ColBERT-350M and LFM2.5-Embedding-350M are the first bidirectional LFMs, built for multilingual search in 11 languages.
- Both 350M models lead their class on NanoBEIR and MKQA-11, outperforming the larger Qwen3-Embedding-0.6B.
- Embedding gives the smallest, cheapest index; Colbert trades a larger index for higher per-token accuracy.
- The GGUF build runs via llama.cpp on CPU, laptop and Edge with less than 10 ms cached p50 query latency.
- They fall into existing RAG pipelines
sentence-transformersand Pilot, under the LFM Open License v1.0.
interactive explainer
check it out Technical Details, LFM2.5-Embedding And LFM2.5-Colbert. Also, feel free to follow us Twitter And don’t forget to join us 150k+ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.
Do you need to partner with us to promote your GitHub repo or Hugging Face page or product release or webinar, etc? join us