Liquid AI launched LFM2-VL-3B, a 3B parameter imaginative and prescient language mannequin for picture textual content to textual content duties. It extends the LFM2-VL household past the 450M and 1.6B variants. The mannequin targets increased accuracy whereas preserving the velocity profile of the LFM2 structure. It’s accessible on LEAP and Hugging Face below the LFM Open License v1.0.
Mannequin overview and interface
LFM2-VL-3B accepts interleaved picture and textual content inputs and produces textual content outputs. The mannequin exposes a ChatML like template. The processor inserts an sentinel that’s changed with encoded picture tokens at run time. The default textual content context size is 32,768 tokens. These particulars assist devs reproduce evaluations and combine the mannequin with current multimodal pipelines.
Structure
The stack pairs a language tower with a form conscious imaginative and prescient tower and a projector. The language tower is LFM2-2.6B, a hybrid convolution plus consideration spine. The imaginative and prescient tower is SigLIP2 NaFlex at 400M parameters, it preserves native side ratios and avoids distortion. The connector is a 2 layer MLP with pixel unshuffle, it compresses picture tokens earlier than fusion with the language house. This design lets customers cap imaginative and prescient token budgets with out retraining the mannequin.
The encoder processes native resolutions as much as 512×512. Bigger inputs are break up into non overlapping 512×512 patches. A thumbnail pathway offers world context throughout tiling. The environment friendly token mapping is documented with concrete examples, a 256×384 picture maps to 96 tokens, a 1000×3000 picture maps to 1,020 tokens. The mannequin card exposes consumer controls for minimal and most picture tokens and the tiling change. These controls tune velocity and high quality at inference time.
Inference settings
The Hugging Face mannequin card offers really useful parameters. Textual content era makes use of temperature 0.1, min p 0.15, and a repetition penalty of 1.05. Imaginative and prescient settings use min picture tokens 64, max picture tokens 256, and picture splitting enabled. The processor applies the chat template and the picture sentinel routinely. The instance makes use of AutoModelForImageTextToText and AutoProcessor with bfloat16 precision.
How is it skilled?
Liquid AI describes a staged strategy. The workforce performs joint mid coaching that adjusts the textual content to picture ratio over time. The mannequin then undergoes supervised nice tuning targeted on picture understanding. The information sources are massive scale open datasets plus in home artificial imaginative and prescient knowledge for job protection.
Benchmarks
The analysis workforce stories aggressive outcomes amongst light-weight open VLMs. On MM-IFEval the mannequin reaches 51.83. On RealWorldQA it reaches 71.37. On MMBench dev en it reaches 79.81. The POPE rating is 89.01. The desk notes that scores for different methods have been computed with VLMEvalKit. The desk excludes Qwen3-VL-2B as a result of that system was launched at some point earlier.
The language functionality stays near the LFM2-2.6B spine. The analysis workforce cites 30 % on GPQA and 63 % on MMLU. This issues when notion duties embrace data queries. The workforce additionally states expanded multilingual visible understanding throughout English, Japanese, French, Spanish, German, Italian, Portuguese, Arabic, Chinese language, and Korean.
Why edge customers ought to care?
The structure retains compute and reminiscence inside small machine budgets. Picture tokens are compressible and consumer constrained, so throughput is predictable. SigLIP2 400M NaFlex encoder preserves side ratios, which helps nice grained notion. The projector reduces tokens on the connector, which improves tokens per second. The analysis workforce additionally revealed a GGUF construct for on machine runtimes. These properties are helpful for robotics, cell, and industrial purchasers that want native processing and strict knowledge boundaries.
Key Takeaways
- Compact multimodal stack: 3B parameter LFM2-VL-3B pairs an LFM2-2.6B language tower with a 400M SigLIP2 NaFlex imaginative and prescient encoder and a 2-layer MLP projector for image-token fusion. NaFlex preserves native side ratios.
- Decision dealing with and token budgets: Photos run natively as much as 512×512, bigger inputs tile into non overlapping 512×512 patches with a thumbnail pathway for world context. Documented token mappings embrace 256×384 → 96 tokens and 1000×3000 → 1,020 tokens.
- Inference interface: ChatML-like prompting with an
sentinel, default textual content context 32,768 tokens, really useful decoding settings, and processor-level controls for picture splitting allow reproducible analysis and straightforward integration in multimodal pipelines. - Measured efficiency: Reported outcomes embrace MM-IFEval 51.83, RealWorldQA 71.37, MMBench-dev-en 79.81, and POPE 89.01. Language-only indicators from the spine are about 30% GPQA and 63% MMLU, helpful for blended notion plus data workloads.
LFM2-VL-3B is a sensible step for edge multimodal workloads, the 3B stack pairs LFM2-2.6B with a 400M SigLIP2 NaFlex encoder and an environment friendly projector, which lowers picture token counts for predictable latency. Native decision processing with 512 by 512 tiling and token caps offers deterministic budgets. Reported scores on MM-IFEval, RealWorldQA, MMBench, and POPE are aggressive for this measurement. Open weights, a GGUF construct, and LEAP entry scale back integration friction. Total, that is an edge prepared VLM launch with clear controls and clear benchmarks.
Try the Model on HF and Technical details. Be at liberty to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be at liberty to comply with us on Twitter and don’t overlook to hitch our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Michal Sutter is an information science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a stable basis in statistical evaluation, machine studying, and knowledge engineering, Michal excels at reworking complicated datasets into actionable insights.