
Both vision language models (VLM) allow both text input and visual understanding. However, the image is important for VLM performance for resolution text and processing of chart-rich data. The growing image resolution creates important challenges. First, preternd vision encoders often struggle with high-resolution images due to disabled preting requirements. Walking on high-resolution images increases computational costs and delay during visual token generation, whether single high-resolution processing or several low-resolution tile strategies. Second, high-resolution images produce more tokens, leading to an increase in LLM prefilling time and time-to-token (TTFT), which is the sum of vision encoder delay and LLM prefilling time.
Large multimodal models such as frozen and Florence used cross-attitude to combine image and text embeding within intermediate LLM layers. Auto-Regressive Architecture such as Llava, MPLug-Owl, Minigpt-4, and Cambrian-1 are effective. For efficient image encoding, clip-primary vision transformers are widely adopted, with variants such as siglip, eva-clip, internavit and DFNCLIP. Ways such as Llava-PRUMERGE and Matryoshka-based token sampling attempts dynamic token, while hierarchical backbones such as convnext and fastvit reduce tokens through progressive downsmpling. Recently, the conveyor was introduced, which uses a pure-ecological vision encoder to encoded images for a VLM.
Apple researchers have proposed Fastvlm, a model that receives a customized tradeoff between resolution, delay and accuracy, analyzing how image quality, processing time, number of tokens and LLM size affect each other. It uses fastwith, using a hybrid vision encoder that is designed to output low tokens and reduce encoding time for high-resolution images. Fastvlm only attains an optimal balance between visual token count and image resolution by scaling the input image. This shows 3.2 times the improvement in TTFT in Llava1.5 setup and achieves better performance on the major benchmark using the same 0.5B LLM compared to Llava-orvision at maximum resolution. It uses TTFT 85 times faster using 3.4 times small vision encoder.
All Fastvlm models are trained at a single node with Nvidia H100-80GB GPU 8 times, where VLM’s stage 1 training is sharp, it takes about 30 minutes to train with Qwen2-7B decoder. In addition, Fastvithd enhances the base fastest architecture by starting an additional phase with an additional phase with an additional step. This ensures that the self-coordination operates on tensers by a factor of 32 instead of 16, reduces the image encoding delay while generating 4 times less tokens for LLM decoders. Fastvithd architecture involves five stages: the first three phases use repsimixer blocks for efficient processing, while the final two phases employ multi-oriented self-eclipse blocks, creating a optimal balance between computational efficiency and high-resolution image understanding.
When compared with Convelwa using the same LLM and similar training data, Fastvlm achieves 8.4% better performance on textvQA and 12.5% improvement on DOCVQA when operating 22% faster. Performance benefits grow in high resolutions, where Fastvlm maintains 2 × rapid processing speed compared to convllava in various benchmarks. Fastvlm generates or crosses the MM1 performance in a variety of benchmarks using an intermediate pretraying with 15M samples for resolution scaling, generating 5 times less visible tokens. In addition, Fastvlm not only improves Cambrian -1, but also runs 7.9 times faster. With the scaled instructions tuning, it gives better results using 2.3 times less visible tokens.
In the conclusion, researchers introduced Fastvlm, which is an advancement in VLM, which is using Fastvithd vision backbone for efficient high-resolution image encoding. Hybrid architecture, pretense on reinforced image-read data, reduces visual token output while maintaining minimum accuracy sacrifice compared to existing approaches. The VLM achieves competitive performance in the VLM benchmark, improving notable efficiency in both Fastvlm TTFT and vision backbone parameters count. Hard benchmarking on M1 MacBook Pro Hardware suggests that Fastvlm provides a state-of-the-art resolution-element-entry trade-off for existing methods.
Check it paper, All credit for this research goes to the researchers of this project. Also, feel free to follow us Twitter And don’t forget to join us 100k+ mL subredit More membership Our newspaper,
You can also like Nvidia’s open citrus cosmos defooder [Check it now]
Sajjad Ansari is a final year graduation from IIT Kharagpur. As a technical enthusiast, he delays practical applications of AI with focus on understanding the impact of AI technologies and their real -world implications. He aims to clarify complex AI concepts in a clear and accessible way.