This is an cool thorough breakdown of vLLMs and their integration with visual data. Your explanation of cross-modality attention and the challenges in efficiently merging image and text embeddings is spot on.
Do you think reinforcement fine-tuning could reduce the need for massive multi-modal datasets when aligning visual and textual modalities, especially for specialized enterprise use cases? I think it could and it does, but curious to hear your opinion.
Also, with models like LLaMA-3.2 Vision becoming more sophisticated, it seems like there’s a growing need for optimization techniques that balance speed and accuracy. Would love to hear your thoughts on where fine-tuning fits into the vLLM roadmap, in your view.
Hi! There is nothing unique to vLLMs about this question. Post training is generally the same for these models; e.g., LLaMA-3.2 vision model is post trained basically identically to the LLaMA-3.1 text models. So, we could apply all the same findings from textual modalities (e.g., reinforcement finetuning) to vision models as well. The main difference here is the underlying data (i.e., the data will have images as an input instead of just text).
Thank! Really elaborate and interesting stuff. I think it is interesting to explore the ways these VLMs process information and fuse modalities. I recently worked on such project, pasting here in case you'll find it interesting- https://vision-of-vlm.github.io/
loved the article
clarifies a lot of my questions, thanks
Glad it was helpful!
great write-up
thank you!
Well wrote and to the point.
Thank you!
As always, excellent.
Thank you!
This is an cool thorough breakdown of vLLMs and their integration with visual data. Your explanation of cross-modality attention and the challenges in efficiently merging image and text embeddings is spot on.
Do you think reinforcement fine-tuning could reduce the need for massive multi-modal datasets when aligning visual and textual modalities, especially for specialized enterprise use cases? I think it could and it does, but curious to hear your opinion.
Also, with models like LLaMA-3.2 Vision becoming more sophisticated, it seems like there’s a growing need for optimization techniques that balance speed and accuracy. Would love to hear your thoughts on where fine-tuning fits into the vLLM roadmap, in your view.
Hi! There is nothing unique to vLLMs about this question. Post training is generally the same for these models; e.g., LLaMA-3.2 vision model is post trained basically identically to the LLaMA-3.1 text models. So, we could apply all the same findings from textual modalities (e.g., reinforcement finetuning) to vision models as well. The main difference here is the underlying data (i.e., the data will have images as an input instead of just text).
Thank! Really elaborate and interesting stuff. I think it is interesting to explore the ways these VLMs process information and fuse modalities. I recently worked on such project, pasting here in case you'll find it interesting- https://vision-of-vlm.github.io/
Thanks for sharing!