11 Comments
User's avatar
Anwesha Chowdhury's avatar

loved the article

clarifies a lot of my questions, thanks

Expand full comment
Cameron R. Wolfe, Ph.D.'s avatar

Glad it was helpful!

Expand full comment
Paul's avatar

great write-up

thank you!

Expand full comment
ROHITH VENKATA REDDY's avatar

Well wrote and to the point.

Expand full comment
Cameron R. Wolfe, Ph.D.'s avatar

Thank you!

Expand full comment
Michael's avatar

As always, excellent.

Expand full comment
Cameron R. Wolfe, Ph.D.'s avatar

Thank you!

Expand full comment
Dev Rishi's avatar

This is an cool thorough breakdown of vLLMs and their integration with visual data. Your explanation of cross-modality attention and the challenges in efficiently merging image and text embeddings is spot on.

Do you think reinforcement fine-tuning could reduce the need for massive multi-modal datasets when aligning visual and textual modalities, especially for specialized enterprise use cases? I think it could and it does, but curious to hear your opinion.

Also, with models like LLaMA-3.2 Vision becoming more sophisticated, it seems like there’s a growing need for optimization techniques that balance speed and accuracy. Would love to hear your thoughts on where fine-tuning fits into the vLLM roadmap, in your view.

Expand full comment
Cameron R. Wolfe, Ph.D.'s avatar

Hi! There is nothing unique to vLLMs about this question. Post training is generally the same for these models; e.g., LLaMA-3.2 vision model is post trained basically identically to the LLaMA-3.1 text models. So, we could apply all the same findings from textual modalities (e.g., reinforcement finetuning) to vision models as well. The main difference here is the underlying data (i.e., the data will have images as an input instead of just text).

Expand full comment
Omri's avatar

Thank! Really elaborate and interesting stuff. I think it is interesting to explore the ways these VLMs process information and fuse modalities. I recently worked on such project, pasting here in case you'll find it interesting- https://vision-of-vlm.github.io/

Expand full comment
Cameron R. Wolfe, Ph.D.'s avatar

Thanks for sharing!

Expand full comment