Vision Large Language Models (vLLMs)

Cameron R. Wolfe, Ph.D.

Mar 31

127

Teaching LLMs to understand images and videos in addition to text...

Read →

11 Comments

Anwesha Chowdhury

Jun 18

loved the article

clarifies a lot of my questions, thanks

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Jun 18

Glad it was helpful!

Expand full comment

Paul

Apr 29

great write-up

thank you!

Expand full comment

ROHITH VENKATA REDDY

Apr 13

Well wrote and to the point.

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Apr 13

Thank you!

Expand full comment

Michael

Mar 31

As always, excellent.

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Apr 1

Thank you!

Expand full comment

Dev Rishi

Mar 31

This is an cool thorough breakdown of vLLMs and their integration with visual data. Your explanation of cross-modality attention and the challenges in efficiently merging image and text embeddings is spot on.

Do you think reinforcement fine-tuning could reduce the need for massive multi-modal datasets when aligning visual and textual modalities, especially for specialized enterprise use cases? I think it could and it does, but curious to hear your opinion.

Also, with models like LLaMA-3.2 Vision becoming more sophisticated, it seems like there’s a growing need for optimization techniques that balance speed and accuracy. Would love to hear your thoughts on where fine-tuning fits into the vLLM roadmap, in your view.

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Apr 1

Hi! There is nothing unique to vLLMs about this question. Post training is generally the same for these models; e.g., LLaMA-3.2 vision model is post trained basically identically to the LLaMA-3.1 text models. So, we could apply all the same findings from textual modalities (e.g., reinforcement finetuning) to vision models as well. The main difference here is the underlying data (i.e., the data will have images as an input instead of just text).

Expand full comment

Omri

Mar 31

Thank! Really elaborate and interesting stuff. I think it is interesting to explore the ways these VLMs process information and fuse modalities. I recently worked on such project, pasting here in case you'll find it interesting- https://vision-of-vlm.github.io/

Expand full comment

Reply (1)

Cameron R. Wolfe, Ph.D.

Mar 31

Thanks for sharing!

Expand full comment

Deep (Learning) Focus

Vision Large Language Models (vLLMs)