Llava-1.5-7B
Image Captioning
W4A16
post
Llava-1.5-7B

LLaVA-1.5-7B is a multimodal large language model that integrates visual understanding and natural language generation. Built upon the Vicuna-7B language model and the CLIP visual encoder, it can process both image and text inputs to support tasks such as visual question answering, image captioning, and multimodal reasoning. It is widely used in applications like intelligent assistants, education, and image-based analysis. Compared to earlier versions, LLaVA-1.5 offers significantly improved vision-language alignment and response accuracy, making it a key advancement in the field of multimodal AI.

Source model repository: Llava-1.5-7B

Performance Reference

Device

Backend
Precision
TTFT
Prefill
Decode
Context Size
File Size
Model Details

Model type:

LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

Model date:

LLaVA-v1.5-7B was trained in September 2023.

Paper or resources for more information:

https://llava-vl.github.io/

Source Model Evaluation

No data available

Model Inference

To be released

License
Source Model:LLAMA2
Deployable Model:LLAMA2
Performance Reference

Device

Backend
Precision
TTFT
Prefill
Decode
Context Size
File Size