Llava-1.5-7B

Llava-1.5-7B

LLaVA-1.5-7B is a multimodal large language model that integrates visual understanding and natural language generation. Built upon the Vicuna-7B language model and the CLIP visual encoder, it can process both image and text inputs to support tasks such as visual question answering, image captioning, and multimodal reasoning. It is widely used in applications like intelligent assistants, education, and image-based analysis. Compared to earlier versions, LLaVA-1.5 offers significantly improved vision-language alignment and response accuracy, making it a key advancement in the field of multimodal AI.

Source model repository: Llava-1.5-7B

Performance Reference

Device

Backend

Precision

TTFT

Prefill

Decode

Context Size

File Size

Model Details

Model type:

LLaVA is an open-source chatbot trained by fine-tuning LLaMA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture.

Model date:

LLaVA-v1.5-7B was trained in September 2023.

Paper or resources for more information:

https://llava-vl.github.io/

Source Model Evaluation

No data available

Model Inference

To be released

License

Source Model:LLAMA2

Deployable Model:LLAMA2