This project uses LLaVA (Large Language-and-Vision Assistant) , an end-to-end trained large multimodal model that connects a vision encoder and LLM for general-purpose visual and language understanding.
llava generates the description of the image and the description is the fed to llama3 to generate the caption of the image.
-
Clone the repo
git clone <URL>
-
Activate a virtual env
python3 -m venv cenv source cenv/bin/activate
-
Install requirements
pip install -r requirements.txt
-
Download the llms using the following command
ollama pull llama3 ollama pull llava
-
Start the local ollama server
ollama serve
-
Run the backend server
uvicorn main:app --reload
-
Run the code
streamlit run app.py