Skip to main content
AI / Python

Multimodal video understanding with Whisper + LLaVA

Python tool that extracts audio with ffmpeg, transcribes with OpenAI Whisper (CUDA accelerated), samples video frames and analyzes them with LLaVA 13B via Ollama, then combines both streams into a comprehensive video summary. All local, all free.

Links Coming Soon
Tech Stack
PythonOpenAI WhisperOllamaLLaVA 13BffmpegCUDAPillow
Key Features

01

Audio extraction and transcription via Whisper with CUDA GPU acceleration

02

Frame sampling at configurable FPS (default 0.5) with 512px downscaling

03

Visual analysis of sampled frames via LLaVA 13B through Ollama

04

Combined audio + visual summary generation

05

Configurable whisper model size (tiny through large)

06

Runs entirely local — no API keys or cloud services needed

Architecture

Pipeline: ffmpeg extracts audio track -> Whisper transcribes with GPU acceleration -> ffmpeg samples frames at target FPS -> frames resized via Pillow -> each frame sent to Ollama LLaVA 13B for description -> all transcripts and frame descriptions merged into final analysis.

Screenshot / Demo Coming Soon