Multimodal video understanding with Whisper + LLaVA
Python tool that extracts audio with ffmpeg, transcribes with OpenAI Whisper (CUDA accelerated), samples video frames and analyzes them with LLaVA 13B via Ollama, then combines both streams into a comprehensive video summary. All local, all free.
Audio extraction and transcription via Whisper with CUDA GPU acceleration
Frame sampling at configurable FPS (default 0.5) with 512px downscaling
Visual analysis of sampled frames via LLaVA 13B through Ollama
Combined audio + visual summary generation
Configurable whisper model size (tiny through large)
Runs entirely local — no API keys or cloud services needed
Pipeline: ffmpeg extracts audio track -> Whisper transcribes with GPU acceleration -> ffmpeg samples frames at target FPS -> frames resized via Pillow -> each frame sent to Ollama LLaVA 13B for description -> all transcripts and frame descriptions merged into final analysis.