Dev Portfolio | SproulTech

AI / Python

Multimodal video understanding with Whisper + LLaVA

Python tool that extracts audio with ffmpeg, transcribes with OpenAI Whisper (CUDA accelerated), samples video frames and analyzes them with LLaVA 13B via Ollama, then combines both streams into a comprehensive video summary. All local, all free.

Links Coming Soon

Tech Stack

PythonOpenAI WhisperOllamaLLaVA 13BffmpegCUDAPillow

Key Features

Audio extraction and transcription via Whisper with CUDA GPU acceleration

Frame sampling at configurable FPS (default 0.5) with 512px downscaling

Visual analysis of sampled frames via LLaVA 13B through Ollama

Combined audio + visual summary generation

Configurable whisper model size (tiny through large)

Runs entirely local — no API keys or cloud services needed

Architecture

Pipeline: ffmpeg extracts audio track -> Whisper transcribes with GPU acceleration -> ffmpeg samples frames at target FPS -> frames resized via Pillow -> each frame sent to Ollama LLaVA 13B for description -> all transcripts and frame descriptions merged into final analysis.

Screenshot / Demo Coming Soon

Next Project

Artificial life with procedural biomes and evolution

→

Video Analyzer

What It Does

How It's Built

Cell Sim