"Where the video goes: stays on your machine" - No, the frames (that this tool extracts) obviously get sent to Anthropic if you use Claude.

I was just thinking about this exact use case yesterday:

And it's for me measuring different charged speeds at different starting battery capacities and different temperatures and I was like well. What if I just had a video camera pointing at the voltage going in and out and then I could see the battery percentage increase and I can have a temperature gun pointed at the phone as well. And I couldn't know what temperature of the phone is as well and it could just figure it all out create charts..

This would make reviewing different charging equipment really easy as long as you really have to do is plug it in and tell other people to do the same thing and take a video of it and beat it to the system.

I might very well give this a try!

Cool idea, but keyframes are not videos. Motion, object permanence, are not things Claude can infer from a set of images. Nice demo though!

Nice @OP i put together something similar as well. Incidentally I found for motion design specifically llm is not able to infer specific animations as well as it just being described very plainly and accurately what is happening and the timing.

One thing which sort of worked decently was actually take the frames and put them into a grid and have the agent look at the image of all of the frames together. It did surprisingly well but missed a lot of subtle details that it couldn’t see.

Also tried various kinds of vision embeddings, heat map of motion etc, and blur etc to show motion. But none really worked as well so I ended up just describing it until it got it. Haven’t quite found the right solution yet.

I think this is much more useful than just LLM related applications. I'd suggest renaming it to not make it seem like it's LLM related.

How do you handle things like scrolling quickly in a video?

this is really clever, props

Hi HN! I built this because I was frustrated that no LLM actually "sees" a video — Claude won't accept video files, ChatGPT reads the transcript only, and Gemini samples at a fixed 1fps (missing fast cuts, over-sampling static slides).

claude-real-video takes a URL or local file and:

1. Extracts frames at every scene change (not fixed intervals) + a density floor 2. Deduplicates with a sliding-window pixel-diff algorithm (so A-B-A interview cutaways don't re-send the same shot) 3. Transcribes audio (prefers embedded subtitles, falls back to Whisper) 4. Optionally keeps the full soundtrack for audio-capable models 5. Writes a clean MANIFEST.txt you can drop into any LLM chat

A 10-min presentation goes from ~600 fixed-interval frames to 5-15 meaningful keyframes. 90%+ token savings with better comprehension.

The dedup approach (v0.2.0) uses real pixel difference on 16x16 RGB thumbnails against a sliding window of the last N kept frames — inspired by videostil's pixelmatch, but simpler and self-contained.

`--report` generates a self-contained HTML showing every keep/drop decision with diff percentages, so you can tune the threshold visually.

pip install claude-real-video && crv "https://youtube.com/watch?v=..." --report

MIT licensed, pure Python + ffmpeg. Happy to answer questions!

"Where the video goes: stays on your machine" - No, the frames (that this tool extracts) obviously get sent to Anthropic if you use Claude.

this is really clever, props

I was just thinking about this exact use case yesterday:

I might very well give this a try!

I think this is much more useful than just LLM related applications. I'd suggest renaming it to not make it seem like it's LLM related.

Cool idea, but keyframes are not videos. Motion, object permanence, are not things Claude can infer from a set of images. Nice demo though!

How do you handle things like scrolling quickly in a video?

claude-real-video takes a URL or local file and:

A 10-min presentation goes from ~600 fixed-interval frames to 5-15 meaningful keyframes. 90%+ token savings with better comprehension.

`--report` generates a self-contained HTML showing every keep/drop decision with diff percentages, so you can tune the threshold visually.

pip install claude-real-video && crv "https://youtube.com/watch?v=..." --report

MIT licensed, pure Python + ffmpeg. Happy to answer questions!

I gave Claude a video provided by a county attorney for a speeding ticket I got. It was spot on in its analysis, even though I don’t like what the video showed.

What does it mean that Claude can’t view video; it did it just fine. Or do you mean tool less?

I think a more or less clunky name like 'llm video preprocessor' would be better description? In any case seems like a you came up with a good project idea. I wonder how long until the sota models will just have this kind of functionallity built in.

Very cool I have something that does this as well along these lines. I’ll dig into yours over the next few days and contribute where and if I can too, awesome to see!

I gave Claude a video provided by a county attorney for a speeding ticket I got. It was spot on in its analysis, even though I don’t like what the video showed.

What does it mean that Claude can’t view video; it did it just fine. Or do you mean tool less?

yeah im pretty sure claude code can handle videos. its been doing frame by frame analysis for me with generated video to iterate on pipelines

claude-real-video

Let Claude — or any LLM — actually watch a video.

Most AI tools don't really see a video. Paste a YouTube link into ChatGPT and it reads the transcript, not the picture. Claude won't take a video file at all. Even Gemini, which can read video natively, has to send it up to Google and samples frames at a fixed interval (1 fps by default), so fast cuts slip past.

claude-real-video does it differently, and locally: point it at a URL or a file, and it pulls the frames that actually matter (every scene change, not a fixed quota), throws away the near-duplicates, transcribes the audio, and hands you a clean folder any LLM can read — on your own machine, nothing uploaded.

crv "https://www.youtube.com/watch?v=..."
# → crv-out/frames/*.jpg  +  crv-out/transcript.txt  +  crv-out/MANIFEST.txt

Then drop the frames + MANIFEST.txt into Claude / ChatGPT / Gemini and ask away.

Why not just sample frames?

Most "let an LLM watch a video" scripts (and Gemini's own pipeline) grab frames at a fixed interval — e.g. one per second. That over-samples a static screencast and under-samples a fast-cut reel. claude-real-video is smarter:

	fixed-interval sampling	claude-real-video
Frame selection	every N seconds	scene-change detection + density floor
Repeated shots (A-B-A cuts)	sent again every time	sliding-window dedup sends each shot once
Static slide (10 min)	~600 near-identical frames	collapses to 1 (dedup)
Fast-cut reel	misses frames between samples	catches each visual change
Audio	often ignored	Whisper transcript w/ language detect
Where the video goes	often uploaded to a cloud	stays on your machine
Input	usually local file only	URL (yt-dlp) or local file

You feed the model fewer, more meaningful frames — cheaper context, better understanding.

Install

pip install claude-real-video              # core (frames + dedup)
pip install "claude-real-video[whisper]"   # + audio transcription

System requirement: ffmpeg

ffmpeg / ffprobe are used for frame extraction and audio, and aren't pip-installable. Install them once:

OS	command
macOS	`brew install ffmpeg`
Linux	`sudo apt install ffmpeg` (or your distro's package manager)
Windows	`winget install Gyan.FFmpeg` — or `choco install ffmpeg` — or download a build and add its `bin\` folder to your `PATH`

Verify it's on your PATH:

ffmpeg -version

Transcription uses the whisper CLI (installed by the [whisper] extra, or pip install openai-whisper). Whisper also relies on ffmpeg.

Works on macOS, Windows, and Linux — Python 3.10+.

Usage

# A YouTube / Instagram / TikTok / ... link
crv "https://www.instagram.com/reel/XXXX/"

# A local file, English transcript, output to ./out
crv lecture.mp4 -o out --lang en

# Frames only, no transcription
crv clip.mp4 --no-transcribe

# A login-gated video (your own / authorised use): pass a Netscape cookie file
crv "https://..." --cookies cookies.txt

python -m claude_real_video ... works as an alias for crv too.

Options

flag	default	meaning
`-o, --out`	`crv-out`	output directory
`--scene`	`0.30`	scene-change sensitivity (lower = more frames)
`--fps-floor`	`1.0`	at least one frame every N seconds
`--max-frames`	`150`	hard cap on total frames
`--lang`	`auto`	Whisper language (`en`, `zh`, `auto`, ...)
`--dedup-threshold`	`8`	% of pixels that must change for a frame to count as new; higher = fewer frames
`--dedup-window`	`4`	compare against the last N kept frames — a shot the model already saw doesn't come back after a cutaway (`1` = consecutive-only)
`--report`	off	keep dropped frames in `./dropped` + write `report.html` visualising every keep/drop decision
`--no-transcribe`	off	skip audio
`--keep-audio`	off	also save the full soundtrack (`audio.m4a`) so audio models can hear it
`--cookies`	–	Netscape cookie file for login-gated sources

Use it from Python

from claude_real_video import process

r = process("https://youtu.be/...", "out", lang="en")
print(r.frame_count, r.transcript_path)

How it works

Fetch — yt-dlp for URLs (optional cookies), or copy a local file.
Extract — one chronological ffmpeg select pass grabs every scene change plus a density floor (at least one frame every --fps-floor seconds), so fast cuts and slow screencasts are both covered.
Dedup — real pixel difference (downscaled RGB, not a perceptual hash — hashes go blind on flat colours and equal-luma hue changes) against a sliding window of the last --dedup-window kept frames, so an A-B-A cutaway doesn't re-send a shot the model has already seen. --report writes report.html showing every keep/drop decision with its diff %, for tuning.
Text — if the video already has subtitles (a sidecar .srt/.vtt next to a local file, or an embedded subtitle track), those are used as the transcript — faster and more accurate than re-transcribing. Only when there are no subtitles does it fall back to Whisper on the audio (skipped cleanly if there's no audio).
Audio (optional, --keep-audio) — save the full original soundtrack (audio.m4a: music + speech + effects, copied losslessly when possible). The transcript only has the words; the audio file lets a model that can listen (Gemini, GPT-4o, …) actually hear the music and tone.
Manifest — MANIFEST.txt summarises everything for the model.

So the model can see (key frames), read (transcript) and — with --keep-audio — hear (full soundtrack) the video. The transcript is plain text any model can read; the tool doesn't burn subtitles into the video — burning is a presentation choice, not something needed to make a video AI-readable.

Notes

Only download content you have the right to. The --cookies option is for your own, authorised access — don't ship credentials in a repo.
Re-running overwrites the output directory.

License

MIT