How not to process large quantities of video | Adrian Edwards
[go: up one dir, main page]

Leaf shape with a donut shape removed from the middle

How not to process large quantities of video

By Adrian Edwards

You might think of video as just a sequence of still images played back really fast, but there’s a lot that has to happen for it to all work as smoothly as it does today. So grab your nearest comfort object and buckle up, because I’m about to talk about more things than you ever thought you wanted to know about video files.

Consider this post a kind of “Cursed Knowledge” compendium if you will. I will leave it up to you whether this is a helpful guide, or a list of reasons to run away from any task involving merging gigabytes of video files together.

How it started

Last summer, I had the opportunity to help run several events as part of my internship with the Fedora Project. These included virtual events such as both the Fedora 40 and 41 Release Parties as well as Fedora’s Week of Diversity event. I also attended Flock to Fedora 2024 and gave a short talk on my role and some of the other projects I’d been up to throughout my internship (after some A/V failures).

While helping run these events, I began creating many tools (such as matrix bots) to help make the process easier. My goal was to speed up the existing process by developing some scripts that would help accelerate the existing process and get videos out to the community faster after the event.

Doing it the wrong way, but faster

1. Imperfect cuts

Isn’t all recorded video simply a series of photos taken in rapid succession? Surely you can just split the video between any two frames right?

Yes, that was a joke.

Modern video formats take advantage of whats known as “interframe compression” to save space. As you capture video faster and faster (i.e. a higher frame rate), less time as passed between each frame, meaning less things have likely moved or changed from one frame to the next.

Instead of storing every frame, it’s possible to save quite a lot of space by only storing full frames occasionally and encoding the in-between frames as “differences” or “transformations” of those full frames to create the next frame on the fly. While this is great for storage efficiency, this kind of frame compression means you can’t just chop a video wherever you want. These “difference” frames need to be processed relative to what is around them to make any sense.

Because early versions of the script were making cuts in a way that rounded to the nearest full frame, cuts would frequently be too early or too late by up to a couple of seconds.

Of course, while it is possible to decode every frame from the video in full quality, remove the ones you don’t want, and then re-encode the video from that new set of frames, this can often take longer to process a video than that video’s duration. When you’re processing video from multiple days worth of several concurrent live streams, having to spend more time than it would take to “watch” every single livestream is pretty inefficient.

The solution? The LosslessCut video editing tool had a feature called “smart cut” that is able to intelligently handle these kinds of cuts, only re-encoding the minimum possible amount and being both a LOT faster than a full re-encoding operation, and a lot more accurate than existing “fast cut” operations.

2. Video Format consistency

Surely just having all the videos in the same format will work better right?

Right?

Depends on your definition of “the same.”

Early on in the development of these scripts, I thought Google’s new webm video format would be the easiest to work with and keep things simple. The tools for downloading live streams from YouTube even supported selecting this format, how perfect!

It was not perfect.

As it turns out, Google probably knows a lot more about video formats than I do. I had assumed Google was offering up the same video in a variety of different formats for convenience; in reality they were trying to save as much space as possible by offering different quality levels of video in different formats.

By blindly assuming that the format I had chosen was the best available quality as everything else, I was inadvertently including lower-resolution content in the final edits, causing the quality to change mid-video.

Thankfully, the YouTube comments let me know almost immediately.

Ultimately the solution was to use the highest available quality (which usually ended up in a mix of mkv, mp4, or webm format) and ensure that the formats matched after the fact.

3. High Pitched Voices (but only some of them)

Then came the weirdest bug of all. As the talks started to go live and people began watching, they noticed that voices seemed pitch-shifted. Not all of them, only some of them.

The culprit here ended up being silence.

I wish this was a joke.

The part of my script that was responsible for converting the provided title images into short video clips to add before the main talk content needed to create a few seconds of silent audio to play while the title image showed. This was to ensure the rest of the audio started playing as soon as the title slide was done, and not before.

(Now is your chance to check “multiple types of silence” off your “why is the video broken” bingo card…)

These digital formats that represent sound have to make tens of thousands of measurements per second so they can approximate that same sound later. While the end result may sound the same, some audio files use 48,000 samples per second, whereas others use 44,100. This may seem like a difference of only fractions of a second, but it can affect how precisely sounds get stored, specifically higher-frequency sounds.

Because this silent audio track was configured at this faster rate, it caused the whole video file to treat all of the sound as if it was recorded at this faster rate, even though the audio from the event was recorded at the lower speed of 44,100 samples per second. This difference caused the audio to play back about 8% faster than it was recorded, leading to two separate audio issues in the recordings:

  1. A noticeable “pop” or moment of silence every few moments when the faster-playing audio ran out of data and had to wait until the next audio chunk.
  2. Audio incorporating higher-pitched tones would seem more noticeably pitch-shifted and warped compared to audio with only lower tones. This was because higher frequencies are more sensitive to this difference in playback speed than lower tones

4. Human Problems and Miscellany

Another technical problem: YouTube forces scheduled videos to be private. It is not possible to schedule a video to go from unlisted to public on a particular date. The videos MUST start as private. This created some friction since I was using YouTube Playlists to soft-publish the unlisted videos before they were fully published.

The rest of the issues are things I would lump together as “human problems” on my part. These included things like:

Shout Outs

I would like to shout out everyone who helped make these uploads happen, including:

So what could be better for next time?

Ultimately, I think the process can be improved next time by:

  1. Potentially having some better tooling to systematically check/review large batches of videos, but in a way that can spread it across a whole review team (maybe with some kind of web interface)
  2. Clearer guidance for speakers on how they should input their name into PreTalx since the scripts currently just use that as-is for the thumbnails.
  3. Reducing the tedium with (potentially) some more automations for things like uploading to YouTube, or attaching the YouTube URLs to the PreTalx Schedule.

Recent Posts

Decoding the Concept2 Timestamp format

Lessons learned while picking apart how Concept2 encodes dates on their rowing machines

Fixing LaundryConnectPay's minimum deposits - a story of disrespectful technology

When systems are designed to not respect you, and what to do about it

Adding System Pressure Data To KDE System Monitor

Why I contributed system pressure data to the KDE System Monitor