How not to process large quantities of video
By Adrian EdwardsYou might think of video as just a sequence of still images played back really fast, but there’s a lot that has to happen for it to all work as smoothly as it does today. So grab your nearest comfort object and buckle up, because I’m about to talk about more things than you ever thought you wanted to know about video files.
Consider this post a kind of “Cursed Knowledge” compendium if you will. I will leave it up to you whether this is a helpful guide, or a list of reasons to run away from any task involving merging gigabytes of video files together.
How it started
Last summer, I had the opportunity to help run several events as part of my internship with the Fedora Project. These included virtual events such as both the Fedora 40 and 41 Release Parties as well as Fedora’s Week of Diversity event. I also attended Flock to Fedora 2024 and gave a short talk on my role and some of the other projects I’d been up to throughout my internship (after some A/V failures).
While helping run these events, I began creating many tools (such as matrix bots) to help make the process easier. My goal was to speed up the existing process by developing some scripts that would help accelerate the existing process and get videos out to the community faster after the event.
Doing it the wrong way, but faster
1. Imperfect cuts
Isn’t all recorded video simply a series of photos taken in rapid succession? Surely you can just split the video between any two frames right?
Yes, that was a joke.
Modern video formats take advantage of whats known as “interframe compression” to save space. As you capture video faster and faster (i.e. a higher frame rate), less time as passed between each frame, meaning less things have likely moved or changed from one frame to the next.
Instead of storing every frame, it’s possible to save quite a lot of space by only storing full frames occasionally and encoding the in-between frames as “differences” or “transformations” of those full frames to create the next frame on the fly. While this is great for storage efficiency, this kind of frame compression means you can’t just chop a video wherever you want. These “difference” frames need to be processed relative to what is around them to make any sense.
Because early versions of the script were making cuts in a way that rounded to the nearest full frame, cuts would frequently be too early or too late by up to a couple of seconds.
Of course, while it is possible to decode every frame from the video in full quality, remove the ones you don’t want, and then re-encode the video from that new set of frames, this can often take longer to process a video than that video’s duration. When you’re processing video from multiple days worth of several concurrent live streams, having to spend more time than it would take to “watch” every single livestream is pretty inefficient.
The solution? The LosslessCut video editing tool had a feature called “smart cut” that is able to intelligently handle these kinds of cuts, only re-encoding the minimum possible amount and being both a LOT faster than a full re-encoding operation, and a lot more accurate than existing “fast cut” operations.
2. Video Format consistency
Surely just having all the videos in the same format will work better right?
Right?
Depends on your definition of “the same.”
Early on in the development of these scripts, I thought Google’s new webm video format would be the easiest to work with and keep things simple. The tools for downloading live streams from YouTube even supported selecting this format, how perfect!
It was not perfect.
As it turns out, Google probably knows a lot more about video formats than I do. I had assumed Google was offering up the same video in a variety of different formats for convenience; in reality they were trying to save as much space as possible by offering different quality levels of video in different formats.
By blindly assuming that the format I had chosen was the best available quality as everything else, I was inadvertently including lower-resolution content in the final edits, causing the quality to change mid-video.
Thankfully, the YouTube comments let me know almost immediately.
Ultimately the solution was to use the highest available quality (which usually ended up in a mix of mkv, mp4, or webm format) and ensure that the formats matched after the fact.
3. High Pitched Voices (but only some of them)
Then came the weirdest bug of all. As the talks started to go live and people began watching, they noticed that voices seemed pitch-shifted. Not all of them, only some of them.
The culprit here ended up being silence.
I wish this was a joke.
The part of my script that was responsible for converting the provided title images into short video clips to add before the main talk content needed to create a few seconds of silent audio to play while the title image showed. This was to ensure the rest of the audio started playing as soon as the title slide was done, and not before.
(Now is your chance to check “multiple types of silence” off your “why is the video broken” bingo card…)
These digital formats that represent sound have to make tens of thousands of measurements per second so they can approximate that same sound later. While the end result may sound the same, some audio files use 48,000 samples per second, whereas others use 44,100. This may seem like a difference of only fractions of a second, but it can affect how precisely sounds get stored, specifically higher-frequency sounds.
Because this silent audio track was configured at this faster rate, it caused the whole video file to treat all of the sound as if it was recorded at this faster rate, even though the audio from the event was recorded at the lower speed of 44,100 samples per second. This difference caused the audio to play back about 8% faster than it was recorded, leading to two separate audio issues in the recordings:
- A noticeable “pop” or moment of silence every few moments when the faster-playing audio ran out of data and had to wait until the next audio chunk.
- Audio incorporating higher-pitched tones would seem more noticeably pitch-shifted and warped compared to audio with only lower tones. This was because higher frequencies are more sensitive to this difference in playback speed than lower tones
4. Human Problems and Miscellany
Another technical problem: YouTube forces scheduled videos to be private. It is not possible to schedule a video to go from unlisted to public on a particular date. The videos MUST start as private. This created some friction since I was using YouTube Playlists to soft-publish the unlisted videos before they were fully published.
The rest of the issues are things I would lump together as “human problems” on my part. These included things like:
- Forgetting, or incorrectly performing, some of the steps in process that were not yet handled by the automation (such as removing non-talk “break” sessions from the schedule prior to processing)
- Not manually checking every minute of every video. This would have been really time consuming given how many hours of video there were, so I just… didn’t. This decision led to a snowball effect of additional issues, such as the YouTube comments identifying quality issues in some of the early videos. Since the URLs had already been manually added to the schedule, pulling down and re-uploading the videos again would be even more manual work that would slow down the releases of the remaining videos.
Shout Outs
I would like to shout out everyone who helped make these uploads happen, including:
- Justin, Aoife, Natalie, Dorka, and everyone else who helped run Flock 2024
- The media team from On Site A/V, who provided and managed the capture and presentation hardware and were great to work with (especially Greg)
- YouTube commenter @gabrielrmattoso for contributing post timestamps on many of the talk videos (these were an awesome sanity-check to make sure the automations were working and helped shortcut the process by a lot!)
- Everyone else who left YouTube comments to point out errors with the videos. Your feedback helped me catch the issues sooner!
So what could be better for next time?
Ultimately, I think the process can be improved next time by:
- Potentially having some better tooling to systematically check/review large batches of videos, but in a way that can spread it across a whole review team (maybe with some kind of web interface)
- Clearer guidance for speakers on how they should input their name into PreTalx since the scripts currently just use that as-is for the thumbnails.
- Reducing the tedium with (potentially) some more automations for things like uploading to YouTube, or attaching the YouTube URLs to the PreTalx Schedule.