Take a look at YouTube’s recommended upload settings. Do you know what all this means? The nuts and bolts of a digital video and audio file are important to understand, even if your end deliverable is just a consumer web player.
The size of an image, sometimes called its “raster”. The most fundamental element of a digital image is called a “picture+element” or “pixel”. When you hear people talk about “HD” or “8K”, they’re talking about resolution or how many pixels make up a given frame.
Resolution matters much less than most people think. The difference between standard definition “SD” and high definition “HD” was very noticeable. Beyond HD the returns diminish. Steve Yedlin (Star Wars DP) has one of the better explanations of where resolution fits in the image quality spectrum.
The “aspect ratio” refers to the width vs height of the overall image.
But the pixels themselves can also have an aspect ratio if they are “non square pixels”. If the software you’re using doesn’t recognize the format as a non-square-pixel format your image will appear skewed. Many modern formats don’t have this issue, but it’s a good one to be aware of.
A progressive image refers to nothing more than showing the entire video frame at once.
Early video signals were limited by distribution bandwidth for television and therefore introduced interlaced video. In order to keep motion smooth and reduce flicker, the image was essentially divided in half and displayed in line pairs. Every other line is displayed at any given time in an interlaced “field”. These alternating fields look funny when paused, but when played back they actually increase temporal resolution.
Frame Rate refers to how quickly frames of video are capture in succession and it’s usually measured in “frames per second” or FPS. Common frame rates include:
Remember to set the frame rate of your project at the beginning. Most modern NLE software allows you to set a frame on a per timeline basis, but it’s something you need to consciously choose.
Any clips that do not conform to the timeline frame rate will be played back with skipped or duplicated frames to try to match the timeline frame rate.
24fps to 30fps is easy; you’re essentially duplicating a frame every 4 frames which isn’t all that noticeable. It’s essentially the process of ‘pulldown’ common to telecine operations of the past where film material was converted for broadcast (more on that below). You’re adding additional frames, which works especially well when going to 60i.
It’s quite tough to convert from 30 fps to 24 fps so do try to avoid this. Because you can’t evenly drop frames, the resulting playback is asymmetrical, dropping 1 frame, then 2 frames, then 1 frame then 2 frames, etc. This is visually noticeable and requires some form of motion vector analysis or “optical flow” to blend frames together. This can result it some weird artifacting. Most of the other common conversions are possible with less potential for harming the image.
This is a great reference on what happens when putting high frame rate footage into a lower frame rate timeline.
This is the process of converting 24 fps footage to 60i fps using a 2:3 pulldown. See how 5 frames are made from 4 frames, but the 3rd and 4th frames are hybrid frames with fields from 2 different source frames? The opposite direction, 60i to 24p, is called a ‘reverse pulldown’.
You can see that the increased “temporal” resolution of 60i makes for a better conversion to 24p.
Before discussing how color can be discarded (chroma subsampling) it’s useful to understand how color is initially captured at the sensor level.
So you can see that there’s already a form of resolution reduction inherent to how the camera captures different colors of light, this is sort of a “chroma subsampling”. However, after the electrical charges from photons striking the sensor are converted to digital values and demosaicing is performed, there’s another form of color compression and that’s what most people refer to when they reference chroma subsampling.
The data rate of a video is simply a measure of how large the file is, usually measured per second. Uncompressed video is very inefficient and therefore usually not a thing. Most video formats employ some form of compression, even if it’s not immediately apparent. Common bitrates of highly compressed footage would be something like an HD YouTube video which is compressed to around 8Mbps (8 megabits per second). Something like a Sony Venice shooting in 6K produces files larger than 2Gb per second (2 gigabits per second).
VBR means “variable bit rate” or that the bit rate changes over time. This is useful since some complex scenes have a lot of motion and could require a lot more information than other scenes. Think of a thousand tree leaves swaying in the wind. Every frame is different and the amount of detail is complex. Contrast that with a static shot of the sky. There is little detail in the frame and little change over time so the data rate required can be much lower. The alternative to VBR is CBR or “Constant Bit Rate” where the bitrate is steady through the entirety of the clip.
We know it’s common to decrease the datarate of a video file and compression is the way we do that. You’ve likely heard of the following, popular, codecs, but what to they mean?
To minimize file sizes, cameras compress/encode an image upon capture and playback devices must decompress or decode the image. This compression+decompression is abbreviated as “Codec“. Choosing a codec is a very important part of any workflow. Some codecs are efficient for storing video files, but very demanding to play back and may not contain much room to manipulate the image in post. Codecs will employ two main types of compression.
This term refers to a codec designed for post production. Camera originals would be converted to this format for editorial. It’s generally intraframe for quick scrubbing with minimal system taxation and can handle multiple generations of encoding without significant quality loss. A workflow using a mezzanine codec does not expect to relink to camera originals (OCN).
Open/Closed GOP: “A closed GOP is a group of pictures in which the last pictures do not need data from the next GOP for bidirectional coding. Closed GOP is used to make a splice point in a bit stream.”
As always, Frame.io has a great explanation on codecs.
Compression within the frame. This is considered intra-frame compression since the compression doesn’t cross between frames.
Compression across various frames over time. This is called inter-frame compression since the compressor will look at “groups of pictures” and compress them together for increased efficiency.
The “I” frame will contain the entire image, but successive frames (called “B” and “P” frames) will contain only the parts of the image that change over time.
It’s important to realize that the codec of a video file is not the same as its container. h.264 is a compression standard, but it could live within a .mov container, a .mp4 container, a .mxf containter, etc. The container is basically a form of standardizing information about how the video content is stored, both the video and its metadata.
Bit depth refers to the size of the binary digit used to store a color value. The higher the number, the more granular the value stored.
Bit depth is not dynamic range. In other words, your blackest black and whitest white do not change with bit depth. The “range” of colors doesn’t even necessarily change. The effect of bit depth is in the amount of subtle differences between the darkest and brightest point of any given color.
Imagine it like a staircase. Increasing bit depth doesn’t move the first floor any lower or the second floor any higher, it simply adds more steps between the levels.
This B&H video has a good, basic, explanation:
If you find this content interesting, and since we haven’t covered everything reference at the beginning, read on for more detail.
“Context-adaptive binary arithmetic coding is a form of entropy encoding used in the H.264/MPEG-4 AVC and High Efficiency Video Coding (HEVC) standards. It is a lossless compression technique, although the video coding standards in which it is used are typically for lossy compression applications.”
“Edit Lists” are a Apple-specific mp4 extensions; they are atoms that basically allow you to pick portions of the video file for playback. You could, for example, play on loop, the middle two seconds of a 15 second video. This isn’t commonly used for most people, but YouTube’s warning to avoid them should make sense now.
“Atoms” (or “boxes” in the ISO spec) are data within the video file container that hold specific information about the video file’s parameters. These descriptive atoms differ from the actual media data (the individual frames of video or samples of audio) itself.
Though the atom’s location should be determined at the compression and muxing stage, software does exist to move the “moov” atom’s location after compression has happened. This hierarchical structure of atoms containing data separate from the atoms describing that data is part of what makes editing easy with the Quicktime format. In fact, the descriptive bits and the media bits don’t even have to reside in the same .mov file. Media can be ‘redescribed’ by changing description media atoms rather than having to rewrite the media file.
For example, the most common “moov” atom is sometimes called the “movie atom” and includes information on video length, track count, timescale and compression. Perhaps most importantly, it’s also the index with information about where the actual media file to be played is stored. Within the “moov” atom sits a “trak” sub-atom for each of the movies tracks, and within each trak atom sits a “mdia” atom with even further defined specifics. That moov atom is crucial for the playback of the entire clip, and the end user will not be able to scrub the playhead or jump to a location in the clip without it. For this reason, in some web streaming situations it’s crucial to load it first so you’ll see parameters in encoding software for “progressive download,” sometimes called “fast start,” or “use streaming mode.” “Muxing” is the term used for merging your video track, audio tracks, and subtitles all into one container.
And another video on compression that dives deeper:
That video was a bit in depth, but you should well understand that compression compounds in severity with every generation. This is best demonstrated via this multi-generational compression demonstration on YouTube where this guy uploaded a video to YT 1000 times. That’s the practical takeaway of the whole thing.