In this assignment, my group and I were tasked with turning a raw video file into an encoded one with semantic layering (foreground and background layers) and quantized DCT coefficients.
In order to achieve this, we broke each video frame down into 16x16 pixel macroblocks. We searched for each macroblock's movement between each frame by finding the macroblock in the subsequent frame with the least difference to the one we were searching for. We then computed a motion vector for the macroblock based on the direction and distance it moved between frames. We clustered these motion vectors to determine which macroblocks were in the background (most similar to average motion vector of the frame), and which were in the foreground (most different than average motion vector).Each macroblock was written to the encoded file by its list of Discrete Cosine Transform coefficients.
We then had to create a video decoder that would play the video and quantize the foreground and background layers at specified qualities. By loading 10 frames into a buffer before playback and loading more frames as we played the video, we were able to achieve real time video playback. The player also supported simulated VR "Gaze Control", where the mouse could be dragged over a 64x64 square of the video to quantize it at full quality. The idea behind this was that in a virtual reality application, one would want to have higher quality where a person's eyes are focused.
The following video shows an example of the decoder in action. Background frames were quantized at 1/1000th of the value to accentuate the difference between foreground and background layers. When the people are moving, they are put into the foreground layer and are quantized at full quality.