AT&T Video Optimizer

Streaming Separate Audio and Video

Introduction

Audio normally accompanies video when streaming, but it is often an afterthought.

And there are different ways of matching them together.

Originally in internet media streaming, most audio and video services were multiplexed together, or muxed.

There has been a steady transition to demuxed streaming, where the audio and video segments (A.K.A. chunks) are sent independently, then matched together by the player.

Demuxed streaming can save money by requiring less server storage, and it can improve CDN usage. It also offers more audio options.

While there are advantages to demuxing audio, you need to plan and monitor. And there are related best practices you should follow.

In the Background section below we offer some context on demuxed streaming. In the Issue section we identify some potential problems, and their underlying causes. In the Best Practices section we offer design principles and recommendations that can help you mitigate these issues, leading to better Quality of Experience (QoE).

Background

When the audio and video tracks are stored separately in demuxed mode, it offers a number of advantages.

One advantage of demuxing would be in situations where a video requires multiple language tracks.
Another advantage would be any scenario where multiple audio quality levels are desired.
A third advantage of demuxed audio is that it requires less storage on the origin server. You only need to store the appropriate audio once for all the video segments. When the audio is muxed, each video needs to be replicated to include all the audio variants.
Demuxed mode can also increase CDN caching efficiency. For example, suppose user A requests video track #1 and audio track #1. Then user B requests video track #1 and audio track #2 at a later time.

In muxed mode, user B will need to download a variant containing both video track #1 and audio track #2 from the origin server.

But in demuxed mode, user B can get video track #1 from the CDN cache (which was downloaded previously to the CDN due to user A’s request), then User B only needs to retrieve audio track #2 from the origin server. The image below illustrates the video caching.

streaming-separate-audio-and-video-image1

Due to the above advantages, more services are moving towards demuxed audio and video tracks.

The Player

The logic in the player needs to dynamically determine which audio and which video segments to select.

Adaptive bitrate (ABR) streaming allows adaptation to dynamic network conditions by providing multiple tracks/variants that all represent the same content, but which are encoded at different bitrates and quality levels. Each track is divided into multiple segments (chunks), each containing a few seconds worth of content.

During playback, the client player dynamically selects a segment from the multiple available variants based on existing network conditions.

On the server, a video track and its corresponding audio can be combined together as a single multiplexed track (in muxed mode), where each segment in the track contains the associated video and audio content.

Alternatively, the video and audio content can be kept separately as demultiplexed tracks (in demuxed mode), as shown in the image below.

streaming-separate-audio-and-video-image2

ABR Protocols

DASH and HLS are the two predominant ABR streaming protocols. Both of them now support demuxed audio and video tracks.

The DASH protocol defines an Adaptation Set as a set of interchangeable encoded versions of one or several media content components. For demuxed audio and video tracks, it defines one Adaptation Set for video tracks and another set for audio tracks. The bandwidth requirement of each track is specified in the bandwidth attribute.

In HLS, a top-level master playlist uses EXT-X-STREAM-INF tag to specify a set of audio and video track combinations. It could be all the desired combinations, or it could be a subset of combinations. The bandwidth requirement is defined in the bandwidth attribute for each combination (instead of each audio/video track).

The Issue

While demuxing can increase your options and reduce CDN costs, limitations in handling demuxed audio can create new problems.

One issue we see are video streams that stall because of a late arriving audio segment.

Because demuxed streams decouple the video and the audio, losing sync between the video and the audio can also become an issue.

Another issue is the common assumption that audio bitrates are always significantly lower than video bitrates. This is interpreted as meaning audio track decisions have little impact on the video track selection. However, increasingly, an audio track can be of a much higher bitrate than some video tracks.

Yet another challenge with demuxing is the increased complexity. This complexity means that audio cannot be an afterthought. Even in relatively simple combinations, demuxing requires attention to all the variants. As an example, the table below lists 18 audio and video (i.e. V1 + A1) combinations for one sample stream.

Video / Audio Combinations

Video Quality 1 + Audio 1

Video Quality 1 + Audio 2

Video Quality 1 + Audio 3

Video Quality 2 + Audio 1

Video Quality 2 + Audio 2

Video Quality 2 + Audio 3

Video Quality 3 + Audio 1

Video Quality 3 + Audio 2

Video Quality 3 + Audio 3

Video Quality 4 + Audio 1

Video Quality 4 + Audio 2

Video Quality 4 + Audio 3

Video Quality 5 + Audio 1

Video Quality 5 + Audio 2

Video Quality 5 + Audio 3

Video Quality 6 + Audio 1

Video Quality 6 + Audio 2

Video Quality 6 + Audio 3

Beyond the issues above, limitations in both the protocols and the players can cause undesirable behavior, such as selection of potentially undesirable combinations, such as very low quality video combined with very high quality audio.

Protocol Limitations

There are key differences between DASH and HLS specifications in the treatment of demuxed audio and video tracks. These differences have implications for player design and the resulting QoE.

One difference between the two protocols is that DASH does not provide a mechanism to specify the desired subset of combinations. HLS provides an explicit mechanism to do so. This can make DASH more vulnerable to choosing undesirable combinations.

The player typically lacks detailed domain knowledge of the content being streamed. Therefore when streaming DASH content, the player is forced to either consider all possible combinations of video and audio track variants, or apply its own policy to determine the subset of allowed combinations.

Both approaches have drawbacks. The first option could lead to a player selecting an unsuitable combination for certain types of content. The second option could potentially exclude more desirable combinations.

Another key difference between DASH and HLS is that DASH provides bandwidth requirements of individual audio and video tracks, while HLS provides the total bandwidth requirements of audio and video combinations. This difference should be recognized by the player to avoid undesirable behaviors.

Player Logic Limitations

When a player does not perform rate adaptation for audio it can lead to poor performance, such as a large amount of rebuffering. Therefore, it is a good idea to perform rate adaptation for both audio and video.

Some players select undesirable audio and video combinations, which can lead to poor viewing quality. An example would be combining the lowest quality video and highest quality audio tracks.
Some players make the rate adaptation decisions for audio and video independently. Considering audio and video jointly is better than considering them independently. But considering them jointly in an overly simplified way is also not desirable. Following this practice in certain players can lead to frequent audio/video track changes.
Some players sync audio and video buffer on a very coarse granularity. This can lead to unbalanced content in the buffer. Since we need both video and audio in the playback, having a lot more content in video/audio is not helpful. Instead, it is better to use a finer granularity when syncing them.
An HLS manifest can specify a subset of audio and video combinations, but some players do not conform to the manifest file. For example, ExoPlayer may select some undesirable combinations that are not in the manifest file.

Best Practice Recommendation

Below are some Best Practice recommendations for demuxed streaming

Server-Side Recommendations

Audio and video combinations. In the manifest file, we recommend specifying the allowed combinations of video and audio track variants, which is the subset of all possible combinations. A player can only select segments from the allowed combinations.

This practice allows the content owner to specify combinations that are suitable for the specific content, such as a music video versus an action movie, or specific device configurations, such screen size, or sound system. It also simplifies the rate adaptation task for the player, which helps the player make better choices.

HLS already supports this feature, so it is a good idea to leverage it. Do not specify all possible combinations unless they are all desirable.

DASH at present does not offer a way to list only specific allowed combinations in the manifest. A practical short term workaround would be for the client to get this information from the server out-of-band, e.g., via an auxiliary file download over HTTP2. In the long term, the DASH specification can be expanded to support this feature.

Audio and video bandwidth declaration. In the manifest file, it is good to specify sufficient information about the aggregate bandwidth requirements of audio and video combination, as well as the bandwidth requirements of individual audio/video tracks.

This is particularly important when audio and video are fetched over different network paths that have different network characteristics (e.g., when they are stored at different servers).

Currently, DASH specifies the bandwidth for each audio and video track, and the aggregate bandwidth requirement for audio and video combinations (if provided following the earlier suggestion) can be calculated from the individual tracks.

HLS uses two levels of manifest files: the top-level master playlist that specifies the aggregate bitrate for the combination of the audio and video, and the second level media playlists, each providing detailed information of the individual audio/video segments inside a track.

When the packaging is in Fragmented MP4 format, the media playlists specify the EXT-X-BYTERANGE information, which can be used to obtain the audio/video bitrate (the byte range information is used to specify the start and end byte positions to fetch the content ,but can also be used to obtain segment size, and hence bitrate with the segment duration information specified in the media playlists).

When the packaging is in MPEG-2 TS format, no byte range information is provided.

On the other hand, HLS provides an optional EXT-X-BITRATE tag that can be used to obtain per-segment bitrate information. This option needs to be made mandatory. In addition, for both packaging formats, since the information is in per track manifest files, we suggest that the player can download the per-track manifests and read the information at the very beginning, before any adaptation decision is made.

Client-Side Recommendations

Adopt audio adaptation. For dynamic network bandwidth scenarios, it is important to perform audio adaptation.

High quality audio tracks can have higher bitrate than that of low video tracks. Audio adaptation is just as important as video adaptation to avoid adverse impact on QoE.

Joint adaptation of audio and video. We recommend that the selection of the audio, the selection of the video variants (tracks) and the positioning in playback order, be considered jointly.

In addition, only consider the candidates from the audio and video combinations specified in the manifest file, if provided.

The combinations listed in the manifest file suggests the desired audio and video combinations based on the content. Therefore, it is important for the client to select only from those combinations. For both video and audio adaptation, it is desirable to satisfy conflicting goals in maximizing quality, minimizing stalls and minimizing quality variation.

Since the selection of audio and video is inherently coupled, in order that good combinations of audio and video tracks are chosen, we recommend considering the specified combinations of audio and video (i.e., those specified in the manifest file if provided) while making rate adaptation decisions, instead of considering audio and video individually. The rate adaptation should be done carefully to avoid frequent changes in either audio or video tracks.

Maintain a balance between audio and video prefetching. It is recommended keep the audio and video buffer level (in seconds) balanced.

This is because either empty audio or video buffer leads to stalls. It makes no sense no having many audio segments ready if the next video segment is not available, and vice versa.

The balance can be achieved by synching the duration of prefetched audio and video content at a fine granularity, for example, at the segment level or in terms of a small number of segments.

Video Optimizer

If your stream is demuxed, with separate audio and video segments, Video Optimizer breaks out the segments on the Video Tab.

Below is a screenshot from the Video tab, showing separate video and audio in their own tables. You can control the view through checkboxes, displaying just video, just audio, or both video and audio side by side, as seen below.

streaming-separate-audio-and-video-image3

This allows quicker detection of behavior on demuxed streams that would cause issues, such as late-arriving audio segments that would cause stalling.

Learn More

You can get a lot more information about video streaming and other mobile development practices on our Mobile Development Best Practices web site.