AT&T Video Optimizer

Duplicate Content

Introduction

The World Wide Web is an object oriented environment in which content is requested by and delivered to web clients, one piece at a time—as objects called resources. These pieces of content, or resources, are produced by web servers.

Each resource is uniquely identified by its fully-qualified name—it's Uniform Resource Indicator (URI).

Some examples of resources are:

A web page
An image
A script file

Downloading an identical piece of content each time there is a request for it, creates duplicated content, which can slow down an application's response and create needless load on a network.

This Best Practice Deep Dive looks at how content becomes duplicated, examines the issue of how that effects an application, and offers recommendations for developing a caching strategy to reduce duplicate content in an application.

Background

On the web, clients and servers perform every interaction by passing HTTP Messages between them.

To download a resource from a server, a client initiates a client/server transaction by sending the server an HTTP Request Message.

The header of an HTTP Request Message contains details associated with the request. Specifically, the method name GET, the resource's URI, and any optional processing directives. In response, the server packages the requested resource inside an Entity-body, and sends it to the client as the payload of an HTTP Response Message. This round-trip exchange constitutes one Request/Response transaction. A series of them form a dialog between a client and server.

The entire conversation usually last several minutes, and is referred to as a Session.

The shipping industry uses a similar model. Consider:

Merchandise is packaged into cardboard boxes.
The boxes are warehoused, and order requests are filled.
The boxes are packed into containers and shipped to waiting customers.

How does content become duplicated?

Web clients don't actually receive raw content from web servers. The content they receive is processed content. Processed content is refined to a form that is optimized for packet transmission by being converted to a collection of manageable chunks called resources. Each resource is represented as a file, and can be referred to by the following terms:

Resource Terms

Once the client has established a connection with a particular server, it can specify particular resources by URI. This form also lends itself to efficient local processing by the client, an HTTP 1.1 feature designed to significantly improve web application performance.

If you think of content in terms of the packages (Entities) they are transported in, duplicate content results whenever the client requests a resource that it has already requested and received earlier.

The Issue

Each time a client requests content from a server, the server returns it as an Entity (the payload of an HTTP Response Message).

Imagine saving Response Entities as they arrive from the server. By the end of the session, you would end up with a collection of Response Entities. If you checked this collection for duplicates, that is, for Entities that contain the exact same content—you would probably find some, perhaps many.

Since every content request results in a Full Response from the server, eliminating the Full Responses that result in duplicate content also eliminates the associated overhead for the Full Response, which includes:

The time it takes for the Request Message to reach the server, for the server to fill the request, and for the Response Message to return to the client.
The wireless bandwidth consumption.
The Battery drain from using radio resources.

Eliminating duplicate content provides benefits to an individual user, such as:

Improved application responsiveness.
Longer lasting battery charge.
Reduced impact on the user's data plan.

It also provides benefits to all users by improving throughput due to the cumulative effects of the reduction in bandwidth usage, and the reduction in processing overhead on the web server.

Best Practice Recommendation

Duplicate content is a problem produced by the client application. It is a result of the way the client processes server response messages. This problem can be solved by incorporating a caching strategy into your client application design. The HTTP 1.1 protocol supports Response Entity caching mechanisms, and you can put them to use by incorporating them into your caching strategy.

The cache is a process that runs locally, as a service. It behaves transparently, and operates as a middle-man standing in between the client and server processes. The cache serves locally-stored copies of Response Entities, so you can refer to it as a Response Entity Cache. There are several ways to implement caching. There are libraries available for this, and some operating systems have functions for it, but the most direct approach is to incorporate Response Entity Cache functionality into your client software code.

This is done by implementing a class that wraps a collection of Response Entities. This class should encapsulate a searchable container of Response Entity objects and include methods for:

Transparently intercepting Inbound Response Messages.

Determining if a Response Entity is Cacheable.
Adding Cacheable Response Entities to the collection.

Transparently intercepting Outbound Request Messages.

Checking to see if the associated Response Entity is a cached item.
If not (a cache miss), relaying the Request Message to the Origin Server.
If yes (a cache hit), determine if it is appropriate to serve the cached response.
If it is, then serve the cached response.

Why is Caching Important?

Caching is important to your application for the following reasons:

Cached files are available immediately, with no download latency. This makes your application appear faster.
Battery life. Every data connection drains the device's battery. If battery draining appears too excessive, the application may be uninstalled by the user.
Data caps. Users have monthly data caps, and resending files over and over can result in additional costs to your customers if they exceed their monthly data allowance and incur overage charges.
Citizenship: Wireless networks have limited capacity. If you clog the network with extra data, it hurts the responsiveness of all applications (as well as phone calls).

Can one extra 4KB image be that bad?

Consider snow. One snowflake isn't much, but an avalanche is. As your application gains exposure, the number of downloads grows exponentially—like an avalanche.

If your application requests a 4 KB image twice per session with just 5,000 users, you're sending 19 MB of "extra" data to your users. The radio power usage on these 5,000 additional downloads could be the equivalent to draining 35% of a smart phone battery.