Deferred Download Design

This section discusses the details of the the deferred downloading services and how they integrate with Pulp’s core services.

Component Overview

Three additional services were necessary in order to support deferred downloading. In addition to these new services, several changes were made to the platform.

Pulp Content WSGI Application

The Pulp Content WSGI application was created because Apache does not provide a good way to redirect the client appropriately when the content is missing. When a broken link is encountered by Apache, it returns a HTTP 403 Forbidden, and while it is possible to add a custom ErrorDocument directive for HTTP 403, Pulp uses HTTP 403 in other cases. The WSGI application determines whether or not the client has access, and it determines whether the requested file is part of a repository or not. This change is a part of the platform.

Apache Reverse Proxy

Although it is possible to use the same instance of Apache for both Pulp and the deferred download services, it is not required. Therefore, it is best to think of this as a new service. The Apache reverse proxy serves as a way to terminate TLS connections with clients. This is required because Squid cannot cache encrypted content. In addition to terminating the TLS connection, the Apache reverse proxy also makes use of a WSGI access script, which validates that the URL has been cryptographically signed by Pulp and has not expired.

Squid Cache

Technically, Squid is not required for deferred downloading to work, but it provides two very important performance enhancements that are required for any production deployment.

The first is that Squid de-duplicates client requests. This is helpful since it is not unlikely that many clients request the same content at the same time. Rather than downloading the content many times concurrently, Squid makes a single request on behalf of the clients.

The second is that Squid caches the files it downloads, so that the next time a client requests them, Squid services the request from its cache. This is important because Pulp downloads all cached content from Squid at regular intervals and it would be wasteful to download the content from the Internet twice. It is also important because until Pulp downloads the content from the cache, Squid will receive all requests for the content from clients.

Pulp Streamer

The Pulp Streamer is the component that interfaces with Pulp to determine where the requested content can be found. It does this by consulting a collection in the database, which maps files to URLs, download settings, and, if necessary, entitlement certificates. Rather than downloading the entire file and returning it to Squid, it returns the downloaded bits to Squid as it receives them. This ensures the client does not time out the connection, since Squid will in turn stream the content back to Apache, which will stream back to the client.

Sequence Diagram

The following sequence diagram provides a more detailed view of how the services interact than the component diagram above:

  1. The client requests a content unit from Pulp, just as it would without lazy content loading. This includes potentially offering a client certificate for content entitlement.
  2. The requested URL is re-written using Apache’s mod_rewrite to a location in /pulp/content/. Each Pulp plugin provides one or more re-write rules as part of its configuration.
  3. A WSGI application is configured to handle all requests to /pulp/content/. When a request is made, the application confirms the client is entitled to the content. If the client is not entitled, the WSGI application returns a HTTP 403 Forbidden to the client. If the client is entitled, the WSGI application checks to see if the file exists in the repository by looking for the symbolic link in the published repository. If it is not present, a HTTP 404 is returned. If it is present, the link is followed. If the location the link points to exists, the WSGI application uses mod_xsendfile so that Apache serves the file. If the location the link points to does not exist, the WSGI application creates a signed URL and returns that.
  4. Apache serves the appropriate response to the client. The following steps only occur if the HTTP 302 Redirect is returned.
  5. The client follows the redirect URL. This request is handled by the Apache reverse proxy. It is first passed through a WSGIAccessScript in order to validate the URL signature. Once validated, Apache uses mod_rewrite to strip the signature from the URL. This occurs so that the URL does not differ for the same file so that Squid can retrieve the cached content for subsequent requests.
  6. Apache acts as a reverse proxy and makes a plaintext request to Squid on behalf of the client.
  7. Squid performs a cache lookup on the requested content. If it is available, it is served to Apache, which in turn serves it to the client. If the cache lookup results in a miss, Squid makes a request to the Pulp Streamer.
  8. The Pulp Streamer determines the correct upstream URL and entitlement certificates by looking the requested content up in the content catalog in MongoDB.
  9. The Pulp Streamer makes a request to the upstream repository, using any entitlement certificates necessary.
  10. The Pulp Streamer forwards the results of the request to Squid as it receives them.
  11. Squid streams content received from the Pulp Streamer to Apache and caches the content for future requests.
  12. The Apache reverse proxy streams the content back to the client.
  13. The Pulp Streamer adds an entry to the DeferredDownload database collection to indicate to the Pulp server that a content unit has been cached and is ready for retrieval for permanent storage.
  14. At regular intervals a task is dispatched by Pulp to download all content specified in the DeferredDownload collection. This task retrieves the entries made by the Pulp Streamer.
  15. For each deferred download entry, the task determines all the files in the content unit associated with the file that triggered the deferred download entry and requests them from the Apache reverse proxy using a URL it signs for itself.
  16. Apache forwards the request to Squid.
  17. Squid returns the files it has cached, or retrieves them from the streamer if they are not found in the cache.
  18. Apache returns the requested content to the task, which is saved so that Pulp itself can return the content using mod_xsendfile to clients that request it in the future. The task marks the content unit as downloaded when all its files are saved locally.

Known Flaws

This design, like all the other proposed designs, has a few known efficiency problems. There are several cases, outlined below, where content is downloaded multiple times by Pulp from the Squid proxy. Although this does not access an external network, it is still considered undesirable since it consumes disk I/O unnecessarily.

Multiple Downloads

A content unit could be downloaded multiple times if a client requests a file in that unit and then a download_repo task for a repository that contains that unit and the celerybeat deferred_downloads task run at the same time, and they happen to process the that content unit at the same time.

A content unit could be downloaded multiple times if the deferred_downloads task is set to run often enough that a new task is dispatched before the old one is finished. If those tasks select the same units at the same time, they could download the same content twice. This is a fairly narrow window as each task should be reading and then removing the document from MongoDB, but it is by no means impossible.

A content unit could be downloaded multiple times if a client is actively requesting content from a multi-file ContentUnit. This occurs if the deferred_downloads task removes an entry to process, and then the client asks for a new file (that isn’t cached in Squid). The Streamer will be able to add another entry for that ContentUnit there is no longer an entry for that (unit_id, unit_type_id).

Mitigation: Have both download_repo and deferred_downloads regularly check the ContentUnit.downloaded flag on the units it is processing. This way it can detect if another task has already downloaded the unit and quit.

Lost Downloads

Since the deferred_downloads task removes entries from the collection, it is possible for a lazy=passive download to be lost by Pulp if the worker is killed before it finishes the download, but after it has removed the database record(s).

Mitigation: Have the deferred_downloads task remove relatively few entries at a time. This is a matter of balancing the performance of parallelizing downloads versus losing entries and having to wait for the Squid cache to expire and cause the Streamer to add the entry back to the deferred_downloads collection. A user can also dispatch a download_repo task if they want these lost units to be downloaded by Pulp.