pulpcore.plugin.download

The module implements downloaders that solve many of the common problems plugin writers have while downloading remote data. A high level list of features provided by these downloaders include:

  • auto-configuration from remote settings (auth, ssl, proxy)
  • synchronous or parallel downloading
  • digest and size validation computed during download
  • grouping downloads together to return to the user when all files are downloaded
  • customizable download behaviors via subclassing

All classes documented here should be imported directly from the pulpcore.plugin.download namespace.

Basic Downloading

The most basic downloading from a url can be done like this:

>>> downloader = HttpDownloader('http://example.com/')
>>> result = downloader.fetch()

The example above downloads the data synchronously. The fetch() call blocks until the data is downloaded and the DownloadResult is returned or a fatal exception is raised.

Parallel Downloading

Any downloader in the pulpcore.plugin.download package can be run in parallel with the asyncio event loop. Each downloader has a run() method which returns a coroutine object that asyncio can schedule in parallel. Consider this example:

>>> download_coroutines = [
>>>     HttpDownloader('http://example.com/').run(),
>>>     HttpDownloader('http://pulpproject.org/').run(),
>>> ]
>>>
>>> loop = asyncio.get_event_loop()
>>> done, not_done = loop.run_until_complete(asyncio.wait([download_coroutines]))
>>>
>>> for task in done:
>>>     try:
>>>         task.result()  # This is a DownloadResult
>>>     except Exception as error:
>>>         pass  # fatal exceptions are raised by result()

Download Results

The download result contains all the information about a completed download and is returned from a the downloader’s run() method when the download is complete.

class pulpcore.plugin.download.DownloadResult(url, artifact_attributes, path, headers)
Parameters:
  • url (str) – The url corresponding with the download.
  • path (str) – The absolute path to the saved file
  • artifact_attributes (dict) – Contains keys corresponding with Artifact fields. This includes the computed digest values along with size information.
  • headers (aiohttp.multidict.MultiDict) – HTTP response headers. The keys are header names. The values are header content. None when not using the HttpDownloader or sublclass.

Configuring from a Remote

When fetching content during a sync, the remote has settings like SSL certs, SSL validation, basic auth credentials, and proxy settings. Downloaders commonly want to use these settings while downloading. The Remote’s settings can automatically configure a downloader either to download a url or a pulpcore.plugin.models.RemoteArtifact using the get_downloader() call. Here is an example download from a URL:

>>> downloader = my_remote.get_downloader(url='http://example.com')
>>> downloader.fetch()  # This downloader is configured with the remote's settings

Here is an example of a download configured from a RemoteArtifact, which also configures the downloader with digest and size validation:

>>> remote_artifact = RemoteArtifact.objects.get(...)
>>> downloader = my_remote.get_downloader(remote_artifact=ra)
>>> downloader.fetch()  # This downloader has the remote's settings and digest+validation checking

The get_downloader() internally calls the DownloaderFactory, so it expects a url that the DownloaderFactory can build a downloader for. See the DownloaderFactory for more information on supported urls.

Tip

The get_downloader() accepts kwargs that can enable size or digest based validation, and specifying a file-like object for the data to be written into. See get_downloader() for more information.

Note

All HttpDownloader downloaders produced by the same remote instance share an aiohttp session, which provides a connection pool, connection reusage and keep-alives shared across all downloaders produced by a single remote.

Automatic Retry

The HttpDownloader will automatically retry 10 times if the server responds with one of the following error codes:

  • 429 - Too Many Requests

Exception Handling

Unrecoverable errors of several types can be raised during downloading. One example is a validation exception that is raised if the content downloaded fails size or digest validation. There can also be protocol specific errors such as an aiohttp.ClientResponse being raised when a server responds with a 400+ response such as an HTTP 403.

Any exception raised is a fatal exception and should likely be recorded with the append_non_fatal_error() interface. A fatal exception on a single download likely does not cause an entire sync to fail, so a downloader’s fatal exception is recorded as a non-fatal exception on the task. Plugin writers can also choose to halt the entire task by allowing the exception be uncaught which would mark the entire task as failed.

Note

The HttpDownloader automatically retry in some cases, but if unsuccessful will raise an exception for any HTTP response code that is 400 or greater.

Custom Download Behavior

Custom download behavior is provided by subclassing a downloader and providing a new run() method. For example you could catch a specific error code like a 404 and try another mirror if your downloader knew of several mirrors. Here is an example of that in code.

A custom downloader can be given as the downloader to use for a given protocol using the downloader_overrides on the DownloaderFactory. Additionally, you can implement the get_downloader() method to specify the downloader_overrides to the DownloaderFactory.

Adding New Protocol Support

To create a new protocol downloader implement a subclass of the BaseDownloader. See the docs on BaseDownloader for more information on the requirements.

Download Factory

The DownloaderFactory constructs and configures a downloader for any given url. Specifically:

  1. Select the appropriate downloader based from these supported schemes: http, https or file.
  2. Auto-configure the selected downloader with settings from a remote including (auth, ssl, proxy).

The build() method constructs one downloader for any given url.

Note

Any HttpDownloader objects produced by an instantiated DownloaderFactory share an aiohttp session, which provides a connection pool, connection reusage and keep-alives shared across all downloaders produced by a single factory.

Tip

The build() method accepts kwargs that enable size or digest based validation or the specification of a file-like object for the data to be written into. See build() for more information.

class pulpcore.plugin.download.DownloaderFactory(remote, downloader_overrides=None)

A factory for creating downloader objects that are configured from with remote settings.

The DownloadFactory correctly handles SSL settings, basic auth settings, proxy settings, and connection limit settings.

It supports handling urls with the http, https, and file protocols. The downloader_overrides option allows the caller to specify the download class to be used for any given protocol. This allows the user to specify custom, subclassed downloaders to be built by the factory.

Usage:
>>> the_factory = DownloaderFactory(remote)
>>> downloader = the_factory.build(url_a)
>>> result = downloader.fetch()  # 'result' is a DownloadResult

For http and https urls, in addition to the remote settings, non-default timing values are used. Specifically, the “total” timeout is set to None and the “sock_connect” and “sock_read” are both 5 minutes. For more info on these settings, see the aiohttp docs: http://aiohttp.readthedocs.io/en/stable/client_quickstart.html#timeouts Behaviorally, it should allow for an active download to be arbitrarily long, while still detecting dead or closed sessions even when TCPKeepAlive is disabled.

Also for http and https urls, even though HTTP 1.1 is used, the TCP connection is setup and closed with each request. This is done for compatibility reasons due to various issues related to session continuation implementation in various servers.

Parameters:
  • remote (Remote) – The remote used to populate downloader settings.
  • downloader_overrides (dict) – Keyed on a scheme name, e.g. ‘https’ or ‘ftp’ and the value is the downloader class to be used for that scheme, e.g. {‘https’: MyCustomDownloader}. These override the default values.
build(url, **kwargs)

Build a downloader which can optionally verify integrity using either digest or size.

The built downloader also provides concurrency restriction if specified by the remote.

Parameters:
  • url (str) – The download URL.
  • kwargs (dict) – All kwargs are passed along to the downloader. At a minimum, these include the BaseDownloader parameters.
Returns:

A downloader that is configured with the remote settings.

Return type:

subclass of BaseDownloader

HttpDownloader

This downloader is an asyncio-aware parallel downloader which is the default downloader produced by the Download Factory for urls starting with http:// or https://. It also supports synchronous downloading using fetch().

class pulpcore.plugin.download.HttpDownloader(url, session=None, auth=None, proxy=None, proxy_auth=None, headers_ready_callback=None, **kwargs)

An HTTP/HTTPS Downloader built on aiohttp.

This downloader downloads data from one url and is not reused.

The downloader optionally takes a session argument, which is an aiohttp.ClientSession. This allows many downloaders to share one aiohttp.ClientSession which provides a connection pool, connection reuse, and keep-alives across multiple downloaders. When creating many downloaders, have one session shared by all of your HttpDownloader objects.

A session is optional; if omitted, one session will be created, used for this downloader, and then closed when the download is complete. A session that is passed in will not be closed when the download is complete.

If a session is not provided, the one created by HttpDownloader uses non-default timing values. Specifically, the “total” timeout is set to None and the “sock_connect” and “sock_read” are both 5 minutes. For more info on these settings, see the aiohttp docs: http://aiohttp.readthedocs.io/en/stable/client_quickstart.html#timeouts Behaviorally, it should allow for an active download to be arbitrarily long, while still detecting dead or closed sessions even when TCPKeepAlive is disabled.

If a session is not provided, the one created will force TCP connection closure after each request. This is done for compatibility reasons due to various issues related to session continuation implementation in various servers.

aiohttp.ClientSession objects allows you to configure options that will apply to all downloaders using that session such as auth, timeouts, headers, etc. For more info on these options see the aiohttp.ClientSession docs for more information: http://aiohttp.readthedocs.io/en/stable/client_reference.html#aiohttp.ClientSession

The aiohttp.ClientSession can additionally be configured for SSL configuration by passing in a aiohttp.TCPConnector. For information on configuring either server or client certificate based identity verification, see the aiohttp documentation: http://aiohttp.readthedocs.io/en/stable/client.html#ssl-control-for-tcp-sockets

For more information on aiohttp.BasicAuth objects, see their docs: http://aiohttp.readthedocs.io/en/stable/client_reference.html#aiohttp.BasicAuth

Synchronous Download:
>>> downloader = HttpDownloader('http://example.com/')
>>> result = downloader.fetch()
Parallel Download:
>>> download_coroutines = [
>>>     HttpDownloader('http://example.com/').run(),
>>>     HttpDownloader('http://pulpproject.org/').run(),
>>> ]
>>>
>>> loop = asyncio.get_event_loop()
>>> done, not_done = loop.run_until_complete(asyncio.wait(download_coroutines))
>>>
>>> for task in done:
>>>     try:
>>>         task.result()  # This is a DownloadResult
>>>     except Exception as error:
>>>         pass  # fatal exceptions are raised by result()

The HTTPDownloaders contain automatic retry logic if the server responds with HTTP 429 response. The coroutine will automatically retry 10 times with exponential backoff before allowing a final exception to be raised.

Variables:
  • session (aiohttp.ClientSession) – The session to be used by the downloader.
  • auth (aiohttp.BasicAuth) – An object that represents HTTP Basic Authorization or None
  • proxy (str) – An optional proxy URL or None
  • proxy_auth (aiohttp.BasicAuth) – An optional object that represents proxy HTTP Basic Authorization or None
  • headers_ready_callback (callable) – An optional callback that accepts a single dictionary as its argument. The callback will be called when the response headers are available. The dictionary passed has the header names as the keys and header values as its values. e.g. {‘Transfer-Encoding’: ‘chunked’}. This can also be None.

This downloader also has all of the attributes of BaseDownloader

Parameters:
  • url (str) – The url to download.
  • session (aiohttp.ClientSession) – The session to be used by the downloader. (optional) If not specified it will open the session and close it
  • auth (aiohttp.BasicAuth) – An object that represents HTTP Basic Authorization (optional)
  • proxy (str) – An optional proxy URL.
  • proxy_auth (aiohttp.BasicAuth) – An optional object that represents proxy HTTP Basic Authorization.
  • headers_ready_callback (callable) – An optional callback that accepts a single dictionary as its argument. The callback will be called when the response headers are available. The dictionary passed has the header names as the keys and header values as its values. e.g. {‘Transfer-Encoding’: ‘chunked’}
  • kwargs (dict) – This accepts the parameters of BaseDownloader.
fetch()

Run the download synchronously and return the DownloadResult.

Returns:DownloadResult
Raises:Exception – Any fatal exception emitted during downloading
finalize()

A coroutine to flush downloaded data, close the file writer, and validate the data.

All subclasses are required to call this method after all data has been passed to handle_data().

Raises:
handle_data(data)

A coroutine that writes data to the file object and compute its digests.

All subclassed downloaders are expected to pass all data downloaded to this method. Similar to the hashlib docstring, repeated calls are equivalent to a single call with the concatenation of all the arguments: m.handle_data(a); m.handle_data(b) is equivalent to m.handle_data(a+b).

Parameters:data (bytes) – The data to be handled by the downloader.
run(extra_data=None)

Run the downloader with concurrency restriction.

This method acquires self.semaphore before calling the actual download implementation contained in _run(). This ensures that the semaphore stays acquired even as the backoff decorator on _run(), handles backoff-and-retry logic.

Parameters:extra_data (dict) – Extra data passed to the downloader.
Returns:DownloadResult from _run().
validate_digests()

Validate all digests validate if expected_digests is set

Raises:DigestValidationError – When any of the expected_digest values don’t match the digest of the data passed to handle_data().
validate_size()

Validate the size if expected_size is set

Raises:SizeValidationError – When the expected_size value doesn’t match the size of the data passed to handle_data().
artifact_attributes

A property that returns a dictionary with size and digest information. The keys of this dictionary correspond with Artifact fields.

FileDownloader

This downloader is an asyncio-aware parallel file reader which is the default downloader produced by the Download Factory for urls starting with file://.

class pulpcore.plugin.download.FileDownloader(url, **kwargs)

A downloader for downloading files from the filesystem.

It provides digest and size validation along with computation of the digests needed to save the file as an Artifact. It writes a new file to the disk and the return path is included in the DownloadResult.

This downloader has all of the attributes of BaseDownloader

Download files from a url that starts with file://

Parameters:
  • url (str) – The url to the file. This is expected to begin with file://
  • kwargs (dict) – This accepts the parameters of BaseDownloader.
fetch()

Run the download synchronously and return the DownloadResult.

Returns:DownloadResult
Raises:Exception – Any fatal exception emitted during downloading
finalize()

A coroutine to flush downloaded data, close the file writer, and validate the data.

All subclasses are required to call this method after all data has been passed to handle_data().

Raises:
handle_data(data)

A coroutine that writes data to the file object and compute its digests.

All subclassed downloaders are expected to pass all data downloaded to this method. Similar to the hashlib docstring, repeated calls are equivalent to a single call with the concatenation of all the arguments: m.handle_data(a); m.handle_data(b) is equivalent to m.handle_data(a+b).

Parameters:data (bytes) – The data to be handled by the downloader.
run(extra_data=None)

Run the downloader with concurrency restriction.

This method acquires self.semaphore before calling the actual download implementation contained in _run(). This ensures that the semaphore stays acquired even as the backoff decorator on _run(), handles backoff-and-retry logic.

Parameters:extra_data (dict) – Extra data passed to the downloader.
Returns:DownloadResult from _run().
validate_digests()

Validate all digests validate if expected_digests is set

Raises:DigestValidationError – When any of the expected_digest values don’t match the digest of the data passed to handle_data().
validate_size()

Validate the size if expected_size is set

Raises:SizeValidationError – When the expected_size value doesn’t match the size of the data passed to handle_data().
artifact_attributes

A property that returns a dictionary with size and digest information. The keys of this dictionary correspond with Artifact fields.

BaseDownloader

This is an abstract downloader that is meant for subclassing. All downloaders are expected to be descendants of BaseDownloader.

class pulpcore.plugin.download.BaseDownloader(url, custom_file_object=None, expected_digests=None, expected_size=None, semaphore=None)

The base class of all downloaders, providing digest calculation, validation, and file handling.

This is an abstract class and is meant to be subclassed. Subclasses are required to implement the run() method and do two things:

  1. Pass all downloaded data to handle_data() and schedule it.
  2. Schedule finalize() after all data has been delivered to handle_data().

Passing all downloaded data the into handle_data() allows the file digests to be computed while data is written to disk. The digests computed are required if the download is to be saved as an Artifact which avoids having to re-read the data later.

The handle_data() method by default writes to a random file in the current working directory or you can pass in your own file object. See the custom_file_object keyword argument for more details. Allowing the download instantiator to define the file to receive data allows the streamer to receive the data instead of having it written to disk.

The call to finalize() ensures that all data written to the file-like object is quiesced to disk before the file-like object has close() called on it.

Variables:
  • url (str) – The url to download.
  • expected_digests (dict) – Keyed on the algorithm name provided by hashlib and stores the value of the expected digest. e.g. {‘md5’: ‘912ec803b2ce49e4a541068d495ab570’}
  • expected_size (int) – The number of bytes the download is expected to have.
  • path (str) – The full path to the file containing the downloaded data if no custom_file_object option was specified, otherwise None.

Create a BaseDownloader object. This is expected to be called by all subclasses.

Parameters:
  • url (str) – The url to download.
  • custom_file_object (file object) – An open, writable file object that downloaded data can be written to by handle_data().
  • expected_digests (dict) – Keyed on the algorithm name provided by hashlib and stores the value of the expected digest. e.g. {‘md5’: ‘912ec803b2ce49e4a541068d495ab570’}
  • expected_size (int) – The number of bytes the download is expected to have.
  • semaphore (asyncio.Semaphore) – A semaphore the downloader must acquire before running. Useful for limiting the number of outstanding downloaders in various ways.
fetch()

Run the download synchronously and return the DownloadResult.

Returns:DownloadResult
Raises:Exception – Any fatal exception emitted during downloading
finalize()

A coroutine to flush downloaded data, close the file writer, and validate the data.

All subclasses are required to call this method after all data has been passed to handle_data().

Raises:
handle_data(data)

A coroutine that writes data to the file object and compute its digests.

All subclassed downloaders are expected to pass all data downloaded to this method. Similar to the hashlib docstring, repeated calls are equivalent to a single call with the concatenation of all the arguments: m.handle_data(a); m.handle_data(b) is equivalent to m.handle_data(a+b).

Parameters:data (bytes) – The data to be handled by the downloader.
run(extra_data=None)

Run the downloader with concurrency restriction.

This method acquires self.semaphore before calling the actual download implementation contained in _run(). This ensures that the semaphore stays acquired even as the backoff decorator on _run(), handles backoff-and-retry logic.

Parameters:extra_data (dict) – Extra data passed to the downloader.
Returns:DownloadResult from _run().
validate_digests()

Validate all digests validate if expected_digests is set

Raises:DigestValidationError – When any of the expected_digest values don’t match the digest of the data passed to handle_data().
validate_size()

Validate the size if expected_size is set

Raises:SizeValidationError – When the expected_size value doesn’t match the size of the data passed to handle_data().
artifact_attributes

A property that returns a dictionary with size and digest information. The keys of this dictionary correspond with Artifact fields.

Validation Exceptions

class pulpcore.exceptions.DigestValidationError

Raised when a file fails to validate a digest checksum.

class pulpcore.exceptions.SizeValidationError

Raised when a file fails to validate a size checksum.

class pulpcore.exceptions.ValidationError(error_code)

A base class for all Validation Errors.

Parameters:error_code (str) – unique error code