pulpcore.plugin.download
The module implements downloaders that solve many of the common problems plugin writers have while downloading remote data. A high level list of features provided by these downloaders include:
auto-configuration from remote settings (auth, ssl, proxy)
synchronous or parallel downloading
digest and size validation computed during download
grouping downloads together to return to the user when all files are downloaded
customizable download behaviors via subclassing
All classes documented here should be imported directly from the
pulpcore.plugin.download
namespace.
Basic Downloading
The most basic downloading from a url can be done like this:
downloader = HttpDownloader('http://example.com/')
result = downloader.fetch()
The example above downloads the data synchronously. The
fetch()
call blocks until the data is
downloaded and the DownloadResult
is returned or a fatal
exception is raised.
Parallel Downloading
Any downloader in the pulpcore.plugin.download
package can be run in parallel with the
asyncio
event loop. Each downloader has a
run()
method which returns a coroutine object
that asyncio
can schedule in parallel. Consider this example:
download_coroutines = [
HttpDownloader('http://example.com/').run(),
HttpDownloader('http://pulpproject.org/').run(),
]
loop = asyncio.get_event_loop()
done, not_done = loop.run_until_complete(asyncio.wait([download_coroutines]))
for task in done:
try:
task.result() # This is a DownloadResult
except Exception as error:
pass # fatal exceptions are raised by result()
Download Results
The download result contains all the information about a completed download and is returned from a the downloader’s run() method when the download is complete.
- class pulpcore.plugin.download.DownloadResult(url, artifact_attributes, path, headers)
- Parameters:
url (str) – The url corresponding with the download.
path (str) – The absolute path to the saved file
artifact_attributes (dict) – Contains keys corresponding with
Artifact
fields. This includes the computed digest values along with size information.headers (aiohttp.multidict.MultiDict) – HTTP response headers. The keys are header names. The values are header content. None when not using the HttpDownloader or sublclass.
Configuring from a Remote
When fetching content during a sync, the remote has settings like SSL certs, SSL validation, basic
auth credentials, and proxy settings. Downloaders commonly want to use these settings while
downloading. The Remote’s settings can automatically configure a downloader either to download a
url or a pulpcore.plugin.models.RemoteArtifact
using the
get_downloader()
call. Here is an example download from a URL:
downloader = my_remote.get_downloader(url='http://example.com')
downloader.fetch() # This downloader is configured with the remote's settings
Here is an example of a download configured from a RemoteArtifact, which also configures the downloader with digest and size validation:
remote_artifact = RemoteArtifact.objects.get(...)
downloader = my_remote.get_downloader(remote_artifact=ra)
downloader.fetch() # This downloader has the remote's settings and digest+validation checking
The get_downloader()
internally calls the
DownloaderFactory, so it expects a url that the DownloaderFactory can build a downloader for.
See the DownloaderFactory
for more information on
supported urls.
Tip
The get_downloader()
accepts kwargs that can
enable size or digest based validation, and specifying a file-like object for the data to be
written into. See get_downloader()
for more
information.
Note
All HttpDownloader
downloaders produced by the same
remote instance share an aiohttp session, which provides a connection pool, connection
reusage and keep-alives shared across all downloaders produced by a single remote.
Automatic Retry
The HttpDownloader
will automatically retry 10 times if the
server responds with one of the following error codes:
429 - Too Many Requests
Exception Handling
Unrecoverable errors of several types can be raised during downloading. One example is a
validation exception that is raised if the content downloaded fails
size or digest validation. There can also be protocol specific errors such as an
aiohttp.ClientResponse
being raised when a server responds with a 400+ response such as an HTTP
403.
Plugin writers can choose to halt the entire task by allowing the exception be uncaught which would mark the entire task as failed.
Note
The HttpDownloader
automatically retry in some cases, but if
unsuccessful will raise an exception for any HTTP response code that is 400 or greater.
Custom Download Behavior
Custom download behavior is provided by subclassing a downloader and providing a new run() method. For example you could catch a specific error code like a 404 and try another mirror if your downloader knew of several mirrors. Here is an example of that in code.
A custom downloader can be given as the downloader to use for a given protocol using the
downloader_overrides
on the DownloaderFactory
.
Additionally, you can implement the get_downloader()
method to specify the downloader_overrides
to the
DownloaderFactory
.
Adding New Protocol Support
To create a new protocol downloader implement a subclass of the
BaseDownloader
. See the docs on
BaseDownloader
for more information on the requirements.
Download Factory
The DownloaderFactory constructs and configures a downloader for any given url. Specifically:
Select the appropriate downloader based from these supported schemes: http, https or file.
Auto-configure the selected downloader with settings from a remote including (auth, ssl, proxy).
The build()
method constructs one
downloader for any given url.
Note
Any HttpDownloader objects produced by an instantiated DownloaderFactory share an aiohttp session, which provides a connection pool, connection reusage and keep-alives shared across all downloaders produced by a single factory.
Tip
The build()
method accepts kwargs that
enable size or digest based validation or the specification of a file-like object for the data
to be written into. See build()
for
more information.
- class pulpcore.plugin.download.DownloaderFactory(remote, downloader_overrides=None)
A factory for creating downloader objects that are configured from with remote settings.
The DownloadFactory correctly handles SSL settings, basic auth settings, proxy settings, and connection limit settings.
It supports handling urls with the http, https, and file protocols. The
downloader_overrides
option allows the caller to specify the download class to be used for any given protocol. This allows the user to specify custom, subclassed downloaders to be built by the factory.Usage:
the_factory = DownloaderFactory(remote) downloader = the_factory.build(url_a) result = downloader.fetch() # 'result' is a DownloadResult
For http and https urls, in addition to the remote settings, non-default timing values are used. Specifically, the “total” timeout is set to None and the “sock_connect” and “sock_read” are both 5 minutes. For more info on these settings, see the aiohttp docs: http://aiohttp.readthedocs.io/en/stable/client_quickstart.html#timeouts Behaviorally, it should allow for an active download to be arbitrarily long, while still detecting dead or closed sessions even when TCPKeepAlive is disabled.
Also for http and https urls, even though HTTP 1.1 is used, the TCP connection is setup and closed with each request. This is done for compatibility reasons due to various issues related to session continuation implementation in various servers.
- Parameters:
remote (
Remote
) – The remote used to populate downloader settings.downloader_overrides (dict) – Keyed on a scheme name, e.g. ‘https’ or ‘ftp’ and the value is the downloader class to be used for that scheme, e.g. {‘https’: MyCustomDownloader}. These override the default values.
- build(url, **kwargs)
Build a downloader which can optionally verify integrity using either digest or size.
The built downloader also provides concurrency restriction if specified by the remote.
- Parameters:
url (str) – The download URL.
kwargs (dict) – All kwargs are passed along to the downloader. At a minimum, these include the
BaseDownloader
parameters.
- Returns:
A downloader that is configured with the remote settings.
- Return type:
subclass of
BaseDownloader
- static user_agent()
Produce a User-Agent string to identify Pulp and relevant system info.
HttpDownloader
This downloader is an asyncio-aware parallel downloader which is the default downloader produced by
the Download Factory for urls starting with http:// or https://. It also supports
synchronous downloading using fetch()
.
- class pulpcore.plugin.download.HttpDownloader(url, session=None, auth=None, proxy=None, proxy_auth=None, headers_ready_callback=None, headers=None, throttler=None, max_retries=0, **kwargs)
An HTTP/HTTPS Downloader built on aiohttp.
This downloader downloads data from one url and is not reused.
The downloader optionally takes a session argument, which is an aiohttp.ClientSession. This allows many downloaders to share one aiohttp.ClientSession which provides a connection pool, connection reuse, and keep-alives across multiple downloaders. When creating many downloaders, have one session shared by all of your HttpDownloader objects.
A session is optional; if omitted, one session will be created, used for this downloader, and then closed when the download is complete. A session that is passed in will not be closed when the download is complete.
If a session is not provided, the one created by HttpDownloader uses non-default timing values. Specifically, the “total” timeout is set to None and the “sock_connect” and “sock_read” are both 5 minutes. For more info on these settings, see the aiohttp docs: http://aiohttp.readthedocs.io/en/stable/client_quickstart.html#timeouts Behaviorally, it should allow for an active download to be arbitrarily long, while still detecting dead or closed sessions even when TCPKeepAlive is disabled.
If a session is not provided, the one created will force TCP connection closure after each request. This is done for compatibility reasons due to various issues related to session continuation implementation in various servers.
aiohttp.ClientSession objects allows you to configure options that will apply to all downloaders using that session such as auth, timeouts, headers, etc. For more info on these options see the aiohttp.ClientSession docs for more information: http://aiohttp.readthedocs.io/en/stable/client_reference.html#aiohttp.ClientSession
The aiohttp.ClientSession can additionally be configured for SSL configuration by passing in a aiohttp.TCPConnector. For information on configuring either server or client certificate based identity verification, see the aiohttp documentation: http://aiohttp.readthedocs.io/en/stable/client.html#ssl-control-for-tcp-sockets
For more information on aiohttp.BasicAuth objects, see their docs: http://aiohttp.readthedocs.io/en/stable/client_reference.html#aiohttp.BasicAuth
Synchronous Download:
downloader = HttpDownloader('http://example.com/') result = downloader.fetch()
Parallel Download:
download_coroutines = [ HttpDownloader('http://example.com/').run(), HttpDownloader('http://pulpproject.org/').run(), ] loop = asyncio.get_event_loop() done, not_done = loop.run_until_complete(asyncio.wait(download_coroutines)) for task in done: try: task.result() # This is a DownloadResult except Exception as error: pass # fatal exceptions are raised by result()
The HTTPDownloaders contain automatic retry logic if the server responds with HTTP 429 response. The coroutine will automatically retry 10 times with exponential backoff before allowing a final exception to be raised.
- Variables:
session (aiohttp.ClientSession) – The session to be used by the downloader.
auth (aiohttp.BasicAuth) – An object that represents HTTP Basic Authorization or None
proxy (str) – An optional proxy URL or None
proxy_auth (aiohttp.BasicAuth) – An optional object that represents proxy HTTP Basic Authorization or None
headers_ready_callback (callable) – An optional callback that accepts a single dictionary as its argument. The callback will be called when the response headers are available. The dictionary passed has the header names as the keys and header values as its values. e.g. {‘Transfer-Encoding’: ‘chunked’}. This can also be None.
This downloader also has all of the attributes of
BaseDownloader
- Parameters:
url (str) – The url to download.
session (aiohttp.ClientSession) – The session to be used by the downloader. (optional) If not specified it will open the session and close it
auth (aiohttp.BasicAuth) – An object that represents HTTP Basic Authorization (optional)
proxy (str) – An optional proxy URL.
proxy_auth (aiohttp.BasicAuth) – An optional object that represents proxy HTTP Basic Authorization.
headers_ready_callback (callable) – An optional callback that accepts a single dictionary as its argument. The callback will be called when the response headers are available. The dictionary passed has the header names as the keys and header values as its values. e.g. {‘Transfer-Encoding’: ‘chunked’}
headers (dict) – Headers to be submitted with the request.
throttler (asyncio_throttle.Throttler) – Throttler for asyncio.
max_retries (int) – The maximum number of times to retry a download upon failure.
kwargs (dict) – This accepts the parameters of
BaseDownloader
.
- fetch()
Run the download synchronously and return the DownloadResult.
- Returns:
- Raises:
Exception – Any fatal exception emitted during downloading
- async finalize()
A coroutine to flush downloaded data, close the file writer, and validate the data.
All subclasses are required to call this method after all data has been passed to
handle_data()
.- Raises:
DigestValidationError – When any of the
expected_digest
values don’t match the digest of the data passed tohandle_data()
.SizeValidationError – When the
expected_size
value doesn’t match the size of the data passed tohandle_data()
.
- async handle_data(data)
A coroutine that writes data to the file object and compute its digests.
All subclassed downloaders are expected to pass all data downloaded to this method. Similar to the hashlib docstring, repeated calls are equivalent to a single call with the concatenation of all the arguments: m.handle_data(a); m.handle_data(b) is equivalent to m.handle_data(a+b).
- Parameters:
data (bytes) – The data to be handled by the downloader.
- raise_for_status(response)
Raise error if aiohttp response status is >= 400 and not silenced.
- Parameters:
response (aiohttp.ClientResponse) – The response to handle.
- Raises:
aiohttp.ClientResponseError – When the response status is >= 400.
- async run(extra_data=None)
Run the downloader with concurrency restriction and retry logic.
This method acquires self.semaphore before calling the actual download implementation contained in _run(). This ensures that the semaphore stays acquired even as the backoff wrapper around _run(), handles backoff-and-retry logic.
- Parameters:
extra_data (dict) – Extra data passed to the downloader.
- Returns:
DownloadResult
from _run().
- validate_digests()
Validate all digests validate if
expected_digests
is set- Raises:
DigestValidationError – When any of the
expected_digest
values don’t match the digest of the data passed tohandle_data()
.
- validate_size()
Validate the size if
expected_size
is set- Raises:
SizeValidationError – When the
expected_size
value doesn’t match the size of the data passed tohandle_data()
.
- property artifact_attributes
A property that returns a dictionary with size and digest information. The keys of this dictionary correspond with
Artifact
fields.
FileDownloader
This downloader is an asyncio-aware parallel file reader which is the default downloader produced by the Download Factory for urls starting with file://.
- class pulpcore.plugin.download.FileDownloader(url, *args, **kwargs)
A downloader for downloading files from the filesystem.
It provides digest and size validation along with computation of the digests needed to save the file as an Artifact. It writes a new file to the disk and the return path is included in the
DownloadResult
.This downloader has all of the attributes of
BaseDownloader
Download files from a url that starts with file://
- Parameters:
url (str) – The url to the file. This is expected to begin with file://
kwargs (dict) – This accepts the parameters of
BaseDownloader
.
- Raises:
ValidationError – When the url starts with file://, but is not a subfolder of a path in the ALLOWED_IMPORT_PATH setting.
- fetch()
Run the download synchronously and return the DownloadResult.
- Returns:
- Raises:
Exception – Any fatal exception emitted during downloading
- async finalize()
A coroutine to flush downloaded data, close the file writer, and validate the data.
All subclasses are required to call this method after all data has been passed to
handle_data()
.- Raises:
DigestValidationError – When any of the
expected_digest
values don’t match the digest of the data passed tohandle_data()
.SizeValidationError – When the
expected_size
value doesn’t match the size of the data passed tohandle_data()
.
- async handle_data(data)
A coroutine that writes data to the file object and compute its digests.
All subclassed downloaders are expected to pass all data downloaded to this method. Similar to the hashlib docstring, repeated calls are equivalent to a single call with the concatenation of all the arguments: m.handle_data(a); m.handle_data(b) is equivalent to m.handle_data(a+b).
- Parameters:
data (bytes) – The data to be handled by the downloader.
- async run(extra_data=None)
Run the downloader with concurrency restriction.
This method acquires self.semaphore before calling the actual download implementation contained in _run(). This ensures that the semaphore stays acquired even as the backoff decorator on _run(), handles backoff-and-retry logic.
- Parameters:
extra_data (dict) – Extra data passed to the downloader.
- Returns:
DownloadResult
from _run().
- validate_digests()
Validate all digests validate if
expected_digests
is set- Raises:
DigestValidationError – When any of the
expected_digest
values don’t match the digest of the data passed tohandle_data()
.
- validate_size()
Validate the size if
expected_size
is set- Raises:
SizeValidationError – When the
expected_size
value doesn’t match the size of the data passed tohandle_data()
.
- property artifact_attributes
A property that returns a dictionary with size and digest information. The keys of this dictionary correspond with
Artifact
fields.
BaseDownloader
This is an abstract downloader that is meant for subclassing. All downloaders are expected to be descendants of BaseDownloader.
- class pulpcore.plugin.download.BaseDownloader(url, expected_digests=None, expected_size=None, semaphore=None, *args, **kwargs)
The base class of all downloaders, providing digest calculation, validation, and file handling.
This is an abstract class and is meant to be subclassed. Subclasses are required to implement the
run()
method and do two things:Pass all downloaded data to
handle_data()
and schedule it.Schedule
finalize()
after all data has been delivered tohandle_data()
.
Passing all downloaded data the into
handle_data()
allows the file digests to be computed while data is written to disk. The digests computed are required if the download is to be saved as anArtifact
which avoids having to re-read the data later.The
handle_data()
method by default writes to a random file in the current working directory.The call to
finalize()
ensures that all data written to the file-like object is quiesced to disk before the file-like object has close() called on it.- Variables:
url (str) – The url to download.
expected_digests (dict) – Keyed on the algorithm name provided by hashlib and stores the value of the expected digest. e.g. {‘md5’: ‘912ec803b2ce49e4a541068d495ab570’}
expected_size (int) – The number of bytes the download is expected to have.
path (str) – The full path to the file containing the downloaded data.
Create a BaseDownloader object. This is expected to be called by all subclasses.
- Parameters:
url (str) – The url to download.
expected_digests (dict) – Keyed on the algorithm name provided by hashlib and stores the value of the expected digest. e.g. {‘md5’: ‘912ec803b2ce49e4a541068d495ab570’}
expected_size (int) – The number of bytes the download is expected to have.
semaphore (asyncio.Semaphore) – A semaphore the downloader must acquire before running. Useful for limiting the number of outstanding downloaders in various ways.
- fetch()
Run the download synchronously and return the DownloadResult.
- Returns:
- Raises:
Exception – Any fatal exception emitted during downloading
- async finalize()
A coroutine to flush downloaded data, close the file writer, and validate the data.
All subclasses are required to call this method after all data has been passed to
handle_data()
.- Raises:
DigestValidationError – When any of the
expected_digest
values don’t match the digest of the data passed tohandle_data()
.SizeValidationError – When the
expected_size
value doesn’t match the size of the data passed tohandle_data()
.
- async handle_data(data)
A coroutine that writes data to the file object and compute its digests.
All subclassed downloaders are expected to pass all data downloaded to this method. Similar to the hashlib docstring, repeated calls are equivalent to a single call with the concatenation of all the arguments: m.handle_data(a); m.handle_data(b) is equivalent to m.handle_data(a+b).
- Parameters:
data (bytes) – The data to be handled by the downloader.
- async run(extra_data=None)
Run the downloader with concurrency restriction.
This method acquires self.semaphore before calling the actual download implementation contained in _run(). This ensures that the semaphore stays acquired even as the backoff decorator on _run(), handles backoff-and-retry logic.
- Parameters:
extra_data (dict) – Extra data passed to the downloader.
- Returns:
DownloadResult
from _run().
- validate_digests()
Validate all digests validate if
expected_digests
is set- Raises:
DigestValidationError – When any of the
expected_digest
values don’t match the digest of the data passed tohandle_data()
.
- validate_size()
Validate the size if
expected_size
is set- Raises:
SizeValidationError – When the
expected_size
value doesn’t match the size of the data passed tohandle_data()
.
- property artifact_attributes
A property that returns a dictionary with size and digest information. The keys of this dictionary correspond with
Artifact
fields.
Validation Exceptions
- class pulpcore.exceptions.DigestValidationError(actual, expected, *args, url=None, **kwargs)
Raised when a file fails to validate a digest checksum.
- Parameters:
error_code (str) – unique error code
- class pulpcore.exceptions.SizeValidationError(actual, expected, *args, url=None, **kwargs)
Raised when a file fails to validate a size checksum.
- Parameters:
error_code (str) – unique error code
- class pulpcore.exceptions.ValidationError(error_code)
A base class for all Validation Errors.
- Parameters:
error_code (str) – unique error code