The Origin of RPM Content¶
Overview¶
Pulp supports importing and publishing RPM content, but where does all this content come from? Who is generating it, and with what tools? The answer to these questions are answered in this document. Note that this document only covers Fedora and Red Hat Enterprise Linux. It does not cover SLES, OpenSuse, Scientific Linux, CentOS, Oracle Linux, or any other RPM-based distributions (yet).
The RPM¶
RPM is a packaging format. Many Linux distributions use RPM to distribute their software packages. An RPM is a collection of metadata about the package as well as a payload. What the payload contains depends on the type of RPM. There are three different types of RPM:
- Source code RPMs (referred to as “SRPMs”, ”.src.rpm” file extension)
- Binary RPMs (referred to as “RPMs”, ”.rpm” file extension)
- Binary delta RPMs (referred to as “DRPMs”, ”.drpm” file extension)
What each RPM contains is completely up to the creator of the RPM package. It could, for example, contain an entire operating system. However, this is very unweildy. Many distributions (Fedora, Red Hat, etc.) have strict packaging guidelines about what should be included. The long version can be found on the Fedora Wiki Package Guideline, but the short version is that each RPM should contain at most a single software project (for example, the Linux kernel, GCC, or binutils). Generally, if documentation is large it is separated into its own RPM. If the software project is compiled, debugging symbols are packaged separately as well. This means there can be several RPMs for a single project.
File Format¶
The RPM file format consists of a “lead” which identifies the file as an RPM, a “signature” section which can be used to ensure the integrity and authenticity of the RPM (via GPG), a “header” section which contains metadata about the packaged software (name, version, architecture, file list, etc.), and a “payload” which is a file archive (usually in the cpio format) compressed with gzip.
SRPM¶
The payload of a source RPM is simply a compressed tarball of source code and a
file, called a spec file, that describes how to turn the source RPM into a
binary RPM. The spec file includes the installation location and permissions
for all files in the package. This allows the RPM tool to install
the binary RPM to a system with the correct permissions and track which package
“owns” which files. Spec files also allow the author to run shell scripts before
or after an installation, removal, or upgrade of a package. The source RPM can
be built into one or more binary RPMs using rpmbuild
.
RPM¶
A binary RPM’s payload is the collection of files installed from the build
process of the source code. It is architecture and distribution-specific.
If a package happens to be architecture-independent, it can declare its
architecture as noarch
.
DRPM¶
A binary delta RPM has a payload which contains the binary diff between two
releases of the same package. For example, there could be a binary delta RPM
that can be used to upgrade an existing Firefox 45 installation to a Firefox
46 installation without downloading the entire binary RPM for Firefox 46. This
format exists to save bandwidth for content provides and requires a significant
amount of computation on the client attempting to install the DRPM. Unlike RPMs
and SRPMs, DRPMs are not created by rpmbuild
. Other tools exist to build
them, like deltarpm,
or createrepo_c
which leverages the drpm library.
When using createrepo_c, it is possible to generate DRPMs and the prestodelta.xml
metadata required for an RPM repository (covered below).
The Yum/RPM Repository¶
RPM provides a packaging format, but the RPM (often referred to as yum) repository provides a way to distribute them. An RPM repository consists of one or more RPM packages and some metadata describing what RPM packages the repository contains. The metadata, usually located in a directory called “repodata” in the root of the repository, is contained in several XML and/or optionally several SQLite files. DNF does not make use of the SQLite databases (which contain the same metadata as the XML), although some clients might. The filenames of each of these metadata files can be arbitrary, so clients locate them by using a metadata file that describes the metadata: the repomd.xml file.
To create an RPM repository, all that is required is that the repomd.xml, primary.xml, filelists.xml, and other.xml metadata files be present. There are two libraries that can do this: createrepo, and createrepo_c. createrepo is a Python library that is no longer maintained. createrepo_c is a C library with Python bindings that is actively maintained.
repomd.xml¶
repomd.xml is the metadata file that clients use to discover what repository
metadata files exist in the repository. It should always be located at
repodata/repomd.xml
relative to the root of the repository. It references the
location of all other metadata files for the repository. This means that the
other metadata files might not be located in the repodata/
directory, but it
is convension to store all RPM repository metadata in repodata/
and all
current Fedora, Red Hat, and CentOS repositories do this.
The repomd.xml file (XML namespace: http://linux.duke.edu/metadata/repo which is
sadly a dead link) contains data
elements with one attribute, type
. The
type
attribute is a string which references the type of metadata file the
data
element refers to. Common values are group
, filelists
, group_gz
,
primary
, other
, filelists_db
, primary_db
, and other_db
. <thing>_db
refer to SQLite versions of the metadata, while those sans _db
refer to XML
versions. Each data
element contains several other elements describing the
metadata and where it is located: checksum
, location
, timestamp
, and
size
seem to be always present, with open-size
, open-checksum
, and
database_version
potentially appearing as well.
primary.xml¶
The primary.xml file (often stored in repodata/<file-checksum>-primary.xml.gz
)
contains a list of every RPM and SRPM package (DRPMs are covered by
prestodelta.xml below) in the repository (and the network location to download
them). This includes information like the name, epoch version, release, and
architecture. It also lists what libraries and binaries the package provides, as
well as what libraries and binaries the package depends upon to work. This
metadata can be used by the client to determine the dependency tree of a
package, how much data it will need to download, and how much space the packages
will take up when installed. Try doing yum install <some uninstalled package>
some time and note how it describes what it’s going to install for dependencies
and how much space it’s going to take up. All that comes from this metadata
file.
filelists.xml¶
The filelists.xml metadata does exactly what the name implies. It is a list of
every single file contained in each RPM package. Like the primary.xml file, it
contains a list of package
elements (which references packages from the
primary.xml file), within which there are a number of file
elements, as well
as a version
element that identifies the package version. Files that are
directories have a type=dir
attribute.
other.xml¶
The other.xml contains... well, other information about each package. It
references each package in much the same way as filelists.xml. At the very
least, it contains changelog
elements, where an element exists for each
changelog entry in the spec file used to build the RPM. Typically this is
truncated, often to the 10 most recent releases.
comps.xml¶
comps.xml contains, among other things, a list of groups. Each group contains
a description and a list of packages in that group. Packages can be marked as
mandatory, default, or optional, based on the value of the type
attribute
on the packagereq
element.
Additional metadata in comps.xml are package environments and categories, which are simply a list of package groups, and langpacks.
prestodelta.xml¶
prestodelta.xml is used to describe the DRPMs a repository contains. A DRPM is built from two different binary RPMs (a new version and an old version). A repository can, and often does, contain several DRPMs for various upgrade paths. For example, there might be a DRPM containing the difference between firefox-45.0 and firefox-46.0, as well as a DRPM containing the difference between firefox-45.1 and firefox-46.0. A client must retrieve the correct DRPM for the version of a package it currently has installed to apply the DPRM.
The prestodata
root element contains zero or more newpackage
elements. Each
newpackage
element has name
, epoch
, version
, release
, and arch
attributes to identify what the new version of the package is.
Each newpackage
element contains one or more delta
elements. The delta
element has the oldepoch
, oldversion
, and oldrelease
attributes to
identify which old version of the package the DRPM applies to.
Each delta
element contains 4 elements: filename
, sequence
, size
, and
checksum
.
For example:
<?xml version="1.0" encoding="UTF-8"?>
<prestodelta>
<newpackagename="cmake-fedora" epoch="0" version="2.6.0" release="1.fc23" arch="noarch">
<delta oldepoch="0" oldversion="2.3.4" oldrelease="2.fc23">
<filename>drpms/cmake-fedora-2.3.4-2.fc23_2.6.0-1.fc23.noarch.drpm</filename>
<sequence>cmake-fedora-2.3.4-2.fc23-84bdd3315d4caddf8245e82cb83de4e301d5</sequence>
<size>51194</size>
<checksum type="sha256">6926544188f70d0e9dbedfd07fcf361d6fdc813d2888f5635fd647069bcc14ed</checksum>
</delta>
<delta oldepoch="0" oldversion="2.5.1" oldrelease="1.fc23">
<filename>drpms/cmake-fedora-2.5.1-1.fc23_2.6.0-1.fc23.noarch.drpm</filename>
<sequence>cmake-fedora-2.5.1-1.fc23-9930049f7b6f6c78a7732f5230c38f6e0196</sequence>
<size>34154</size>
<checksum type="sha256">45012a502babf1bdda402c05b50c1c68f8c5dbe62d85ce61a0a41c71c0ec6f8c</checksum>
</delta>
</newpackagename>
</prestodelta>
updateinfo.xml¶
updateinfo.xml describes errata. An erratum describes a change in an RPM repository. Errata are typically divided into three categories: security, bugfix, and enhancement. If a package is being updated to fix a security problem, the erratum for that update is a security erratum. If it is simply a bug with no (known) security implications, it is a bugfix erratum. Finally, the update could be to provide additional features, in which case it is an enhancement erratum.
In Fedora, the updateinfo.xml metadata is generated by Bodhi. It is created when an update is pushed by Bodhi and injected into the RPM repository metadata using the modifyrepo_c tool, part of the createrepo_c package.
What errata reference vary from project to project and product to product. For example, Red Hat Enterprise Linux and CentOS issue an erratum per component (SRPM package). However, other projects and products might issue a single erratum for many components at once. Therefore, an erratum references a list of one or more RPM packages since one SRPM can produce many RPM packages.
Each errata has a pkglist
element, which contains a collection
element,
which contains a name
element and one or more package
elements. Each package
element has name
, version
, release
, epoch
, and arch
attributes to
identify the affected package. In addition to those attributes, there is a src
attribute. In RHEL errata, this appears to be the name of the SPRM:
<package name="java-1.7.0-openjdk" version="1.7.0.55" release="2.4.7.2.el7_0" epoch="1" arch="x86_64" src="java-1.7.0-openjdk-1.7.0.55-2.4.7.2.el7_0.src.rpm">
However, in Fedora this src
field references where the package is located by URL:
<package name="opendnssec" version="1.4.9" release="1.fc23" epoch="0" arch="i686" src="https://download.fedoraproject.org/pub/fedora/linux/updates/23/i386/o/opendnssec-1.4.9-1.fc23.i686.rpm">
Each package
element contains a filename
element, and in RHEL errata, a
sum
element.
Organizing RPM Builds¶
As you now know from the RPM section, each package requires a source tarball and a spec file. In addition to these two required files, a packager may create patch files that alter the source code in some way. This is done for many reasons, but generally it is done to work around a bug in the upstream project, back-port a bugfix from upstream, or unbundle libraries. All this can become unwieldy to manage and track, especially when dealing with thousands of packages (Fedora contains ~18,000 packages). Fedora uses dist-git to solve this problem.
dist-git is designed specificly to manage RPM package sources. It stores the spec file, patches, and a reference to the source tarball in a git repository. The source tarball itself is not checked into Git and instead lives in a lookaside cache. The validity of the source tarball is determined by the reference checked into the git repository. Each package is contained in its own dist-git repository. This allows package maintainers to collaborate and view the history of a package.
Of course, having the sources, patches, and spec files organized doesn’t help
much if the RPMs have to be built manually.
Koji (and to some extent
Copr is a tool to build and track SRPMs
and RPMs from those dist-git repositories. It performs the builds in clean, secure
environments for many different architectures by using
Mock. Each build can be tagged to help
track where each build ends up. This is helpful when we want to turn a
collection of packages into an operating system distribution. An example of a
tag would be f24
, f24-updates
, or f24-updates-candidate
.
Composes¶
Having all the packages built and tracked in a tool like Koji is only helpful if there are tools to turn those packages into useful, consumable content. What is useful content?
- RPM repositories from which packages can be installed
- Installation media (ISOs for CD/DVD, PXE boot images, USB boot images, etc)
- Arbritrary additional files such as release notes, licenses, EULA, GPG keys, and branding images.
Fedora and RHEL have the concept of a compose. A set of packages make of a product release (Fedora 24, for example). The set of packages used in a compose can be controlled by the tag a package has in Koji. As a release is developed, new packages are added and current packages are updated or removed. A compose is an immutable snapshot at a certain point in time of a product release’s development. At some point, the compose is deemed to be “gold” and becomes the GA release of a product. For example, Fedora 23 is a release of the Fedora product.
A compose contains one or more variants. A variant is a particular subset of the set of packages used in the compose. One subset might target servers, another workstations, and another Atomic hosts. Each variant is built for one or more architectures (i686, x86_64, sparc, ppc64, etc).
Each of these variant builds for a specific architecture are referred to as trees. A tree is made up of:
- One or more RPM repositories
- Bootable ISO images
- PXE boot images including EFI boot files, ISOLINUX boot files, and one or more kernel images with initial RAM disks.
Almost all the content in a tree is described in a metadata file called the
treeinfo
file (sometimes .treeinfo
), which is located in the root of the tree
directory. This metadata file can be parsed using the Red Hat Release
Engineering tool, productmd.
To summarize, a compose is made up of variants, which are made of architecture-specific trees.
The tool used by Fedora to create composes is called Pungi. Pungi makes use of the Lorax project to build each tree. Prior to the Lorax project, trees were generated by scripts in the Anaconda installer. These scripts have been removed since Lorax replaces them.
As a concrete example, the Red Hat Enterprise Linux 6.7 (release) Server (variant) x86_64 tree contains the following:
- The RPM repository (metadata in
repodata/
)- Several addon RPM repositores (metadata in
HighAvailablility/repodata/
,LoadBalancer/repodata/
,ResilientStorage/repodata/
, andScalableFileSystem/repodata/
)- EFI/BOOT/BOOTX64.conf: EFI configuration containing references to the kernel and initrd in
images/pxeboot/
- EFI/BOOT/BOOTX64.efi: EFI boot file for x86_64 architecture
- EFI/BOOT/splash.xpm.gz: boot splash screen graphic
- images/efiboot.img: CD/DVD boot image for EFI systems
- images/efidisk.img: USB boot image for EFI systems (can be dd’ed to a USB
- flash drive)
- images/boot.iso: Bootable ISO image built from the various images in
images/
,EFI/
,isolinux/
- images/install.img: Stage 2 Installation image, loaded when you start the installation from a supported boot method.
- images/product.img: RHEL product description information used in the installer
- images/pxeboot/initrd.img: Initial ramdisk file for PXE-capable systems
- images/pxeboot/vmlinuz: Kernel image for PXE-capable systems
- isolinux/: bootloader with configuration, as well as a kernel image, initial RAM disk, and memtest.
In the above example the EFI/
and isolinux/
directories are not referenced
by metadata as they are not required by any client.
Updates¶
Composes are immutable, and when a product is released, it does not change. Updates are provided in the form of errata. When a package is updated, an erratum must be associated with it. An erratum is metadata about the update of one or more packages, very much like the erratum for a book. These are described in the updateinfo.xml file in an RPM repository. In the case of Fedora and RHEL, the RPM repositories in the compose are kept pristine and unchanged, but this is not enforced by the tooling, it is merely convention. There may be distributions out there that add their errata and updated RPM packages to the GA compose.
Fedora provides an excellent example of this method. When a release is made,
it is located under the released/
directory on the mirrors. For example, the
Fedora releases lives in releases/<release-version>/<variant>/<arch>/
. This
repository remains unchanged, even after updates are released for Fedora. You’ll
notice in the repodata
directories, there is no updateinfo.xml. Updates are
provided under updates/<release-version>/<arch>/
. This RPM repository does
contain updateinfo.xml, which is the errata for all the packages in this repository.
Red Hat Enterprise Linux is similar, except that releases are usually stored in
the rhel/<variant>/<major-release>/<minor-release>/<arch>/kickstart/
repository. Updates are provided in the
rhel/<variant>/<major-release>/<minor-release>/<arch>/os/
repository.
Overview of the Fedora Build Process¶
To get an idea of how this works in practice, the Fedora build process is outlined below. The Fedora Release Engineering team has written documentation on their release process which may be helpful to reference.
The basic workflow is as follows:
- Packages are created from upstream repositories (Git repositories, PyPi, RubyGems, etc.) by creating a spec file and any necessary patches. These go through a review process. Once approved, a dist-git repository is created for the package and the spec file with patches are checked in. The source tarball is uploaded to a lookaside cache (it is not checked into source control, but a method of verifying the tarball is).
- Packages are built in Koji by the package maintainer. Each build is made for a Koji build target. A build target specifies where a package should be built and how it should be tagged afterwards. This allows target names to remain fixed as tags change through releases.
- Products are composed using Pungi. This creates ISOs and other installation media, boot images for PXE, etc.
- At a certain point in the release cycle, Fedora’s Bodhi is turned on. After a package is built (step 2), the package maintainer submits the build to Bodhi. It is available for testing in the updates-testing repository and community members can +1 or -1 updates. After a certain period of time or enough +1, the package is approved. It is pushed into the updates repository with an entry in the updateinfo.xml metadata file.