From mboxrd@z Thu Jan 1 00:00:00 1970 From: Thomas Petazzoni Date: Tue, 3 Apr 2018 14:17:03 +0200 Subject: [Buildroot] Per-package download folders and Git caching Message-ID: <20180403141703.21cd24fe@windsurf> List-Id: MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit To: buildroot@busybox.net Hello, As was noted in Buildroot Hackathon day 3 highlights e-mail, the patch series changing the download infrastructure to support Git caching was merged. This brings a number of changes visible to the Buildroot user that are worth explaining. Per-package download folders ============================ The first visible change is that your DL_DIR will no longer have all files stored flat in DL_DIR, but organized into sub-folders, one per package: + dl/ + linux/ + linux-v4.16.tar.bz2 + busybox/ + busybox-1.28.1.tar.bz2 + busybox-1.28.2.tar.bz2 The benefit of such a new organization is that if two packages need to download a file with the same name, they won't conflict anymore. So today, if you use the latest Buildroot master and start with an empty DL_DIR, everything will be stored in sub-folders, one per package. However, it is likely that many users already have a DL_DIR with a number of files. What the new Buildroot will do is that it will first check if the requested file is in the per-package sub-folder. If it is not found there, it will then try to see if it is in the main DL_DIR, and if that's the case, it will create a hard link of this file into the per-package sub-folder. Therefore, you will see something like: + dl/ + linux-v4.16.tar.bz2 + linux/ + linux-v4.16.tar.bz2 But the Linux tarball is not duplicated, it is only hard-linked. This organization with per-package subfolders is also used when getting files from the BR2_PRIMARY_SITE or the BR2_BACKUP_SITE (sources.buildroot.net). We could therefore summarize the search logic of Buildroot as follows: - Try to find the file in DL_DIR//. If found, we're good. - If not found, try to find the file in DL_DIR/, and if found, create a hard link into DL_DIR//, and then we're good. - If not found, go to BR2_PRIMARY_SITE//. If found, store the file in DL_DIR//, and we're good. - If not found, go to BR2_PRIMARY_SITE/. If found, store the file in DL_DIR//, and we're good. - If not found, go to the upstream location. If found, store the file in DL_DIR// and we're good. - If not found, go to BR2_BACKUP_SITE//. If found, store the file in DL_DIR//, and we're good. - If not found, go to BR2_BACKUP_SITE/. If found, store the file in DL_DIR// and we're good. - If not found, well, bail out with an error. It is worth mentioning that a few packages share the same source code (linux and linux-headers, gcc-initial and gcc-final, mesa3d and mesa3d-headers), and even if we have per-package download folders, we wanted to avoid downloading the tarballs for such packages twice, especially considering that gcc and linux are quite large. To solve this those packages have a special _DL_SUBDIR variable, which they use to override the name of the download sub-folder. Hence, the tarballs for the gcc-initial and gcc-final packages are stored in DL_DIR/gcc/, because both of those packages define _DL_SUBDIR = gcc. Git caching =========== Git caching is the primary and original motivation for this patch series. Before getting into this, let's summarize what Buildroot was doing to download source code from Git: 1. Clone the Git repository. Buildroot tries hard to download only what's needed using a "shallow clone" but depending on the type of reference (tag or full SHA1) and other parameters, a "shallow clone" is not always possible, in which case Buildroot fell back to a regular full clone. 2. Create a tarball out of the source code that has been checked out, without the .git/ metadata. This tarball is stored in DL_DIR/ 3. The Git clone is completely removed. The obvious drawback is that each time you bump the version of a package fetched from Git, you have to do a full clone of the upstream Git repository to re-create the new tarball. This is slow and inefficient, especially for large projects like the Linux kernel. Therefore, what Buildroot does now is that it keeps in DL_DIR//git/ a Git clone. This Git clone is re-used whenever a new download from Git of the same package is done. Let's say you start with an empty DL_DIR/linux/git/, and you start a build with a Linux kernel fetched from the official Linux Git repo from Linus Torvalds. Buildroot will clone it into DL_DIR/linux/git/, and use that to create a tarball in DL_DIR/linux/ containing the Linux source code at the version you specified. The next day, you build a different Buildroot configuration that uses a Linux kernel fetched from the RaspberryPi Github repository. Buildroot will see it already has a Git clone in DL_DIR/linux/git/, and it will simply fetch the missing Git objects from the RaspberryPi Github repository, and then generate the tarball. This obviously increases the disk space being used, since we keep a clone instead of removing it after the download, but it greatly reduces the download time. If you think that your DL_DIR has grown too large, you can just remove whatever you want from there, including specifically the DL_DIR//git/. Buildroot will re-clone as needed. It is worth mentioning that we continue to generate a tarball stored in DL_DIR//. One might think that we could just copy/rsync the code from DL_DIR//git/ into the package build directory, instead of creating a tarball, and re-extracting it later in the package build directory. However, there are a number of reasons why a tarball is still needed: - Tarball is the mechanism we use to interact with the BR2_PRIMARY_SITE and BR2_BACKUP_SITE. BR2_PRIMARY_SITE and BR2_BACKUP_SITE don't backup a Git clone of the upstream project, but a tarball. Therefore, continuing to use tarballs make sense to keep this logic unchanged. - Tarball is the unit on which we do the hash file verification. We certainly don't want to store a hash of all the source files of a project in the package .hash file. Even if the SHA1 of a Git commit normally guarantees that we're really fetching what we think, and therefore would remove the need for a hash file, we still have the case of the BR2_PRIMARY_SITE and BR2_BACKUP_SITE that store tarballs, and we want to be sure those tarballs haven't been modified (maliciously or not). - Tarballs are needed for legal-info. We could generate them only for legal-info, but because of the two other reasons above, we continue to generate tarballs from Git repositories. Concurrency handling ==================== Until now, all downloads were made into a temporary folder, and only when the final tarball was ready it was moved to DL_DIR/. Thanks to the atomic property of "mv", this allowed multiple parallel Buildroot builds to access and populate a common DL_DIR/ without any problem. However, with Git caching in place, the folder DL_DIR//git/ becomes shared, and can be used by multiple parallel Buildroot builds. This required some locking, with is achieved by using "flock". All download operations are now protected by a lock taken on the per-package download folder, i.e DL_DIR//. This means that parallel Buildroot builds can continue to download files in parallel as long as they download files for different packages. If they download files for the same package at the same time, the lock will ensure that those operations will be serialized. Typically this means that if you start two Buildroot builds, and one starts cloning the Linux kernel, the other Buildroot build will be blocked if it tries to download the Linux kernel (even from HTTP), until the other Buildroot instance has finished cloning the Linux kernel. You will simply see the message ">>> linux 1.2.3 downloading" and nothing happening, until the lock gets released. For now, the lock is taken for all download methods, even if some methods such as HTTP, do not require a lock. This is something that might be improved in the future (patches welcome!). Conclusion ========== Hopefully this e-mail has been useful to people who wanted to understand better the changes that we have brought to the download infrastructure recently. All credits for this work obviously go to Yann E. Morin, Maxime Hadjinlian, and Peter Seiderer. Best regards, Thomas -- Thomas Petazzoni, CTO, Bootlin (formerly Free Electrons) Embedded Linux and Kernel engineering https://bootlin.com