From: "Yann E. MORIN" <yann.morin.1998@free.fr>
To: Brandon Maier <brandon.maier@collins.com>
Cc: Herve Codina <herve.codina@bootlin.com>,
Thomas Petazzoni <thomas.petazzoni@bootlin.com>,
buildroot@buildroot.org
Subject: Re: [Buildroot] [PATCH 1/1] ppd-merge: speed up per-package-rsync
Date: Tue, 28 Nov 2023 22:07:17 +0100 [thread overview]
Message-ID: <20231128210717.GA107928@scaer> (raw)
In-Reply-To: <20231127224139.35969-1-brandon.maier@collins.com>
Brandon, All,
+Thomas for the intial work on this topic
On 2023-11-27 22:41 +0000, Brandon Maier spake thusly:
> The per-package-rsync stage can add a significant amount of time to
> builds. They can also be annoying as the slowest rsyncs, the final
> target-finalize and host-finalize stages, run on `make all` which we use
> frequently during development.
>
> The per-package-rsync is slow because it launches a new rsync for each
> source tree, and each rsync must rescan the destination directory and
> potentially overwrite files multiple times. So instead we manually walk
> the source trees to find only the files that should be written to the
> destination, then feed that to a single instance of rsync.
Sorry, I haven't looked into the script in details, but why can't we
just do something like:
rsync -a \
--hard-links \
$(if $(filter hardlink,$(4)), \
--link-dest=$(PER_PACKAGE_DIR)/$(pkg)/$(2)/, \
$(if $(filter copy,$(4)), \
$(empty), \
$(error per-package-rsync can only "copy" or "hardlink", not "$(4)") \
) \
) \
$(patsubst %, $(PER_PACKAGE_DIR)/%/$(2)/, $(1)) \
$(3)
rsync can take multiple sources, and is usually pretty fast, so I would
very much prefer that we do not invent our own tooling if a single rsync
can do the job.
Also, the copy situation is now only ever needed when we aggregate the
final target and host directories, while the hard-linking case is used
when assembling each per-package; so there is no point in benchmarking
the copy mode for the per-package preparation, or the hard-linking for
the final target and host assembling.
Hervé, Thomas, do you remember if we ever addressed using a single rsync
rather than multiple ones in your previous work on the topic, and if so,
why we did not use that?
Regards,
Yann E. MORIN.
> create mode 100755 support/scripts/ppd-merge.sh
>
> diff --git a/package/pkg-utils.mk b/package/pkg-utils.mk
> index 059e86ae0a..3968cabebd 100644
> --- a/package/pkg-utils.mk
> +++ b/package/pkg-utils.mk
> @@ -216,19 +216,9 @@ ifeq ($(BR2_PER_PACKAGE_DIRECTORIES),y)
> # $3: destination directory
> # $4: literal "copy" or "hardlink" to copy or hardlink files from src to dest
> define per-package-rsync
> - mkdir -p $(3)
> - $(foreach pkg,$(1),\
> - rsync -a \
> - --hard-links \
> - $(if $(filter hardlink,$(4)), \
> - --link-dest=$(PER_PACKAGE_DIR)/$(pkg)/$(2)/, \
> - $(if $(filter copy,$(4)), \
> - $(empty), \
> - $(error per-package-rsync can only "copy" or "hardlink", not "$(4)") \
> - ) \
> - ) \
> - $(PER_PACKAGE_DIR)/$(pkg)/$(2)/ \
> - $(3)$(sep))
> + PER_PACKAGE_DIR=$(PER_PACKAGE_DIR) \
> + $(TOPDIR)/support/scripts/ppd-merge.sh \
> + $(2) $(3) $(4) $(1)
> endef
>
> # prepares the per-package HOST_DIR and TARGET_DIR of the current
> diff --git a/support/scripts/ppd-merge.sh b/support/scripts/ppd-merge.sh
> new file mode 100755
> index 0000000000..eb1caf9e52
> --- /dev/null
> +++ b/support/scripts/ppd-merge.sh
> @@ -0,0 +1,154 @@
> +#!/bin/bash
> +# Merge a set of Buildroot per-package-directories (PPD) into a single
> +# destination directory.
> +#
> +# ppd-merge scans through all the PPD and builds a table of which files from
> +# each source directory will be written to the destination directory. It feeds
> +# that table into rsync to tell it exactly which files to copy.
> +#
> +# Example:
> +# ppd-merge replaces the original method of merging PPD that ran many rsync
> +# commands like below.
> +#
> +# > rsync -a --link-dest=$PER_PACKAGE_DIR/pkg-1/host/ \
> +# > $PER_PACKAGE_DIR/pkg-1/host/ dest/
> +# > rsync -a --link-dest=$PER_PACKAGE_DIR/pkg-2/host/ \
> +# > $PER_PACKAGE_DIR/pkg-2/host/ dest/
> +# ...
> +# > rsync -a --link-dest=$PER_PACKAGE_DIR/pkg-n/host/ \
> +# > $PER_PACKAGE_DIR/pkg-n/host/ dest/
> +#
> +# The equivalent command with ppd-merge is
> +#
> +# > PER_PACKAGE_DIR=$PER_PACKAGE_DIR \
> +# > ppd-merge.sh host dest/ hardlink \
> +# > pkg-1 pkg-2 ... pkg-n
> +
> +set -euo pipefail
> +
> +usage() {
> + cat <<EOF >&2
> +Usage:
> + ppd-merge.sh <host|target> <destination> <hardlink|copy> <pkg-name>...
> +
> + PER_PACKAGE_DIR must be defined in the environment.
> +
> + Merges each directory for <pkg-name> at
> + "\$PER_PACKAGE_DIR/<pkg-name>/<host|target>/" into <destination>.
> +EOF
> + echo "Error: $*" >&2
> + exit 1
> +}
> +
> +search_subtrees() {
> + # Use `find` to walk all subtrees and print their contents in `rsync`
> + # '--relative' path syntax. Filter paths through awk so we discard any
> + # duplicate files found while walking subtrees.
> + (cd "$PER_PACKAGE_DIR" && find "${subtrees[@]}" -mindepth 1 -printf "%H/./%P\0") \
> + | awk '
> + BEGIN {
> + RS="\0";
> + ORS="\0";
> + FS="/\\./";
> + }
> + {
> + if (!($2 in found)) {
> + found[$2]=1;
> + print;
> + }
> + }'
> +}
> +
> +rsync_hardlink() {
> + # rsync hardlinking with --link-dest hard caps at 20 subtrees. Batch up the
> + # input files in $tmpdir until we have 20 subtrees, then flush them with an
> + # rsync call.
> + tmpdir=$(mktemp -d)
> + awk -v tmpdir="$tmpdir" '
> + function mkfiles() {
> + out_file=tmpdir "/" i ".files";
> + subtrees_file=tmpdir "/" i ".subtrees";
> + }
> + function batch_out() {
> + close(out_file);
> + close(subtrees_file);
> + delete subtrees;
> + print i;
> + i+=1;
> + mkfiles();
> + }
> + BEGIN {
> + RS="\0";
> + ORS="\0";
> + FS="/\\./";
> + i=0;
> + mkfiles();
> + }
> + !($1 in subtrees) && (length(subtrees) >= 20) {
> + batch_out();
> + }
> + {
> + print > out_file;
> + if (!($1 in subtrees)) {
> + subtrees[$1]=1;
> + print $1 > subtrees_file;
> + }
> + }
> + END {
> + if (length(subtrees) > 0) {
> + batch_out();
> + }
> + }' | while IFS= read -r -d $'\0' i; do
> + rsync_cmd_batch=( "${rsync_cmd[@]}" )
> + while IFS= read -r -d $'\0' subtree; do
> + rsync_cmd_batch+=( --link-dest="$PER_PACKAGE_DIR/$subtree/" )
> + done <"$tmpdir/$i.subtrees"
> + "${rsync_cmd_batch[@]}" "$PER_PACKAGE_DIR/" "$dest/" <"$tmpdir/$i.files"
> + done
> + rm -rf "$tmpdir"
> +}
> +
> +if [ "$#" -lt 3 ]; then
> + usage "Invalid number of arguments"
> +fi
> +type=$1
> +dest=$2
> +link=$3
> +shift 3
> +
> +case "$type" in
> +host|target)
> + ;;
> +*)
> + usage "Unknown type '$type'"
> + ;;
> +esac
> +
> +subtrees=()
> +for ((i = $#; i > 0; i--)); do
> + subtrees+=( "${!i}/$type" )
> +done
> +
> +if [ -z "${PER_PACKAGE_DIR:-}" ]; then
> + usage "PER_PACKAGE_DIR must be defined"
> +fi
> +
> +rsync_cmd=( rsync -a --hard-links --files-from=- --from0 )
> +
> +mkdir -p "$dest"
> +
> +if [ "${#subtrees[@]}" -eq 0 ]; then
> + exit 0
> +fi
> +
> +case "$link" in
> +hardlink)
> + search_subtrees | rsync_hardlink
> + ;;
> +copy)
> + search_subtrees | "${rsync_cmd[@]}" "$PER_PACKAGE_DIR/" "$dest/"
> + ;;
> +*)
> + usage "Unknown link command '$link'"
> + ;;
> +esac
> --
> 2.43.0
>
--
.-----------------.--------------------.------------------.--------------------.
| Yann E. MORIN | Real-Time Embedded | /"\ ASCII RIBBON | Erics' conspiracy: |
| +33 662 376 056 | Software Designer | \ / CAMPAIGN | ___ |
| +33 561 099 427 `------------.-------: X AGAINST | \e/ There is no |
| http://ymorin.is-a-geek.org/ | _/*\_ | / \ HTML MAIL | v conspiracy. |
'------------------------------^-------^------------------^--------------------'
_______________________________________________
buildroot mailing list
buildroot@buildroot.org
https://lists.buildroot.org/mailman/listinfo/buildroot
next prev parent reply other threads:[~2023-11-28 21:07 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-11-27 22:41 [Buildroot] [PATCH 1/1] ppd-merge: speed up per-package-rsync Brandon Maier via buildroot
2023-11-28 15:49 ` Herve Codina via buildroot
2023-11-28 17:21 ` [Buildroot] [External] " Maier, Brandon L Collins via buildroot
2023-11-28 17:54 ` Herve Codina via buildroot
2023-11-28 18:04 ` Maier, Brandon L Collins via buildroot
2023-11-28 21:07 ` Yann E. MORIN [this message]
2023-11-29 0:00 ` Maier, Brandon L Collins via buildroot
2023-11-29 8:17 ` [Buildroot] " Herve Codina via buildroot
2023-12-01 17:05 ` [Buildroot] [PATCH v2 " Brandon Maier via buildroot
2023-12-04 12:46 ` Herve Codina via buildroot
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20231128210717.GA107928@scaer \
--to=yann.morin.1998@free.fr \
--cc=brandon.maier@collins.com \
--cc=buildroot@buildroot.org \
--cc=herve.codina@bootlin.com \
--cc=thomas.petazzoni@bootlin.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox