Buildroot Archive on lore.kernel.org
 help / color / mirror / Atom feed
From: "Yann E. MORIN" <yann.morin.1998@free.fr>
To: Brandon Maier <brandon.maier@collins.com>
Cc: Herve Codina <herve.codina@bootlin.com>,
	Thomas Petazzoni <thomas.petazzoni@bootlin.com>,
	buildroot@buildroot.org
Subject: Re: [Buildroot] [PATCH 1/1] ppd-merge: speed up per-package-rsync
Date: Tue, 28 Nov 2023 22:07:17 +0100	[thread overview]
Message-ID: <20231128210717.GA107928@scaer> (raw)
In-Reply-To: <20231127224139.35969-1-brandon.maier@collins.com>

Brandon, All,

+Thomas for the intial work on this topic

On 2023-11-27 22:41 +0000, Brandon Maier spake thusly:
> The per-package-rsync stage can add a significant amount of time to
> builds. They can also be annoying as the slowest rsyncs, the final
> target-finalize and host-finalize stages, run on `make all` which we use
> frequently during development.
> 
> The per-package-rsync is slow because it launches a new rsync for each
> source tree, and each rsync must rescan the destination directory and
> potentially overwrite files multiple times. So instead we manually walk
> the source trees to find only the files that should be written to the
> destination, then feed that to a single instance of rsync.

Sorry, I haven't looked into the script in details, but why can't we
just do something like:

    rsync -a \
        --hard-links \
        $(if $(filter hardlink,$(4)), \
            --link-dest=$(PER_PACKAGE_DIR)/$(pkg)/$(2)/, \
            $(if $(filter copy,$(4)), \
                $(empty), \
                $(error per-package-rsync can only "copy" or "hardlink", not "$(4)") \
            ) \
        ) \
        $(patsubst %, $(PER_PACKAGE_DIR)/%/$(2)/, $(1)) \
        $(3)

rsync can take multiple sources, and is usually pretty fast, so I would
very much prefer that we do not invent our own tooling if a single rsync
can do the job.

Also, the copy situation is now only ever needed when we aggregate the
final target and host directories, while the hard-linking case is used
when assembling each per-package; so there is no point in benchmarking
the copy mode for the per-package preparation, or the hard-linking for
the final target and host assembling.

Hervé, Thomas, do you remember if we ever addressed using a single rsync
rather than multiple ones in your previous work on the topic, and if so,
why we did not use that?

Regards,
Yann E. MORIN.

>  create mode 100755 support/scripts/ppd-merge.sh
> 
> diff --git a/package/pkg-utils.mk b/package/pkg-utils.mk
> index 059e86ae0a..3968cabebd 100644
> --- a/package/pkg-utils.mk
> +++ b/package/pkg-utils.mk
> @@ -216,19 +216,9 @@ ifeq ($(BR2_PER_PACKAGE_DIRECTORIES),y)
>  # $3: destination directory
>  # $4: literal "copy" or "hardlink" to copy or hardlink files from src to dest
>  define per-package-rsync
> -	mkdir -p $(3)
> -	$(foreach pkg,$(1),\
> -		rsync -a \
> -			--hard-links \
> -			$(if $(filter hardlink,$(4)), \
> -				--link-dest=$(PER_PACKAGE_DIR)/$(pkg)/$(2)/, \
> -				$(if $(filter copy,$(4)), \
> -					$(empty), \
> -					$(error per-package-rsync can only "copy" or "hardlink", not "$(4)") \
> -				) \
> -			) \
> -			$(PER_PACKAGE_DIR)/$(pkg)/$(2)/ \
> -			$(3)$(sep))
> +	PER_PACKAGE_DIR=$(PER_PACKAGE_DIR) \
> +		$(TOPDIR)/support/scripts/ppd-merge.sh \
> +		$(2) $(3) $(4) $(1)
>  endef
>  
>  # prepares the per-package HOST_DIR and TARGET_DIR of the current
> diff --git a/support/scripts/ppd-merge.sh b/support/scripts/ppd-merge.sh
> new file mode 100755
> index 0000000000..eb1caf9e52
> --- /dev/null
> +++ b/support/scripts/ppd-merge.sh
> @@ -0,0 +1,154 @@
> +#!/bin/bash
> +# Merge a set of Buildroot per-package-directories (PPD) into a single
> +# destination directory.
> +#
> +# ppd-merge scans through all the PPD and builds a table of which files from
> +# each source directory will be written to the destination directory. It feeds
> +# that table into rsync to tell it exactly which files to copy.
> +#
> +# Example:
> +#   ppd-merge replaces the original method of merging PPD that ran many rsync
> +#   commands like below.
> +#
> +#   > rsync -a --link-dest=$PER_PACKAGE_DIR/pkg-1/host/ \
> +#   >     $PER_PACKAGE_DIR/pkg-1/host/ dest/
> +#   > rsync -a --link-dest=$PER_PACKAGE_DIR/pkg-2/host/ \
> +#   >     $PER_PACKAGE_DIR/pkg-2/host/ dest/
> +#   ...
> +#   > rsync -a --link-dest=$PER_PACKAGE_DIR/pkg-n/host/ \
> +#   >     $PER_PACKAGE_DIR/pkg-n/host/ dest/
> +#
> +#   The equivalent command with ppd-merge is
> +#
> +#   > PER_PACKAGE_DIR=$PER_PACKAGE_DIR \
> +#   >     ppd-merge.sh host dest/ hardlink \
> +#   >     pkg-1 pkg-2 ... pkg-n
> +
> +set -euo pipefail
> +
> +usage() {
> +    cat <<EOF >&2
> +Usage:
> +  ppd-merge.sh <host|target> <destination> <hardlink|copy> <pkg-name>...
> +
> +  PER_PACKAGE_DIR must be defined in the environment.
> +
> +  Merges each directory for <pkg-name> at
> +  "\$PER_PACKAGE_DIR/<pkg-name>/<host|target>/" into <destination>.
> +EOF
> +    echo "Error: $*" >&2
> +    exit 1
> +}
> +
> +search_subtrees() {
> +    # Use `find` to walk all subtrees and print their contents in `rsync`
> +    # '--relative' path syntax. Filter paths through awk so we discard any
> +    # duplicate files found while walking subtrees.
> +    (cd "$PER_PACKAGE_DIR" && find "${subtrees[@]}" -mindepth 1 -printf "%H/./%P\0") \
> +        | awk '
> +        BEGIN {
> +            RS="\0";
> +            ORS="\0";
> +            FS="/\\./";
> +        }
> +        {
> +            if (!($2 in found)) {
> +                found[$2]=1;
> +                print;
> +            }
> +        }'
> +}
> +
> +rsync_hardlink() {
> +    # rsync hardlinking with --link-dest hard caps at 20 subtrees. Batch up the
> +    # input files in $tmpdir until we have 20 subtrees, then flush them with an
> +    # rsync call.
> +    tmpdir=$(mktemp -d)
> +    awk -v tmpdir="$tmpdir" '
> +        function mkfiles() {
> +            out_file=tmpdir "/" i ".files";
> +            subtrees_file=tmpdir "/" i ".subtrees";
> +        }
> +        function batch_out() {
> +            close(out_file);
> +            close(subtrees_file);
> +            delete subtrees;
> +            print i;
> +            i+=1;
> +            mkfiles();
> +        }
> +        BEGIN {
> +            RS="\0";
> +            ORS="\0";
> +            FS="/\\./";
> +            i=0;
> +            mkfiles();
> +        }
> +        !($1 in subtrees) && (length(subtrees) >= 20) {
> +            batch_out();
> +        }
> +        {
> +            print > out_file;
> +            if (!($1 in subtrees)) {
> +                subtrees[$1]=1;
> +                print $1 > subtrees_file;
> +            }
> +        }
> +        END {
> +            if (length(subtrees) > 0) {
> +                batch_out();
> +            }
> +        }' | while IFS= read -r -d $'\0' i; do
> +            rsync_cmd_batch=( "${rsync_cmd[@]}" )
> +            while IFS= read -r -d $'\0' subtree; do
> +                rsync_cmd_batch+=( --link-dest="$PER_PACKAGE_DIR/$subtree/" )
> +            done <"$tmpdir/$i.subtrees"
> +            "${rsync_cmd_batch[@]}" "$PER_PACKAGE_DIR/" "$dest/" <"$tmpdir/$i.files"
> +        done
> +    rm -rf "$tmpdir"
> +}
> +
> +if [ "$#" -lt 3 ]; then
> +    usage "Invalid number of arguments"
> +fi
> +type=$1
> +dest=$2
> +link=$3
> +shift 3
> +
> +case "$type" in
> +host|target)
> +    ;;
> +*)
> +    usage "Unknown type '$type'"
> +    ;;
> +esac
> +
> +subtrees=()
> +for ((i = $#; i > 0; i--)); do
> +    subtrees+=( "${!i}/$type" )
> +done
> +
> +if [ -z "${PER_PACKAGE_DIR:-}" ]; then
> +    usage "PER_PACKAGE_DIR must be defined"
> +fi
> +
> +rsync_cmd=( rsync -a --hard-links --files-from=- --from0 )
> +
> +mkdir -p "$dest"
> +
> +if [ "${#subtrees[@]}" -eq 0 ]; then
> +    exit 0
> +fi
> +
> +case "$link" in
> +hardlink)
> +    search_subtrees | rsync_hardlink
> +    ;;
> +copy)
> +    search_subtrees | "${rsync_cmd[@]}" "$PER_PACKAGE_DIR/" "$dest/"
> +    ;;
> +*)
> +    usage "Unknown link command '$link'"
> +    ;;
> +esac
> -- 
> 2.43.0
> 

-- 
.-----------------.--------------------.------------------.--------------------.
|  Yann E. MORIN  | Real-Time Embedded | /"\ ASCII RIBBON | Erics' conspiracy: |
| +33 662 376 056 | Software  Designer | \ / CAMPAIGN     |  ___               |
| +33 561 099 427 `------------.-------:  X  AGAINST      |  \e/  There is no  |
| http://ymorin.is-a-geek.org/ | _/*\_ | / \ HTML MAIL    |   v   conspiracy.  |
'------------------------------^-------^------------------^--------------------'
_______________________________________________
buildroot mailing list
buildroot@buildroot.org
https://lists.buildroot.org/mailman/listinfo/buildroot

  parent reply	other threads:[~2023-11-28 21:07 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-27 22:41 [Buildroot] [PATCH 1/1] ppd-merge: speed up per-package-rsync Brandon Maier via buildroot
2023-11-28 15:49 ` Herve Codina via buildroot
2023-11-28 17:21   ` [Buildroot] [External] " Maier, Brandon L Collins via buildroot
2023-11-28 17:54     ` Herve Codina via buildroot
2023-11-28 18:04       ` Maier, Brandon L Collins via buildroot
2023-11-28 21:07 ` Yann E. MORIN [this message]
2023-11-29  0:00   ` Maier, Brandon L Collins via buildroot
2023-11-29  8:17   ` [Buildroot] " Herve Codina via buildroot
2023-12-01 17:05 ` [Buildroot] [PATCH v2 " Brandon Maier via buildroot
2023-12-04 12:46   ` Herve Codina via buildroot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20231128210717.GA107928@scaer \
    --to=yann.morin.1998@free.fr \
    --cc=brandon.maier@collins.com \
    --cc=buildroot@buildroot.org \
    --cc=herve.codina@bootlin.com \
    --cc=thomas.petazzoni@bootlin.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox