All of lore.kernel.org
 help / color / mirror / Atom feed
From: Taylor Blau <me@ttaylorr.com>
To: Derrick Stolee via GitGitGadget <gitgitgadget@gmail.com>
Cc: git@vger.kernel.org, christian.couder@gmail.com,
	gitster@pobox.com, johannes.schindelin@gmx.de,
	johncai86@gmail.com, jonathantanmy@google.com,
	karthik.188@gmail.com, kristofferhaugsbakk@fastmail.com,
	newren@gmail.com, peff@peff.net, ps@pks.im,
	Derrick Stolee <stolee@gmail.com>
Subject: Re: [PATCH v2 11/13] pack-objects: thread the path-based compression
Date: Tue, 6 May 2025 21:33:35 -0400	[thread overview]
Message-ID: <aBq4b2R0pAS5PiZm@nand.local> (raw)
In-Reply-To: <ae73d26319ac5db338507988eb23f1ecaac67671.1742829770.git.gitgitgadget@gmail.com>

On Mon, Mar 24, 2025 at 03:22:47PM +0000, Derrick Stolee via GitGitGadget wrote:
> Using the Git repository as a test repo, the p5313 performance test
> shows that the resulting size of the repo is the same, but the threaded
> implementation gives gains of varying degrees depending on the number of
> objects being packed. (This was tested on a 16-core machine.)
>
> Test                        HEAD~1      HEAD
> ---------------------------------------------------
> 5313.20: big pack             2.38      1.99 -16.4%
> 5313.21: big pack size       16.1M     16.0M  -0.2%
> 5313.24: repack             107.32     45.41 -57.7%
> 5313.25: repack size        213.3M    213.2M  -0.0%

Very nice indeed!

> This ~60% reduction in 'git repack --path-walk' time is typical across
> all repos I used for testing. What is interesting is to compare when the
> overall time improves enough to outperform the --name-hash-version=1
> case. These time improvements correlate with repositories with data
> shapes that significantly improve their data size as well. The

I had to read this sentence a couple of times to wrap my head around it,
but perhaps you can double-check my understanding. Are you saying that
reducing the size of the resulting packs suggests that our runtime with
the path-walk algorithm is improved enough to be competitive with (or
better than) the non-path-walk implementation with name-hash v1?

I think that's what you mean, and I admit that I don't find my own
restatement to be all that much clearer. So I think it's fine to leave
this as-is, though it is interesting to think about why the statement
is true.

I suspect (without having tested it thoroughly) that a significant
proportion of the time saved on the path-walk side is because we found
better deltas that are easier to compress, as a result of a more
well-tuned search technique. I don't think you necessarily have to get
too far into the details here, but I was curious (not for commit message
polishing, but purely for my own curiosity) if you had any thoughts or
measurements of why this might be the case.

But...

> --path-walk feature frequently takes longer than --name-hash-verison=2,
> trading some extrac computation for some additional compression. The

s/extrac/extra/

> natural place where this additional computation comes from is the two
> compression passes that --path-walk takes, though the first pass is
> naturally faster due to the path boundaries avoiding a number of delta
> compression attempts.

... I should have kept reading before commenting the above! ;-)

> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  builtin/pack-objects.c | 163 ++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 161 insertions(+), 2 deletions(-)
>
> diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
> index d4e05ca4434..2a6246c1e78 100644
> --- a/builtin/pack-objects.c
> +++ b/builtin/pack-objects.c
> @@ -2964,6 +2964,7 @@ static void find_deltas(struct object_entry **list, unsigned *list_size,
>  struct thread_params {
>  	pthread_t thread;
>  	struct object_entry **list;
> +	struct packing_region *regions;
>  	unsigned list_size;
>  	unsigned remaining;
>  	int window;
> @@ -3281,6 +3282,164 @@ static void find_deltas_by_region(struct object_entry *list,
>  	stop_progress(&progress_state);
>  }
>
> +static void *threaded_find_deltas_by_path(void *arg)
> +{
> +	struct thread_params *me = arg;
> +
> +	progress_lock();
> +	while (me->remaining) {
> +		while (me->remaining) {

I am not quite following the double-while loop here. Could you help
clarify what's going on here? Is this due to the work-stealing below?

> +static void ll_find_deltas_by_region(struct object_entry *list,
> +				     struct packing_region *regions,
> +				     uint32_t start, uint32_t nr)
> +{
> +	struct thread_params *p;
> +	int i, ret, active_threads = 0;
> +	unsigned int processed = 0;
> +	uint32_t progress_nr;
> +	init_threaded_search();
> +
> +	if (!nr)
> +		return;
> +
> +	progress_nr =  regions[nr - 1].start + regions[nr - 1].nr;
> +	if (delta_search_threads <= 1) {
> +		find_deltas_by_region(list, regions, start, nr);
> +		cleanup_threaded_search();
> +		return;
> +	}
> +
> +	if (progress > pack_to_stdout)
> +		fprintf_ln(stderr, _("Path-based delta compression using up to %d threads"),
> +			   delta_search_threads);

This looks a copy-and-paste from the non-path-walk code, but I wonder if
it might be worth using Q_() here instead of _() to provide better
output in the case that delta_search_threads is 1.

The rest of the implementation of ll_find_deltas_by_region() looks good
to me.

Thanks,
Taylor

  reply	other threads:[~2025-05-07  1:33 UTC|newest]

Thread overview: 75+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-10  1:50 [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-03-12 21:01   ` Taylor Blau
2025-03-20 19:48     ` Derrick Stolee
2025-03-10  1:50 ` [PATCH 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-03-12 21:14   ` Taylor Blau
2025-03-20 19:46     ` Derrick Stolee
2025-03-10  1:50 ` [PATCH 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-03-12 21:14   ` Taylor Blau
2025-03-10  1:50 ` [PATCH 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-03-12 21:21   ` Taylor Blau
2025-03-20 19:57     ` Derrick Stolee
2025-03-10  1:50 ` [PATCH 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-03-10  1:50 ` [PATCH 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
2025-03-10 17:28 ` [PATCH 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Junio C Hamano
2025-03-12 20:47   ` Taylor Blau
2025-03-20 20:18     ` Derrick Stolee
2025-03-24 15:22 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-05-02 22:48     ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-02 23:21     ` Taylor Blau
2025-05-06 19:39       ` Derrick Stolee
2025-05-16 15:27         ` Derrick Stolee
2025-03-24 15:22   ` [PATCH v2 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-05-02 23:25     ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-05-02 23:31     ` Taylor Blau
2025-05-06 19:43       ` Derrick Stolee
2025-03-24 15:22   ` [PATCH v2 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-05-02 23:34     ` Taylor Blau
2025-05-16 15:32       ` Derrick Stolee
2025-03-24 15:22   ` [PATCH v2 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-02 23:38     ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-05-02 23:42     ` Taylor Blau
2025-05-06 19:46       ` Derrick Stolee
2025-05-16 15:41         ` Derrick Stolee
2025-03-24 15:22   ` [PATCH v2 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-05-07  0:58     ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-05-07  1:14     ` Taylor Blau
2025-05-16 16:27       ` Derrick Stolee
2025-05-29  0:17         ` Taylor Blau
2025-03-24 15:22   ` [PATCH v2 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-05-07  1:33     ` Taylor Blau [this message]
2025-03-24 15:22   ` [PATCH v2 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-03-24 15:22   ` [PATCH v2 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
2025-05-02 21:24   ` [PATCH v2 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Junio C Hamano
2025-05-02 22:45     ` Taylor Blau
2025-05-02 23:44       ` Taylor Blau
2025-05-07  1:35         ` Taylor Blau
2025-05-16 18:11   ` [PATCH v3 " Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 01/13] pack-objects: extract should_attempt_deltas() Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 02/13] pack-objects: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 03/13] pack-objects: update usage to match docs Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 04/13] p5313: add performance tests for --path-walk Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 05/13] pack-objects: introduce GIT_TEST_PACK_PATH_WALK Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 06/13] t5538: add tests to confirm deltas in shallow pushes Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 07/13] repack: add --path-walk option Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 08/13] pack-objects: enable --path-walk via config Derrick Stolee via GitGitGadget
2025-05-16 18:11     ` [PATCH v3 09/13] scalar: enable path-walk during push " Derrick Stolee via GitGitGadget
2025-05-16 18:12     ` [PATCH v3 10/13] pack-objects: refactor path-walk delta phase Derrick Stolee via GitGitGadget
2025-05-16 18:12     ` [PATCH v3 11/13] pack-objects: thread the path-based compression Derrick Stolee via GitGitGadget
2025-05-16 18:12     ` [PATCH v3 12/13] path-walk: add new 'edge_aggressive' option Derrick Stolee via GitGitGadget
2025-05-16 18:12     ` [PATCH v3 13/13] pack-objects: allow --shallow and --path-walk Derrick Stolee via GitGitGadget
2025-05-29  0:20     ` [PATCH v3 00/13] PATH WALK II: Add --path-walk option to 'git pack-objects' Taylor Blau

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aBq4b2R0pAS5PiZm@nand.local \
    --to=me@ttaylorr.com \
    --cc=christian.couder@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=gitgitgadget@gmail.com \
    --cc=gitster@pobox.com \
    --cc=johannes.schindelin@gmx.de \
    --cc=johncai86@gmail.com \
    --cc=jonathantanmy@google.com \
    --cc=karthik.188@gmail.com \
    --cc=kristofferhaugsbakk@fastmail.com \
    --cc=newren@gmail.com \
    --cc=peff@peff.net \
    --cc=ps@pks.im \
    --cc=stolee@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.