[PATCH 0/2] [RFC] for-each-ref: add --count-matches mode

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2] [RFC] for-each-ref: add --count-matches mode
@ 2023-06-26 15:09 Derrick Stolee via GitGitGadget
  2023-06-26 15:09 ` [PATCH 1/2] for-each-ref: extract ref output loop Derrick Stolee via GitGitGadget
  2023-06-26 15:09 ` [PATCH 2/2] for-each-ref: add --count-matches option Derrick Stolee via GitGitGadget
  0 siblings, 2 replies; 12+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2023-06-26 15:09 UTC (permalink / raw)
  To: git; +Cc: vdye, gitster, me, mjcheetham, Derrick Stolee

I'm leaving this as an RFC for now because I can't decide if this new option
in git for-each-ref is good or if this needs an entirely new builtin. I'm
open to whatever people think is best, I'd just like a way to count matches
based on refspecs.

Thanks, -Stolee

Derrick Stolee (2):
  for-each-ref: extract ref output loop
  for-each-ref: add --count-matches option

 Documentation/git-for-each-ref.txt |  5 ++
 builtin/for-each-ref.c             | 80 +++++++++++++++++++++---------
 ref-filter.c                       | 47 ++++++++++++++++++
 ref-filter.h                       |  7 +++
 t/perf/p1501-ref-iteration.sh      | 35 +++++++++++++
 t/t6300-for-each-ref.sh            | 28 +++++++++++
 6 files changed, 179 insertions(+), 23 deletions(-)
 create mode 100755 t/perf/p1501-ref-iteration.sh


base-commit: d7d8841f67f29e6ecbad85a11805c907d0f00d5d
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1548%2Fderrickstolee%2Ffor-each-ref-count-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1548/derrickstolee/for-each-ref-count-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1548
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] for-each-ref: extract ref output loop
  2023-06-26 15:09 [PATCH 0/2] [RFC] for-each-ref: add --count-matches mode Derrick Stolee via GitGitGadget
@ 2023-06-26 15:09 ` Derrick Stolee via GitGitGadget
  2023-06-26 15:09 ` [PATCH 2/2] for-each-ref: add --count-matches option Derrick Stolee via GitGitGadget
  1 sibling, 0 replies; 12+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2023-06-26 15:09 UTC (permalink / raw)
  To: git; +Cc: vdye, gitster, me, mjcheetham, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

In preparation for a new output mode of 'git for-each-ref', extract the
loop that outputs references into a static method. No functional change
here.

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 builtin/for-each-ref.c | 58 ++++++++++++++++++++++++++----------------
 1 file changed, 36 insertions(+), 22 deletions(-)

diff --git a/builtin/for-each-ref.c b/builtin/for-each-ref.c
index 695fc8f4a5e..ce34940e3e6 100644
--- a/builtin/for-each-ref.c
+++ b/builtin/for-each-ref.c
@@ -17,17 +17,47 @@ static char const * const for_each_ref_usage[] = {
 	NULL
 };
 
-int cmd_for_each_ref(int argc, const char **argv, const char *prefix)
+static void filter_and_output_refs(struct repository *r,
+				   struct ref_array *array,
+				   struct ref_filter *filter,
+				   struct ref_format *format,
+				   struct ref_sorting *sorting,
+				   int maxcount,
+				   int omit_empty)
 {
 	int i;
+	struct strbuf err = STRBUF_INIT;
+	struct strbuf output = STRBUF_INIT;
+
+	filter_refs(array, filter, FILTER_REFS_ALL);
+	filter_ahead_behind(r, format, array);
+
+	ref_array_sort(sorting, array);
+
+	if (!maxcount || array->nr < maxcount)
+		maxcount = array->nr;
+	for (i = 0; i < maxcount; i++) {
+		strbuf_reset(&err);
+		strbuf_reset(&output);
+		if (format_ref_array_item(array->items[i], format, &output, &err))
+			die("%s", err.buf);
+		fwrite(output.buf, 1, output.len, stdout);
+		if (output.len || !omit_empty)
+			putchar('\n');
+	}
+
+	strbuf_release(&err);
+	strbuf_release(&output);
+}
+
+int cmd_for_each_ref(int argc, const char **argv, const char *prefix)
+{
 	struct ref_sorting *sorting;
 	struct string_list sorting_options = STRING_LIST_INIT_DUP;
-	int maxcount = 0, icase = 0, omit_empty = 0;
+	int maxcount = 0, icase = 0, omit_empty = 0, count_matches = 0;
 	struct ref_array array;
 	struct ref_filter filter;
 	struct ref_format format = REF_FORMAT_INIT;
-	struct strbuf output = STRBUF_INIT;
-	struct strbuf err = STRBUF_INIT;
 	int from_stdin = 0;
 	struct strvec vec = STRVEC_INIT;
 
@@ -101,25 +131,9 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix)
 	}
 
 	filter.match_as_path = 1;
-	filter_refs(&array, &filter, FILTER_REFS_ALL);
-	filter_ahead_behind(the_repository, &format, &array);
+	filter_and_output_refs(the_repository, &array, &filter, &format,
+			       sorting, maxcount, omit_empty);
 
-	ref_array_sort(sorting, &array);
-
-	if (!maxcount || array.nr < maxcount)
-		maxcount = array.nr;
-	for (i = 0; i < maxcount; i++) {
-		strbuf_reset(&err);
-		strbuf_reset(&output);
-		if (format_ref_array_item(array.items[i], &format, &output, &err))
-			die("%s", err.buf);
-		fwrite(output.buf, 1, output.len, stdout);
-		if (output.len || !omit_empty)
-			putchar('\n');
-	}
-
-	strbuf_release(&err);
-	strbuf_release(&output);
 	ref_array_clear(&array);
 	free_commit_list(filter.with_commit);
 	free_commit_list(filter.no_commit);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH 2/2] for-each-ref: add --count-matches option
  2023-06-26 15:09 [PATCH 0/2] [RFC] for-each-ref: add --count-matches mode Derrick Stolee via GitGitGadget
  2023-06-26 15:09 ` [PATCH 1/2] for-each-ref: extract ref output loop Derrick Stolee via GitGitGadget
@ 2023-06-26 15:09 ` Derrick Stolee via GitGitGadget
  2023-06-26 16:14   ` Junio C Hamano
  2023-06-27  7:30   ` Jeff King
  1 sibling, 2 replies; 12+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2023-06-26 15:09 UTC (permalink / raw)
  To: git; +Cc: vdye, gitster, me, mjcheetham, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <derrickstolee@github.com>

In order to count references of different types based on their initial
prefixes, there are two current approaches:

 1. Run 'git for-each-ref' on all refs and parse the output to count
    those that match each prefix.

 2. Run 'git for-each-ref "prefix/"' and pipe that output to 'wc -l' to
    count the number of output lines.

Each of these approaches is wasteful as it requires sending the list of
matching reference names across a pipe plus the cost of parsing that
output.

Instead, it would be helpful to have a Git command that counts the
number of refs matching a list of patterns.

This change adds a new mode, '--count-matches' to 'git for-each-ref' so
we can make use of the existing infrastructure around parsing refspecs
in the correct places. '--count' is already taken as a "maximum number"
of refs to output.

An alternative approach could be to make a brand-new builtin that is
focused on counting ref matches. This would involve duplicating a bit of
code around parsing refpecs, but would not be terribly difficult. The
actual overlap of implementation here with 'git for-each-ref' is small
enough that we could instead extract this elsewhere. My gut feeling is
that this behavior doesn't merit a new builtin.

The implementation is extremely simple: iterate through all references
and compare each ref to each refspec. On a match, increment a counter
value for that refspec.

In the end, output each refspec followed by the count.

If all given refspecs were prefixes, then it is tempting to instead use
a counting behavior that we can use in things like "how many OIDs start
with this short hex string?" Navigating to the start and end of the
range of refs starting with that prefix is possible in the packed-refs
file. However, the records do not have a constant size so we cannot
infer the number of references in that range using the current format.
(Perhaps, in the future we will have a ref storage system that allows
this kind of counting to be easy to do in O(log N) time.)

A new performance test is included to check the performance of iterating
through these references and counting them appropriately. This presents
a 3x improvement over the trivial piping through to 'wc -l', and that
assumes there is a single pattern to match instead of multiple. We can
see that testing three patterns sequentially adds to the total time, but
doing a single process with --count-match continues to be as fast. (It's
difficult to tell since it _also_ matches the sum of the three for this
example repo.)

Test                                                         this tree
----------------------------------------------------------------------------
1501.2: count refs/heads/: git for-each-ref | wc -l          0.01(0.00+0.01)
1501.3: count refs/heads/: git for-each-ref --count-match    0.00(0.00+0.00)
1501.4: count refs/tags/: git for-each-ref | wc -l           0.02(0.00+0.02)
1501.5: count refs/tags/: git for-each-ref --count-match     0.00(0.00+0.00)
1501.6: count refs/remotes: git for-each-ref | wc -l         0.15(0.08+0.07)
1501.7: count refs/remotes: git for-each-ref --count-match   0.04(0.01+0.02)
1501.8: count all patterns: git for-each-ref | wc -l         0.18(0.08+0.10)
1501.9: count all patterns: git for-each-ref --count-match   0.04(0.02+0.02)

Signed-off-by: Derrick Stolee <derrickstolee@github.com>
---
 Documentation/git-for-each-ref.txt |  5 ++++
 builtin/for-each-ref.c             | 26 +++++++++++++++--
 ref-filter.c                       | 47 ++++++++++++++++++++++++++++++
 ref-filter.h                       |  7 +++++
 t/perf/p1501-ref-iteration.sh      | 35 ++++++++++++++++++++++
 t/t6300-for-each-ref.sh            | 28 ++++++++++++++++++
 6 files changed, 145 insertions(+), 3 deletions(-)
 create mode 100755 t/perf/p1501-ref-iteration.sh

diff --git a/Documentation/git-for-each-ref.txt b/Documentation/git-for-each-ref.txt
index 1e215d4e734..74018755ed5 100644
--- a/Documentation/git-for-each-ref.txt
+++ b/Documentation/git-for-each-ref.txt
@@ -42,6 +42,11 @@ OPTIONS
 	`<pattern>`.  This option makes it stop after showing
 	that many refs.
 
+--count-matches::
+	Instead of listing references according to the given format, output
+	the number of references that match each pattern. Incompatible with
+	`--format`, `--sort`, and `--count`.
+
 --sort=<key>::
 	A field name to sort on.  Prefix `-` to sort in
 	descending order of the value.  When unspecified,
diff --git a/builtin/for-each-ref.c b/builtin/for-each-ref.c
index ce34940e3e6..e3db719bb87 100644
--- a/builtin/for-each-ref.c
+++ b/builtin/for-each-ref.c
@@ -50,6 +50,17 @@ static void filter_and_output_refs(struct repository *r,
 	strbuf_release(&output);
 }
 
+static void count_and_output_patterns(struct ref_filter *filter)
+{
+	uint32_t *counts = count_ref_patterns(filter);
+
+	for (int i = 0; filter->name_patterns && filter->name_patterns[i]; i++)
+		fprintf(stdout, "%s %"PRIu32"\n",
+			filter->name_patterns[i], counts[i]);
+
+	free(counts);
+}
+
 int cmd_for_each_ref(int argc, const char **argv, const char *prefix)
 {
 	struct ref_sorting *sorting;
@@ -60,6 +71,7 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix)
 	struct ref_format format = REF_FORMAT_INIT;
 	int from_stdin = 0;
 	struct strvec vec = STRVEC_INIT;
+	const char *initial_format = "%(objectname) %(objecttype)\t%(refname)";
 
 	struct option opts[] = {
 		OPT_BIT('s', "shell", &format.quote_style,
@@ -72,6 +84,8 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix)
 			N_("quote placeholders suitably for Tcl"), QUOTE_TCL),
 		OPT_BOOL(0, "omit-empty",  &omit_empty,
 			N_("do not output a newline after empty formatted refs")),
+		OPT_BOOL(0, "count-matches", &count_matches,
+			N_("output number of references matching each pattern instead of any other output")),
 
 		OPT_GROUP(""),
 		OPT_INTEGER( 0 , "count", &maxcount, N_("show only <n> matched refs")),
@@ -93,7 +107,7 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix)
 	memset(&array, 0, sizeof(array));
 	memset(&filter, 0, sizeof(filter));
 
-	format.format = "%(objectname) %(objecttype)\t%(refname)";
+	format.format = initial_format;
 
 	git_config(git_default_config, NULL);
 
@@ -102,6 +116,9 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix)
 		error("invalid --count argument: `%d'", maxcount);
 		usage_with_options(for_each_ref_usage, opts);
 	}
+	if (count_matches &&
+	    (maxcount || format.format != initial_format || sorting_options.nr))
+		die("--count-matches incompatible with --count, --format, or --sort");
 	if (HAS_MULTI_BITS(format.quote_style)) {
 		error("more than one quoting style?");
 		usage_with_options(for_each_ref_usage, opts);
@@ -131,8 +148,11 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix)
 	}
 
 	filter.match_as_path = 1;
-	filter_and_output_refs(the_repository, &array, &filter, &format,
-			       sorting, maxcount, omit_empty);
+	if (count_matches)
+		count_and_output_patterns(&filter);
+	else
+		filter_and_output_refs(the_repository, &array, &filter, &format,
+				       sorting, maxcount, omit_empty);
 
 	ref_array_clear(&array);
 	free_commit_list(filter.with_commit);
diff --git a/ref-filter.c b/ref-filter.c
index 4991cd4f7a8..4e02a4ae98d 100644
--- a/ref-filter.c
+++ b/ref-filter.c
@@ -2560,6 +2560,53 @@ int filter_refs(struct ref_array *array, struct ref_filter *filter, unsigned int
 	return ret;
 }
 
+struct filter_and_counts {
+	struct ref_filter *filter;
+	uint32_t *counts;
+};
+
+static int ref_filter_counter(const char *refname,
+			      const struct object_id *oid,
+			      int flag, void *cb_data)
+{
+	struct filter_and_counts *fc = cb_data;
+	const char **pattern = fc->filter->name_patterns;
+	size_t namelen = strlen(refname);
+	unsigned flags = fc->filter->ignore_case ? WM_CASEFOLD : 0;
+
+	for (int i = 0; *pattern; i++, pattern++) {
+		const char *p = *pattern;
+		int plen = strlen(p);
+
+		if ((plen <= namelen) &&
+		    !strncmp(refname, p, plen) &&
+		    (refname[plen] == '\0' ||
+		     refname[plen] == '/' ||
+		     p[plen-1] == '/'))
+			fc->counts[i]++;
+		else if (!wildmatch(p, refname, flags))
+			fc->counts[i]++;
+	}
+	return 0;
+}
+
+uint32_t *count_ref_patterns(struct ref_filter *filter)
+{
+	int size = 0;
+	struct filter_and_counts fc = {
+		.filter = filter,
+	};
+
+	while (filter->name_patterns[size])
+		size++;
+
+	CALLOC_ARRAY(fc.counts, size);
+
+	for_each_fullref_in_pattern(filter, ref_filter_counter, &fc);
+
+	return fc.counts;
+}
+
 static int compare_detached_head(struct ref_array_item *a, struct ref_array_item *b)
 {
 	if (!(a->kind ^ b->kind))
diff --git a/ref-filter.h b/ref-filter.h
index 430701cfb76..b7e5a4f6a80 100644
--- a/ref-filter.h
+++ b/ref-filter.h
@@ -141,6 +141,13 @@ char *get_head_description(void);
 /*  Set up translated strings in the output. */
 void setup_ref_filter_porcelain_msg(void);
 
+/*
+ * Iterate over all references, counting how many match each filter
+ * pattern. Returns an array of the counts with the ith count matching the
+ * ith filter->name_pattern entry.
+ */
+uint32_t *count_ref_patterns(struct ref_filter *filter);
+
 /*
  * Print a single ref, outside of any ref-filter. Note that the
  * name must be a fully qualified refname.
diff --git a/t/perf/p1501-ref-iteration.sh b/t/perf/p1501-ref-iteration.sh
new file mode 100755
index 00000000000..609487d20b0
--- /dev/null
+++ b/t/perf/p1501-ref-iteration.sh
@@ -0,0 +1,35 @@
+#!/bin/sh
+
+test_description='Ref iteration performance tests'
+. ./perf-lib.sh
+
+test_perf_large_repo
+
+# Optimize ref backend store
+test_expect_success 'setup' '
+	git pack-refs
+'
+
+for pattern in "refs/heads/" "refs/tags/" "refs/remotes"
+do
+	test_perf "count $pattern: git for-each-ref | wc -l" "
+		git for-each-ref $pattern | wc -l
+	"
+
+	test_perf "count $pattern: git for-each-ref --count-match" "
+		git for-each-ref --count-matches $pattern
+	"
+done
+
+test_perf "count all patterns: git for-each-ref | wc -l" "
+	git for-each-ref refs/heads/ | wc -l &&
+	git for-each-ref refs/tags/ | wc -l &&
+	git for-each-ref refs/remotes/ | wc -l
+"
+
+test_perf "count all patterns: git for-each-ref --count-match" "
+	git for-each-ref --count-matches \
+		refs/heads/ refs/tags/ refs/remotes/
+"
+
+test_done
diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh
index 5c00607608a..001382956e4 100755
--- a/t/t6300-for-each-ref.sh
+++ b/t/t6300-for-each-ref.sh
@@ -486,6 +486,34 @@ for i in "--perl --shell" "-s --python" "--python --tcl" "--tcl --perl"; do
 	"
 done
 
+test_expect_success '--count-matches incompatible with some options' '
+	for opt in "--format=x" "--sort=refname" "--count=10"
+	do
+		test_must_fail git for-each-ref --count-matches $opt refs/heads/ 2>err &&
+		grep "count-matches incompatible" err || return 1
+	done
+'
+
+test_expect_success '--count-matches tallies the number matching each refspec' '
+	git init multi-refs &&
+	test_commit -C multi-refs A &&
+	git -C multi-refs branch A &&
+	git -C multi-refs branch pre/A &&
+	test_commit -C multi-refs --no-tag B &&
+	git -C multi-refs branch B &&
+	git -C multi-refs for-each-ref --count-matches \
+		refs/heads/ refs/heads/pre/ refs/tags/ "*A*" >actual &&
+
+	cat >expect <<-EOF &&
+	refs/heads/ 4
+	refs/heads/pre/ 1
+	refs/tags/ 1
+	*A* 3
+	EOF
+
+	test_cmp expect actual
+'
+
 test_expect_success 'setup for upstream:track[short]' '
 	test_commit two
 '
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] for-each-ref: add --count-matches option
  2023-06-26 15:09 ` [PATCH 2/2] for-each-ref: add --count-matches option Derrick Stolee via GitGitGadget
@ 2023-06-26 16:14   ` Junio C Hamano
  2023-06-27  7:30   ` Jeff King
  1 sibling, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2023-06-26 16:14 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, vdye, me, mjcheetham, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Each of these approaches is wasteful as it requires sending the list of
> matching reference names across a pipe plus the cost of parsing that
> output.
>
> Instead, it would be helpful to have a Git command that counts the
> number of refs matching a list of patterns.

For for-each-ref (which can be reused by branch and tag for no
cost), I am on the fence but slightly in favor (partly because the
code has already been written).  But I have to wonder where changes
that come from the above reasoning need to end, though.

Do we count branches and tags because "git branch --list | wc -l" is
too costly?  Should we teach "git remote" the same trick?  How about
"git stash list"?  "git show-index $pack_index"?  "git ls-files"?

How do we decide where to draw the line?  When a command invocation
in a large repository can produce records exceeding a million?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] for-each-ref: add --count-matches option
  2023-06-26 15:09 ` [PATCH 2/2] for-each-ref: add --count-matches option Derrick Stolee via GitGitGadget
  2023-06-26 16:14   ` Junio C Hamano
@ 2023-06-27  7:30   ` Jeff King
  2023-06-27 10:05     ` Phillip Wood
  2023-07-10 16:51     ` Derrick Stolee
  1 sibling, 2 replies; 12+ messages in thread
From: Jeff King @ 2023-06-27  7:30 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget
  Cc: Junio C Hamano, git, vdye, me, mjcheetham, Derrick Stolee

On Mon, Jun 26, 2023 at 03:09:57PM +0000, Derrick Stolee via GitGitGadget wrote:

> +for pattern in "refs/heads/" "refs/tags/" "refs/remotes"
> +do
> +	test_perf "count $pattern: git for-each-ref | wc -l" "
> +		git for-each-ref $pattern | wc -l
> +	"
> +
> +	test_perf "count $pattern: git for-each-ref --count-match" "
> +		git for-each-ref --count-matches $pattern
> +	"
> +done

I don't think this is a very realistic perf test, because for-each-ref
is doing a bunch of work to generate its default format, only to have
"wc" throw most of it away. Doing:

  git for-each-ref --format='%(refname)' | wc -l

is much better (obviously you have to remember to do that if you care
about optimizing your command, but that's true of --count-matches, too).

Running hyperfine with three variants shows that the command above is
competitive with --count-matches, though slightly slower (hyperfine
complains about short commands and outliers because these runtimes are
so tiny in the first place; I omitted those warnings from the output
below for readability):

  Benchmark 1: ./git-for-each-ref refs/remotes/ | wc -l
    Time (mean ± σ):       6.1 ms ±   0.2 ms    [User: 3.0 ms, System: 3.6 ms]
    Range (min … max):     5.6 ms …   7.1 ms    397 runs
  
  Benchmark 2: ./git-for-each-ref --format="%(refname)" refs/remotes/ | wc -l
    Time (mean ± σ):       3.3 ms ±   0.2 ms    [User: 2.2 ms, System: 1.5 ms]
    Range (min … max):     3.0 ms …   4.0 ms    774 runs
  
  Benchmark 3: ./git-for-each-ref --count-matches refs/remotes/
    Time (mean ± σ):       2.4 ms ±   0.1 ms    [User: 1.5 ms, System: 0.9 ms]
    Range (min … max):     2.2 ms …   3.4 ms    1018 runs
  
  Summary
    './git-for-each-ref --count-matches refs/remotes/' ran
      1.33 ± 0.10 times faster than './git-for-each-ref --format="%(refname)" refs/remotes/ | wc -l'
      2.48 ± 0.17 times faster than './git-for-each-ref refs/remotes/ | wc -l'

I will note this is an unloaded multi-core system, which gives the piped
version a slight edge. Total CPU is probably more interesting than
wall-clock time, but all of these are so short that I think the results
should be taken with a pretty big grain of salt (I had to switch from
the "powersave" to "performance" CPU governor just to get more
consistent results).

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] for-each-ref: add --count-matches option
  2023-06-27  7:30   ` Jeff King
@ 2023-06-27 10:05     ` Phillip Wood
  2023-06-27 18:22       ` Junio C Hamano
                         ` (2 more replies)
  2023-07-10 16:51     ` Derrick Stolee
  1 sibling, 3 replies; 12+ messages in thread
From: Phillip Wood @ 2023-06-27 10:05 UTC (permalink / raw)
  To: Jeff King, Derrick Stolee via GitGitGadget
  Cc: Junio C Hamano, git, vdye, me, mjcheetham, Derrick Stolee

On 27/06/2023 08:30, Jeff King wrote:
> On Mon, Jun 26, 2023 at 03:09:57PM +0000, Derrick Stolee via GitGitGadget wrote:
> 
>> +for pattern in "refs/heads/" "refs/tags/" "refs/remotes"
>> +do
>> +	test_perf "count $pattern: git for-each-ref | wc -l" "
>> +		git for-each-ref $pattern | wc -l
>> +	"
>> +
>> +	test_perf "count $pattern: git for-each-ref --count-match" "
>> +		git for-each-ref --count-matches $pattern
>> +	"
>> +done
> 
> I don't think this is a very realistic perf test, because for-each-ref
> is doing a bunch of work to generate its default format, only to have
> "wc" throw most of it away. Doing:
> 
>    git for-each-ref --format='%(refname)' | wc -l

That's a good point. I wondered if using a short fixed format string was 
even better so I tried

git init test
cd test
git commit --allow-empty -m initial
seq 0 100000 | sed "s:\(.*\):create refs/heads/some-prefix/\1 $(git 
rev-parse HEAD):" | git update-ref --stdin
git pack-refs --all
hyperfine -L fmt "","--format=%\(refname\)","--format=x" 'git 
for-each-ref {fmt} refs/heads/ | wc -l'

Which gives

Benchmark 1: git for-each-ref  refs/heads/ | wc -l
   Time (mean ± σ):      1.150 s ±  0.010 s    [User: 0.494 s, System: 
0.637 s]
   Range (min … max):    1.140 s …  1.170 s    10 runs

Benchmark 2: git for-each-ref --format=%\(refname\) refs/heads/ | wc -l
   Time (mean ± σ):      66.0 ms ±   0.3 ms    [User: 58.9 ms, System: 
9.5 ms]
   Range (min … max):    65.2 ms …  67.1 ms    43 runs

Benchmark 3: git for-each-ref --format=x refs/heads/ | wc -l
   Time (mean ± σ):      63.0 ms ±   0.5 ms    [User: 54.3 ms, System: 
9.6 ms]
   Range (min … max):    62.3 ms …  65.4 ms    44 runs

Summary
   git for-each-ref --format=x refs/heads/ | wc -l ran
     1.05 ± 0.01 times faster than git for-each-ref 
--format=%\(refname\) refs/heads/ | wc -l
    18.25 ± 0.20 times faster than git for-each-ref  refs/heads/ | wc -l

So on my somewhat slower machine the default format is over an order of 
magnitude slower than using either --format=%(refname) or --format=x and 
the short fixed format is marginally faster. I haven't applied stolee's 
patch but the 3 or 4 times improvement mentioned in the commit message 
seems likely to be from not processing the default format. One thing to 
note is that we're not comparing like-with-like when more than one 
pattern is given as --count-matches gives a separate count for each pattern.

I'm a bit suspicious of the massive speed up I'm seeing by avoiding the 
default format but it appears to be repeatable.

Best Wishes

Phillip

> is much better (obviously you have to remember to do that if you care
> about optimizing your command, but that's true of --count-matches, too).
> 
> Running hyperfine with three variants shows that the command above is
> competitive with --count-matches, though slightly slower (hyperfine
> complains about short commands and outliers because these runtimes are
> so tiny in the first place; I omitted those warnings from the output
> below for readability):
> 
>    Benchmark 1: ./git-for-each-ref refs/remotes/ | wc -l
>      Time (mean ± σ):       6.1 ms ±   0.2 ms    [User: 3.0 ms, System: 3.6 ms]
>      Range (min … max):     5.6 ms …   7.1 ms    397 runs
>    
>    Benchmark 2: ./git-for-each-ref --format="%(refname)" refs/remotes/ | wc -l
>      Time (mean ± σ):       3.3 ms ±   0.2 ms    [User: 2.2 ms, System: 1.5 ms]
>      Range (min … max):     3.0 ms …   4.0 ms    774 runs
>    
>    Benchmark 3: ./git-for-each-ref --count-matches refs/remotes/
>      Time (mean ± σ):       2.4 ms ±   0.1 ms    [User: 1.5 ms, System: 0.9 ms]
>      Range (min … max):     2.2 ms …   3.4 ms    1018 runs
>    
>    Summary
>      './git-for-each-ref --count-matches refs/remotes/' ran
>        1.33 ± 0.10 times faster than './git-for-each-ref --format="%(refname)" refs/remotes/ | wc -l'
>        2.48 ± 0.17 times faster than './git-for-each-ref refs/remotes/ | wc -l'
> 
> I will note this is an unloaded multi-core system, which gives the piped
> version a slight edge. Total CPU is probably more interesting than
> wall-clock time, but all of these are so short that I think the results
> should be taken with a pretty big grain of salt (I had to switch from
> the "powersave" to "performance" CPU governor just to get more
> consistent results).
> 
> -Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] for-each-ref: add --count-matches option
  2023-06-27 10:05     ` Phillip Wood
@ 2023-06-27 18:22       ` Junio C Hamano
  2023-06-27 19:59         ` Jeff King
  2023-06-28 13:12       ` Phillip Wood
  2023-07-11 14:48       ` René Scharfe
  2 siblings, 1 reply; 12+ messages in thread
From: Junio C Hamano @ 2023-06-27 18:22 UTC (permalink / raw)
  To: Phillip Wood
  Cc: Jeff King, Derrick Stolee via GitGitGadget, git, vdye, me,
	mjcheetham, Derrick Stolee

Phillip Wood <phillip.wood123@gmail.com> writes:

> I'm a bit suspicious of the massive speed up I'm seeing by avoiding
> the default format but it appears to be repeatable.

This is interesting.  The default format over just 'x' is to
concatenate the refname with hexadecimal form of the object name,
while (at least as the code originally was intended):

 * the cost of parsing the "--format=<format>" string into series of
   atoms is one-time O(1), and there is nothing specially tricky in
   there.

 * the cost of computing these atoms should be miniscule, as
   in the default format, the per-ref cost come from these:

   - populate_value() -> get_refname() -> show_ref() for the refname
     would merely be a xstrdup() of the refname.

   - populate_value() -> grab_oid() -> do_grab_oid() should be using
     O_FULL so there shouldn't be any find_unique_abbrev() cost.
     Just binary-to-hex.

   - populate_value() -> get_object() -> oid_object_info_extended() ->
     grab_common_values() asks for ATOM_OBJECTTYPE that should be a
     single xstrdup(), but oid_object_info_extended() needs to parse
     the object, but .info.contentp ought to be NULL and we should
     not be calling parse_object_buffer().

So it might be worth looking into where the time is going (didn't
Peff or somebody do that a few years ago, though?) when using the
default format and optimize or special case the codepath, but the
responses I have seen from two of you makes me suspect that the new
option is not the best general approach.

Thanks.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] for-each-ref: add --count-matches option
  2023-06-27 18:22       ` Junio C Hamano
@ 2023-06-27 19:59         ` Jeff King
  0 siblings, 0 replies; 12+ messages in thread
From: Jeff King @ 2023-06-27 19:59 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Phillip Wood, Derrick Stolee via GitGitGadget, git, vdye, me,
	mjcheetham, Derrick Stolee

On Tue, Jun 27, 2023 at 11:22:02AM -0700, Junio C Hamano wrote:

>    - populate_value() -> get_object() -> oid_object_info_extended() ->
>      grab_common_values() asks for ATOM_OBJECTTYPE that should be a
>      single xstrdup(), but oid_object_info_extended() needs to parse
>      the object, but .info.contentp ought to be NULL and we should
>      not be calling parse_object_buffer().

This is the atom that I expect to give most of the cost. We should
definitely be computing that just with a pack lookup (and chasing deltas
through the pack, though commits usually aren't deltas). But without
that atom, I think we would not even mmap the idx or pack file at all.

The ref-filter code is also pretty keen to malloc() little strings. When
we're measuring 5ms total runtimes and there's just not that much actual
work to do, those little things start to matter more.

> So it might be worth looking into where the time is going (didn't
> Peff or somebody do that a few years ago, though?) when using the
> default format and optimize or special case the codepath,[...]

Running:

 perf record -g git for-each-ref --format='%(refname) %(objecttype)' >/dev/null
 perf script report flamegraph

in linux.git shows that we spend a lot of time faulting in .idx pages.

A lot of the time is attributed to ref_array_sort(), but I think that's
a red herring. It lazy-loads the atom values, so that's where most of
the work is happening (though I would have thought it would just do the
refname atom, not all of them; but I guess it doesn't really matter
since we'll eventually need them to print anyway). Commenting out the
sort call just shows the same time spent in format_ref_array_item().

The patches you're thinking of are probably:

  https://lore.kernel.org/git/YTNpQ7Od1U%2F5i0R7@coredump.intra.peff.net/

which are mostly about micro-optimizing out those mallocs. They do still
help a little in the '%(refname)' case, but not nearly as much as
dropping '%(objecttype)' does.

(Using --format=x in theory drops out some mallocs, but I wasn't able to
measure any actual speedup).

> [...]but the
> responses I have seen from two of you makes me suspect that the new
> option is not the best general approach.

Yeah, my feeling is that with the reduced format, eliminating the
pipe/wc overhead is a pretty small improvement.

-Peff

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] for-each-ref: add --count-matches option
  2023-06-27 10:05     ` Phillip Wood
  2023-06-27 18:22       ` Junio C Hamano
@ 2023-06-28 13:12       ` Phillip Wood
  2023-06-28 17:08         ` Junio C Hamano
  2023-07-11 14:48       ` René Scharfe
  2 siblings, 1 reply; 12+ messages in thread
From: Phillip Wood @ 2023-06-28 13:12 UTC (permalink / raw)
  To: Jeff King, Derrick Stolee via GitGitGadget
  Cc: Junio C Hamano, git, vdye, me, mjcheetham, Derrick Stolee

On 27/06/2023 11:05, Phillip Wood wrote:
> On 27/06/2023 08:30, Jeff King wrote:
>> I don't think this is a very realistic perf test, because for-each-ref
>> is doing a bunch of work to generate its default format, only to have
>> "wc" throw most of it away. Doing:
>>
>>    git for-each-ref --format='%(refname)' | wc -l
> 
> That's a good point. I wondered if using a short fixed format string was 
> even better so I tried
> 
> git init test
> cd test
> git commit --allow-empty -m initial
> seq 0 100000 | sed "s:\(.*\):create refs/heads/some-prefix/\1 $(git 
> rev-parse HEAD):" | git update-ref --stdin
> git pack-refs --all
> hyperfine -L fmt "","--format=%\(refname\)","--format=x" 'git 
> for-each-ref {fmt} refs/heads/ | wc -l'
> 
> Which gives
> [...] 
> Summary
>    git for-each-ref --format=x refs/heads/ | wc -l ran
>      1.05 ± 0.01 times faster than git for-each-ref 
> --format=%\(refname\) refs/heads/ | wc -l
>     18.25 ± 0.20 times faster than git for-each-ref  refs/heads/ | wc -l
> [...] 
> I'm a bit suspicious of the massive speed up I'm seeing by avoiding the 
> default format but it appears to be repeatable.

Having seen Peff's mail [1] I realized that my test repo above is 
looking up the commit from a loose object. If I repack the repository 
then the default format is still slower than using "--format=%(refname)" 
but is much more competitive.

$ git repack -a
Enumerating objects: 2, done.
Counting objects: 100% (2/2), done.
Writing objects: 100% (2/2), done.
Total 2 (delta 0), reused 0 (delta 0), pack-reused 0

$ hyperfine -L fmt "","--format=%\(refname\)","--format=x" 'git 
for-each-ref {fmt} refs/heads/ | wc'
Benchmark 1: git for-each-ref  refs/heads/ | wc -l
   Time (mean ± σ):     111.4 ms ±   1.4 ms    [User: 96.9 ms, System: 
19.6 ms]
   Range (min … max):   109.6 ms … 115.1 ms    25 runs

Benchmark 2: git for-each-ref --format=%\(refname\) refs/heads/ | wc -l
   Time (mean ± σ):      66.7 ms ±   0.7 ms    [User: 59.5 ms, System: 
9.5 ms]
   Range (min … max):    65.6 ms …  68.2 ms    42 runs

Benchmark 3: git for-each-ref --format=x refs/heads/ | wc -l
   Time (mean ± σ):      63.4 ms ±   0.7 ms    [User: 56.3 ms, System: 
8.0 ms]
   Range (min … max):    61.9 ms …  65.1 ms    44 runs

Summary
   git for-each-ref --format=x refs/heads/ | wc -l ran
     1.05 ± 0.02 times faster than git for-each-ref 
--format=%\(refname\) refs/heads/ | wc -l
     1.76 ± 0.03 times faster than git for-each-ref  refs/heads/ | wc -l

So it seems most of the slowdown I was seeing yesterday was due it 
looking up a loose object. I'm surprised repacking makes such a 
difference in a repository that only contains two objects.

Best Wishes

Phillip

[1] 
https://lore.kernel.org/git/20230627195900.GC1280909@coredump.intra.peff.net

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] for-each-ref: add --count-matches option
  2023-06-28 13:12       ` Phillip Wood
@ 2023-06-28 17:08         ` Junio C Hamano
  0 siblings, 0 replies; 12+ messages in thread
From: Junio C Hamano @ 2023-06-28 17:08 UTC (permalink / raw)
  To: Phillip Wood
  Cc: Jeff King, Derrick Stolee via GitGitGadget, git, vdye, me,
	mjcheetham, Derrick Stolee

Phillip Wood <phillip.wood123@gmail.com> writes:

> So it seems most of the slowdown I was seeing yesterday was due it
> looking up a loose object. I'm surprised repacking makes such a
> difference in a repository that only contains two objects.

If we compare what is done in packfile.c:packed_object_info() and
object-file.c:loose_object_info() when we are only interested in
finding out the object type, there aren't that many differences
in the set of system calls each codepath needs to make.

 * The packfile codepath needs to open and mmap *.pack and *.idx,
   binary search in the .idx for the object location, then read a
   few bytes from .pack, before being able to decode the header to
   find out the type.

 * The loose object codepath needs to open and mmap the loose object
   file, read a few bytes from there, before being abole to decode
   the header to find out the type. After that, it needs to munmap.

The cost of open/mmap for packfile codepath amortises over number of
objects (hence number of refs) very well.  If there are many refs
that point at the same object, cache object layer will kick in to
avoid disk access for second and subsequent accesses to the same
object, but it helps both codepaths equally, so there should not be
much difference either way.

Thanks for a interesting piece of food for thought.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] for-each-ref: add --count-matches option
  2023-06-27  7:30   ` Jeff King
  2023-06-27 10:05     ` Phillip Wood
@ 2023-07-10 16:51     ` Derrick Stolee
  1 sibling, 0 replies; 12+ messages in thread
From: Derrick Stolee @ 2023-07-10 16:51 UTC (permalink / raw)
  To: Jeff King, Derrick Stolee via GitGitGadget
  Cc: Junio C Hamano, git, vdye, me, mjcheetham

On 6/27/2023 1:30 AM, Jeff King wrote:
>On Mon, Jun 26, 2023 at 03:09:57PM +0000, Derrick Stolee via GitGitGadget wrote:
>
>>+for pattern in "refs/heads/" "refs/tags/" "refs/remotes"
>>+do
>>+   test_perf "count $pattern: git for-each-ref | wc -l" "
>>+       git for-each-ref $pattern | wc -l
>>+   "
>>+
>>+   test_perf "count $pattern: git for-each-ref --count-match" "
>>+       git for-each-ref --count-matches $pattern
>>+   "
>>+done
>
>I don't think this is a very realistic perf test, because for-each-ref
>is doing a bunch of work to generate its default format, only to have
>"wc" throw most of it away. Doing:
>
>  git for-each-ref --format='%(refname)' | wc -l
>
>is much better (obviously you have to remember to do that if you care
>about optimizing your command, but that's true of --count-matches, too).
Thank you for pointing this out! I'll be sure to modify the test and the
analysis about it.
The other check I need to compare is when multiple refspecs are provided
at the same time, which I do believe still is a valuable thing to combine
into a single process instead of multiple pipes to 'wc'.
Did you have any thoughts on whether or not this works as an option in
'git for-each-ref' or would be better broken into a new builtin?
Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 2/2] for-each-ref: add --count-matches option
  2023-06-27 10:05     ` Phillip Wood
  2023-06-27 18:22       ` Junio C Hamano
  2023-06-28 13:12       ` Phillip Wood
@ 2023-07-11 14:48       ` René Scharfe
  2 siblings, 0 replies; 12+ messages in thread
From: René Scharfe @ 2023-07-11 14:48 UTC (permalink / raw)
  To: phillip.wood, Jeff King, Derrick Stolee via GitGitGadget
  Cc: Junio C Hamano, git, vdye, me, mjcheetham, Derrick Stolee

Am 27.06.23 um 12:05 schrieb Phillip Wood:
> On 27/06/2023 08:30, Jeff King wrote:
>> On Mon, Jun 26, 2023 at 03:09:57PM +0000, Derrick Stolee via GitGitGadget wrote:
>>
>>> +for pattern in "refs/heads/" "refs/tags/" "refs/remotes"
>>> +do
>>> +    test_perf "count $pattern: git for-each-ref | wc -l" "
>>> +        git for-each-ref $pattern | wc -l
>>> +    "
>>> +
>>> +    test_perf "count $pattern: git for-each-ref --count-match" "
>>> +        git for-each-ref --count-matches $pattern
>>> +    "
>>> +done
>>
>> I don't think this is a very realistic perf test, because for-each-ref
>> is doing a bunch of work to generate its default format, only to have
>> "wc" throw most of it away. Doing:
>>
>>    git for-each-ref --format='%(refname)' | wc -l
>
> That's a good point. I wondered if using a short fixed format string was even better so I tried
>
> git init test
> cd test
> git commit --allow-empty -m initial
> seq 0 100000 | sed "s:\(.*\):create refs/heads/some-prefix/\1 $(git rev-parse HEAD):" | git update-ref --stdin
> git pack-refs --all
> hyperfine -L fmt "","--format=%\(refname\)","--format=x" 'git for-each-ref {fmt} refs/heads/ | wc -l'
>
> Which gives
>
> Benchmark 1: git for-each-ref  refs/heads/ | wc -l
>   Time (mean ± σ):      1.150 s ±  0.010 s    [User: 0.494 s, System: 0.637 s]
>   Range (min … max):    1.140 s …  1.170 s    10 runs
>
> Benchmark 2: git for-each-ref --format=%\(refname\) refs/heads/ | wc -l
>   Time (mean ± σ):      66.0 ms ±   0.3 ms    [User: 58.9 ms, System: 9.5 ms]
>   Range (min … max):    65.2 ms …  67.1 ms    43 runs
>
> Benchmark 3: git for-each-ref --format=x refs/heads/ | wc -l
>   Time (mean ± σ):      63.0 ms ±   0.5 ms    [User: 54.3 ms, System: 9.6 ms]
>   Range (min … max):    62.3 ms …  65.4 ms    44 runs
>
> Summary
>   git for-each-ref --format=x refs/heads/ | wc -l ran
>     1.05 ± 0.01 times faster than git for-each-ref --format=%\(refname\) refs/heads/ | wc -l
>    18.25 ± 0.20 times faster than git for-each-ref  refs/heads/ | wc -l

You don't need no "x", by the way; using the empty string saves some
cycles for me.  In my Git clone, no special setup, 9931 refs:

Benchmark 1: git for-each-ref --format=x | wc -l
  Time (mean ± σ):      25.1 ms ±   0.1 ms    [User: 8.7 ms, System: 16.8 ms]
  Range (min … max):    24.9 ms …  25.6 ms    109 runs

Benchmark 2: git for-each-ref --format= | wc -l
  Time (mean ± σ):      24.6 ms ±   0.1 ms    [User: 8.3 ms, System: 16.8 ms]
  Range (min … max):    24.4 ms …  25.3 ms    110 runs

Summary
  git for-each-ref --format= | wc -l ran
    1.02 ± 0.01 times faster than git for-each-ref --format=x | wc -l

René


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-07-11 14:49 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-26 15:09 [PATCH 0/2] [RFC] for-each-ref: add --count-matches mode Derrick Stolee via GitGitGadget
2023-06-26 15:09 ` [PATCH 1/2] for-each-ref: extract ref output loop Derrick Stolee via GitGitGadget
2023-06-26 15:09 ` [PATCH 2/2] for-each-ref: add --count-matches option Derrick Stolee via GitGitGadget
2023-06-26 16:14   ` Junio C Hamano
2023-06-27  7:30   ` Jeff King
2023-06-27 10:05     ` Phillip Wood
2023-06-27 18:22       ` Junio C Hamano
2023-06-27 19:59         ` Jeff King
2023-06-28 13:12       ` Phillip Wood
2023-06-28 17:08         ` Junio C Hamano
2023-07-11 14:48       ` René Scharfe
2023-07-10 16:51     ` Derrick Stolee

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).