* [PATCH 0/2] [RFC] for-each-ref: add --count-matches mode @ 2023-06-26 15:09 Derrick Stolee via GitGitGadget 2023-06-26 15:09 ` [PATCH 1/2] for-each-ref: extract ref output loop Derrick Stolee via GitGitGadget 2023-06-26 15:09 ` [PATCH 2/2] for-each-ref: add --count-matches option Derrick Stolee via GitGitGadget 0 siblings, 2 replies; 12+ messages in thread From: Derrick Stolee via GitGitGadget @ 2023-06-26 15:09 UTC (permalink / raw) To: git; +Cc: vdye, gitster, me, mjcheetham, Derrick Stolee I'm leaving this as an RFC for now because I can't decide if this new option in git for-each-ref is good or if this needs an entirely new builtin. I'm open to whatever people think is best, I'd just like a way to count matches based on refspecs. Thanks, -Stolee Derrick Stolee (2): for-each-ref: extract ref output loop for-each-ref: add --count-matches option Documentation/git-for-each-ref.txt | 5 ++ builtin/for-each-ref.c | 80 +++++++++++++++++++++--------- ref-filter.c | 47 ++++++++++++++++++ ref-filter.h | 7 +++ t/perf/p1501-ref-iteration.sh | 35 +++++++++++++ t/t6300-for-each-ref.sh | 28 +++++++++++ 6 files changed, 179 insertions(+), 23 deletions(-) create mode 100755 t/perf/p1501-ref-iteration.sh base-commit: d7d8841f67f29e6ecbad85a11805c907d0f00d5d Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1548%2Fderrickstolee%2Ffor-each-ref-count-v1 Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1548/derrickstolee/for-each-ref-count-v1 Pull-Request: https://github.com/gitgitgadget/git/pull/1548 -- gitgitgadget ^ permalink raw reply [flat|nested] 12+ messages in thread
* [PATCH 1/2] for-each-ref: extract ref output loop 2023-06-26 15:09 [PATCH 0/2] [RFC] for-each-ref: add --count-matches mode Derrick Stolee via GitGitGadget @ 2023-06-26 15:09 ` Derrick Stolee via GitGitGadget 2023-06-26 15:09 ` [PATCH 2/2] for-each-ref: add --count-matches option Derrick Stolee via GitGitGadget 1 sibling, 0 replies; 12+ messages in thread From: Derrick Stolee via GitGitGadget @ 2023-06-26 15:09 UTC (permalink / raw) To: git; +Cc: vdye, gitster, me, mjcheetham, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> In preparation for a new output mode of 'git for-each-ref', extract the loop that outputs references into a static method. No functional change here. Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- builtin/for-each-ref.c | 58 ++++++++++++++++++++++++++---------------- 1 file changed, 36 insertions(+), 22 deletions(-) diff --git a/builtin/for-each-ref.c b/builtin/for-each-ref.c index 695fc8f4a5e..ce34940e3e6 100644 --- a/builtin/for-each-ref.c +++ b/builtin/for-each-ref.c @@ -17,17 +17,47 @@ static char const * const for_each_ref_usage[] = { NULL }; -int cmd_for_each_ref(int argc, const char **argv, const char *prefix) +static void filter_and_output_refs(struct repository *r, + struct ref_array *array, + struct ref_filter *filter, + struct ref_format *format, + struct ref_sorting *sorting, + int maxcount, + int omit_empty) { int i; + struct strbuf err = STRBUF_INIT; + struct strbuf output = STRBUF_INIT; + + filter_refs(array, filter, FILTER_REFS_ALL); + filter_ahead_behind(r, format, array); + + ref_array_sort(sorting, array); + + if (!maxcount || array->nr < maxcount) + maxcount = array->nr; + for (i = 0; i < maxcount; i++) { + strbuf_reset(&err); + strbuf_reset(&output); + if (format_ref_array_item(array->items[i], format, &output, &err)) + die("%s", err.buf); + fwrite(output.buf, 1, output.len, stdout); + if (output.len || !omit_empty) + putchar('\n'); + } + + strbuf_release(&err); + strbuf_release(&output); +} + +int cmd_for_each_ref(int argc, const char **argv, const char *prefix) +{ struct ref_sorting *sorting; struct string_list sorting_options = STRING_LIST_INIT_DUP; - int maxcount = 0, icase = 0, omit_empty = 0; + int maxcount = 0, icase = 0, omit_empty = 0, count_matches = 0; struct ref_array array; struct ref_filter filter; struct ref_format format = REF_FORMAT_INIT; - struct strbuf output = STRBUF_INIT; - struct strbuf err = STRBUF_INIT; int from_stdin = 0; struct strvec vec = STRVEC_INIT; @@ -101,25 +131,9 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix) } filter.match_as_path = 1; - filter_refs(&array, &filter, FILTER_REFS_ALL); - filter_ahead_behind(the_repository, &format, &array); + filter_and_output_refs(the_repository, &array, &filter, &format, + sorting, maxcount, omit_empty); - ref_array_sort(sorting, &array); - - if (!maxcount || array.nr < maxcount) - maxcount = array.nr; - for (i = 0; i < maxcount; i++) { - strbuf_reset(&err); - strbuf_reset(&output); - if (format_ref_array_item(array.items[i], &format, &output, &err)) - die("%s", err.buf); - fwrite(output.buf, 1, output.len, stdout); - if (output.len || !omit_empty) - putchar('\n'); - } - - strbuf_release(&err); - strbuf_release(&output); ref_array_clear(&array); free_commit_list(filter.with_commit); free_commit_list(filter.no_commit); -- gitgitgadget ^ permalink raw reply related [flat|nested] 12+ messages in thread
* [PATCH 2/2] for-each-ref: add --count-matches option 2023-06-26 15:09 [PATCH 0/2] [RFC] for-each-ref: add --count-matches mode Derrick Stolee via GitGitGadget 2023-06-26 15:09 ` [PATCH 1/2] for-each-ref: extract ref output loop Derrick Stolee via GitGitGadget @ 2023-06-26 15:09 ` Derrick Stolee via GitGitGadget 2023-06-26 16:14 ` Junio C Hamano 2023-06-27 7:30 ` Jeff King 1 sibling, 2 replies; 12+ messages in thread From: Derrick Stolee via GitGitGadget @ 2023-06-26 15:09 UTC (permalink / raw) To: git; +Cc: vdye, gitster, me, mjcheetham, Derrick Stolee, Derrick Stolee From: Derrick Stolee <derrickstolee@github.com> In order to count references of different types based on their initial prefixes, there are two current approaches: 1. Run 'git for-each-ref' on all refs and parse the output to count those that match each prefix. 2. Run 'git for-each-ref "prefix/"' and pipe that output to 'wc -l' to count the number of output lines. Each of these approaches is wasteful as it requires sending the list of matching reference names across a pipe plus the cost of parsing that output. Instead, it would be helpful to have a Git command that counts the number of refs matching a list of patterns. This change adds a new mode, '--count-matches' to 'git for-each-ref' so we can make use of the existing infrastructure around parsing refspecs in the correct places. '--count' is already taken as a "maximum number" of refs to output. An alternative approach could be to make a brand-new builtin that is focused on counting ref matches. This would involve duplicating a bit of code around parsing refpecs, but would not be terribly difficult. The actual overlap of implementation here with 'git for-each-ref' is small enough that we could instead extract this elsewhere. My gut feeling is that this behavior doesn't merit a new builtin. The implementation is extremely simple: iterate through all references and compare each ref to each refspec. On a match, increment a counter value for that refspec. In the end, output each refspec followed by the count. If all given refspecs were prefixes, then it is tempting to instead use a counting behavior that we can use in things like "how many OIDs start with this short hex string?" Navigating to the start and end of the range of refs starting with that prefix is possible in the packed-refs file. However, the records do not have a constant size so we cannot infer the number of references in that range using the current format. (Perhaps, in the future we will have a ref storage system that allows this kind of counting to be easy to do in O(log N) time.) A new performance test is included to check the performance of iterating through these references and counting them appropriately. This presents a 3x improvement over the trivial piping through to 'wc -l', and that assumes there is a single pattern to match instead of multiple. We can see that testing three patterns sequentially adds to the total time, but doing a single process with --count-match continues to be as fast. (It's difficult to tell since it _also_ matches the sum of the three for this example repo.) Test this tree ---------------------------------------------------------------------------- 1501.2: count refs/heads/: git for-each-ref | wc -l 0.01(0.00+0.01) 1501.3: count refs/heads/: git for-each-ref --count-match 0.00(0.00+0.00) 1501.4: count refs/tags/: git for-each-ref | wc -l 0.02(0.00+0.02) 1501.5: count refs/tags/: git for-each-ref --count-match 0.00(0.00+0.00) 1501.6: count refs/remotes: git for-each-ref | wc -l 0.15(0.08+0.07) 1501.7: count refs/remotes: git for-each-ref --count-match 0.04(0.01+0.02) 1501.8: count all patterns: git for-each-ref | wc -l 0.18(0.08+0.10) 1501.9: count all patterns: git for-each-ref --count-match 0.04(0.02+0.02) Signed-off-by: Derrick Stolee <derrickstolee@github.com> --- Documentation/git-for-each-ref.txt | 5 ++++ builtin/for-each-ref.c | 26 +++++++++++++++-- ref-filter.c | 47 ++++++++++++++++++++++++++++++ ref-filter.h | 7 +++++ t/perf/p1501-ref-iteration.sh | 35 ++++++++++++++++++++++ t/t6300-for-each-ref.sh | 28 ++++++++++++++++++ 6 files changed, 145 insertions(+), 3 deletions(-) create mode 100755 t/perf/p1501-ref-iteration.sh diff --git a/Documentation/git-for-each-ref.txt b/Documentation/git-for-each-ref.txt index 1e215d4e734..74018755ed5 100644 --- a/Documentation/git-for-each-ref.txt +++ b/Documentation/git-for-each-ref.txt @@ -42,6 +42,11 @@ OPTIONS `<pattern>`. This option makes it stop after showing that many refs. +--count-matches:: + Instead of listing references according to the given format, output + the number of references that match each pattern. Incompatible with + `--format`, `--sort`, and `--count`. + --sort=<key>:: A field name to sort on. Prefix `-` to sort in descending order of the value. When unspecified, diff --git a/builtin/for-each-ref.c b/builtin/for-each-ref.c index ce34940e3e6..e3db719bb87 100644 --- a/builtin/for-each-ref.c +++ b/builtin/for-each-ref.c @@ -50,6 +50,17 @@ static void filter_and_output_refs(struct repository *r, strbuf_release(&output); } +static void count_and_output_patterns(struct ref_filter *filter) +{ + uint32_t *counts = count_ref_patterns(filter); + + for (int i = 0; filter->name_patterns && filter->name_patterns[i]; i++) + fprintf(stdout, "%s %"PRIu32"\n", + filter->name_patterns[i], counts[i]); + + free(counts); +} + int cmd_for_each_ref(int argc, const char **argv, const char *prefix) { struct ref_sorting *sorting; @@ -60,6 +71,7 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix) struct ref_format format = REF_FORMAT_INIT; int from_stdin = 0; struct strvec vec = STRVEC_INIT; + const char *initial_format = "%(objectname) %(objecttype)\t%(refname)"; struct option opts[] = { OPT_BIT('s', "shell", &format.quote_style, @@ -72,6 +84,8 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix) N_("quote placeholders suitably for Tcl"), QUOTE_TCL), OPT_BOOL(0, "omit-empty", &omit_empty, N_("do not output a newline after empty formatted refs")), + OPT_BOOL(0, "count-matches", &count_matches, + N_("output number of references matching each pattern instead of any other output")), OPT_GROUP(""), OPT_INTEGER( 0 , "count", &maxcount, N_("show only <n> matched refs")), @@ -93,7 +107,7 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix) memset(&array, 0, sizeof(array)); memset(&filter, 0, sizeof(filter)); - format.format = "%(objectname) %(objecttype)\t%(refname)"; + format.format = initial_format; git_config(git_default_config, NULL); @@ -102,6 +116,9 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix) error("invalid --count argument: `%d'", maxcount); usage_with_options(for_each_ref_usage, opts); } + if (count_matches && + (maxcount || format.format != initial_format || sorting_options.nr)) + die("--count-matches incompatible with --count, --format, or --sort"); if (HAS_MULTI_BITS(format.quote_style)) { error("more than one quoting style?"); usage_with_options(for_each_ref_usage, opts); @@ -131,8 +148,11 @@ int cmd_for_each_ref(int argc, const char **argv, const char *prefix) } filter.match_as_path = 1; - filter_and_output_refs(the_repository, &array, &filter, &format, - sorting, maxcount, omit_empty); + if (count_matches) + count_and_output_patterns(&filter); + else + filter_and_output_refs(the_repository, &array, &filter, &format, + sorting, maxcount, omit_empty); ref_array_clear(&array); free_commit_list(filter.with_commit); diff --git a/ref-filter.c b/ref-filter.c index 4991cd4f7a8..4e02a4ae98d 100644 --- a/ref-filter.c +++ b/ref-filter.c @@ -2560,6 +2560,53 @@ int filter_refs(struct ref_array *array, struct ref_filter *filter, unsigned int return ret; } +struct filter_and_counts { + struct ref_filter *filter; + uint32_t *counts; +}; + +static int ref_filter_counter(const char *refname, + const struct object_id *oid, + int flag, void *cb_data) +{ + struct filter_and_counts *fc = cb_data; + const char **pattern = fc->filter->name_patterns; + size_t namelen = strlen(refname); + unsigned flags = fc->filter->ignore_case ? WM_CASEFOLD : 0; + + for (int i = 0; *pattern; i++, pattern++) { + const char *p = *pattern; + int plen = strlen(p); + + if ((plen <= namelen) && + !strncmp(refname, p, plen) && + (refname[plen] == '\0' || + refname[plen] == '/' || + p[plen-1] == '/')) + fc->counts[i]++; + else if (!wildmatch(p, refname, flags)) + fc->counts[i]++; + } + return 0; +} + +uint32_t *count_ref_patterns(struct ref_filter *filter) +{ + int size = 0; + struct filter_and_counts fc = { + .filter = filter, + }; + + while (filter->name_patterns[size]) + size++; + + CALLOC_ARRAY(fc.counts, size); + + for_each_fullref_in_pattern(filter, ref_filter_counter, &fc); + + return fc.counts; +} + static int compare_detached_head(struct ref_array_item *a, struct ref_array_item *b) { if (!(a->kind ^ b->kind)) diff --git a/ref-filter.h b/ref-filter.h index 430701cfb76..b7e5a4f6a80 100644 --- a/ref-filter.h +++ b/ref-filter.h @@ -141,6 +141,13 @@ char *get_head_description(void); /* Set up translated strings in the output. */ void setup_ref_filter_porcelain_msg(void); +/* + * Iterate over all references, counting how many match each filter + * pattern. Returns an array of the counts with the ith count matching the + * ith filter->name_pattern entry. + */ +uint32_t *count_ref_patterns(struct ref_filter *filter); + /* * Print a single ref, outside of any ref-filter. Note that the * name must be a fully qualified refname. diff --git a/t/perf/p1501-ref-iteration.sh b/t/perf/p1501-ref-iteration.sh new file mode 100755 index 00000000000..609487d20b0 --- /dev/null +++ b/t/perf/p1501-ref-iteration.sh @@ -0,0 +1,35 @@ +#!/bin/sh + +test_description='Ref iteration performance tests' +. ./perf-lib.sh + +test_perf_large_repo + +# Optimize ref backend store +test_expect_success 'setup' ' + git pack-refs +' + +for pattern in "refs/heads/" "refs/tags/" "refs/remotes" +do + test_perf "count $pattern: git for-each-ref | wc -l" " + git for-each-ref $pattern | wc -l + " + + test_perf "count $pattern: git for-each-ref --count-match" " + git for-each-ref --count-matches $pattern + " +done + +test_perf "count all patterns: git for-each-ref | wc -l" " + git for-each-ref refs/heads/ | wc -l && + git for-each-ref refs/tags/ | wc -l && + git for-each-ref refs/remotes/ | wc -l +" + +test_perf "count all patterns: git for-each-ref --count-match" " + git for-each-ref --count-matches \ + refs/heads/ refs/tags/ refs/remotes/ +" + +test_done diff --git a/t/t6300-for-each-ref.sh b/t/t6300-for-each-ref.sh index 5c00607608a..001382956e4 100755 --- a/t/t6300-for-each-ref.sh +++ b/t/t6300-for-each-ref.sh @@ -486,6 +486,34 @@ for i in "--perl --shell" "-s --python" "--python --tcl" "--tcl --perl"; do " done +test_expect_success '--count-matches incompatible with some options' ' + for opt in "--format=x" "--sort=refname" "--count=10" + do + test_must_fail git for-each-ref --count-matches $opt refs/heads/ 2>err && + grep "count-matches incompatible" err || return 1 + done +' + +test_expect_success '--count-matches tallies the number matching each refspec' ' + git init multi-refs && + test_commit -C multi-refs A && + git -C multi-refs branch A && + git -C multi-refs branch pre/A && + test_commit -C multi-refs --no-tag B && + git -C multi-refs branch B && + git -C multi-refs for-each-ref --count-matches \ + refs/heads/ refs/heads/pre/ refs/tags/ "*A*" >actual && + + cat >expect <<-EOF && + refs/heads/ 4 + refs/heads/pre/ 1 + refs/tags/ 1 + *A* 3 + EOF + + test_cmp expect actual +' + test_expect_success 'setup for upstream:track[short]' ' test_commit two ' -- gitgitgadget ^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] for-each-ref: add --count-matches option 2023-06-26 15:09 ` [PATCH 2/2] for-each-ref: add --count-matches option Derrick Stolee via GitGitGadget @ 2023-06-26 16:14 ` Junio C Hamano 2023-06-27 7:30 ` Jeff King 1 sibling, 0 replies; 12+ messages in thread From: Junio C Hamano @ 2023-06-26 16:14 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget; +Cc: git, vdye, me, mjcheetham, Derrick Stolee "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes: > Each of these approaches is wasteful as it requires sending the list of > matching reference names across a pipe plus the cost of parsing that > output. > > Instead, it would be helpful to have a Git command that counts the > number of refs matching a list of patterns. For for-each-ref (which can be reused by branch and tag for no cost), I am on the fence but slightly in favor (partly because the code has already been written). But I have to wonder where changes that come from the above reasoning need to end, though. Do we count branches and tags because "git branch --list | wc -l" is too costly? Should we teach "git remote" the same trick? How about "git stash list"? "git show-index $pack_index"? "git ls-files"? How do we decide where to draw the line? When a command invocation in a large repository can produce records exceeding a million? ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] for-each-ref: add --count-matches option 2023-06-26 15:09 ` [PATCH 2/2] for-each-ref: add --count-matches option Derrick Stolee via GitGitGadget 2023-06-26 16:14 ` Junio C Hamano @ 2023-06-27 7:30 ` Jeff King 2023-06-27 10:05 ` Phillip Wood 2023-07-10 16:51 ` Derrick Stolee 1 sibling, 2 replies; 12+ messages in thread From: Jeff King @ 2023-06-27 7:30 UTC (permalink / raw) To: Derrick Stolee via GitGitGadget Cc: Junio C Hamano, git, vdye, me, mjcheetham, Derrick Stolee On Mon, Jun 26, 2023 at 03:09:57PM +0000, Derrick Stolee via GitGitGadget wrote: > +for pattern in "refs/heads/" "refs/tags/" "refs/remotes" > +do > + test_perf "count $pattern: git for-each-ref | wc -l" " > + git for-each-ref $pattern | wc -l > + " > + > + test_perf "count $pattern: git for-each-ref --count-match" " > + git for-each-ref --count-matches $pattern > + " > +done I don't think this is a very realistic perf test, because for-each-ref is doing a bunch of work to generate its default format, only to have "wc" throw most of it away. Doing: git for-each-ref --format='%(refname)' | wc -l is much better (obviously you have to remember to do that if you care about optimizing your command, but that's true of --count-matches, too). Running hyperfine with three variants shows that the command above is competitive with --count-matches, though slightly slower (hyperfine complains about short commands and outliers because these runtimes are so tiny in the first place; I omitted those warnings from the output below for readability): Benchmark 1: ./git-for-each-ref refs/remotes/ | wc -l Time (mean ± σ): 6.1 ms ± 0.2 ms [User: 3.0 ms, System: 3.6 ms] Range (min … max): 5.6 ms … 7.1 ms 397 runs Benchmark 2: ./git-for-each-ref --format="%(refname)" refs/remotes/ | wc -l Time (mean ± σ): 3.3 ms ± 0.2 ms [User: 2.2 ms, System: 1.5 ms] Range (min … max): 3.0 ms … 4.0 ms 774 runs Benchmark 3: ./git-for-each-ref --count-matches refs/remotes/ Time (mean ± σ): 2.4 ms ± 0.1 ms [User: 1.5 ms, System: 0.9 ms] Range (min … max): 2.2 ms … 3.4 ms 1018 runs Summary './git-for-each-ref --count-matches refs/remotes/' ran 1.33 ± 0.10 times faster than './git-for-each-ref --format="%(refname)" refs/remotes/ | wc -l' 2.48 ± 0.17 times faster than './git-for-each-ref refs/remotes/ | wc -l' I will note this is an unloaded multi-core system, which gives the piped version a slight edge. Total CPU is probably more interesting than wall-clock time, but all of these are so short that I think the results should be taken with a pretty big grain of salt (I had to switch from the "powersave" to "performance" CPU governor just to get more consistent results). -Peff ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] for-each-ref: add --count-matches option 2023-06-27 7:30 ` Jeff King @ 2023-06-27 10:05 ` Phillip Wood 2023-06-27 18:22 ` Junio C Hamano ` (2 more replies) 2023-07-10 16:51 ` Derrick Stolee 1 sibling, 3 replies; 12+ messages in thread From: Phillip Wood @ 2023-06-27 10:05 UTC (permalink / raw) To: Jeff King, Derrick Stolee via GitGitGadget Cc: Junio C Hamano, git, vdye, me, mjcheetham, Derrick Stolee On 27/06/2023 08:30, Jeff King wrote: > On Mon, Jun 26, 2023 at 03:09:57PM +0000, Derrick Stolee via GitGitGadget wrote: > >> +for pattern in "refs/heads/" "refs/tags/" "refs/remotes" >> +do >> + test_perf "count $pattern: git for-each-ref | wc -l" " >> + git for-each-ref $pattern | wc -l >> + " >> + >> + test_perf "count $pattern: git for-each-ref --count-match" " >> + git for-each-ref --count-matches $pattern >> + " >> +done > > I don't think this is a very realistic perf test, because for-each-ref > is doing a bunch of work to generate its default format, only to have > "wc" throw most of it away. Doing: > > git for-each-ref --format='%(refname)' | wc -l That's a good point. I wondered if using a short fixed format string was even better so I tried git init test cd test git commit --allow-empty -m initial seq 0 100000 | sed "s:\(.*\):create refs/heads/some-prefix/\1 $(git rev-parse HEAD):" | git update-ref --stdin git pack-refs --all hyperfine -L fmt "","--format=%\(refname\)","--format=x" 'git for-each-ref {fmt} refs/heads/ | wc -l' Which gives Benchmark 1: git for-each-ref refs/heads/ | wc -l Time (mean ± σ): 1.150 s ± 0.010 s [User: 0.494 s, System: 0.637 s] Range (min … max): 1.140 s … 1.170 s 10 runs Benchmark 2: git for-each-ref --format=%\(refname\) refs/heads/ | wc -l Time (mean ± σ): 66.0 ms ± 0.3 ms [User: 58.9 ms, System: 9.5 ms] Range (min … max): 65.2 ms … 67.1 ms 43 runs Benchmark 3: git for-each-ref --format=x refs/heads/ | wc -l Time (mean ± σ): 63.0 ms ± 0.5 ms [User: 54.3 ms, System: 9.6 ms] Range (min … max): 62.3 ms … 65.4 ms 44 runs Summary git for-each-ref --format=x refs/heads/ | wc -l ran 1.05 ± 0.01 times faster than git for-each-ref --format=%\(refname\) refs/heads/ | wc -l 18.25 ± 0.20 times faster than git for-each-ref refs/heads/ | wc -l So on my somewhat slower machine the default format is over an order of magnitude slower than using either --format=%(refname) or --format=x and the short fixed format is marginally faster. I haven't applied stolee's patch but the 3 or 4 times improvement mentioned in the commit message seems likely to be from not processing the default format. One thing to note is that we're not comparing like-with-like when more than one pattern is given as --count-matches gives a separate count for each pattern. I'm a bit suspicious of the massive speed up I'm seeing by avoiding the default format but it appears to be repeatable. Best Wishes Phillip > is much better (obviously you have to remember to do that if you care > about optimizing your command, but that's true of --count-matches, too). > > Running hyperfine with three variants shows that the command above is > competitive with --count-matches, though slightly slower (hyperfine > complains about short commands and outliers because these runtimes are > so tiny in the first place; I omitted those warnings from the output > below for readability): > > Benchmark 1: ./git-for-each-ref refs/remotes/ | wc -l > Time (mean ± σ): 6.1 ms ± 0.2 ms [User: 3.0 ms, System: 3.6 ms] > Range (min … max): 5.6 ms … 7.1 ms 397 runs > > Benchmark 2: ./git-for-each-ref --format="%(refname)" refs/remotes/ | wc -l > Time (mean ± σ): 3.3 ms ± 0.2 ms [User: 2.2 ms, System: 1.5 ms] > Range (min … max): 3.0 ms … 4.0 ms 774 runs > > Benchmark 3: ./git-for-each-ref --count-matches refs/remotes/ > Time (mean ± σ): 2.4 ms ± 0.1 ms [User: 1.5 ms, System: 0.9 ms] > Range (min … max): 2.2 ms … 3.4 ms 1018 runs > > Summary > './git-for-each-ref --count-matches refs/remotes/' ran > 1.33 ± 0.10 times faster than './git-for-each-ref --format="%(refname)" refs/remotes/ | wc -l' > 2.48 ± 0.17 times faster than './git-for-each-ref refs/remotes/ | wc -l' > > I will note this is an unloaded multi-core system, which gives the piped > version a slight edge. Total CPU is probably more interesting than > wall-clock time, but all of these are so short that I think the results > should be taken with a pretty big grain of salt (I had to switch from > the "powersave" to "performance" CPU governor just to get more > consistent results). > > -Peff ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] for-each-ref: add --count-matches option 2023-06-27 10:05 ` Phillip Wood @ 2023-06-27 18:22 ` Junio C Hamano 2023-06-27 19:59 ` Jeff King 2023-06-28 13:12 ` Phillip Wood 2023-07-11 14:48 ` René Scharfe 2 siblings, 1 reply; 12+ messages in thread From: Junio C Hamano @ 2023-06-27 18:22 UTC (permalink / raw) To: Phillip Wood Cc: Jeff King, Derrick Stolee via GitGitGadget, git, vdye, me, mjcheetham, Derrick Stolee Phillip Wood <phillip.wood123@gmail.com> writes: > I'm a bit suspicious of the massive speed up I'm seeing by avoiding > the default format but it appears to be repeatable. This is interesting. The default format over just 'x' is to concatenate the refname with hexadecimal form of the object name, while (at least as the code originally was intended): * the cost of parsing the "--format=<format>" string into series of atoms is one-time O(1), and there is nothing specially tricky in there. * the cost of computing these atoms should be miniscule, as in the default format, the per-ref cost come from these: - populate_value() -> get_refname() -> show_ref() for the refname would merely be a xstrdup() of the refname. - populate_value() -> grab_oid() -> do_grab_oid() should be using O_FULL so there shouldn't be any find_unique_abbrev() cost. Just binary-to-hex. - populate_value() -> get_object() -> oid_object_info_extended() -> grab_common_values() asks for ATOM_OBJECTTYPE that should be a single xstrdup(), but oid_object_info_extended() needs to parse the object, but .info.contentp ought to be NULL and we should not be calling parse_object_buffer(). So it might be worth looking into where the time is going (didn't Peff or somebody do that a few years ago, though?) when using the default format and optimize or special case the codepath, but the responses I have seen from two of you makes me suspect that the new option is not the best general approach. Thanks. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] for-each-ref: add --count-matches option 2023-06-27 18:22 ` Junio C Hamano @ 2023-06-27 19:59 ` Jeff King 0 siblings, 0 replies; 12+ messages in thread From: Jeff King @ 2023-06-27 19:59 UTC (permalink / raw) To: Junio C Hamano Cc: Phillip Wood, Derrick Stolee via GitGitGadget, git, vdye, me, mjcheetham, Derrick Stolee On Tue, Jun 27, 2023 at 11:22:02AM -0700, Junio C Hamano wrote: > - populate_value() -> get_object() -> oid_object_info_extended() -> > grab_common_values() asks for ATOM_OBJECTTYPE that should be a > single xstrdup(), but oid_object_info_extended() needs to parse > the object, but .info.contentp ought to be NULL and we should > not be calling parse_object_buffer(). This is the atom that I expect to give most of the cost. We should definitely be computing that just with a pack lookup (and chasing deltas through the pack, though commits usually aren't deltas). But without that atom, I think we would not even mmap the idx or pack file at all. The ref-filter code is also pretty keen to malloc() little strings. When we're measuring 5ms total runtimes and there's just not that much actual work to do, those little things start to matter more. > So it might be worth looking into where the time is going (didn't > Peff or somebody do that a few years ago, though?) when using the > default format and optimize or special case the codepath,[...] Running: perf record -g git for-each-ref --format='%(refname) %(objecttype)' >/dev/null perf script report flamegraph in linux.git shows that we spend a lot of time faulting in .idx pages. A lot of the time is attributed to ref_array_sort(), but I think that's a red herring. It lazy-loads the atom values, so that's where most of the work is happening (though I would have thought it would just do the refname atom, not all of them; but I guess it doesn't really matter since we'll eventually need them to print anyway). Commenting out the sort call just shows the same time spent in format_ref_array_item(). The patches you're thinking of are probably: https://lore.kernel.org/git/YTNpQ7Od1U%2F5i0R7@coredump.intra.peff.net/ which are mostly about micro-optimizing out those mallocs. They do still help a little in the '%(refname)' case, but not nearly as much as dropping '%(objecttype)' does. (Using --format=x in theory drops out some mallocs, but I wasn't able to measure any actual speedup). > [...]but the > responses I have seen from two of you makes me suspect that the new > option is not the best general approach. Yeah, my feeling is that with the reduced format, eliminating the pipe/wc overhead is a pretty small improvement. -Peff ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] for-each-ref: add --count-matches option 2023-06-27 10:05 ` Phillip Wood 2023-06-27 18:22 ` Junio C Hamano @ 2023-06-28 13:12 ` Phillip Wood 2023-06-28 17:08 ` Junio C Hamano 2023-07-11 14:48 ` René Scharfe 2 siblings, 1 reply; 12+ messages in thread From: Phillip Wood @ 2023-06-28 13:12 UTC (permalink / raw) To: Jeff King, Derrick Stolee via GitGitGadget Cc: Junio C Hamano, git, vdye, me, mjcheetham, Derrick Stolee On 27/06/2023 11:05, Phillip Wood wrote: > On 27/06/2023 08:30, Jeff King wrote: >> I don't think this is a very realistic perf test, because for-each-ref >> is doing a bunch of work to generate its default format, only to have >> "wc" throw most of it away. Doing: >> >> git for-each-ref --format='%(refname)' | wc -l > > That's a good point. I wondered if using a short fixed format string was > even better so I tried > > git init test > cd test > git commit --allow-empty -m initial > seq 0 100000 | sed "s:\(.*\):create refs/heads/some-prefix/\1 $(git > rev-parse HEAD):" | git update-ref --stdin > git pack-refs --all > hyperfine -L fmt "","--format=%\(refname\)","--format=x" 'git > for-each-ref {fmt} refs/heads/ | wc -l' > > Which gives > [...] > Summary > git for-each-ref --format=x refs/heads/ | wc -l ran > 1.05 ± 0.01 times faster than git for-each-ref > --format=%\(refname\) refs/heads/ | wc -l > 18.25 ± 0.20 times faster than git for-each-ref refs/heads/ | wc -l > [...] > I'm a bit suspicious of the massive speed up I'm seeing by avoiding the > default format but it appears to be repeatable. Having seen Peff's mail [1] I realized that my test repo above is looking up the commit from a loose object. If I repack the repository then the default format is still slower than using "--format=%(refname)" but is much more competitive. $ git repack -a Enumerating objects: 2, done. Counting objects: 100% (2/2), done. Writing objects: 100% (2/2), done. Total 2 (delta 0), reused 0 (delta 0), pack-reused 0 $ hyperfine -L fmt "","--format=%\(refname\)","--format=x" 'git for-each-ref {fmt} refs/heads/ | wc' Benchmark 1: git for-each-ref refs/heads/ | wc -l Time (mean ± σ): 111.4 ms ± 1.4 ms [User: 96.9 ms, System: 19.6 ms] Range (min … max): 109.6 ms … 115.1 ms 25 runs Benchmark 2: git for-each-ref --format=%\(refname\) refs/heads/ | wc -l Time (mean ± σ): 66.7 ms ± 0.7 ms [User: 59.5 ms, System: 9.5 ms] Range (min … max): 65.6 ms … 68.2 ms 42 runs Benchmark 3: git for-each-ref --format=x refs/heads/ | wc -l Time (mean ± σ): 63.4 ms ± 0.7 ms [User: 56.3 ms, System: 8.0 ms] Range (min … max): 61.9 ms … 65.1 ms 44 runs Summary git for-each-ref --format=x refs/heads/ | wc -l ran 1.05 ± 0.02 times faster than git for-each-ref --format=%\(refname\) refs/heads/ | wc -l 1.76 ± 0.03 times faster than git for-each-ref refs/heads/ | wc -l So it seems most of the slowdown I was seeing yesterday was due it looking up a loose object. I'm surprised repacking makes such a difference in a repository that only contains two objects. Best Wishes Phillip [1] https://lore.kernel.org/git/20230627195900.GC1280909@coredump.intra.peff.net ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] for-each-ref: add --count-matches option 2023-06-28 13:12 ` Phillip Wood @ 2023-06-28 17:08 ` Junio C Hamano 0 siblings, 0 replies; 12+ messages in thread From: Junio C Hamano @ 2023-06-28 17:08 UTC (permalink / raw) To: Phillip Wood Cc: Jeff King, Derrick Stolee via GitGitGadget, git, vdye, me, mjcheetham, Derrick Stolee Phillip Wood <phillip.wood123@gmail.com> writes: > So it seems most of the slowdown I was seeing yesterday was due it > looking up a loose object. I'm surprised repacking makes such a > difference in a repository that only contains two objects. If we compare what is done in packfile.c:packed_object_info() and object-file.c:loose_object_info() when we are only interested in finding out the object type, there aren't that many differences in the set of system calls each codepath needs to make. * The packfile codepath needs to open and mmap *.pack and *.idx, binary search in the .idx for the object location, then read a few bytes from .pack, before being able to decode the header to find out the type. * The loose object codepath needs to open and mmap the loose object file, read a few bytes from there, before being abole to decode the header to find out the type. After that, it needs to munmap. The cost of open/mmap for packfile codepath amortises over number of objects (hence number of refs) very well. If there are many refs that point at the same object, cache object layer will kick in to avoid disk access for second and subsequent accesses to the same object, but it helps both codepaths equally, so there should not be much difference either way. Thanks for a interesting piece of food for thought. ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] for-each-ref: add --count-matches option 2023-06-27 10:05 ` Phillip Wood 2023-06-27 18:22 ` Junio C Hamano 2023-06-28 13:12 ` Phillip Wood @ 2023-07-11 14:48 ` René Scharfe 2 siblings, 0 replies; 12+ messages in thread From: René Scharfe @ 2023-07-11 14:48 UTC (permalink / raw) To: phillip.wood, Jeff King, Derrick Stolee via GitGitGadget Cc: Junio C Hamano, git, vdye, me, mjcheetham, Derrick Stolee Am 27.06.23 um 12:05 schrieb Phillip Wood: > On 27/06/2023 08:30, Jeff King wrote: >> On Mon, Jun 26, 2023 at 03:09:57PM +0000, Derrick Stolee via GitGitGadget wrote: >> >>> +for pattern in "refs/heads/" "refs/tags/" "refs/remotes" >>> +do >>> + test_perf "count $pattern: git for-each-ref | wc -l" " >>> + git for-each-ref $pattern | wc -l >>> + " >>> + >>> + test_perf "count $pattern: git for-each-ref --count-match" " >>> + git for-each-ref --count-matches $pattern >>> + " >>> +done >> >> I don't think this is a very realistic perf test, because for-each-ref >> is doing a bunch of work to generate its default format, only to have >> "wc" throw most of it away. Doing: >> >> git for-each-ref --format='%(refname)' | wc -l > > That's a good point. I wondered if using a short fixed format string was even better so I tried > > git init test > cd test > git commit --allow-empty -m initial > seq 0 100000 | sed "s:\(.*\):create refs/heads/some-prefix/\1 $(git rev-parse HEAD):" | git update-ref --stdin > git pack-refs --all > hyperfine -L fmt "","--format=%\(refname\)","--format=x" 'git for-each-ref {fmt} refs/heads/ | wc -l' > > Which gives > > Benchmark 1: git for-each-ref refs/heads/ | wc -l > Time (mean ± σ): 1.150 s ± 0.010 s [User: 0.494 s, System: 0.637 s] > Range (min … max): 1.140 s … 1.170 s 10 runs > > Benchmark 2: git for-each-ref --format=%\(refname\) refs/heads/ | wc -l > Time (mean ± σ): 66.0 ms ± 0.3 ms [User: 58.9 ms, System: 9.5 ms] > Range (min … max): 65.2 ms … 67.1 ms 43 runs > > Benchmark 3: git for-each-ref --format=x refs/heads/ | wc -l > Time (mean ± σ): 63.0 ms ± 0.5 ms [User: 54.3 ms, System: 9.6 ms] > Range (min … max): 62.3 ms … 65.4 ms 44 runs > > Summary > git for-each-ref --format=x refs/heads/ | wc -l ran > 1.05 ± 0.01 times faster than git for-each-ref --format=%\(refname\) refs/heads/ | wc -l > 18.25 ± 0.20 times faster than git for-each-ref refs/heads/ | wc -l You don't need no "x", by the way; using the empty string saves some cycles for me. In my Git clone, no special setup, 9931 refs: Benchmark 1: git for-each-ref --format=x | wc -l Time (mean ± σ): 25.1 ms ± 0.1 ms [User: 8.7 ms, System: 16.8 ms] Range (min … max): 24.9 ms … 25.6 ms 109 runs Benchmark 2: git for-each-ref --format= | wc -l Time (mean ± σ): 24.6 ms ± 0.1 ms [User: 8.3 ms, System: 16.8 ms] Range (min … max): 24.4 ms … 25.3 ms 110 runs Summary git for-each-ref --format= | wc -l ran 1.02 ± 0.01 times faster than git for-each-ref --format=x | wc -l René ^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [PATCH 2/2] for-each-ref: add --count-matches option 2023-06-27 7:30 ` Jeff King 2023-06-27 10:05 ` Phillip Wood @ 2023-07-10 16:51 ` Derrick Stolee 1 sibling, 0 replies; 12+ messages in thread From: Derrick Stolee @ 2023-07-10 16:51 UTC (permalink / raw) To: Jeff King, Derrick Stolee via GitGitGadget Cc: Junio C Hamano, git, vdye, me, mjcheetham On 6/27/2023 1:30 AM, Jeff King wrote: >On Mon, Jun 26, 2023 at 03:09:57PM +0000, Derrick Stolee via GitGitGadget wrote: > >>+for pattern in "refs/heads/" "refs/tags/" "refs/remotes" >>+do >>+ test_perf "count $pattern: git for-each-ref | wc -l" " >>+ git for-each-ref $pattern | wc -l >>+ " >>+ >>+ test_perf "count $pattern: git for-each-ref --count-match" " >>+ git for-each-ref --count-matches $pattern >>+ " >>+done > >I don't think this is a very realistic perf test, because for-each-ref >is doing a bunch of work to generate its default format, only to have >"wc" throw most of it away. Doing: > > git for-each-ref --format='%(refname)' | wc -l > >is much better (obviously you have to remember to do that if you care >about optimizing your command, but that's true of --count-matches, too). Thank you for pointing this out! I'll be sure to modify the test and the analysis about it. The other check I need to compare is when multiple refspecs are provided at the same time, which I do believe still is a valuable thing to combine into a single process instead of multiple pipes to 'wc'. Did you have any thoughts on whether or not this works as an option in 'git for-each-ref' or would be better broken into a new builtin? Thanks, -Stolee ^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2023-07-11 14:49 UTC | newest] Thread overview: 12+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2023-06-26 15:09 [PATCH 0/2] [RFC] for-each-ref: add --count-matches mode Derrick Stolee via GitGitGadget 2023-06-26 15:09 ` [PATCH 1/2] for-each-ref: extract ref output loop Derrick Stolee via GitGitGadget 2023-06-26 15:09 ` [PATCH 2/2] for-each-ref: add --count-matches option Derrick Stolee via GitGitGadget 2023-06-26 16:14 ` Junio C Hamano 2023-06-27 7:30 ` Jeff King 2023-06-27 10:05 ` Phillip Wood 2023-06-27 18:22 ` Junio C Hamano 2023-06-27 19:59 ` Jeff King 2023-06-28 13:12 ` Phillip Wood 2023-06-28 17:08 ` Junio C Hamano 2023-07-11 14:48 ` René Scharfe 2023-07-10 16:51 ` Derrick Stolee
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).