All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2] ls-files: filter pathspec before lstat
@ 2026-06-09  2:37 Tamir Duberstein
  2026-06-09  3:26 ` Junio C Hamano
  0 siblings, 1 reply; 5+ messages in thread
From: Tamir Duberstein @ 2026-06-09  2:37 UTC (permalink / raw)
  To: git
  Cc: René Scharfe, Patrick Steinhardt, Junio C Hamano, Jeff King,
	Tamir Duberstein

show_files() checks whether each index entry is deleted or modified
before show_ce() applies the pathspec. prune_index() avoids most of this
work for pathspecs with a common directory prefix, but a top-level name
or leading wildcard leaves every entry to be checked.

For a single pathspec, match it before lstat() in the deleted and
modified modes. Keep the later match in show_ce() so --error-unmatch is
satisfied only by entries that are actually shown.

match_pathspec() is linear in the number of pathspec items. Applying it
early for every item can therefore multiply the work for commands with
many pathspecs, especially when lstat() shows that no entries are
modified. Restrict the early check to one pathspec. Callers with
multiple pathspecs retain the existing lstat()-first order.

On a repository with 859,211 index entries, a 19,931,862-byte index,
and 25,303,439 packed objects occupying 21.13 GiB, I exported $parent
and $this to binaries built from the parent and this commit, then ran:

    hyperfine --warmup 0 --runs 3 \
        --command-name parent \
        '$parent -c core.fsmonitor=false ls-files --deleted -- README.md' \
        --command-name 'this commit' \
        '$this -c core.fsmonitor=false ls-files --deleted -- README.md'

The results were:

             parent       this commit
  elapsed    60.742 s     1.061 s
  user        1.117 s     0.963 s
  system     10.740 s     0.042 s

For an all-matching pathspec, I used a checkout with 859,940 index
entries and ran:

    hyperfine --warmup 0 --runs 3 \
        --command-name parent \
        '$parent -c core.fsmonitor=false ls-files --deleted -- "*"' \
        --command-name 'this commit' \
        '$this -c core.fsmonitor=false ls-files --deleted -- "*"'

I repeated the benchmark with the commands reversed. The results were:

                         parent          this commit
  parent first elapsed    56.807 s        64.618 s
               user        1.256 s         1.270 s
               system     10.633 s        11.068 s
  patched first elapsed   63.361 s        64.316 s
                user       1.238 s         1.280 s
                system    10.296 s        11.864 s

The patched user-time means were 14 ms and 42 ms higher in the two
orderings. Elapsed time changed by several seconds when the order was
reversed, so those results do not show a stable wall-time ordering.

Jeff King pointed out that a preliminary match for each of many literal
pathspecs can be much more expensive. On a generated repository with
10,000 clean files, I recorded the paths with "git ls-files >paths".
With $v1 exported to a binary built from the implementation sent in v1,
I ran:

    hyperfine --warmup 2 --runs 10 \
        --command-name parent \
        '$parent ls-files -m -- $(cat paths) >/dev/null' \
        --command-name 'this commit' \
        '$this ls-files -m -- $(cat paths) >/dev/null'

I replaced $this with $v1 in a second invocation. The wall-clock means
and standard deviations were:

                         mean          standard deviation
  parent, final run     110.1 ms              4.1 ms
  this commit           104.9 ms              2.2 ms
  parent, v1 run        112.5 ms              6.6 ms
  unguarded v1          494.1 ms             17.2 ms

The guarded result matches the parent within the observed variation,
while avoiding the regression in v1.

All three revisions were built with -O3, -mcpu=native, and ThinLTO
using Apple clang 21.0.0 on macOS 26.5. The machine was a MacBook Pro
(Mac16,6) with a 16-core Apple M4 Max (12 performance and four
efficiency cores) and 128 GB RAM.

Link: https://lore.kernel.org/r/20260607-ls-files-pathspec-lstat-v1-1-8cf40b730146@gmail.com
Helped-by: Jeff King <peff@peff.net>
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
---
A selective pathspec should let ls-files --deleted and --modified avoid
statting entries that cannot be shown. Match a single pathspec before
accessing the worktree, while preserving the existing lstat-first order
for multiple pathspecs whose matching cost grows linearly.
---
Changes in v2:
- Restrict early matching to one pathspec, avoiding the regression Jeff
  demonstrated with many pathspecs.
- Add all-matching and many-pathspec performance results.
- Drop the Assisted-by trailer.
- Link to v1: https://patch.msgid.link/20260607-ls-files-pathspec-lstat-v1-1-8cf40b730146@gmail.com
---
 builtin/ls-files.c                  | 11 +++++++++++
 t/meson.build                       |  1 +
 t/perf/p3010-ls-files.sh            | 31 +++++++++++++++++++++++++++++++
 t/t3010-ls-files-killed-modified.sh | 18 ++++++++++++++++++
 4 files changed, 61 insertions(+)

diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index e1a22b41b9..8d7158652b 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -450,6 +450,17 @@ static void show_files(struct repository *repo, struct dir_struct *dir)
 			continue;
 		if (ce_skip_worktree(ce))
 			continue;
+		/*
+		 * match_pathspec() is linear in pathspec.nr, so prefilter only
+		 * the single-pathspec case. Only entries shown by show_ce()
+		 * satisfy --error-unmatch.
+		 */
+		if (pathspec.nr == 1 &&
+		    !match_pathspec(repo->index, &pathspec, fullname.buf,
+				    fullname.len, max_prefix_len, NULL,
+				    S_ISDIR(ce->ce_mode) ||
+				    S_ISGITLINK(ce->ce_mode)))
+			continue;
 		stat_err = lstat(fullname.buf, &st);
 		if (stat_err && (errno != ENOENT && errno != ENOTDIR))
 			error_errno("cannot lstat '%s'", fullname.buf);
diff --git a/t/meson.build b/t/meson.build
index 2af8d01279..ee8086e6ef 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -1140,6 +1140,7 @@ benchmarks = [
   'perf/p1500-graph-walks.sh',
   'perf/p1501-rev-parse-oneline.sh',
   'perf/p2000-sparse-operations.sh',
+  'perf/p3010-ls-files.sh',
   'perf/p3400-rebase.sh',
   'perf/p3404-rebase-interactive.sh',
   'perf/p4000-diff-algorithms.sh',
diff --git a/t/perf/p3010-ls-files.sh b/t/perf/p3010-ls-files.sh
new file mode 100755
index 0000000000..ae14449432
--- /dev/null
+++ b/t/perf/p3010-ls-files.sh
@@ -0,0 +1,31 @@
+#!/bin/sh
+
+test_description='Tests ls-files worktree performance'
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+test_checkout_worktree
+
+test_expect_success 'select a zero-prefix pathspec' '
+	tracked_file=$(git ls-files | sed -n 1p) &&
+	test -n "$tracked_file" &&
+	pathspec="?${tracked_file#?}" &&
+	test_export pathspec
+'
+
+test_perf 'ls-files --deleted with pathspec' '
+	git -c core.fsmonitor=false ls-files --deleted \
+		-- "$pathspec" >/dev/null
+'
+
+test_perf 'ls-files --deleted with all-matching pathspec' '
+	git -c core.fsmonitor=false ls-files --deleted -- "*" >/dev/null
+'
+
+test_perf 'ls-files --modified with pathspec' '
+	git -c core.fsmonitor=false ls-files --modified \
+		-- "$pathspec" >/dev/null
+'
+
+test_done
diff --git a/t/t3010-ls-files-killed-modified.sh b/t/t3010-ls-files-killed-modified.sh
index 7af4532cd1..6e38e10219 100755
--- a/t/t3010-ls-files-killed-modified.sh
+++ b/t/t3010-ls-files-killed-modified.sh
@@ -124,4 +124,22 @@ test_expect_success 'validate git ls-files -m output.' '
 	test_cmp .expected .output
 '
 
+test_expect_success 'worktree modes honor wildcard pathspecs' '
+	cat >.expected <<-\EOF &&
+	path2/file2
+	path3/file3
+	EOF
+	git ls-files --deleted -- "path?/file?" >.output &&
+	test_cmp .expected .output &&
+
+	cat >.expected <<-\EOF &&
+	path7
+	path8
+	EOF
+	git ls-files --modified --error-unmatch -- "path[78]" >.output &&
+	test_cmp .expected .output &&
+
+	test_must_fail git ls-files --modified --error-unmatch -- path10
+'
+
 test_done

---
base-commit: 9ac3f193c05c2237e2b14ebaa1149e9fc8a1abe0
change-id: 20260607-ls-files-pathspec-lstat-885125a5d644

Best regards,
--  
Tamir Duberstein <tamird@gmail.com>


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] ls-files: filter pathspec before lstat
  2026-06-09  2:37 [PATCH v2] ls-files: filter pathspec before lstat Tamir Duberstein
@ 2026-06-09  3:26 ` Junio C Hamano
  2026-06-09  3:38   ` Tamir Duberstein
  2026-06-09  3:42   ` Junio C Hamano
  0 siblings, 2 replies; 5+ messages in thread
From: Junio C Hamano @ 2026-06-09  3:26 UTC (permalink / raw)
  To: Tamir Duberstein; +Cc: git, René Scharfe, Patrick Steinhardt, Jeff King

Tamir Duberstein <tamird@gmail.com> writes:

> show_files() checks whether each index entry is deleted or modified
> before show_ce() applies the pathspec. prune_index() avoids most of this
> work for pathspecs with a common directory prefix, but a top-level name
> or leading wildcard leaves every entry to be checked.
> ...

Please make sure that your v2 is a response to v1; otherwise loses
sight of the previous iteration.

> Changes in v2:
> - Restrict early matching to one pathspec, avoiding the regression Jeff
>   demonstrated with many pathspecs.
> - Add all-matching and many-pathspec performance results.
> - Drop the Assisted-by trailer.
> - Link to v1: https://patch.msgid.link/20260607-ls-files-pathspec-lstat-v1-1-8cf40b730146@gmail.com

And it is *not* a replacement to force human to follow such a link.

Instead, please make sure each piece of your e-mail identifies where
it fits in the discussion thread by pointing the message of the
previous round with its In-Reply-To: header.

Thanks.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] ls-files: filter pathspec before lstat
  2026-06-09  3:26 ` Junio C Hamano
@ 2026-06-09  3:38   ` Tamir Duberstein
  2026-06-09  3:42   ` Junio C Hamano
  1 sibling, 0 replies; 5+ messages in thread
From: Tamir Duberstein @ 2026-06-09  3:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, René Scharfe, Patrick Steinhardt, Jeff King

On Mon, Jun 8, 2026 at 8:26 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Tamir Duberstein <tamird@gmail.com> writes:
>
> > show_files() checks whether each index entry is deleted or modified
> > before show_ce() applies the pathspec. prune_index() avoids most of this
> > work for pathspecs with a common directory prefix, but a top-level name
> > or leading wildcard leaves every entry to be checked.
> > ...
>
> Please make sure that your v2 is a response to v1; otherwise loses
> sight of the previous iteration.
>
> > Changes in v2:
> > - Restrict early matching to one pathspec, avoiding the regression Jeff
> >   demonstrated with many pathspecs.
> > - Add all-matching and many-pathspec performance results.
> > - Drop the Assisted-by trailer.
> > - Link to v1: https://patch.msgid.link/20260607-ls-files-pathspec-lstat-v1-1-8cf40b730146@gmail.com
>
> And it is *not* a replacement to force human to follow such a link.
>
> Instead, please make sure each piece of your e-mail identifies where
> it fits in the discussion thread by pointing the message of the
> previous round with its In-Reply-To: header.
>
> Thanks.

Apologies, I used b4 which follows kernel rules. I'll follow this
guidance in the future.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] ls-files: filter pathspec before lstat
  2026-06-09  3:26 ` Junio C Hamano
  2026-06-09  3:38   ` Tamir Duberstein
@ 2026-06-09  3:42   ` Junio C Hamano
  2026-06-09  3:48     ` Tamir Duberstein
  1 sibling, 1 reply; 5+ messages in thread
From: Junio C Hamano @ 2026-06-09  3:42 UTC (permalink / raw)
  To: Tamir Duberstein; +Cc: git, René Scharfe, Patrick Steinhardt, Jeff King

Junio C Hamano <gitster@pobox.com> writes:

> Please make sure that your v2 is a response to v1; otherwise loses
> sight of the previous iteration.
>
>> Changes in v2:
>> - Restrict early matching to one pathspec, avoiding the regression Jeff
>>   demonstrated with many pathspecs.
>> - Add all-matching and many-pathspec performance results.
>> - Drop the Assisted-by trailer.
>> - Link to v1: https://patch.msgid.link/20260607-ls-files-pathspec-lstat-v1-1-8cf40b730146@gmail.com
>
> And it is *not* a replacement to force human to follow such a link.
>
> Instead, please make sure each piece of your e-mail identifies where
> it fits in the discussion thread by pointing the message of the
> previous round with its In-Reply-To: header.

I won't complain about them individually, but it seems that all the
other v2 in different topics from you share the same problem.

Documentation/SubmittingPatches expect that the messages on the same
topic are threaded with In-Reply-To: headers; e-mail based workflow
tools like "b4" offer a useful feature that lets the user to feed
the message ID of an earlier round and fetch the messages in the
latest round.  As the message IDs of an earlier round that have
become commits for v1 are known in the refs/notes/amlog notes
(published at the usual places), replacing a topic with its newer
iteration becomes:

 (0) check out the previous round.

 (1) learn the message ID of the previous round we have checked out
     using notes/amlog (e.g., "git show -s --notes=amlog HEAD"),

 (2) detach the HEAD at the base of the previous round (roughly "git
     checkout master...HEAD", but not always),

 (3) ask "b4 am" to fetch the latest round of the thread the message
     we found in step (1) belongs to, and apply these new patches,

 (4) run "git range-diff @{-1}...HEAD".

which is very much automatable.

Unless an author breaks the thread, that is.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] ls-files: filter pathspec before lstat
  2026-06-09  3:42   ` Junio C Hamano
@ 2026-06-09  3:48     ` Tamir Duberstein
  0 siblings, 0 replies; 5+ messages in thread
From: Tamir Duberstein @ 2026-06-09  3:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git, René Scharfe, Patrick Steinhardt, Jeff King

On Mon, Jun 8, 2026 at 8:42 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> Junio C Hamano <gitster@pobox.com> writes:
>
> > Please make sure that your v2 is a response to v1; otherwise loses
> > sight of the previous iteration.
> >
> >> Changes in v2:
> >> - Restrict early matching to one pathspec, avoiding the regression Jeff
> >>   demonstrated with many pathspecs.
> >> - Add all-matching and many-pathspec performance results.
> >> - Drop the Assisted-by trailer.
> >> - Link to v1: https://patch.msgid.link/20260607-ls-files-pathspec-lstat-v1-1-8cf40b730146@gmail.com
> >
> > And it is *not* a replacement to force human to follow such a link.
> >
> > Instead, please make sure each piece of your e-mail identifies where
> > it fits in the discussion thread by pointing the message of the
> > previous round with its In-Reply-To: header.
>
> I won't complain about them individually, but it seems that all the
> other v2 in different topics from you share the same problem.
>
> Documentation/SubmittingPatches expect that the messages on the same
> topic are threaded with In-Reply-To: headers; e-mail based workflow
> tools like "b4" offer a useful feature that lets the user to feed
> the message ID of an earlier round and fetch the messages in the
> latest round.  As the message IDs of an earlier round that have
> become commits for v1 are known in the refs/notes/amlog notes
> (published at the usual places), replacing a topic with its newer
> iteration becomes:
>
>  (0) check out the previous round.
>
>  (1) learn the message ID of the previous round we have checked out
>      using notes/amlog (e.g., "git show -s --notes=amlog HEAD"),
>
>  (2) detach the HEAD at the base of the previous round (roughly "git
>      checkout master...HEAD", but not always),
>
>  (3) ask "b4 am" to fetch the latest round of the thread the message
>      we found in step (1) belongs to, and apply these new patches,
>
>  (4) run "git range-diff @{-1}...HEAD".
>
> which is very much automatable.
>
> Unless an author breaks the thread, that is.

Yes, heard loud and clear. As I mentioned on the other thread, I
followed kernel conventions here by using b4. That's my fault. Sorry
about that. I'll do the proper thing in future mailings.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-06-09  3:48 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-09  2:37 [PATCH v2] ls-files: filter pathspec before lstat Tamir Duberstein
2026-06-09  3:26 ` Junio C Hamano
2026-06-09  3:38   ` Tamir Duberstein
2026-06-09  3:42   ` Junio C Hamano
2026-06-09  3:48     ` Tamir Duberstein

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.