* [PATCH] ls-files: filter pathspec before lstat
@ 2026-06-07 15:40 Tamir Duberstein
2026-06-07 16:02 ` Kristoffer Haugsbakk
2026-06-08 13:06 ` Junio C Hamano
0 siblings, 2 replies; 10+ messages in thread
From: Tamir Duberstein @ 2026-06-07 15:40 UTC (permalink / raw)
To: git; +Cc: René Scharfe, Patrick Steinhardt, Junio C Hamano,
Tamir Duberstein
show_files() checks whether each index entry is deleted or modified
before show_ce() applies the pathspec. prune_index() avoids most of this
work for pathspecs with a common directory prefix, but a top-level name
or leading wildcard leaves every entry to be checked.
Match the pathspec before lstat() for the deleted and modified modes.
Keep the later match in show_ce() so --error-unmatch is satisfied only
by entries that are actually shown.
On a repository with 859,211 index entries, a 19,931,862-byte index, and
25,303,439 packed objects occupying 21.13 GiB, I ran the following
command with the parent and patched binaries:
hyperfine --warmup 0 --runs 3 \
'git -c core.fsmonitor=false ls-files --deleted -- README.md'
The results were:
parent this commit
elapsed 60.742 s 1.061 s
user 1.117 s 0.963 s
system 10.740 s 0.042 s
Both revisions were built with -O3, -mcpu=native, and ThinLTO using
Apple clang 21.0.0 on macOS 26.5. The machine was a MacBook Pro
(Mac16,6) with a 16-core Apple M4 Max (12 performance and four
efficiency cores) and 128 GB RAM.
Assisted-by: Codex gpt-5.5
Signed-off-by: Tamir Duberstein <tamird@gmail.com>
---
builtin/ls-files.c | 7 +++++++
t/meson.build | 1 +
t/perf/p3010-ls-files.sh | 27 +++++++++++++++++++++++++++
t/t3010-ls-files-killed-modified.sh | 18 ++++++++++++++++++
4 files changed, 53 insertions(+)
diff --git a/builtin/ls-files.c b/builtin/ls-files.c
index e1a22b41b9..702c607183 100644
--- a/builtin/ls-files.c
+++ b/builtin/ls-files.c
@@ -450,6 +450,13 @@ static void show_files(struct repository *repo, struct dir_struct *dir)
continue;
if (ce_skip_worktree(ce))
continue;
+ /* Only entries shown by show_ce() satisfy --error-unmatch. */
+ if (pathspec.nr &&
+ !match_pathspec(repo->index, &pathspec, fullname.buf,
+ fullname.len, max_prefix_len, NULL,
+ S_ISDIR(ce->ce_mode) ||
+ S_ISGITLINK(ce->ce_mode)))
+ continue;
stat_err = lstat(fullname.buf, &st);
if (stat_err && (errno != ENOENT && errno != ENOTDIR))
error_errno("cannot lstat '%s'", fullname.buf);
diff --git a/t/meson.build b/t/meson.build
index 2af8d01279..ee8086e6ef 100644
--- a/t/meson.build
+++ b/t/meson.build
@@ -1140,6 +1140,7 @@ benchmarks = [
'perf/p1500-graph-walks.sh',
'perf/p1501-rev-parse-oneline.sh',
'perf/p2000-sparse-operations.sh',
+ 'perf/p3010-ls-files.sh',
'perf/p3400-rebase.sh',
'perf/p3404-rebase-interactive.sh',
'perf/p4000-diff-algorithms.sh',
diff --git a/t/perf/p3010-ls-files.sh b/t/perf/p3010-ls-files.sh
new file mode 100755
index 0000000000..bb80768063
--- /dev/null
+++ b/t/perf/p3010-ls-files.sh
@@ -0,0 +1,27 @@
+#!/bin/sh
+
+test_description='Tests ls-files worktree performance'
+
+. ./perf-lib.sh
+
+test_perf_large_repo
+test_checkout_worktree
+
+test_expect_success 'select a zero-prefix pathspec' '
+ tracked_file=$(git ls-files | sed -n 1p) &&
+ test -n "$tracked_file" &&
+ pathspec="?${tracked_file#?}" &&
+ test_export pathspec
+'
+
+test_perf 'ls-files --deleted with pathspec' '
+ git -c core.fsmonitor=false ls-files --deleted \
+ -- "$pathspec" >/dev/null
+'
+
+test_perf 'ls-files --modified with pathspec' '
+ git -c core.fsmonitor=false ls-files --modified \
+ -- "$pathspec" >/dev/null
+'
+
+test_done
diff --git a/t/t3010-ls-files-killed-modified.sh b/t/t3010-ls-files-killed-modified.sh
index 7af4532cd1..6e38e10219 100755
--- a/t/t3010-ls-files-killed-modified.sh
+++ b/t/t3010-ls-files-killed-modified.sh
@@ -124,4 +124,22 @@ test_expect_success 'validate git ls-files -m output.' '
test_cmp .expected .output
'
+test_expect_success 'worktree modes honor wildcard pathspecs' '
+ cat >.expected <<-\EOF &&
+ path2/file2
+ path3/file3
+ EOF
+ git ls-files --deleted -- "path?/file?" >.output &&
+ test_cmp .expected .output &&
+
+ cat >.expected <<-\EOF &&
+ path7
+ path8
+ EOF
+ git ls-files --modified --error-unmatch -- "path[78]" >.output &&
+ test_cmp .expected .output &&
+
+ test_must_fail git ls-files --modified --error-unmatch -- path10
+'
+
test_done
---
base-commit: 9ac3f193c05c2237e2b14ebaa1149e9fc8a1abe0
change-id: 20260607-ls-files-pathspec-lstat-885125a5d644
Best regards,
--
Tamir Duberstein <tamird@gmail.com>
^ permalink raw reply related [flat|nested] 10+ messages in thread* Re: [PATCH] ls-files: filter pathspec before lstat
2026-06-07 15:40 [PATCH] ls-files: filter pathspec before lstat Tamir Duberstein
@ 2026-06-07 16:02 ` Kristoffer Haugsbakk
2026-06-07 16:07 ` Tamir Duberstein
2026-06-08 13:06 ` Junio C Hamano
1 sibling, 1 reply; 10+ messages in thread
From: Kristoffer Haugsbakk @ 2026-06-07 16:02 UTC (permalink / raw)
To: Tamir Duberstein, git
Cc: René Scharfe, Patrick Steinhardt, Junio C Hamano
On Sun, Jun 7, 2026, at 17:40, Tamir Duberstein wrote:
>[snip]
> Assisted-by: Codex gpt-5.5
This is more of a Git for Windows trailer. The Git project doesn’t
document its use.
An aside here but these trailers attributing specific LLMs feels like
etching “Peter was here” under some table. What benefit for the project
does knowing that it was this version of Codex or Claude or something?
A link to the prompt/conversation would provide provenance and show how
the LLM was used. But three years from now, what information beyond the
fact that an LLM was involved (any of them) does this offer?
I can understand the benefit for the companies behind these LLMs to have
these attributions in OSS projects.
I have done the same thing in our company repo, crediting <LLM> for
authoring or co-authoring or helping with a specific thing. Using a
“people” trailer. But the intent was just to show how some LLM was
involved. So I think I am going to switch to the following trailer for
our company repo.
LLM: Yes
> Signed-off-by: Tamir Duberstein <tamird@gmail.com>
> ---
>[snip]
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH] ls-files: filter pathspec before lstat
2026-06-07 16:02 ` Kristoffer Haugsbakk
@ 2026-06-07 16:07 ` Tamir Duberstein
2026-06-07 16:17 ` Kristoffer Haugsbakk
0 siblings, 1 reply; 10+ messages in thread
From: Tamir Duberstein @ 2026-06-07 16:07 UTC (permalink / raw)
To: Kristoffer Haugsbakk
Cc: git, René Scharfe, Patrick Steinhardt, Junio C Hamano
On Sun, Jun 7, 2026 at 12:02 PM Kristoffer Haugsbakk
<kristofferhaugsbakk@fastmail.com> wrote:
>
> On Sun, Jun 7, 2026, at 17:40, Tamir Duberstein wrote:
> >[snip]
> > Assisted-by: Codex gpt-5.5
>
> This is more of a Git for Windows trailer. The Git project doesn’t
> document its use.
>
> An aside here but these trailers attributing specific LLMs feels like
> etching “Peter was here” under some table. What benefit for the project
> does knowing that it was this version of Codex or Claude or something?
> A link to the prompt/conversation would provide provenance and show how
> the LLM was used. But three years from now, what information beyond the
> fact that an LLM was involved (any of them) does this offer?
>
> I can understand the benefit for the companies behind these LLMs to have
> these attributions in OSS projects.
>
> I have done the same thing in our company repo, crediting <LLM> for
> authoring or co-authoring or helping with a specific thing. Using a
> “people” trailer. But the intent was just to show how some LLM was
> involved. So I think I am going to switch to the following trailer for
> our company repo.
>
> LLM: Yes
This all sounds reasonable to me. The kernel has started asking for
this trailer (https://github.com/torvalds/linux/commit/78d979db6cef557c171d6059cbce06c3db89c7ee)
and I saw precedent in Git as recently as last month
(https://github.com/git/git/commit/7a094d68a27e321a99c8ab6b700909e503904bd9)
so I erred on the side of caution.
I am also OK with this trailer being dropped or replaced on apply.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] ls-files: filter pathspec before lstat
2026-06-07 16:07 ` Tamir Duberstein
@ 2026-06-07 16:17 ` Kristoffer Haugsbakk
2026-06-07 18:24 ` Tamir Duberstein
0 siblings, 1 reply; 10+ messages in thread
From: Kristoffer Haugsbakk @ 2026-06-07 16:17 UTC (permalink / raw)
To: Tamir Duberstein
Cc: git, René Scharfe, Patrick Steinhardt, Junio C Hamano
On Sun, Jun 7, 2026, at 18:07, Tamir Duberstein wrote:
> On Sun, Jun 7, 2026 at 12:02 PM Kristoffer Haugsbakk
> <kristofferhaugsbakk@fastmail.com> wrote:
>>[snip]
>>
>> I have done the same thing in our company repo, crediting <LLM> for
>> authoring or co-authoring or helping with a specific thing. Using a
>> “people” trailer. But the intent was just to show how some LLM was
>> involved. So I think I am going to switch to the following trailer for
>> our company repo.
>>
>> LLM: Yes
>
> This all sounds reasonable to me. The kernel has started asking for
> this trailer
> (https://github.com/torvalds/linux/commit/78d979db6cef557c171d6059cbce06c3db89c7ee)
> and I saw precedent in Git as recently as last month
> (https://github.com/git/git/commit/7a094d68a27e321a99c8ab6b700909e503904bd9)
> so I erred on the side of caution.
>
> I am also OK with this trailer being dropped or replaced on apply.
The most important thing to be aware of is “Use of Artificial
Intelligence (AI)” in `Documentation/SubmittingPatches`. :)
Thanks
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] ls-files: filter pathspec before lstat
2026-06-07 16:17 ` Kristoffer Haugsbakk
@ 2026-06-07 18:24 ` Tamir Duberstein
0 siblings, 0 replies; 10+ messages in thread
From: Tamir Duberstein @ 2026-06-07 18:24 UTC (permalink / raw)
To: Kristoffer Haugsbakk
Cc: git, René Scharfe, Patrick Steinhardt, Junio C Hamano
On Sun, Jun 7, 2026 at 12:17 PM Kristoffer Haugsbakk
<kristofferhaugsbakk@fastmail.com> wrote:
>
> On Sun, Jun 7, 2026, at 18:07, Tamir Duberstein wrote:
> > On Sun, Jun 7, 2026 at 12:02 PM Kristoffer Haugsbakk
> > <kristofferhaugsbakk@fastmail.com> wrote:
> >>[snip]
> >>
> >> I have done the same thing in our company repo, crediting <LLM> for
> >> authoring or co-authoring or helping with a specific thing. Using a
> >> “people” trailer. But the intent was just to show how some LLM was
> >> involved. So I think I am going to switch to the following trailer for
> >> our company repo.
> >>
> >> LLM: Yes
> >
> > This all sounds reasonable to me. The kernel has started asking for
> > this trailer
> > (https://github.com/torvalds/linux/commit/78d979db6cef557c171d6059cbce06c3db89c7ee)
> > and I saw precedent in Git as recently as last month
> > (https://github.com/git/git/commit/7a094d68a27e321a99c8ab6b700909e503904bd9)
> > so I erred on the side of caution.
> >
> > I am also OK with this trailer being dropped or replaced on apply.
>
> The most important thing to be aware of is “Use of Artificial
> Intelligence (AI)” in `Documentation/SubmittingPatches`. :)
Acknowledge, thanks! Will omit the trailer on future submissions.
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] ls-files: filter pathspec before lstat
2026-06-07 15:40 [PATCH] ls-files: filter pathspec before lstat Tamir Duberstein
2026-06-07 16:02 ` Kristoffer Haugsbakk
@ 2026-06-08 13:06 ` Junio C Hamano
2026-06-08 19:15 ` Tamir Duberstein
2026-06-08 23:03 ` Jeff King
1 sibling, 2 replies; 10+ messages in thread
From: Junio C Hamano @ 2026-06-08 13:06 UTC (permalink / raw)
To: Tamir Duberstein; +Cc: git, René Scharfe, Patrick Steinhardt
On Sun, Jun 7, 2026 at 11:40, Tamir Duberstein wrote:
> show_files() checks whether each index entry is deleted or modified
> before show_ce() applies the pathspec. prune_index() avoids most of this
> work for pathspecs with a common directory prefix, but a top-level name
> or leading wildcard leaves every entry to be checked.
>
> Match the pathspec before lstat() for the deleted and modified modes.
> Keep the later match in show_ce() so --error-unmatch is satisfied only
> by entries that are actually shown.
Adding an extra early `match_pathspec()` check before making slow
system calls like `lstat()` makes sense, especially when most of the
index entries need to be skipped. But if most of them would match,
then we would end up doing the same match_pathspec() calls twice for
each path, and run lstat() anyway, so you may also be able to
construct a perf test that demonstrates a case where this approach
is not a clear win (or even degradation), perhaps?
> diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> index e1a22b41b9..702c607183 100644
> --- a/builtin/ls-files.c
> +++ b/builtin/ls-files.c
> @@ -450,6 +450,13 @@ static void show_files(struct repository *repo, struct dir_struct *dir)
> continue;
> if (ce_skip_worktree(ce))
> continue;
> + /* Only entries shown by show_ce() satisfy --error-unmatch. */
> + if (pathspec.nr &&
> + !match_pathspec(repo->index, &pathspec, fullname.buf,
> + fullname.len, max_prefix_len, NULL,
> + S_ISDIR(ce->ce_mode) ||
> + S_ISGITLINK(ce->ce_mode)))
> + continue;
> stat_err = lstat(fullname.buf, &st);
> if (stat_err && (errno != ENOENT && errno != ENOTDIR))
> error_errno("cannot lstat '%s'", fullname.buf);
Hmph. In the current code, because there is no such pre-filtering,
show_ce() would unconditionally recurse into active submodules when
told to with the "--recurse-submodules" flag, even if your pathspec
coes not match the submodule. With this change, such a submodule
whose path does not match the pathspec would not even be seen by
show_ce(). Would it cause a change in behaviour?
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH] ls-files: filter pathspec before lstat
2026-06-08 13:06 ` Junio C Hamano
@ 2026-06-08 19:15 ` Tamir Duberstein
2026-06-08 23:03 ` Jeff King
1 sibling, 0 replies; 10+ messages in thread
From: Tamir Duberstein @ 2026-06-08 19:15 UTC (permalink / raw)
To: Junio C Hamano; +Cc: git, René Scharfe, Patrick Steinhardt
On Mon, Jun 8, 2026 at 6:06 AM Junio C Hamano <gitster@pobox.com> wrote:
>
> On Sun, Jun 7, 2026 at 11:40, Tamir Duberstein wrote:
> > show_files() checks whether each index entry is deleted or modified
> > before show_ce() applies the pathspec. prune_index() avoids most of this
> > work for pathspecs with a common directory prefix, but a top-level name
> > or leading wildcard leaves every entry to be checked.
> >
> > Match the pathspec before lstat() for the deleted and modified modes.
> > Keep the later match in show_ce() so --error-unmatch is satisfied only
> > by entries that are actually shown.
>
> Adding an extra early `match_pathspec()` check before making slow
> system calls like `lstat()` makes sense, especially when most of the
> index entries need to be skipped. But if most of them would match,
> then we would end up doing the same match_pathspec() calls twice for
> each path, and run lstat() anyway, so you may also be able to
> construct a perf test that demonstrates a case where this approach
> is not a clear win (or even degradation), perhaps?
Yes. I added an all-matching pathspec case to p3010 and ran:
hyperfine --warmup 0 --runs 3 \
'git -c core.fsmonitor=false ls-files --deleted -- "*"'
On a checkout with 859,940 index entries, I ran the parent and patched
binaries in both orders:
parent this commit
parent first elapsed 56.807 s 64.618 s
user 1.256 s 1.270 s
system 10.633 s 11.068 s
patched first elapsed 63.361 s 64.316 s
user 1.238 s 1.280 s
system 10.296 s 11.864 s
The added match costs 14-42 ms of user time in this case. Elapsed time
varies by several seconds with command order, obscuring that CPU cost.
The later match in show_ce() is reached only for entries actually found
deleted or modified. This case therefore exercises the extra match for
every index entry while still performing every lstat().
>
> > diff --git a/builtin/ls-files.c b/builtin/ls-files.c
> > index e1a22b41b9..702c607183 100644
> > --- a/builtin/ls-files.c
> > +++ b/builtin/ls-files.c
> > @@ -450,6 +450,13 @@ static void show_files(struct repository *repo, struct dir_struct *dir)
> > continue;
> > if (ce_skip_worktree(ce))
> > continue;
> > + /* Only entries shown by show_ce() satisfy --error-unmatch. */
> > + if (pathspec.nr &&
> > + !match_pathspec(repo->index, &pathspec, fullname.buf,
> > + fullname.len, max_prefix_len, NULL,
> > + S_ISDIR(ce->ce_mode) ||
> > + S_ISGITLINK(ce->ce_mode)))
> > + continue;
> > stat_err = lstat(fullname.buf, &st);
> > if (stat_err && (errno != ENOENT && errno != ENOTDIR))
> > error_errno("cannot lstat '%s'", fullname.buf);
>
> Hmph. In the current code, because there is no such pre-filtering,
> show_ce() would unconditionally recurse into active submodules when
> told to with the "--recurse-submodules" flag, even if your pathspec
> coes not match the submodule. With this change, such a submodule
> whose path does not match the pathspec would not even be seen by
> show_ce(). Would it cause a change in behaviour?
This path cannot affect --recurse-submodules. cmd_ls_files() rejects
--recurse-submodules together with either --deleted or --modified before
calling show_files(), and the new check is reached only for those two
modes. Cached and stage output continue to call show_ce() before the new
check. t3007 already verifies that both combinations are rejected.
Given the 60.742 s to 1.061 s improvement for a selective pathspec, I
think this small CPU cost for an all-matching pathspec is a worthwhile
tradeoff. What do you think?
Thanks for the review!
Tamir
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH] ls-files: filter pathspec before lstat
2026-06-08 13:06 ` Junio C Hamano
2026-06-08 19:15 ` Tamir Duberstein
@ 2026-06-08 23:03 ` Jeff King
2026-06-08 23:25 ` Jeff King
1 sibling, 1 reply; 10+ messages in thread
From: Jeff King @ 2026-06-08 23:03 UTC (permalink / raw)
To: Junio C Hamano
Cc: Tamir Duberstein, git, René Scharfe, Patrick Steinhardt
On Mon, Jun 08, 2026 at 06:06:33AM -0700, Junio C Hamano wrote:
> On Sun, Jun 7, 2026 at 11:40, Tamir Duberstein wrote:
> > show_files() checks whether each index entry is deleted or modified
> > before show_ce() applies the pathspec. prune_index() avoids most of this
> > work for pathspecs with a common directory prefix, but a top-level name
> > or leading wildcard leaves every entry to be checked.
> >
> > Match the pathspec before lstat() for the deleted and modified modes.
> > Keep the later match in show_ce() so --error-unmatch is satisfied only
> > by entries that are actually shown.
>
> Adding an extra early `match_pathspec()` check before making slow
> system calls like `lstat()` makes sense, especially when most of the
> index entries need to be skipped. But if most of them would match,
> then we would end up doing the same match_pathspec() calls twice for
> each path, and run lstat() anyway, so you may also be able to
> construct a perf test that demonstrates a case where this approach
> is not a clear win (or even degradation), perhaps?
The patchspec matching is linear in the number of pathspecs, so it's
easy to get quadratic-ish results by just asking about:
git ls-files -- $(git ls-files)
So that probably provides an easy regression demonstration for this
patch.
I don't know how much it matters in the real world. That command is
_already_ slow, and mostly people don't care that much. Long ago I had
patches to build a trie of literal pathspecs, with the intent that
blame-tree/last-modified could use the pathspec mechanism, but
ultimately we took the code in a different direction. And nobody really
complained about it since. ;)
-Peff
^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: [PATCH] ls-files: filter pathspec before lstat
2026-06-08 23:03 ` Jeff King
@ 2026-06-08 23:25 ` Jeff King
2026-06-09 0:13 ` Tamir Duberstein
0 siblings, 1 reply; 10+ messages in thread
From: Jeff King @ 2026-06-08 23:25 UTC (permalink / raw)
To: Junio C Hamano
Cc: Tamir Duberstein, git, René Scharfe, Patrick Steinhardt
On Mon, Jun 08, 2026 at 07:03:15PM -0400, Jeff King wrote:
> > Adding an extra early `match_pathspec()` check before making slow
> > system calls like `lstat()` makes sense, especially when most of the
> > index entries need to be skipped. But if most of them would match,
> > then we would end up doing the same match_pathspec() calls twice for
> > each path, and run lstat() anyway, so you may also be able to
> > construct a perf test that demonstrates a case where this approach
> > is not a clear win (or even degradation), perhaps?
>
> The patchspec matching is linear in the number of pathspecs, so it's
> easy to get quadratic-ish results by just asking about:
>
> git ls-files -- $(git ls-files)
>
> So that probably provides an easy regression demonstration for this
> patch.
Ah, yeah, it is easy to demonstrate. Making a repo of size $n like this:
n=10000
git init
for i in $(seq $n); do
echo $i >file$i
done
git add .
git commit -m foo
If we then run:
time git ls-files -- $(git ls-files) >/dev/null
then n=1000 takes ~15ms for me, but n=10000 takes ~800ms. So that shows
the slowdown of the existing pathspec code as the number of pathspecs
grows.
With this patch, starting with n=10000 and adding in "-m" (which
triggers the code in this patch), like:
time git ls-files -m -- $(git ls-files) >/dev/null
the time goes from ~15ms (without the patch) to ~800ms with it. Which
makes sense. Nothing is modified, so the current code which puts the
lstat() check first eliminates each entry before we even consider
pathspecs. So it doesn't hit the slow case at all.
But after the patch, we do a preliminary pathspec match and
pay the cost.
So it really is a question of how many items are actually modified, the
cost of lstat(), and the cost of pathspec matching (which varies with
the size of the pathspec).
But like I said, this is kind of a silly case. If it actually starts to
matter in the real world, I think it may be more productive to make the
pathspec code scale better.
-Peff
^ permalink raw reply [flat|nested] 10+ messages in thread* Re: [PATCH] ls-files: filter pathspec before lstat
2026-06-08 23:25 ` Jeff King
@ 2026-06-09 0:13 ` Tamir Duberstein
0 siblings, 0 replies; 10+ messages in thread
From: Tamir Duberstein @ 2026-06-09 0:13 UTC (permalink / raw)
To: Jeff King; +Cc: Junio C Hamano, git, René Scharfe, Patrick Steinhardt
On Mon, Jun 8, 2026 at 4:25 PM Jeff King <peff@peff.net> wrote:
>
> On Mon, Jun 08, 2026 at 07:03:15PM -0400, Jeff King wrote:
>
> > > Adding an extra early `match_pathspec()` check before making slow
> > > system calls like `lstat()` makes sense, especially when most of the
> > > index entries need to be skipped. But if most of them would match,
> > > then we would end up doing the same match_pathspec() calls twice for
> > > each path, and run lstat() anyway, so you may also be able to
> > > construct a perf test that demonstrates a case where this approach
> > > is not a clear win (or even degradation), perhaps?
> >
> > The patchspec matching is linear in the number of pathspecs, so it's
> > easy to get quadratic-ish results by just asking about:
> >
> > git ls-files -- $(git ls-files)
> >
> > So that probably provides an easy regression demonstration for this
> > patch.
>
> Ah, yeah, it is easy to demonstrate. Making a repo of size $n like this:
>
> n=10000
> git init
> for i in $(seq $n); do
> echo $i >file$i
> done
> git add .
> git commit -m foo
>
> If we then run:
>
> time git ls-files -- $(git ls-files) >/dev/null
>
> then n=1000 takes ~15ms for me, but n=10000 takes ~800ms. So that shows
> the slowdown of the existing pathspec code as the number of pathspecs
> grows.
>
> With this patch, starting with n=10000 and adding in "-m" (which
> triggers the code in this patch), like:
>
> time git ls-files -m -- $(git ls-files) >/dev/null
>
> the time goes from ~15ms (without the patch) to ~800ms with it. Which
> makes sense. Nothing is modified, so the current code which puts the
> lstat() check first eliminates each entry before we even consider
> pathspecs. So it doesn't hit the slow case at all.
>
> But after the patch, we do a preliminary pathspec match and
> pay the cost.
>
> So it really is a question of how many items are actually modified, the
> cost of lstat(), and the cost of pathspec matching (which varies with
> the size of the pathspec).
>
> But like I said, this is kind of a silly case. If it actually starts to
> matter in the real world, I think it may be more productive to make the
> pathspec code scale better.
Yeah, agreed. Still, it exposed an easy-to-avoid downside in this patch,
so I limited the early match to a single pathspec in v2.
With 10,000 clean files, hyperfine measured 112.5 ms ± 6.6 ms for the
parent and 494.1 ms ± 17.2 ms for v1. With the restriction, the patched
version took 104.9 ms ± 2.2 ms against 110.1 ms ± 4.1 ms for the parent.
Thanks for pointing it out!
^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2026-06-09 0:14 UTC | newest]
Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-07 15:40 [PATCH] ls-files: filter pathspec before lstat Tamir Duberstein
2026-06-07 16:02 ` Kristoffer Haugsbakk
2026-06-07 16:07 ` Tamir Duberstein
2026-06-07 16:17 ` Kristoffer Haugsbakk
2026-06-07 18:24 ` Tamir Duberstein
2026-06-08 13:06 ` Junio C Hamano
2026-06-08 19:15 ` Tamir Duberstein
2026-06-08 23:03 ` Jeff King
2026-06-08 23:25 ` Jeff King
2026-06-09 0:13 ` Tamir Duberstein
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.