[PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files()
@ 2024-06-20 16:11 Derrick Stolee via GitGitGadget
  2024-06-20 16:11 ` [PATCH 1/5] sparse-index: refactor skip worktree retry logic Derrick Stolee via GitGitGadget
                   ` (6 more replies)
  0 siblings, 7 replies; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-20 16:11 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee

While doing some investigation in a private monorepo with sparse-checkout
and a sparse index, I accidentally left a modified file outside of my
sparse-checkout cone. This caused my Git commands to slow to a crawl, so I
reran with GIT_TRACE2_PERF=1.

While I was able to identify clear_skip_worktree_from_present_files() as the
culprit, it took longer than desired to figure out what was going on. This
series intends to both fix the performance issue (as much as possible) and
do some refactoring to make it easier to understand what is happening.

In the end, I was able to reduce the number of lstat() calls in my case from
over 170,000 to about 6,500, improving the time from 2.5s to 71ms on a warm
disk cache.   Thanks, Stolee

Derrick Stolee (5):
  sparse-index: refactor skip worktree retry logic
  sparse-index: refactor path_found()
  sparse-index: use strbuf in path_found()
  sparse-index: count lstat() calls
  sparse-index: improve lstat caching of sparse paths

 sparse-index.c | 220 +++++++++++++++++++++++++++++++++++++------------
 1 file changed, 167 insertions(+), 53 deletions(-)

base-commit: 66ac6e4bcd111be3fa9c2a6b3fafea718d00678d
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1754%2Fderrickstolee%2Fclear-skip-speed-v1
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1754/derrickstolee/clear-skip-speed-v1
Pull-Request: https://github.com/gitgitgadget/git/pull/1754
-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 1/5] sparse-index: refactor skip worktree retry logic
  2024-06-20 16:11 [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Derrick Stolee via GitGitGadget
@ 2024-06-20 16:11 ` Derrick Stolee via GitGitGadget
  2024-06-24 22:12   ` Elijah Newren
  2024-06-20 16:11 ` [PATCH 2/5] sparse-index: refactor path_found() Derrick Stolee via GitGitGadget
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-20 16:11 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The clear_skip_worktree_from_present_files() method was introduced in
af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present
in worktree, 2022-01-14) to help cases where the sparse index is enabled
but some paths outside of the sparse-checkout cone also exist on disk.
This operation can be slow as it needs to check path existence in a way
that is not stored in the collapsed index, so caching was introduced in
d79d299352 (Accelerate clear_skip_worktree_from_present_files() by
caching, 2022-01-14).

If users are having trouble with the performance of this operation and
don't care about paths outside of the sparse-checkout cone, they can
disable them using the sparse.expectFilesOutsideOfPatterns config option
introduced in ecc7c8841d (repo_read_index: add config to expect files
outside sparse patterns, 2022-02-25).

Even with that caching, it was noticed that this could take a long time
to execute. 89aaab11a3 (index: add trace2 region for clear skip
worktree, 2022-11-03) introduced trace2 regions to measure this time.
Further, the way the loop repeats itself was slightly confusing and
prone to breakage, so a BUG() statement was added in 8c7abdc596 (index:
raise a bug if the index is materialised more than once, 2022-11-03) to
be sure that the second run of the loop does not hit any sparse trees.

One thing that can be confusing about the current setup is that the
trace2 regions nest and it is not clear that a second loop is running
after the index is expanded. Here is an example of what the regions look
like in a typical case:

| region_enter | ... | label:clear_skip_worktree_from_present_files
| region_enter | ... | ..label:update
| region_leave | ... | ..label:update
| region_enter | ... | ..label:ensure_full_index
| region_enter | ... | ....label:update
| region_leave | ... | ....label:update
| region_leave | ... | ..label:ensure_full_index
| data         | ... | ..sparse_path_count:1
| data         | ... | ..sparse_path_count_full:269538
| region_leave | ... | label:clear_skip_worktree_from_present_files

One thing that is particularly difficult to understand about these
regions is that most of the time is spent between the close of the
ensure_full_index region and the reporting of the end data. This is
because of the restart of the loop being within the same region as the
first iteration of the loop.

This change refactors the method into two separate methods that are
traced separately. This will be more important later when we change
other features of the methods, but for now the only functional change is
the difference in the structure of the trace regions.

After this change, the same telemetry section is split into three
distinct chunks:

| region_enter | ... | label:clear_skip_worktree_from_present_files_sparse
| data         | ... | ..sparse_path_count:1
| region_leave | ... | label:clear_skip_worktree_from_present_files_sparse
| region_enter | ... | label:update
| region_leave | ... | label:update
| region_enter | ... | label:ensure_full_index
| region_enter | ... | ..label:update
| region_leave | ... | ..label:update
| region_leave | ... | label:ensure_full_index
| region_enter | ... | label:clear_skip_worktree_from_present_files_full
| data         | ... | ..full_path_count:269538
| region_leave | ... | label:clear_skip_worktree_from_present_files_full

Here, we see the sparse loop terminating early with its first sparse
path being a sparse directory containing a file. Then, that loop's
region terminates before ensure_full_index begins (in this case, the
cache-tree must also be computed). Then, _after_ the index is expanded,
the full loop begins with its own region.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 sparse-index.c | 77 ++++++++++++++++++++++++++++++++++----------------
 1 file changed, 53 insertions(+), 24 deletions(-)

diff --git a/sparse-index.c b/sparse-index.c
index e48e40cae71..e0457c87fff 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -486,49 +486,78 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len,
 	return 0;
 }

-void clear_skip_worktree_from_present_files(struct index_state *istate)
+static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate)
 {
 	const char *last_dirname = NULL;
 	size_t dir_len = 0;
 	int dir_found = 1;

-	int i;
-	int path_count[2] = {0, 0};
-	int restarted = 0;
+	int path_count = 0;
+	int to_restart = 0;

-	if (!core_apply_sparse_checkout ||
-	    sparse_expect_files_outside_of_patterns)
-		return;
-
-	trace2_region_enter("index", "clear_skip_worktree_from_present_files",
+	trace2_region_enter("index", "clear_skip_worktree_from_present_files_sparse",
 			    istate->repo);
-restart:
-	for (i = 0; i < istate->cache_nr; i++) {
+	for (int i = 0; i < istate->cache_nr; i++) {
 		struct cache_entry *ce = istate->cache[i];

 		if (ce_skip_worktree(ce)) {
-			path_count[restarted]++;
+			path_count++;
 			if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) {
 				if (S_ISSPARSEDIR(ce->ce_mode)) {
-					if (restarted)
-						BUG("ensure-full-index did not fully flatten?");
-					ensure_full_index(istate);
-					restarted = 1;
-					goto restart;
+					to_restart = 1;
+					break;
 				}
 				ce->ce_flags &= ~CE_SKIP_WORKTREE;
 			}
 		}
 	}

-	if (path_count[0])
-		trace2_data_intmax("index", istate->repo,
-				   "sparse_path_count", path_count[0]);
-	if (restarted)
-		trace2_data_intmax("index", istate->repo,
-				   "sparse_path_count_full", path_count[1]);
-	trace2_region_leave("index", "clear_skip_worktree_from_present_files",
+	trace2_data_intmax("index", istate->repo,
+			   "sparse_path_count", path_count);
+	trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse",
+			    istate->repo);
+	return to_restart;
+}
+
+static void clear_skip_worktree_from_present_files_full(struct index_state *istate)
+{
+	const char *last_dirname = NULL;
+	size_t dir_len = 0;
+	int dir_found = 1;
+
+	int path_count = 0;
+
+	trace2_region_enter("index", "clear_skip_worktree_from_present_files_full",
 			    istate->repo);
+	for (int i = 0; i < istate->cache_nr; i++) {
+		struct cache_entry *ce = istate->cache[i];
+
+		if (S_ISSPARSEDIR(ce->ce_mode))
+			BUG("ensure-full-index did not fully flatten?");
+
+		if (ce_skip_worktree(ce)) {
+			path_count++;
+			if (path_found(ce->name, &last_dirname, &dir_len, &dir_found))
+				ce->ce_flags &= ~CE_SKIP_WORKTREE;
+		}
+	}
+
+	trace2_data_intmax("index", istate->repo,
+			   "full_path_count", path_count);
+	trace2_region_leave("index", "clear_skip_worktree_from_present_files_full",
+			    istate->repo);
+}
+
+void clear_skip_worktree_from_present_files(struct index_state *istate)
+{
+	if (!core_apply_sparse_checkout ||
+	    sparse_expect_files_outside_of_patterns)
+		return;
+
+	if (clear_skip_worktree_from_present_files_sparse(istate)) {
+		ensure_full_index(istate);
+		clear_skip_worktree_from_present_files_full(istate);
+	}
 }

 /*
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/5] sparse-index: refactor skip worktree retry logic
  2024-06-20 16:11 ` [PATCH 1/5] sparse-index: refactor skip worktree retry logic Derrick Stolee via GitGitGadget
@ 2024-06-24 22:12   ` Elijah Newren
  2024-06-26 12:42     ` Derrick Stolee
  0 siblings, 1 reply; 44+ messages in thread
From: Elijah Newren @ 2024-06-24 22:12 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, gitster, anh, Derrick Stolee

On Thu, Jun 20, 2024 at 10:11 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <stolee@gmail.com>
>
> The clear_skip_worktree_from_present_files() method was introduced in
> af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present
> in worktree, 2022-01-14) to help cases where the sparse index is enabled

s/index/checkout/; the code in af6a51875a made no assumptions about
sparse index being enabled.

> but some paths outside of the sparse-checkout cone also exist on disk.

s/cone//; it wasn't specific to cone mode.

> This operation can be slow as it needs to check path existence in a way
> that is not stored in the collapsed index, so caching was introduced in
> d79d299352 (Accelerate clear_skip_worktree_from_present_files() by
> caching, 2022-01-14).

This is worded in a way that seems to imply the code is only relevant
with a sparse index.  Part of the code was written to be faster in
general with a sparse-index, but having a sparse-index was not
assumed.  Perhaps,

This operation could be slow since it needs to check path existence
for the paths outside the sparse checkout, which may vastly outnumber
the number of paths inside the sparse checkout.  To address this,
caching was introduced in d79d299352 (Accelerate
clear_skip_worktree_from_present_files() by caching, 2022-01-14).

> If users are having trouble with the performance of this operation and
> don't care about paths outside of the sparse-checkout cone, they can
> disable them using the sparse.expectFilesOutsideOfPatterns config option
> introduced in ecc7c8841d (repo_read_index: add config to expect files
> outside sparse patterns, 2022-02-25).
>
> Even with that caching, it was noticed that this could take a long time
> to execute. 89aaab11a3 (index: add trace2 region for clear skip
> worktree, 2022-11-03) introduced trace2 regions to measure this time.
> Further, the way the loop repeats itself was slightly confusing and
> prone to breakage, so a BUG() statement was added in 8c7abdc596 (index:
> raise a bug if the index is materialised more than once, 2022-11-03) to
> be sure that the second run of the loop does not hit any sparse trees.
>
> One thing that can be confusing about the current setup is that the
> trace2 regions nest and it is not clear that a second loop is running
> after the index is expanded. Here is an example of what the regions look
> like in a typical case:
>
> | region_enter | ... | label:clear_skip_worktree_from_present_files
> | region_enter | ... | ..label:update
> | region_leave | ... | ..label:update
> | region_enter | ... | ..label:ensure_full_index
> | region_enter | ... | ....label:update
> | region_leave | ... | ....label:update
> | region_leave | ... | ..label:ensure_full_index
> | data         | ... | ..sparse_path_count:1
> | data         | ... | ..sparse_path_count_full:269538
> | region_leave | ... | label:clear_skip_worktree_from_present_files
>
> One thing that is particularly difficult to understand about these
> regions is that most of the time is spent between the close of the
> ensure_full_index region and the reporting of the end data. This is
> because of the restart of the loop being within the same region as the
> first iteration of the loop.
>
> This change refactors the method into two separate methods that are
> traced separately. This will be more important later when we change
> other features of the methods, but for now the only functional change is
> the difference in the structure of the trace regions.
>
> After this change, the same telemetry section is split into three
> distinct chunks:
>
> | region_enter | ... | label:clear_skip_worktree_from_present_files_sparse
> | data         | ... | ..sparse_path_count:1
> | region_leave | ... | label:clear_skip_worktree_from_present_files_sparse
> | region_enter | ... | label:update
> | region_leave | ... | label:update
> | region_enter | ... | label:ensure_full_index
> | region_enter | ... | ..label:update
> | region_leave | ... | ..label:update
> | region_leave | ... | label:ensure_full_index
> | region_enter | ... | label:clear_skip_worktree_from_present_files_full
> | data         | ... | ..full_path_count:269538
> | region_leave | ... | label:clear_skip_worktree_from_present_files_full
>
> Here, we see the sparse loop terminating early with its first sparse
> path being a sparse directory containing a file. Then, that loop's
> region terminates before ensure_full_index begins (in this case, the
> cache-tree must also be computed). Then, _after_ the index is expanded,
> the full loop begins with its own region.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  sparse-index.c | 77 ++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 53 insertions(+), 24 deletions(-)
>
> diff --git a/sparse-index.c b/sparse-index.c
> index e48e40cae71..e0457c87fff 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -486,49 +486,78 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len,
>         return 0;
>  }
>
> -void clear_skip_worktree_from_present_files(struct index_state *istate)
> +static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate)
>  {
>         const char *last_dirname = NULL;
>         size_t dir_len = 0;
>         int dir_found = 1;
>
> -       int i;
> -       int path_count[2] = {0, 0};
> -       int restarted = 0;
> +       int path_count = 0;
> +       int to_restart = 0;
>
> -       if (!core_apply_sparse_checkout ||
> -           sparse_expect_files_outside_of_patterns)
> -               return;
> -
> -       trace2_region_enter("index", "clear_skip_worktree_from_present_files",
> +       trace2_region_enter("index", "clear_skip_worktree_from_present_files_sparse",
>                             istate->repo);
> -restart:
> -       for (i = 0; i < istate->cache_nr; i++) {
> +       for (int i = 0; i < istate->cache_nr; i++) {
>                 struct cache_entry *ce = istate->cache[i];
>
>                 if (ce_skip_worktree(ce)) {
> -                       path_count[restarted]++;
> +                       path_count++;
>                         if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) {
>                                 if (S_ISSPARSEDIR(ce->ce_mode)) {
> -                                       if (restarted)
> -                                               BUG("ensure-full-index did not fully flatten?");
> -                                       ensure_full_index(istate);
> -                                       restarted = 1;
> -                                       goto restart;
> +                                       to_restart = 1;
> +                                       break;
>                                 }
>                                 ce->ce_flags &= ~CE_SKIP_WORKTREE;
>                         }
>                 }
>         }
>
> -       if (path_count[0])
> -               trace2_data_intmax("index", istate->repo,
> -                                  "sparse_path_count", path_count[0]);
> -       if (restarted)
> -               trace2_data_intmax("index", istate->repo,
> -                                  "sparse_path_count_full", path_count[1]);
> -       trace2_region_leave("index", "clear_skip_worktree_from_present_files",
> +       trace2_data_intmax("index", istate->repo,
> +                          "sparse_path_count", path_count);
> +       trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse",
> +                           istate->repo);
> +       return to_restart;
> +}
> +
> +static void clear_skip_worktree_from_present_files_full(struct index_state *istate)
> +{
> +       const char *last_dirname = NULL;
> +       size_t dir_len = 0;
> +       int dir_found = 1;
> +
> +       int path_count = 0;
> +
> +       trace2_region_enter("index", "clear_skip_worktree_from_present_files_full",
>                             istate->repo);
> +       for (int i = 0; i < istate->cache_nr; i++) {
> +               struct cache_entry *ce = istate->cache[i];
> +
> +               if (S_ISSPARSEDIR(ce->ce_mode))
> +                       BUG("ensure-full-index did not fully flatten?");
> +
> +               if (ce_skip_worktree(ce)) {
> +                       path_count++;
> +                       if (path_found(ce->name, &last_dirname, &dir_len, &dir_found))
> +                               ce->ce_flags &= ~CE_SKIP_WORKTREE;
> +               }
> +       }
> +
> +       trace2_data_intmax("index", istate->repo,
> +                          "full_path_count", path_count);
> +       trace2_region_leave("index", "clear_skip_worktree_from_present_files_full",
> +                           istate->repo);
> +}
> +
> +void clear_skip_worktree_from_present_files(struct index_state *istate)
> +{
> +       if (!core_apply_sparse_checkout ||
> +           sparse_expect_files_outside_of_patterns)
> +               return;
> +
> +       if (clear_skip_worktree_from_present_files_sparse(istate)) {
> +               ensure_full_index(istate);
> +               clear_skip_worktree_from_present_files_full(istate);
> +       }
>  }
>
>  /*
> --
> gitgitgadget

Although I had a few wording quibbles with the commit message, overall
the commit message was clear about what you were trying to achieve.
Also, the change in this patch is a straightforward splitting of the
old function into three new functions (one of which is the overall
driver calling the other two).

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 1/5] sparse-index: refactor skip worktree retry logic
  2024-06-24 22:12   ` Elijah Newren
@ 2024-06-26 12:42     ` Derrick Stolee
  0 siblings, 0 replies; 44+ messages in thread
From: Derrick Stolee @ 2024-06-26 12:42 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget; +Cc: git, gitster, anh

On 6/24/24 6:12 PM, Elijah Newren wrote:
> On Thu, Jun 20, 2024 at 10:11 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> The clear_skip_worktree_from_present_files() method was introduced in
>> af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present
>> in worktree, 2022-01-14) to help cases where the sparse index is enabled
> 
> s/index/checkout/; the code in af6a51875a made no assumptions about
> sparse index being enabled.
> 
>> but some paths outside of the sparse-checkout cone also exist on disk.
> 
> s/cone//; it wasn't specific to cone mode.

Thanks for the context that this is a generic sparse-checkout feature,
and is not limited to the sparse index (which is limited to cone mode).
I made assumptions based on the code's location in sparse-index.c.

> Although I had a few wording quibbles with the commit message, overall
> the commit message was clear about what you were trying to achieve.
> Also, the change in this patch is a straightforward splitting of the
> old function into three new functions (one of which is the overall
> driver calling the other two).

I have reworded locally and will have a v2 later today with a better
message. Thanks for your attention to detail.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 2/5] sparse-index: refactor path_found()
  2024-06-20 16:11 [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Derrick Stolee via GitGitGadget
  2024-06-20 16:11 ` [PATCH 1/5] sparse-index: refactor skip worktree retry logic Derrick Stolee via GitGitGadget
@ 2024-06-20 16:11 ` Derrick Stolee via GitGitGadget
  2024-06-24 22:13   ` Elijah Newren
  2024-06-20 16:11 ` [PATCH 3/5] sparse-index: use strbuf in path_found() Derrick Stolee via GitGitGadget
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-20 16:11 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

In advance of changing the behavior of path_found(), take all of the
intermediate data values and group them into a single struct. This
simplifies the method prototype as well as the initialization. Future
changes can be made directly to the struct and method without changing
the callers with this approach.

Note that the clear_path_found_data() method is currently empty, as
there is nothing to free. However, this will change in the future, so
place the method and its callers for now.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 sparse-index.c | 45 +++++++++++++++++++++++++++++----------------
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/sparse-index.c b/sparse-index.c
index e0457c87fff..de6e727f5c1 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -439,8 +439,22 @@ void ensure_correct_sparsity(struct index_state *istate)
 		ensure_full_index(istate);
 }
 
-static int path_found(const char *path, const char **dirname, size_t *dir_len,
-		      int *dir_found)
+struct path_found_data {
+	const char *dirname;
+	size_t dir_len;
+	int dir_found;
+};
+
+#define PATH_FOUND_DATA_INIT { \
+	.dir_found = 1 \
+}
+
+static void clear_path_found_data(struct path_found_data *data)
+{
+	return;
+}
+
+static int path_found(const char *path, struct path_found_data *data)
 {
 	struct stat st;
 	char *newdir;
@@ -450,7 +464,7 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len,
 	 * If dirname corresponds to a directory that doesn't exist, and this
 	 * path starts with dirname, then path can't exist.
 	 */
-	if (!*dir_found && !memcmp(path, *dirname, *dir_len))
+	if (!data->dir_found && !memcmp(path, data->dirname, data->dir_len))
 		return 0;
 
 	/*
@@ -472,15 +486,16 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len,
 	 * If path starts with directory (which we already lstat'ed and found),
 	 * then no need to lstat parent directory again.
 	 */
-	if (*dir_found && *dirname && memcmp(path, *dirname, *dir_len))
+	if (data->dir_found && data->dirname &&
+	    memcmp(path, data->dirname, data->dir_len))
 		return 0;
 
 	/* Free previous dirname, and cache path's dirname */
-	*dirname = path;
-	*dir_len = newdir - path + 1;
+	data->dirname = path;
+	data->dir_len = newdir - path + 1;
 
-	tmp = xstrndup(path, *dir_len);
-	*dir_found = !lstat(tmp, &st);
+	tmp = xstrndup(path, data->dir_len);
+	data->dir_found = !lstat(tmp, &st);
 	free(tmp);
 
 	return 0;
@@ -488,9 +503,7 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len,
 
 static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate)
 {
-	const char *last_dirname = NULL;
-	size_t dir_len = 0;
-	int dir_found = 1;
+	struct path_found_data data = PATH_FOUND_DATA_INIT;
 
 	int path_count = 0;
 	int to_restart = 0;
@@ -502,7 +515,7 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist
 
 		if (ce_skip_worktree(ce)) {
 			path_count++;
-			if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) {
+			if (path_found(ce->name, &data)) {
 				if (S_ISSPARSEDIR(ce->ce_mode)) {
 					to_restart = 1;
 					break;
@@ -516,14 +529,13 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist
 			   "sparse_path_count", path_count);
 	trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse",
 			    istate->repo);
+	clear_path_found_data(&data);
 	return to_restart;
 }
 
 static void clear_skip_worktree_from_present_files_full(struct index_state *istate)
 {
-	const char *last_dirname = NULL;
-	size_t dir_len = 0;
-	int dir_found = 1;
+	struct path_found_data data = PATH_FOUND_DATA_INIT;
 
 	int path_count = 0;
 
@@ -537,7 +549,7 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista
 
 		if (ce_skip_worktree(ce)) {
 			path_count++;
-			if (path_found(ce->name, &last_dirname, &dir_len, &dir_found))
+			if (path_found(ce->name, &data))
 				ce->ce_flags &= ~CE_SKIP_WORKTREE;
 		}
 	}
@@ -546,6 +558,7 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista
 			   "full_path_count", path_count);
 	trace2_region_leave("index", "clear_skip_worktree_from_present_files_full",
 			    istate->repo);
+	clear_path_found_data(&data);
 }
 
 void clear_skip_worktree_from_present_files(struct index_state *istate)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/5] sparse-index: refactor path_found()
  2024-06-20 16:11 ` [PATCH 2/5] sparse-index: refactor path_found() Derrick Stolee via GitGitGadget
@ 2024-06-24 22:13   ` Elijah Newren
  2024-06-26 12:43     ` Derrick Stolee
  0 siblings, 1 reply; 44+ messages in thread
From: Elijah Newren @ 2024-06-24 22:13 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, gitster, anh, Derrick Stolee

On Thu, Jun 20, 2024 at 10:11 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <stolee@gmail.com>
>
> In advance of changing the behavior of path_found(), take all of the
> intermediate data values and group them into a single struct. This
> simplifies the method prototype as well as the initialization. Future
> changes can be made directly to the struct and method without changing
> the callers with this approach.
>
> Note that the clear_path_found_data() method is currently empty, as
> there is nothing to free. However, this will change in the future, so
> place the method and its callers for now.

I can't parse the second half of the final sentence.  What was meant here?

> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  sparse-index.c | 45 +++++++++++++++++++++++++++++----------------
>  1 file changed, 29 insertions(+), 16 deletions(-)
>
> diff --git a/sparse-index.c b/sparse-index.c
> index e0457c87fff..de6e727f5c1 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -439,8 +439,22 @@ void ensure_correct_sparsity(struct index_state *istate)
>                 ensure_full_index(istate);
>  }
>
> -static int path_found(const char *path, const char **dirname, size_t *dir_len,
> -                     int *dir_found)
> +struct path_found_data {
> +       const char *dirname;
> +       size_t dir_len;
> +       int dir_found;
> +};
> +
> +#define PATH_FOUND_DATA_INIT { \
> +       .dir_found = 1 \
> +}
> +
> +static void clear_path_found_data(struct path_found_data *data)
> +{
> +       return;
> +}
> +
> +static int path_found(const char *path, struct path_found_data *data)
>  {
>         struct stat st;
>         char *newdir;
> @@ -450,7 +464,7 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len,
>          * If dirname corresponds to a directory that doesn't exist, and this
>          * path starts with dirname, then path can't exist.
>          */
> -       if (!*dir_found && !memcmp(path, *dirname, *dir_len))
> +       if (!data->dir_found && !memcmp(path, data->dirname, data->dir_len))
>                 return 0;
>
>         /*
> @@ -472,15 +486,16 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len,
>          * If path starts with directory (which we already lstat'ed and found),
>          * then no need to lstat parent directory again.
>          */
> -       if (*dir_found && *dirname && memcmp(path, *dirname, *dir_len))
> +       if (data->dir_found && data->dirname &&
> +           memcmp(path, data->dirname, data->dir_len))
>                 return 0;
>
>         /* Free previous dirname, and cache path's dirname */
> -       *dirname = path;
> -       *dir_len = newdir - path + 1;
> +       data->dirname = path;
> +       data->dir_len = newdir - path + 1;
>
> -       tmp = xstrndup(path, *dir_len);
> -       *dir_found = !lstat(tmp, &st);
> +       tmp = xstrndup(path, data->dir_len);
> +       data->dir_found = !lstat(tmp, &st);
>         free(tmp);
>
>         return 0;
> @@ -488,9 +503,7 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len,
>
>  static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate)
>  {
> -       const char *last_dirname = NULL;
> -       size_t dir_len = 0;
> -       int dir_found = 1;
> +       struct path_found_data data = PATH_FOUND_DATA_INIT;
>
>         int path_count = 0;
>         int to_restart = 0;
> @@ -502,7 +515,7 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist
>
>                 if (ce_skip_worktree(ce)) {
>                         path_count++;
> -                       if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) {
> +                       if (path_found(ce->name, &data)) {
>                                 if (S_ISSPARSEDIR(ce->ce_mode)) {
>                                         to_restart = 1;
>                                         break;
> @@ -516,14 +529,13 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist
>                            "sparse_path_count", path_count);
>         trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse",
>                             istate->repo);
> +       clear_path_found_data(&data);
>         return to_restart;
>  }
>
>  static void clear_skip_worktree_from_present_files_full(struct index_state *istate)
>  {
> -       const char *last_dirname = NULL;
> -       size_t dir_len = 0;
> -       int dir_found = 1;
> +       struct path_found_data data = PATH_FOUND_DATA_INIT;
>
>         int path_count = 0;
>
> @@ -537,7 +549,7 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista
>
>                 if (ce_skip_worktree(ce)) {
>                         path_count++;
> -                       if (path_found(ce->name, &last_dirname, &dir_len, &dir_found))
> +                       if (path_found(ce->name, &data))
>                                 ce->ce_flags &= ~CE_SKIP_WORKTREE;
>                 }
>         }
> @@ -546,6 +558,7 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista
>                            "full_path_count", path_count);
>         trace2_region_leave("index", "clear_skip_worktree_from_present_files_full",
>                             istate->repo);
> +       clear_path_found_data(&data);
>  }
>
>  void clear_skip_worktree_from_present_files(struct index_state *istate)
> --
> gitgitgadget

It's surprising how much a simple, straightforward translation of the
code can improve its readability; I should have done it this way all
along.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 2/5] sparse-index: refactor path_found()
  2024-06-24 22:13   ` Elijah Newren
@ 2024-06-26 12:43     ` Derrick Stolee
  0 siblings, 0 replies; 44+ messages in thread
From: Derrick Stolee @ 2024-06-26 12:43 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget; +Cc: git, gitster, anh

On 6/24/24 6:13 PM, Elijah Newren wrote:
> On Thu, Jun 20, 2024 at 10:11 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> In advance of changing the behavior of path_found(), take all of the
>> intermediate data values and group them into a single struct. This
>> simplifies the method prototype as well as the initialization. Future
>> changes can be made directly to the struct and method without changing
>> the callers with this approach.
>>
>> Note that the clear_path_found_data() method is currently empty, as
>> there is nothing to free. However, this will change in the future, so
>> place the method and its callers for now.
> 
> I can't parse the second half of the final sentence.  What was meant here?

I was trying to point out how this method looks a bit silly:

>> +static void clear_path_found_data(struct path_found_data *data)
>> +{
>> +       return;
>> +}

It will be filled in with a meaningful implementation later on, but
I wanted to establish its callers here. I'll try to find a better
way to communicate this.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 3/5] sparse-index: use strbuf in path_found()
  2024-06-20 16:11 [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Derrick Stolee via GitGitGadget
  2024-06-20 16:11 ` [PATCH 1/5] sparse-index: refactor skip worktree retry logic Derrick Stolee via GitGitGadget
  2024-06-20 16:11 ` [PATCH 2/5] sparse-index: refactor path_found() Derrick Stolee via GitGitGadget
@ 2024-06-20 16:11 ` Derrick Stolee via GitGitGadget
  2024-06-24 22:13   ` Elijah Newren
  2024-06-20 16:11 ` [PATCH 4/5] sparse-index: count lstat() calls Derrick Stolee via GitGitGadget
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-20 16:11 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The path_found() method previously reused strings from the cache entries
the calling methods were using. This prevents string manipulation in
place and causes some odd reallocation before the final lstat() call in
the method.

Refactor the method to use strbufs and copy the path into the strbuf,
but also only the parent directory and not the whole path. This looks
like extra copying when assigning the path to the strbuf, but we save an
allocation by dropping the 'tmp' string, and we are "reusing" the copy
from 'tmp' to put the data in the strbuf.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 sparse-index.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/sparse-index.c b/sparse-index.c
index de6e727f5c1..fec4f393360 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -440,31 +440,30 @@ void ensure_correct_sparsity(struct index_state *istate)
 }
 
 struct path_found_data {
-	const char *dirname;
-	size_t dir_len;
+	struct strbuf dir;
 	int dir_found;
 };
 
 #define PATH_FOUND_DATA_INIT { \
+	.dir = STRBUF_INIT, \
 	.dir_found = 1 \
 }
 
 static void clear_path_found_data(struct path_found_data *data)
 {
-	return;
+	strbuf_release(&data->dir);
 }
 
 static int path_found(const char *path, struct path_found_data *data)
 {
 	struct stat st;
 	char *newdir;
-	char *tmp;
 
 	/*
 	 * If dirname corresponds to a directory that doesn't exist, and this
 	 * path starts with dirname, then path can't exist.
 	 */
-	if (!data->dir_found && !memcmp(path, data->dirname, data->dir_len))
+	if (!data->dir_found && !memcmp(path, data->dir.buf, data->dir.len))
 		return 0;
 
 	/*
@@ -486,17 +485,15 @@ static int path_found(const char *path, struct path_found_data *data)
 	 * If path starts with directory (which we already lstat'ed and found),
 	 * then no need to lstat parent directory again.
 	 */
-	if (data->dir_found && data->dirname &&
-	    memcmp(path, data->dirname, data->dir_len))
+	if (data->dir_found && data->dir.buf &&
+	    memcmp(path, data->dir.buf, data->dir.len))
 		return 0;
 
 	/* Free previous dirname, and cache path's dirname */
-	data->dirname = path;
-	data->dir_len = newdir - path + 1;
+	strbuf_reset(&data->dir);
+	strbuf_add(&data->dir, path, newdir - path + 1);
 
-	tmp = xstrndup(path, data->dir_len);
-	data->dir_found = !lstat(tmp, &st);
-	free(tmp);
+	data->dir_found = !lstat(data->dir.buf, &st);
 
 	return 0;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 3/5] sparse-index: use strbuf in path_found()
  2024-06-20 16:11 ` [PATCH 3/5] sparse-index: use strbuf in path_found() Derrick Stolee via GitGitGadget
@ 2024-06-24 22:13   ` Elijah Newren
  0 siblings, 0 replies; 44+ messages in thread
From: Elijah Newren @ 2024-06-24 22:13 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, gitster, anh, Derrick Stolee

On Thu, Jun 20, 2024 at 10:11 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <stolee@gmail.com>
>
> The path_found() method previously reused strings from the cache entries
> the calling methods were using. This prevents string manipulation in
> place and causes some odd reallocation before the final lstat() call in
> the method.
>
> Refactor the method to use strbufs and copy the path into the strbuf,
> but also only the parent directory and not the whole path. This looks
> like extra copying when assigning the path to the strbuf, but we save an
> allocation by dropping the 'tmp' string, and we are "reusing" the copy
> from 'tmp' to put the data in the strbuf.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  sparse-index.c | 21 +++++++++------------
>  1 file changed, 9 insertions(+), 12 deletions(-)
>
> diff --git a/sparse-index.c b/sparse-index.c
> index de6e727f5c1..fec4f393360 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -440,31 +440,30 @@ void ensure_correct_sparsity(struct index_state *istate)
>  }
>
>  struct path_found_data {
> -       const char *dirname;
> -       size_t dir_len;
> +       struct strbuf dir;
>         int dir_found;
>  };
>
>  #define PATH_FOUND_DATA_INIT { \
> +       .dir = STRBUF_INIT, \
>         .dir_found = 1 \
>  }
>
>  static void clear_path_found_data(struct path_found_data *data)
>  {
> -       return;
> +       strbuf_release(&data->dir);
>  }
>
>  static int path_found(const char *path, struct path_found_data *data)
>  {
>         struct stat st;
>         char *newdir;
> -       char *tmp;
>
>         /*
>          * If dirname corresponds to a directory that doesn't exist, and this
>          * path starts with dirname, then path can't exist.
>          */
> -       if (!data->dir_found && !memcmp(path, data->dirname, data->dir_len))
> +       if (!data->dir_found && !memcmp(path, data->dir.buf, data->dir.len))
>                 return 0;
>
>         /*
> @@ -486,17 +485,15 @@ static int path_found(const char *path, struct path_found_data *data)
>          * If path starts with directory (which we already lstat'ed and found),
>          * then no need to lstat parent directory again.
>          */
> -       if (data->dir_found && data->dirname &&
> -           memcmp(path, data->dirname, data->dir_len))
> +       if (data->dir_found && data->dir.buf &&
> +           memcmp(path, data->dir.buf, data->dir.len))
>                 return 0;
>
>         /* Free previous dirname, and cache path's dirname */
> -       data->dirname = path;
> -       data->dir_len = newdir - path + 1;
> +       strbuf_reset(&data->dir);
> +       strbuf_add(&data->dir, path, newdir - path + 1);
>
> -       tmp = xstrndup(path, data->dir_len);
> -       data->dir_found = !lstat(tmp, &st);
> -       free(tmp);
> +       data->dir_found = !lstat(data->dir.buf, &st);
>
>         return 0;
>  }
> --
> gitgitgadget

Another simple translation; the series looks good so far...

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 4/5] sparse-index: count lstat() calls
  2024-06-20 16:11 [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Derrick Stolee via GitGitGadget
                   ` (2 preceding siblings ...)
  2024-06-20 16:11 ` [PATCH 3/5] sparse-index: use strbuf in path_found() Derrick Stolee via GitGitGadget
@ 2024-06-20 16:11 ` Derrick Stolee via GitGitGadget
  2024-06-24 22:13   ` Elijah Newren
  2024-06-20 16:11 ` [PATCH 5/5] sparse-index: improve lstat caching of sparse paths Derrick Stolee via GitGitGadget
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-20 16:11 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The clear_skip_worktree.. methods already report some statistics about
how many cache entries are checked against path_found() due to having
the skip-worktree bit set. However, due to path_found() performing some
caching, this isn't the only information that would be helpful to
report.

Add a new lstat_count member to the path_found_data struct to count the
number of times path_found() calls lstat(). This will be helpful to help
explain performance problems in this method as well as to demonstrate
future changes to the caching algorithm in a more concrete way than
end-to-end timings.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 sparse-index.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/sparse-index.c b/sparse-index.c
index fec4f393360..8577fa726b8 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -442,6 +442,7 @@ void ensure_correct_sparsity(struct index_state *istate)
 struct path_found_data {
 	struct strbuf dir;
 	int dir_found;
+	size_t lstat_count;
 };
 
 #define PATH_FOUND_DATA_INIT { \
@@ -469,6 +470,7 @@ static int path_found(const char *path, struct path_found_data *data)
 	/*
 	 * If path itself exists, return 1.
 	 */
+	data->lstat_count++;
 	if (!lstat(path, &st))
 		return 1;
 
@@ -493,6 +495,7 @@ static int path_found(const char *path, struct path_found_data *data)
 	strbuf_reset(&data->dir);
 	strbuf_add(&data->dir, path, newdir - path + 1);
 
+	data->lstat_count++;
 	data->dir_found = !lstat(data->dir.buf, &st);
 
 	return 0;
@@ -524,6 +527,8 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist
 
 	trace2_data_intmax("index", istate->repo,
 			   "sparse_path_count", path_count);
+	trace2_data_intmax("index", istate->repo,
+			   "sparse_lstat_count", data.lstat_count);
 	trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse",
 			    istate->repo);
 	clear_path_found_data(&data);
@@ -553,6 +558,8 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista
 
 	trace2_data_intmax("index", istate->repo,
 			   "full_path_count", path_count);
+	trace2_data_intmax("index", istate->repo,
+			   "full_lstat_count", data.lstat_count);
 	trace2_region_leave("index", "clear_skip_worktree_from_present_files_full",
 			    istate->repo);
 	clear_path_found_data(&data);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 4/5] sparse-index: count lstat() calls
  2024-06-20 16:11 ` [PATCH 4/5] sparse-index: count lstat() calls Derrick Stolee via GitGitGadget
@ 2024-06-24 22:13   ` Elijah Newren
  0 siblings, 0 replies; 44+ messages in thread
From: Elijah Newren @ 2024-06-24 22:13 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, gitster, anh, Derrick Stolee

On Thu, Jun 20, 2024 at 10:12 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <stolee@gmail.com>
>
> The clear_skip_worktree.. methods already report some statistics about
> how many cache entries are checked against path_found() due to having
> the skip-worktree bit set. However, due to path_found() performing some
> caching, this isn't the only information that would be helpful to
> report.
>
> Add a new lstat_count member to the path_found_data struct to count the
> number of times path_found() calls lstat(). This will be helpful to help
> explain performance problems in this method as well as to demonstrate
> future changes to the caching algorithm in a more concrete way than
> end-to-end timings.
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  sparse-index.c | 7 +++++++
>  1 file changed, 7 insertions(+)
>
> diff --git a/sparse-index.c b/sparse-index.c
> index fec4f393360..8577fa726b8 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -442,6 +442,7 @@ void ensure_correct_sparsity(struct index_state *istate)
>  struct path_found_data {
>         struct strbuf dir;
>         int dir_found;
> +       size_t lstat_count;
>  };
>
>  #define PATH_FOUND_DATA_INIT { \
> @@ -469,6 +470,7 @@ static int path_found(const char *path, struct path_found_data *data)
>         /*
>          * If path itself exists, return 1.
>          */
> +       data->lstat_count++;
>         if (!lstat(path, &st))
>                 return 1;
>
> @@ -493,6 +495,7 @@ static int path_found(const char *path, struct path_found_data *data)
>         strbuf_reset(&data->dir);
>         strbuf_add(&data->dir, path, newdir - path + 1);
>
> +       data->lstat_count++;
>         data->dir_found = !lstat(data->dir.buf, &st);
>
>         return 0;
> @@ -524,6 +527,8 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist
>
>         trace2_data_intmax("index", istate->repo,
>                            "sparse_path_count", path_count);
> +       trace2_data_intmax("index", istate->repo,
> +                          "sparse_lstat_count", data.lstat_count);
>         trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse",
>                             istate->repo);
>         clear_path_found_data(&data);
> @@ -553,6 +558,8 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista
>
>         trace2_data_intmax("index", istate->repo,
>                            "full_path_count", path_count);
> +       trace2_data_intmax("index", istate->repo,
> +                          "full_lstat_count", data.lstat_count);
>         trace2_region_leave("index", "clear_skip_worktree_from_present_files_full",
>                             istate->repo);
>         clear_path_found_data(&data);
> --
> gitgitgadget

Makes sense.  I'm getting really curious to see how this all fits into
the big improvements you made; I guess I'll see that in the next
patch...

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH 5/5] sparse-index: improve lstat caching of sparse paths
  2024-06-20 16:11 [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Derrick Stolee via GitGitGadget
                   ` (3 preceding siblings ...)
  2024-06-20 16:11 ` [PATCH 4/5] sparse-index: count lstat() calls Derrick Stolee via GitGitGadget
@ 2024-06-20 16:11 ` Derrick Stolee via GitGitGadget
  2024-06-24 22:14   ` Elijah Newren
  2024-06-20 19:16 ` [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Junio C Hamano
  2024-06-26 14:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  6 siblings, 1 reply; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-20 16:11 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The clear_skip_worktree_from_present_files() method was first introduced
in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files
present in worktree, 2022-01-14) to allow better interaction with the
working directory in the presence of paths outside of the
sparse-checkout cone. The initial implementation would lstat() every
single sparse tree to see if it existed, and if one did, then the sparse
index would expand and every sparse file would be checked.

Since these lstat() calls were very expensive, this was improved in
d79d299352 (Accelerate clear_skip_worktree_from_present_files() by
caching, 2022-01-14) by caching directories that do not exist. However,
there are some inefficiencies in that caching mechanism.

The caching mechanism stored only the parent directory as not existing,
even if a higher parent directory also does not exist. This means that
wasted lstat() calls would occur when the sparse files change immediate
parent directories but within the same root directory that does not
exist.

To set up a scenario that triggers this code in an interesting way, we
need a sparse-checkout in cone mode and a sparse index. To trigger the
full index expansion and a call to the
clear_skip_worktree_from_present_files_full() method, we need one of the
sparse trees to actually exist on disk. The performance test script
p2000-sparse-operations.sh takes the sample repository and copies its
HEAD to several copies nested in directories of the form f<i>/f<j>/f<k>
where i, j, and k are numbers from 1 to 4. The sparse-checkout cone is
then selected as "f2/f4/". Creating "f1/f1/" will trigger the behavior
and also lead to some interesting cases for the caching algorithm since
"f1/f1/" exists but "f1/f2/" and "f3/" do not.

This is difficult to notice when running performance tests using the Git
repository (or a blow-up of the Git repository, as in
p2000-sparse-operations.sh) because Git has a very shallow directory
structure.

This change reorganizes the caching algorithm to focus on storing both
the deepest _existing_ directory and the next-level non-existing
directory. By doing a little extra work on the first sparse file, we can
short-circuit all of the sparse files that exist in that non-existing
directory. When in a repository where the first sparse file is likely to
have a much deeper path than the first non-existing directory, this can
realize significant gains.

The details of this algorithm require careful attention, so the new
implementation of path_found() has detailed comments, including the use
of a new max_common_dir_prefix() method that may be of independent
interest.

It's worth noting that this is not universally positive, since we are
doing extra lstat() calls to establish the exact path to cache. In the
blow-up of the Git repository, we can see that the lstat count
_increases_ from 28 to 31. However, these numbers were already
artificially low.

Using an internal monorepo with over two million paths at HEAD and a
typical sparse-checkout cone such that the index contains ~190,000
entries (including over two thousand sparse trees), I was able to
measure these lstat counts when one sparse directory actually exists on
disk:

  Sparse files in expanded index: 1,841,997
       full_lstat_count (before):   173,259
       full_lstat_count  (after):     6,521

This resulted in this absolute time change, on a warm disk:

      Time in full loop (before): 2.527 s
      Time in full loop  (after): 0.071 s

(These times were calculated on a Windows machine, where lstat() is
slower than a similar Linux machine.)

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 sparse-index.c | 118 ++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 93 insertions(+), 25 deletions(-)

diff --git a/sparse-index.c b/sparse-index.c
index 8577fa726b8..cccd8550dfe 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -440,14 +440,21 @@ void ensure_correct_sparsity(struct index_state *istate)
 }
 
 struct path_found_data {
+	/**
+	 * The path stored in 'dir', if non-empty, corresponds to the most-
+	 * recent path that we checked where:
+	 *
+	 *   1. The path should be a directory, according to the index.
+	 *   2. The path does not exist.
+	 *   3. The parent path _does_ exist. (This may be the root of the
+	 *      working directory.)
+	 */
 	struct strbuf dir;
-	int dir_found;
 	size_t lstat_count;
 };
 
 #define PATH_FOUND_DATA_INIT { \
-	.dir = STRBUF_INIT, \
-	.dir_found = 1 \
+	.dir = STRBUF_INIT \
 }
 
 static void clear_path_found_data(struct path_found_data *data)
@@ -455,50 +462,111 @@ static void clear_path_found_data(struct path_found_data *data)
 	strbuf_release(&data->dir);
 }
 
+/**
+ * Return the length of the largest common substring that ends in a
+ * slash ('/') to indicate the largest common parent directory. Returns
+ * zero if no common directory exists.
+ */
+static size_t max_common_dir_prefix(const char *path1, const char *path2)
+{
+	size_t common_prefix = 0;
+	for (size_t i = 0; path1[i] && path2[i]; i++) {
+		if (path1[i] != path2[i])
+			break;
+
+		/*
+		 * If they agree at a directory separator, then add one
+		 * to make sure it is included in the common prefix string.
+		 */
+		if (path1[i] == '/')
+			common_prefix = i + 1;
+	}
+
+	return common_prefix;
+}
+
 static int path_found(const char *path, struct path_found_data *data)
 {
 	struct stat st;
-	char *newdir;
+	size_t common_prefix;
 
 	/*
-	 * If dirname corresponds to a directory that doesn't exist, and this
-	 * path starts with dirname, then path can't exist.
+	 * If data->dir is non-empty, then it contains a path that doesn't
+	 * exist, including an ending slash ('/'). If it is a prefix of 'path',
+	 * then we can return 0.
 	 */
-	if (!data->dir_found && !memcmp(path, data->dir.buf, data->dir.len))
+	if (data->dir.len && !memcmp(path, data->dir.buf, data->dir.len))
 		return 0;
 
 	/*
-	 * If path itself exists, return 1.
+	 * Otherwise, we must check if the current path exists. If it does, then
+	 * return 1. The cached directory will be skipped until we come across
+	 * a missing path again.
 	 */
 	data->lstat_count++;
 	if (!lstat(path, &st))
 		return 1;
 
 	/*
-	 * Otherwise, path does not exist so we'll return 0...but we'll first
-	 * determine some info about its parent directory so we can avoid
-	 * lstat calls for future cache entries.
+	 * At this point, we know that 'path' doesn't exist, and we know that
+	 * the parent directory of 'data->dir' does exist. Let's set 'data->dir'
+	 * to be the top-most non-existing directory of 'path'. If the first
+	 * parent of 'path' exists, then we will act ast though 'path'
+	 * corresponds to a directory (by adding a slash).
 	 */
-	newdir = strrchr(path, '/');
-	if (!newdir)
-		return 0; /* Didn't find a parent dir; just return 0 now. */
+	common_prefix = max_common_dir_prefix(path, data->dir.buf);
 
 	/*
-	 * If path starts with directory (which we already lstat'ed and found),
-	 * then no need to lstat parent directory again.
+	 * At this point, 'path' and 'data->dir' have a common existing parent
+	 * directory given by path[0..common_prefix] (which could have length 0).
+	 * We "grow" the data->dir buffer by checking for existing directories
+	 * along 'path'.
 	 */
-	if (data->dir_found && data->dir.buf &&
-	    memcmp(path, data->dir.buf, data->dir.len))
-		return 0;
 
-	/* Free previous dirname, and cache path's dirname */
-	strbuf_reset(&data->dir);
-	strbuf_add(&data->dir, path, newdir - path + 1);
+	strbuf_setlen(&data->dir, common_prefix);
+	while (1) {
+		/* Find the next directory in 'path'. */
+		const char *next_slash = strchr(path + data->dir.len, '/');
 
-	data->lstat_count++;
-	data->dir_found = !lstat(data->dir.buf, &st);
+		/*
+		 * If there are no more slashes, then 'path' doesn't contain a
+		 * non-existent _parent_ directory. Set 'data->dir' to be equal
+		 * to 'path' plus an additional slash, so it can be used for
+		 * caching in the future. The filename of 'path' is considered
+		 * a non-existent directory.
+		 *
+		 * Note: if "{path}/" exists as a directory, then it will never
+		 * appear as a prefix of other callers to this method, assuming
+		 * the context from the clear_skip_worktree... methods. If this
+		 * method is reused, then this must be reconsidered.
+		 */
+		if (!next_slash) {
+			strbuf_addstr(&data->dir, path + data->dir.len);
+			strbuf_addch(&data->dir, '/');
+			break;
+		}
 
-	return 0;
+		/*
+		 * Now that we have a slash, let's grow 'data->dir' to include
+		 * this slash, then test if we should stop.
+		 */
+		strbuf_add(&data->dir, path + data->dir.len,
+			   (next_slash - path) - data->dir.len + 1);
+
+		/* If the path doesn't exist, then stop here. */
+		data->lstat_count++;
+		if (lstat(data->dir.buf, &st))
+			return 0;
+	}
+
+	/*
+	 * At this point, 'data->dir' is equal to 'path' plus a slash character,
+	 * and the parent directory of 'path' definitely exists. Let's return
+	 * the case of whether 'path' exists.
+	 */
+
+	data->lstat_count++;
+	return !lstat(path, &st);
 }
 
 static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate)
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] sparse-index: improve lstat caching of sparse paths
  2024-06-20 16:11 ` [PATCH 5/5] sparse-index: improve lstat caching of sparse paths Derrick Stolee via GitGitGadget
@ 2024-06-24 22:14   ` Elijah Newren
  2024-06-25  0:08     ` Junio C Hamano
  2024-06-26 13:06     ` Derrick Stolee
  0 siblings, 2 replies; 44+ messages in thread
From: Elijah Newren @ 2024-06-24 22:14 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, gitster, anh, Derrick Stolee

On Thu, Jun 20, 2024 at 10:11 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <stolee@gmail.com>
>
> The clear_skip_worktree_from_present_files() method was first introduced
> in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files
> present in worktree, 2022-01-14) to allow better interaction with the
> working directory in the presence of paths outside of the
> sparse-checkout cone.

s/cone//

> The initial implementation would lstat() every
> single sparse tree to see if it existed, and if one did, then the sparse
> index would expand and every sparse file would be checked.

This sounds like the algorithm only lstat()ed the sparse directories,
which feels misleading.  Perhaps

The initial implementation would lstat() every SKIP_WORKTREE path to
see if it existed; if it ran across a sparse directory that existed
(when a sparse-index was in use), then it would expand the index and
then check every SKIP_WORKTREE path.

> Since these lstat() calls were very expensive, this was improved in
> d79d299352 (Accelerate clear_skip_worktree_from_present_files() by
> caching, 2022-01-14) by caching directories that do not exist. However,
> there are some inefficiencies in that caching mechanism.

Maybe this is obvious, but I thought a few extra words to end the
second-to-last sentence would be helpful, such as "so it could avoid
lstat()ing any files under such directories".

> The caching mechanism stored only the parent directory as not existing,
> even if a higher parent directory also does not exist. This means that
> wasted lstat() calls would occur when the sparse files change immediate
> parent directories but within the same root directory that does not
> exist.

This is the crucial insight that makes this patch improve things so much.

> To set up a scenario that triggers this code in an interesting way, we
> need a sparse-checkout in cone mode and a sparse index. To trigger the

I think you state this too strongly.  While trying to duplicate, I
first went with a cone mode & sparse index at first, but out of
curiosity tried it without either of these modes set and still saw
dramatic improvement from your patch.  What is needed is that the
sparsity is such that entire directories are missing, and not just one
level above the files of interest.  That is more likely to occur when
cone mode and perhaps sparse index are in use, but perhaps consider
changing "we need" to "it is easiest to consider"

> full index expansion and a call to the
> clear_skip_worktree_from_present_files_full() method, we need one of the
> sparse trees to actually exist on disk. The performance test script
> p2000-sparse-operations.sh takes the sample repository and copies its
> HEAD to several copies nested in directories of the form f<i>/f<j>/f<k>
> where i, j, and k are numbers from 1 to 4. The sparse-checkout cone is
> then selected as "f2/f4/". Creating "f1/f1/" will trigger the behavior
> and also lead to some interesting cases for the caching algorithm since
> "f1/f1/" exists but "f1/f2/" and "f3/" do not.

For some reason I had difficulty triggering a case using this guide.
I might have made an error, but I decided I wanted a deeper directory
tree to test with anyway.  After some playing around to come up with
an interesting testcase, I eventually came up with the following steps
to reproduce in case anyone else wants to try:

    git clone https://github.com/newren/gvfs-like-git-bomb
    cd gvfs-like-git-bomb
    ./runme.sh
    git sparse-checkout set bomb/b/c            # incidentally cone mode
    mkdir -p bomb/d/e/f/a/a
    git ls-files -t | colrm 2 | sort | uniq -c  # optional, but interesting
    GIT_TRACE2_PERF=$(pwd)/trace git ls-files -t >/dev/null
    grep lstat_count trace

Further, you can recompile the git version in use in another window,
then come back to this one and run 'rm trace' followed by the last two
commands to retest.

The commands above create a 'gvfs-like-git-bomb' git directory that
has 1,000,001 files in HEAD.

With this test directory, before applying this patch, I see:
    ..sparse_lstat_count:722011
After applying this patch I see
    ..sparse_lstat_count:135
and with a slight tweak to your patch I see
    ..sparse_lstat_count:125
I'll comment on the slight tweak at the end of the patch.

> This is difficult to notice when running performance tests using the Git
> repository (or a blow-up of the Git repository, as in
> p2000-sparse-operations.sh) because Git has a very shallow directory
> structure.
>
> This change reorganizes the caching algorithm to focus on storing both
> the deepest _existing_ directory and the next-level non-existing
> directory.

This was slightly hard to parse for me, and misled me into thinking
you were tracking two directories.  Maybe:

This change reorganizes the caching algorithm to focus on storing the
highest level leading directory that does not exist (i.e. we are
restricting to the leading directory whose parent directory does
exist).

> By doing a little extra work on the first sparse file, we can
> short-circuit all of the sparse files that exist in that non-existing
> directory.

Here you use "exist" as "tracked by git" in one case, and "appears in
the working tree" in another.  That's a problem, because the files in
question are tracked by git but do not appear in the working tree,
making it impossible for people to understand unless they guess the
correct definition for each use.  I think we want "exist" to just mean
"appears in the working tree" here, so we'd need to s/sparse files
that exist in/sparse files underneath/ (or something similar) to fix
this sentence.

Also, you've used the phrase "sparse file(s)" a number of times in
this commit message; I think I know what you mean, but it is not
defined in the vocabulary section of
Documentation/technical/sparse-checkout.txt.  Together with the above
problem, it made me question what was meant, re-read all the
definitions, etc.  Perhaps "sparse file(s)" should be added to that
vocabulary section, though...especially if we are going to use it and
since we never fixed "sparse directory" despite mentioning that we
wanted to?

> When in a repository where the first sparse file is likely to
> have a much deeper path than the first non-existing directory, this can
> realize significant gains.
>
> The details of this algorithm require careful attention, so the new
> implementation of path_found() has detailed comments, including the use
> of a new max_common_dir_prefix() method that may be of independent
> interest.

These comments are well written and very helpful; thanks for including them.

> It's worth noting that this is not universally positive, since we are
> doing extra lstat() calls to establish the exact path to cache. In the
> blow-up of the Git repository, we can see that the lstat count
> _increases_ from 28 to 31. However, these numbers were already
> artificially low.

Yeah, I spent a while thinking about whether there were funny cases
where your algorithm might significantly increase the number of
lstat() calls for non-sparse-index, non-cone mode cases and came up
kind of empty handed.  It seems you only ever add a modest number of
lstat() calls, and in common scenarios (entire non-leaf directories
missing) you remove a dramatic number of lstat() calls.

> Using an internal monorepo with over two million paths at HEAD and a
> typical sparse-checkout cone such that the index contains ~190,000
> entries (including over two thousand sparse trees), I was able to
> measure these lstat counts when one sparse directory actually exists on
> disk:
>
>   Sparse files in expanded index: 1,841,997
>        full_lstat_count (before):   173,259
>        full_lstat_count  (after):     6,521
>
> This resulted in this absolute time change, on a warm disk:
>
>       Time in full loop (before): 2.527 s
>       Time in full loop  (after): 0.071 s

Very nice.  :-)

> (These times were calculated on a Windows machine, where lstat() is
> slower than a similar Linux machine.)
>
> Signed-off-by: Derrick Stolee <stolee@gmail.com>
> ---
>  sparse-index.c | 118 ++++++++++++++++++++++++++++++++++++++-----------
>  1 file changed, 93 insertions(+), 25 deletions(-)
>
> diff --git a/sparse-index.c b/sparse-index.c
> index 8577fa726b8..cccd8550dfe 100644
> --- a/sparse-index.c
> +++ b/sparse-index.c
> @@ -440,14 +440,21 @@ void ensure_correct_sparsity(struct index_state *istate)
>  }
>
>  struct path_found_data {
> +       /**
> +        * The path stored in 'dir', if non-empty, corresponds to the most-
> +        * recent path that we checked where:
> +        *
> +        *   1. The path should be a directory, according to the index.
> +        *   2. The path does not exist.
> +        *   3. The parent path _does_ exist. (This may be the root of the
> +        *      working directory.)
> +        */
>         struct strbuf dir;
> -       int dir_found;
>         size_t lstat_count;
>  };
>
>  #define PATH_FOUND_DATA_INIT { \
> -       .dir = STRBUF_INIT, \
> -       .dir_found = 1 \
> +       .dir = STRBUF_INIT \
>  }
>
>  static void clear_path_found_data(struct path_found_data *data)
> @@ -455,50 +462,111 @@ static void clear_path_found_data(struct path_found_data *data)
>         strbuf_release(&data->dir);
>  }
>
> +/**
> + * Return the length of the largest common substring that ends in a
> + * slash ('/') to indicate the largest common parent directory. Returns
> + * zero if no common directory exists.
> + */
> +static size_t max_common_dir_prefix(const char *path1, const char *path2)
> +{
> +       size_t common_prefix = 0;
> +       for (size_t i = 0; path1[i] && path2[i]; i++) {
> +               if (path1[i] != path2[i])
> +                       break;
> +
> +               /*
> +                * If they agree at a directory separator, then add one
> +                * to make sure it is included in the common prefix string.
> +                */
> +               if (path1[i] == '/')
> +                       common_prefix = i + 1;
> +       }
> +
> +       return common_prefix;
> +}
> +
>  static int path_found(const char *path, struct path_found_data *data)
>  {
>         struct stat st;
> -       char *newdir;
> +       size_t common_prefix;
>
>         /*
> -        * If dirname corresponds to a directory that doesn't exist, and this
> -        * path starts with dirname, then path can't exist.
> +        * If data->dir is non-empty, then it contains a path that doesn't
> +        * exist, including an ending slash ('/'). If it is a prefix of 'path',
> +        * then we can return 0.
>          */
> -       if (!data->dir_found && !memcmp(path, data->dir.buf, data->dir.len))
> +       if (data->dir.len && !memcmp(path, data->dir.buf, data->dir.len))
>                 return 0;
>
>         /*
> -        * If path itself exists, return 1.
> +        * Otherwise, we must check if the current path exists. If it does, then
> +        * return 1. The cached directory will be skipped until we come across
> +        * a missing path again.
>          */
>         data->lstat_count++;
>         if (!lstat(path, &st))
>                 return 1;
>
>         /*
> -        * Otherwise, path does not exist so we'll return 0...but we'll first
> -        * determine some info about its parent directory so we can avoid
> -        * lstat calls for future cache entries.
> +        * At this point, we know that 'path' doesn't exist, and we know that
> +        * the parent directory of 'data->dir' does exist. Let's set 'data->dir'
> +        * to be the top-most non-existing directory of 'path'. If the first
> +        * parent of 'path' exists, then we will act ast though 'path'

s/ast/as/

> +        * corresponds to a directory (by adding a slash).
>          */
> -       newdir = strrchr(path, '/');
> -       if (!newdir)
> -               return 0; /* Didn't find a parent dir; just return 0 now. */
> +       common_prefix = max_common_dir_prefix(path, data->dir.buf);
>
>         /*
> -        * If path starts with directory (which we already lstat'ed and found),
> -        * then no need to lstat parent directory again.
> +        * At this point, 'path' and 'data->dir' have a common existing parent
> +        * directory given by path[0..common_prefix] (which could have length 0).
> +        * We "grow" the data->dir buffer by checking for existing directories
> +        * along 'path'.
>          */
> -       if (data->dir_found && data->dir.buf &&
> -           memcmp(path, data->dir.buf, data->dir.len))
> -               return 0;
>
> -       /* Free previous dirname, and cache path's dirname */
> -       strbuf_reset(&data->dir);
> -       strbuf_add(&data->dir, path, newdir - path + 1);
> +       strbuf_setlen(&data->dir, common_prefix);
> +       while (1) {
> +               /* Find the next directory in 'path'. */
> +               const char *next_slash = strchr(path + data->dir.len, '/');
>
> -       data->lstat_count++;
> -       data->dir_found = !lstat(data->dir.buf, &st);
> +               /*
> +                * If there are no more slashes, then 'path' doesn't contain a
> +                * non-existent _parent_ directory. Set 'data->dir' to be equal
> +                * to 'path' plus an additional slash, so it can be used for
> +                * caching in the future. The filename of 'path' is considered
> +                * a non-existent directory.
> +                *
> +                * Note: if "{path}/" exists as a directory, then it will never
> +                * appear as a prefix of other callers to this method, assuming
> +                * the context from the clear_skip_worktree... methods. If this
> +                * method is reused, then this must be reconsidered.
> +                */
> +               if (!next_slash) {
> +                       strbuf_addstr(&data->dir, path + data->dir.len);
> +                       strbuf_addch(&data->dir, '/');
> +                       break;
> +               }
>
> -       return 0;
> +               /*
> +                * Now that we have a slash, let's grow 'data->dir' to include
> +                * this slash, then test if we should stop.
> +                */
> +               strbuf_add(&data->dir, path + data->dir.len,
> +                          (next_slash - path) - data->dir.len + 1);

I had to re-read this multiple times and was confused by it.  I
eventually realized it was simple -- you use "path + data->dir.len"
3-4 times in this loop.  Could we reduce the cognitive overhead by
setting some variable to this value at the beginning within the loop
and then just use it?  It'd simplify this particular call to

    strbuf_add(&data->dir, rest, next_slash - rest + 1);

or substitute any other variable name for "rest" there.  Maybe it
shouldn't be a big deal, but the rest of the method was complex enough
that I just blew my local stack space at this point.  I think this
simple substitution would have made it easier for me.

> +
> +               /* If the path doesn't exist, then stop here. */
> +               data->lstat_count++;
> +               if (lstat(data->dir.buf, &st))
> +                       return 0;
> +       }
> +
> +       /*
> +        * At this point, 'data->dir' is equal to 'path' plus a slash character,
> +        * and the parent directory of 'path' definitely exists. Let's return
> +        * the case of whether 'path' exists.
> +        */

Can I suggest adding the following to this comment?
  " path does not exist at this point or we would have already
returned 1 above when we lstat()ed it before the above loop. "

> +
> +       data->lstat_count++;
> +       return !lstat(path, &st);

...and, as long as I didn't missing something with my above comment
suggestion, these two lines can be replaced by

    return 0;

Or did I miss something here?


Anyway, despite the many small comments I made, well done!  I think
the method is not only much more performant, but more readable than my
original.  With a few small tweaks, it should be good to merge.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] sparse-index: improve lstat caching of sparse paths
  2024-06-24 22:14   ` Elijah Newren
@ 2024-06-25  0:08     ` Junio C Hamano
  2024-06-26 13:06     ` Derrick Stolee
  1 sibling, 0 replies; 44+ messages in thread
From: Junio C Hamano @ 2024-06-25  0:08 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Derrick Stolee via GitGitGadget, git, anh, Derrick Stolee

Elijah Newren <newren@gmail.com> writes:

> ...  It'd simplify this particular call to
>
>     strbuf_add(&data->dir, rest, next_slash - rest + 1);
>
> or substitute any other variable name for "rest" there.  Maybe it
> shouldn't be a big deal, but ...

Good observation.  Naming is hard, but it often happens that a
temporary variable that is given an appropriate name improves the
readability of the code a lot.

> Anyway, despite the many small comments I made, well done!  I think
> the method is not only much more performant, but more readable than my
> original.  With a few small tweaks, it should be good to merge.

Yup, and your review was a delight to read.

Thanks, both.




^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] sparse-index: improve lstat caching of sparse paths
  2024-06-24 22:14   ` Elijah Newren
  2024-06-25  0:08     ` Junio C Hamano
@ 2024-06-26 13:06     ` Derrick Stolee
  2024-06-28  0:10       ` Elijah Newren
  1 sibling, 1 reply; 44+ messages in thread
From: Derrick Stolee @ 2024-06-26 13:06 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget; +Cc: git, gitster, anh

On 6/24/24 6:14 PM, Elijah Newren wrote:
> On Thu, Jun 20, 2024 at 10:11 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <stolee@gmail.com>
>>
>> The clear_skip_worktree_from_present_files() method was first introduced
>> in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files
>> present in worktree, 2022-01-14) to allow better interaction with the
>> working directory in the presence of paths outside of the
>> sparse-checkout cone.
> 
> s/cone//
> 
>> The initial implementation would lstat() every
>> single sparse tree to see if it existed, and if one did, then the sparse
>> index would expand and every sparse file would be checked.
> 
> This sounds like the algorithm only lstat()ed the sparse directories,
> which feels misleading.  Perhaps
> 
> The initial implementation would lstat() every SKIP_WORKTREE path to
> see if it existed; if it ran across a sparse directory that existed
> (when a sparse-index was in use), then it would expand the index and
> then check every SKIP_WORKTREE path.
> 
>> Since these lstat() calls were very expensive, this was improved in
>> d79d299352 (Accelerate clear_skip_worktree_from_present_files() by
>> caching, 2022-01-14) by caching directories that do not exist. However,
>> there are some inefficiencies in that caching mechanism.
> 
> Maybe this is obvious, but I thought a few extra words to end the
> second-to-last sentence would be helpful, such as "so it could avoid
> lstat()ing any files under such directories".

Thanks. I will use both of these suggestions verbatim.

>> To set up a scenario that triggers this code in an interesting way, we
>> need a sparse-checkout in cone mode and a sparse index. To trigger the
> 
> I think you state this too strongly.  While trying to duplicate, I
> first went with a cone mode & sparse index at first, but out of
> curiosity tried it without either of these modes set and still saw
> dramatic improvement from your patch.  What is needed is that the
> sparsity is such that entire directories are missing, and not just one
> level above the files of interest.  That is more likely to occur when
> cone mode and perhaps sparse index are in use, but perhaps consider
> changing "we need" to "it is easiest to consider"

Absolutely. This stems from my earlier assumption that the sparse index
was required, but it is not. I have reworded locally.

>> full index expansion and a call to the
>> clear_skip_worktree_from_present_files_full() method, we need one of the
>> sparse trees to actually exist on disk. The performance test script
>> p2000-sparse-operations.sh takes the sample repository and copies its
>> HEAD to several copies nested in directories of the form f<i>/f<j>/f<k>
>> where i, j, and k are numbers from 1 to 4. The sparse-checkout cone is
>> then selected as "f2/f4/". Creating "f1/f1/" will trigger the behavior
>> and also lead to some interesting cases for the caching algorithm since
>> "f1/f1/" exists but "f1/f2/" and "f3/" do not.
> 
> For some reason I had difficulty triggering a case using this guide.
> I might have made an error, but I decided I wanted a deeper directory
> tree to test with anyway.  After some playing around to come up with
> an interesting testcase, I eventually came up with the following steps
> to reproduce in case anyone else wants to try:
> 
>      git clone https://github.com/newren/gvfs-like-git-bomb
>      cd gvfs-like-git-bomb
>      ./runme.sh
>      git sparse-checkout set bomb/b/c            # incidentally cone mode
>      mkdir -p bomb/d/e/f/a/a
>      git ls-files -t | colrm 2 | sort | uniq -c  # optional, but interesting
>      GIT_TRACE2_PERF=$(pwd)/trace git ls-files -t >/dev/null
>      grep lstat_count trace
> 
> Further, you can recompile the git version in use in another window,
> then come back to this one and run 'rm trace' followed by the last two
> commands to retest.
> 
> The commands above create a 'gvfs-like-git-bomb' git directory that
> has 1,000,001 files in HEAD.
> 
> With this test directory, before applying this patch, I see:
>      ..sparse_lstat_count:722011
> After applying this patch I see
>      ..sparse_lstat_count:135
> and with a slight tweak to your patch I see
>      ..sparse_lstat_count:125
> I'll comment on the slight tweak at the end of the patch.

Thanks for these numbers! Are you willing to keep that example repo
on GitHub so I can refer to it in the message?

>> This is difficult to notice when running performance tests using the Git
>> repository (or a blow-up of the Git repository, as in
>> p2000-sparse-operations.sh) because Git has a very shallow directory
>> structure.
>>
>> This change reorganizes the caching algorithm to focus on storing both
>> the deepest _existing_ directory and the next-level non-existing
>> directory.
> 
> This was slightly hard to parse for me, and misled me into thinking
> you were tracking two directories.  Maybe:
> 
> This change reorganizes the caching algorithm to focus on storing the
> highest level leading directory that does not exist (i.e. we are
> restricting to the leading directory whose parent directory does
> exist).

I have reworded in a slightly different way, but with the same meaning
as you provide here.

>> By doing a little extra work on the first sparse file, we can
>> short-circuit all of the sparse files that exist in that non-existing
>> directory.
> 
> Here you use "exist" as "tracked by git" in one case, and "appears in
> the working tree" in another.  That's a problem, because the files in
> question are tracked by git but do not appear in the working tree,
> making it impossible for people to understand unless they guess the
> correct definition for each use.  I think we want "exist" to just mean
> "appears in the working tree" here, so we'd need to s/sparse files
> that exist in/sparse files underneath/ (or something similar) to fix
> this sentence.
> 
> Also, you've used the phrase "sparse file(s)" a number of times in
> this commit message; I think I know what you mean, but it is not
> defined in the vocabulary section of
> Documentation/technical/sparse-checkout.txt.  Together with the above
> problem, it made me question what was meant, re-read all the
> definitions, etc.  Perhaps "sparse file(s)" should be added to that
> vocabulary section, though...especially if we are going to use it and
> since we never fixed "sparse directory" despite mentioning that we
> wanted to?

You're right. My use of "sparse file" or "sparse directory" was intended
to mean "a path from a cache entry with SKIP_WORKTREE that is either a
blob or tree" but that isn't necessary here.

Instead, I'll focus my attention on paths being passed to path_found(),
which removes the context of an index. The danger there is that the
caching assumes that we are moving through the paths in an order such
as in the index, but that's a natural type of ordering for paths.

>>          /*
>> -        * Otherwise, path does not exist so we'll return 0...but we'll first
>> -        * determine some info about its parent directory so we can avoid
>> -        * lstat calls for future cache entries.
>> +        * At this point, we know that 'path' doesn't exist, and we know that
>> +        * the parent directory of 'data->dir' does exist. Let's set 'data->dir'
>> +        * to be the top-most non-existing directory of 'path'. If the first
>> +        * parent of 'path' exists, then we will act ast though 'path'
> 
> s/ast/as/

Oops! Thanks.

>> +               /*
>> +                * Now that we have a slash, let's grow 'data->dir' to include
>> +                * this slash, then test if we should stop.
>> +                */
>> +               strbuf_add(&data->dir, path + data->dir.len,
>> +                          (next_slash - path) - data->dir.len + 1);
> 
> I had to re-read this multiple times and was confused by it.  I
> eventually realized it was simple -- you use "path + data->dir.len"
> 3-4 times in this loop.  Could we reduce the cognitive overhead by
> setting some variable to this value at the beginning within the loop
> and then just use it?  It'd simplify this particular call to
> 
>      strbuf_add(&data->dir, rest, next_slash - rest + 1);
> 
> or substitute any other variable name for "rest" there.  Maybe it
> shouldn't be a big deal, but the rest of the method was complex enough
> that I just blew my local stack space at this point.  I think this
> simple substitution would have made it easier for me.

Excellent. Thanks for recognizing this pattern and recommending a
simplification.

>> +       /*
>> +        * At this point, 'data->dir' is equal to 'path' plus a slash character,
>> +        * and the parent directory of 'path' definitely exists. Let's return
>> +        * the case of whether 'path' exists.
>> +        */
> 
> Can I suggest adding the following to this comment?
>    " path does not exist at this point or we would have already
> returned 1 above when we lstat()ed it before the above loop. "
> 
>> +
>> +       data->lstat_count++;
>> +       return !lstat(path, &st);
> 
> ...and, as long as I didn't missing something with my above comment
> suggestion, these two lines can be replaced by
> 
>      return 0;
> 
> Or did I miss something here?

No, you understood correctly. Thanks for the further simplification and
speed improvement.

> Anyway, despite the many small comments I made, well done!  I think
> the method is not only much more performant, but more readable than my
> original.  With a few small tweaks, it should be good to merge.

Thanks for your detailed review. I'll put these updates into a v2 and
send it later today. I need to retest in my monorepo example and add
your publicly-available example before v2 will be ready.

Thanks,
-Stolee


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 5/5] sparse-index: improve lstat caching of sparse paths
  2024-06-26 13:06     ` Derrick Stolee
@ 2024-06-28  0:10       ` Elijah Newren
  0 siblings, 0 replies; 44+ messages in thread
From: Elijah Newren @ 2024-06-28  0:10 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Derrick Stolee via GitGitGadget, git, gitster, anh

On Wed, Jun 26, 2024 at 6:06 AM Derrick Stolee <stolee@gmail.com> wrote:
>
> On 6/24/24 6:14 PM, Elijah Newren wrote:
[...]
> > Further, you can recompile the git version in use in another window,
> > then come back to this one and run 'rm trace' followed by the last two
> > commands to retest.
> >
> > The commands above create a 'gvfs-like-git-bomb' git directory that
> > has 1,000,001 files in HEAD.
> >
> > With this test directory, before applying this patch, I see:
> >      ..sparse_lstat_count:722011
> > After applying this patch I see
> >      ..sparse_lstat_count:135
> > and with a slight tweak to your patch I see
> >      ..sparse_lstat_count:125
> > I'll comment on the slight tweak at the end of the patch.
>
> Thanks for these numbers! Are you willing to keep that example repo
> on GitHub so I can refer to it in the message?

Sure, I've had it up for several years already without change (though
the anachronistic 'gvfs' in the repository name feels like it should
be updated...)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files()
  2024-06-20 16:11 [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Derrick Stolee via GitGitGadget
                   ` (4 preceding siblings ...)
  2024-06-20 16:11 ` [PATCH 5/5] sparse-index: improve lstat caching of sparse paths Derrick Stolee via GitGitGadget
@ 2024-06-20 19:16 ` Junio C Hamano
  2024-06-20 20:21   ` Derrick Stolee
  2024-06-26 14:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  6 siblings, 1 reply; 44+ messages in thread
From: Junio C Hamano @ 2024-06-20 19:16 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, newren, anh, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> While doing some investigation in a private monorepo with sparse-checkout
> and a sparse index, I accidentally left a modified file outside of my
> sparse-checkout cone. This caused my Git commands to slow to a crawl, so I
> reran with GIT_TRACE2_PERF=1.
>
> While I was able to identify clear_skip_worktree_from_present_files() as the
> culprit, it took longer than desired to figure out what was going on. This
> series intends to both fix the performance issue (as much as possible) and
> do some refactoring to make it easier to understand what is happening.
>
> In the end, I was able to reduce the number of lstat() calls in my case from
> over 170,000 to about 6,500, improving the time from 2.5s to 71ms on a warm
> disk cache.   Thanks, Stolee

That's impressive but I cannot offhand tell how big 170k (or 6.5k
for that matter) is relative to the size of the tree.  How many
paths are there in the entire tree (i.e. "git ls-tree -r HEAD | wc
-l") vs the number of the in-cone paths in the working tree?  

If 6.5k is in the same ballpark as the latter, it would be really
good.

Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files()
  2024-06-20 19:16 ` [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Junio C Hamano
@ 2024-06-20 20:21   ` Derrick Stolee
  2024-06-20 21:02     ` Junio C Hamano
  0 siblings, 1 reply; 44+ messages in thread
From: Derrick Stolee @ 2024-06-20 20:21 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget; +Cc: git, newren, anh

On 6/20/24 3:16 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
> 
>> While doing some investigation in a private monorepo with sparse-checkout
>> and a sparse index, I accidentally left a modified file outside of my
>> sparse-checkout cone. This caused my Git commands to slow to a crawl, so I
>> reran with GIT_TRACE2_PERF=1.
>>
>> While I was able to identify clear_skip_worktree_from_present_files() as the
>> culprit, it took longer than desired to figure out what was going on. This
>> series intends to both fix the performance issue (as much as possible) and
>> do some refactoring to make it easier to understand what is happening.
>>
>> In the end, I was able to reduce the number of lstat() calls in my case from
>> over 170,000 to about 6,500, improving the time from 2.5s to 71ms on a warm
>> disk cache.   Thanks, Stolee
> 
> That's impressive but I cannot offhand tell how big 170k (or 6.5k
> for that matter) is relative to the size of the tree.  How many
> paths are there in the entire tree (i.e. "git ls-tree -r HEAD | wc
> -l") vs the number of the in-cone paths in the working tree?
> 
> If 6.5k is in the same ballpark as the latter, it would be really
> good.

You're right, I didn't include the full context here. The repo has
about 2.1 million paths at HEAD, but most of them are sparse.

In Patch 5, I detail that there are 1,841,997 total sparse files in
the expanded index. Thus, the previous caching algorithm was already
doing decent work and calling lstat() 11x fewer times than the naive
implementation.

The new caching algorithm improves this to 6,521, which is a 282x
improvement over naive and and 26x improvement over the previous
caching algorithm.

But what you are really asking is how close this is to the optimal.
I didn't include that in Patch 5 details, but I was able to look at
my notes and see that the sparse_path_count data point was 1,962,
meaning there are that many sparse trees in the sparse index before
expanding. Thus, the 6,521 lstat() calls are 3.3x more than the
absolute minimum required.

Does that help answer the questions you had? I'm happy to provide
more information.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files()
  2024-06-20 20:21   ` Derrick Stolee
@ 2024-06-20 21:02     ` Junio C Hamano
  0 siblings, 0 replies; 44+ messages in thread
From: Junio C Hamano @ 2024-06-20 21:02 UTC (permalink / raw)
  To: Derrick Stolee; +Cc: Derrick Stolee via GitGitGadget, git, newren, anh

Derrick Stolee <stolee@gmail.com> writes:

> But what you are really asking is how close this is to the optimal.
> I didn't include that in Patch 5 details, but I was able to look at
> my notes and see that the sparse_path_count data point was 1,962,
> meaning there are that many sparse trees in the sparse index before
> expanding. Thus, the 6,521 lstat() calls are 3.3x more than the
> absolute minimum required.

Nice, and still impressive.  Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 0/5] sparse-index: improve clear_skip_worktree_from_present_files()
  2024-06-20 16:11 [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Derrick Stolee via GitGitGadget
                   ` (5 preceding siblings ...)
  2024-06-20 19:16 ` [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Junio C Hamano
@ 2024-06-26 14:29 ` Derrick Stolee via GitGitGadget
  2024-06-26 14:29   ` [PATCH v2 1/5] sparse-checkout: refactor skip worktree retry logic Derrick Stolee via GitGitGadget
                     ` (6 more replies)
  6 siblings, 7 replies; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-26 14:29 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee

While doing some investigation in a private monorepo with sparse-checkout
and a sparse index, I accidentally left a modified file outside of my
sparse-checkout cone. This caused my Git commands to slow to a crawl, so I
reran with GIT_TRACE2_PERF=1.

While I was able to identify clear_skip_worktree_from_present_files() as the
culprit, it took longer than desired to figure out what was going on. This
series intends to both fix the performance issue (as much as possible) and
do some refactoring to make it easier to understand what is happening.

In the end, I was able to reduce the number of lstat() calls in my case from
over 1.1 million to about 4,400, improving the time from 13.4s to 81ms on a
warm disk cache. (These numbers are from a test after v2, which somehow hit
the old caching algorithm even worse than my test in v1.)


Updates in v2
=============

Thanks to Elijah for a thorough review, leading to valuable improvements.

 * I was mistaken that the sparse index was required for this logic to
   happen. This has changed several descriptions across the commit messages.
 * The final lstat() in path_found() was not needed, so is removed in v2.
   This saves even more time and lstat() calls, updating the stats.
 * Elijah created a particularly nasty example for testing, which I include
   in my final patch. He gets a "Helped-by" credit for this.
 * Several comments, variables, and other improvements based on Elijah's
   recommendations.

Thanks, Stolee

Derrick Stolee (5):
  sparse-checkout: refactor skip worktree retry logic
  sparse-index: refactor path_found()
  sparse-index: use strbuf in path_found()
  sparse-index: count lstat() calls
  sparse-index: improve lstat caching of sparse paths

 sparse-index.c | 216 +++++++++++++++++++++++++++++++++++++------------
 1 file changed, 164 insertions(+), 52 deletions(-)


base-commit: 66ac6e4bcd111be3fa9c2a6b3fafea718d00678d
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1754%2Fderrickstolee%2Fclear-skip-speed-v2
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1754/derrickstolee/clear-skip-speed-v2
Pull-Request: https://github.com/gitgitgadget/git/pull/1754

Range-diff vs v1:

 1:  ddd8a9a90ce ! 1:  93d0baed0b0 sparse-index: refactor skip worktree retry logic
     @@ Metadata
      Author: Derrick Stolee <dstolee@microsoft.com>
      
       ## Commit message ##
     -    sparse-index: refactor skip worktree retry logic
     +    sparse-checkout: refactor skip worktree retry logic
      
          The clear_skip_worktree_from_present_files() method was introduced in
          af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present
     -    in worktree, 2022-01-14) to help cases where the sparse index is enabled
     -    but some paths outside of the sparse-checkout cone also exist on disk.
     -    This operation can be slow as it needs to check path existence in a way
     -    that is not stored in the collapsed index, so caching was introduced in
     -    d79d299352 (Accelerate clear_skip_worktree_from_present_files() by
     -    caching, 2022-01-14).
     +    in worktree, 2022-01-14) to help cases where sparse-checkout is enabled
     +    but some paths outside of the sparse-checkout also exist on disk.  This
     +    operation can be slow as it needs to check path existence in a way not
     +    stored in the index, so caching was introduced in d79d299352 (Accelerate
     +    clear_skip_worktree_from_present_files() by caching, 2022-01-14).
      
          If users are having trouble with the performance of this operation and
     -    don't care about paths outside of the sparse-checkout cone, they can
     -    disable them using the sparse.expectFilesOutsideOfPatterns config option
     +    don't care about paths outside of the sparse-checkout, they can disable
     +    them using the sparse.expectFilesOutsideOfPatterns config option
          introduced in ecc7c8841d (repo_read_index: add config to expect files
          outside sparse patterns, 2022-02-25).
      
     +    This check is particularly confusing in the presence of a sparse index,
     +    as a sparse tree entry corresponding to an existing directory must first
     +    be expanded to a full index before examining the paths within. This is
     +    currently implemented using a 'goto' and a boolean variable to ensure we
     +    restart only once.
     +
          Even with that caching, it was noticed that this could take a long time
          to execute. 89aaab11a3 (index: add trace2 region for clear skip
          worktree, 2022-11-03) introduced trace2 regions to measure this time.
     @@ Commit message
      
          One thing that can be confusing about the current setup is that the
          trace2 regions nest and it is not clear that a second loop is running
     -    after the index is expanded. Here is an example of what the regions look
     -    like in a typical case:
     +    after a sparse index is expanded. Here is an example of what the regions
     +    look like in a typical case:
      
          | region_enter | ... | label:clear_skip_worktree_from_present_files
          | region_enter | ... | ..label:update
 2:  7c3b545ee5e ! 2:  69c3beaabf7 sparse-index: refactor path_found()
     @@ Commit message
          the callers with this approach.
      
          Note that the clear_path_found_data() method is currently empty, as
     -    there is nothing to free. However, this will change in the future, so
     -    place the method and its callers for now.
     +    there is nothing to free. This method is a placeholder for future
     +    changes that require a non-trivial implementation. Its stub is created
     +    now so consumers could call it now and not change in future changes.
      
          Signed-off-by: Derrick Stolee <stolee@gmail.com>
      
 3:  217594ffb10 = 3:  0a82e6b4183 sparse-index: use strbuf in path_found()
 4:  88a3145e585 = 4:  9549f5b8062 sparse-index: count lstat() calls
 5:  2654fcb7142 ! 5:  0cb344ac14f sparse-index: improve lstat caching of sparse paths
     @@ Commit message
          in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files
          present in worktree, 2022-01-14) to allow better interaction with the
          working directory in the presence of paths outside of the
     -    sparse-checkout cone. The initial implementation would lstat() every
     -    single sparse tree to see if it existed, and if one did, then the sparse
     -    index would expand and every sparse file would be checked.
     +    sparse-checkout. The initial implementation would lstat() every single
     +    SKIP_WORKTREE path to see if it existed; if it ran across a sparse
     +    directory that existed (when a sparse index was in use), then it would
     +    expand the index and then check every SKIP_WORKTREE path.
      
          Since these lstat() calls were very expensive, this was improved in
          d79d299352 (Accelerate clear_skip_worktree_from_present_files() by
     -    caching, 2022-01-14) by caching directories that do not exist. However,
     -    there are some inefficiencies in that caching mechanism.
     +    caching, 2022-01-14) by caching directories that do not exist so it
     +    could avoid lstat()ing any files under such directories. However, there
     +    are some inefficiencies in that caching mechanism.
      
          The caching mechanism stored only the parent directory as not existing,
          even if a higher parent directory also does not exist. This means that
     -    wasted lstat() calls would occur when the sparse files change immediate
     -    parent directories but within the same root directory that does not
     -    exist.
     +    wasted lstat() calls would occur when the paths passed to path_found()
     +    change immediate parent directories but within the same parent directory
     +    that does not exist.
     +
     +    To create an example repository that demonstrates this problem, it helps
     +    to have a directory outside of the sparse-checkout that contains many
     +    deep paths. In particular, the first paths (in lexicographic order)
     +    underneath the sparse directory should have deep directory structures,
     +    maximizing the difference between the old caching algorithm that looks
     +    to a single parent and the new caching algorithm that looks to the
     +    top-most missing directory.
      
     -    To set up a scenario that triggers this code in an interesting way, we
     -    need a sparse-checkout in cone mode and a sparse index. To trigger the
     -    full index expansion and a call to the
     -    clear_skip_worktree_from_present_files_full() method, we need one of the
     -    sparse trees to actually exist on disk. The performance test script
     -    p2000-sparse-operations.sh takes the sample repository and copies its
     -    HEAD to several copies nested in directories of the form f<i>/f<j>/f<k>
     -    where i, j, and k are numbers from 1 to 4. The sparse-checkout cone is
     -    then selected as "f2/f4/". Creating "f1/f1/" will trigger the behavior
     -    and also lead to some interesting cases for the caching algorithm since
     -    "f1/f1/" exists but "f1/f2/" and "f3/" do not.
     +    The performance test script p2000-sparse-operations.sh takes the sample
     +    repository and copies its HEAD to several copies nested in directories
     +    of the form f<i>/f<j>/f<k> where i, j, and k are numbers from 1 to 4.
     +    The sparse-checkout cone is then selected as "f2/f4/". Creating "f1/f1/"
     +    will trigger the behavior and also lead to some interesting cases for
     +    the caching algorithm since "f1/f1/" exists but "f1/f2/" and "f3/" do
     +    not.
      
          This is difficult to notice when running performance tests using the Git
          repository (or a blow-up of the Git repository, as in
          p2000-sparse-operations.sh) because Git has a very shallow directory
          structure.
      
     -    This change reorganizes the caching algorithm to focus on storing both
     -    the deepest _existing_ directory and the next-level non-existing
     -    directory. By doing a little extra work on the first sparse file, we can
     -    short-circuit all of the sparse files that exist in that non-existing
     -    directory. When in a repository where the first sparse file is likely to
     -    have a much deeper path than the first non-existing directory, this can
     -    realize significant gains.
     +    This change reorganizes the caching algorithm to focus on storing the
     +    highest level leading directory that does not exist; specifically this
     +    means that that directory's parent _does_ exist. By doing a little extra
     +    work on a path passed to path_found(), we can short-circuit all of the
     +    paths passed to path_found() afterwards that match a prefix with that
     +    non-existing directory. When in a repository where the first sparse file
     +    is likely to have a much deeper path than the first non-existing
     +    directory, this can realize significant gains.
      
          The details of this algorithm require careful attention, so the new
          implementation of path_found() has detailed comments, including the use
     @@ Commit message
          _increases_ from 28 to 31. However, these numbers were already
          artificially low.
      
     +    Contributor Elijah Newren created a publicly-available test repository
     +    that demonstrates the difference in these caching algorithms in the most
     +    extreme way. To test, follow these steps:
     +
     +      git clone --sparse https://github.com/newren/gvfs-like-git-bomb
     +      cd gvfs-like-git-bomb
     +      ./runme.sh                   # NOTE: check scripts before running!
     +
     +    At this point, assuming you do not have index.sparse=true set globally,
     +    the index has one million paths with the SKIP_WORKTREE bit and they will
     +    all be sent to path_found() in the sparse loop. You can measure this by
     +    running 'git status' with GIT_TRACE2_PERF=1:
     +
     +        Sparse files in the index: 1,000,000
     +      sparse_lstat_count (before):   200,000
     +       sparse_lstat_count (after):         2
     +
     +    And here are the performance numbers:
     +
     +      Benchmark 1: old
     +        Time (mean ± σ):     397.5 ms ±   4.1 ms
     +        Range (min … max):   391.2 ms … 404.8 ms    10 runs
     +
     +      Benchmark 2: new
     +        Time (mean ± σ):     252.7 ms ±   3.1 ms
     +        Range (min … max):   249.4 ms … 259.5 ms    11 runs
     +
     +      Summary
     +        'new' ran
     +          1.57 ± 0.02 times faster than 'old'
     +
     +    By modifying this example further, we can demonstrate a more realistic
     +    example and include the sparse index expansion. Continue by creating
     +    this directory, confusing both caching algorithms somewhat:
     +
     +      mkdir -p bomb/d/e/f/a/a
     +
     +    Then re-run the 'git status' tests to see these statistics:
     +
     +        Sparse files in the index: 1,000,000
     +      sparse_lstat_count (before):   724,010
     +       sparse_lstat_count (after):       106
     +
     +      Benchmark 1: old
     +        Time (mean ± σ):     753.0 ms ±   3.5 ms
     +        Range (min … max):   749.7 ms … 760.9 ms    10 runs
     +
     +      Benchmark 2: new
     +        Time (mean ± σ):     201.4 ms ±   3.2 ms
     +        Range (min … max):   196.0 ms … 207.9 ms    14 runs
     +
     +      Summary
     +        'new' ran
     +          3.74 ± 0.06 times faster than 'old'
     +
     +    Note that if this repository had a sparse index enabled, the additional
     +    cost of expanding the sparse index affects the total time of these
     +    commands by over four seconds, significantly diminishing the benefit of
     +    the caching algorithm. Having existing paths outside of the
     +    sparse-checkout is a known performance issue for the sparse index and is
     +    a known trade-off for the performance benefits given when no such paths
     +    exist.
     +
          Using an internal monorepo with over two million paths at HEAD and a
     -    typical sparse-checkout cone such that the index contains ~190,000
     -    entries (including over two thousand sparse trees), I was able to
     -    measure these lstat counts when one sparse directory actually exists on
     -    disk:
     +    typical sparse-checkout cone such that the sparse index contains
     +    ~190,000 entries (including over two thousand sparse trees), I was able
     +    to measure these lstat counts when one sparse directory actually exists
     +    on disk:
      
            Sparse files in expanded index: 1,841,997
     -           full_lstat_count (before):   173,259
     -           full_lstat_count  (after):     6,521
     +           full_lstat_count (before): 1,188,161
     +           full_lstat_count  (after):     4,404
      
          This resulted in this absolute time change, on a warm disk:
      
     -          Time in full loop (before): 2.527 s
     -          Time in full loop  (after): 0.071 s
     +          Time in full loop (before): 13.481 s
     +          Time in full loop  (after):  0.081 s
      
          (These times were calculated on a Windows machine, where lstat() is
          slower than a similar Linux machine.)
      
     +    Helped-by: Elijah Newren <newren@gmail.com>
          Signed-off-by: Derrick Stolee <stolee@gmail.com>
      
       ## sparse-index.c ##
     @@ sparse-index.c: static void clear_path_found_data(struct path_found_data *data)
      +	 * At this point, we know that 'path' doesn't exist, and we know that
      +	 * the parent directory of 'data->dir' does exist. Let's set 'data->dir'
      +	 * to be the top-most non-existing directory of 'path'. If the first
     -+	 * parent of 'path' exists, then we will act ast though 'path'
     ++	 * parent of 'path' exists, then we will act as though 'path'
      +	 * corresponds to a directory (by adding a slash).
       	 */
      -	newdir = strrchr(path, '/');
     @@ sparse-index.c: static void clear_path_found_data(struct path_found_data *data)
      +	strbuf_setlen(&data->dir, common_prefix);
      +	while (1) {
      +		/* Find the next directory in 'path'. */
     -+		const char *next_slash = strchr(path + data->dir.len, '/');
     ++		const char *rest = path + data->dir.len;
     ++		const char *next_slash = strchr(rest, '/');
       
      -	data->lstat_count++;
      -	data->dir_found = !lstat(data->dir.buf, &st);
     @@ sparse-index.c: static void clear_path_found_data(struct path_found_data *data)
      +		 * method is reused, then this must be reconsidered.
      +		 */
      +		if (!next_slash) {
     -+			strbuf_addstr(&data->dir, path + data->dir.len);
     ++			strbuf_addstr(&data->dir, rest);
      +			strbuf_addch(&data->dir, '/');
      +			break;
      +		}
       
     --	return 0;
      +		/*
      +		 * Now that we have a slash, let's grow 'data->dir' to include
      +		 * this slash, then test if we should stop.
      +		 */
     -+		strbuf_add(&data->dir, path + data->dir.len,
     -+			   (next_slash - path) - data->dir.len + 1);
     ++		strbuf_add(&data->dir, rest, next_slash - rest + 1);
      +
     -+		/* If the path doesn't exist, then stop here. */
     ++		/* If the parent dir doesn't exist, then stop here. */
      +		data->lstat_count++;
      +		if (lstat(data->dir.buf, &st))
      +			return 0;
     @@ sparse-index.c: static void clear_path_found_data(struct path_found_data *data)
      +
      +	/*
      +	 * At this point, 'data->dir' is equal to 'path' plus a slash character,
     -+	 * and the parent directory of 'path' definitely exists. Let's return
     -+	 * the case of whether 'path' exists.
     ++	 * and the parent directory of 'path' definitely exists. Moreover, we
     ++	 * know that 'path' doesn't exist, or we would have returned 1 earlier.
      +	 */
     -+
     -+	data->lstat_count++;
     -+	return !lstat(path, &st);
     + 	return 0;
       }
       
     - static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate)

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 1/5] sparse-checkout: refactor skip worktree retry logic
  2024-06-26 14:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
@ 2024-06-26 14:29   ` Derrick Stolee via GitGitGadget
  2024-06-27 20:59     ` Junio C Hamano
  2024-06-28  0:31     ` Elijah Newren
  2024-06-26 14:29   ` [PATCH v2 2/5] sparse-index: refactor path_found() Derrick Stolee via GitGitGadget
                     ` (5 subsequent siblings)
  6 siblings, 2 replies; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-26 14:29 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The clear_skip_worktree_from_present_files() method was introduced in
af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present
in worktree, 2022-01-14) to help cases where sparse-checkout is enabled
but some paths outside of the sparse-checkout also exist on disk.  This
operation can be slow as it needs to check path existence in a way not
stored in the index, so caching was introduced in d79d299352 (Accelerate
clear_skip_worktree_from_present_files() by caching, 2022-01-14).

If users are having trouble with the performance of this operation and
don't care about paths outside of the sparse-checkout, they can disable
them using the sparse.expectFilesOutsideOfPatterns config option
introduced in ecc7c8841d (repo_read_index: add config to expect files
outside sparse patterns, 2022-02-25).

This check is particularly confusing in the presence of a sparse index,
as a sparse tree entry corresponding to an existing directory must first
be expanded to a full index before examining the paths within. This is
currently implemented using a 'goto' and a boolean variable to ensure we
restart only once.

Even with that caching, it was noticed that this could take a long time
to execute. 89aaab11a3 (index: add trace2 region for clear skip
worktree, 2022-11-03) introduced trace2 regions to measure this time.
Further, the way the loop repeats itself was slightly confusing and
prone to breakage, so a BUG() statement was added in 8c7abdc596 (index:
raise a bug if the index is materialised more than once, 2022-11-03) to
be sure that the second run of the loop does not hit any sparse trees.

One thing that can be confusing about the current setup is that the
trace2 regions nest and it is not clear that a second loop is running
after a sparse index is expanded. Here is an example of what the regions
look like in a typical case:

| region_enter | ... | label:clear_skip_worktree_from_present_files
| region_enter | ... | ..label:update
| region_leave | ... | ..label:update
| region_enter | ... | ..label:ensure_full_index
| region_enter | ... | ....label:update
| region_leave | ... | ....label:update
| region_leave | ... | ..label:ensure_full_index
| data         | ... | ..sparse_path_count:1
| data         | ... | ..sparse_path_count_full:269538
| region_leave | ... | label:clear_skip_worktree_from_present_files

One thing that is particularly difficult to understand about these
regions is that most of the time is spent between the close of the
ensure_full_index region and the reporting of the end data. This is
because of the restart of the loop being within the same region as the
first iteration of the loop.

This change refactors the method into two separate methods that are
traced separately. This will be more important later when we change
other features of the methods, but for now the only functional change is
the difference in the structure of the trace regions.

After this change, the same telemetry section is split into three
distinct chunks:

| region_enter | ... | label:clear_skip_worktree_from_present_files_sparse
| data         | ... | ..sparse_path_count:1
| region_leave | ... | label:clear_skip_worktree_from_present_files_sparse
| region_enter | ... | label:update
| region_leave | ... | label:update
| region_enter | ... | label:ensure_full_index
| region_enter | ... | ..label:update
| region_leave | ... | ..label:update
| region_leave | ... | label:ensure_full_index
| region_enter | ... | label:clear_skip_worktree_from_present_files_full
| data         | ... | ..full_path_count:269538
| region_leave | ... | label:clear_skip_worktree_from_present_files_full

Here, we see the sparse loop terminating early with its first sparse
path being a sparse directory containing a file. Then, that loop's
region terminates before ensure_full_index begins (in this case, the
cache-tree must also be computed). Then, _after_ the index is expanded,
the full loop begins with its own region.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 sparse-index.c | 77 ++++++++++++++++++++++++++++++++++----------------
 1 file changed, 53 insertions(+), 24 deletions(-)

diff --git a/sparse-index.c b/sparse-index.c
index e48e40cae71..e0457c87fff 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -486,49 +486,78 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len,
 	return 0;
 }

-void clear_skip_worktree_from_present_files(struct index_state *istate)
+static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate)
 {
 	const char *last_dirname = NULL;
 	size_t dir_len = 0;
 	int dir_found = 1;

-	int i;
-	int path_count[2] = {0, 0};
-	int restarted = 0;
+	int path_count = 0;
+	int to_restart = 0;

-	if (!core_apply_sparse_checkout ||
-	    sparse_expect_files_outside_of_patterns)
-		return;
-
-	trace2_region_enter("index", "clear_skip_worktree_from_present_files",
+	trace2_region_enter("index", "clear_skip_worktree_from_present_files_sparse",
 			    istate->repo);
-restart:
-	for (i = 0; i < istate->cache_nr; i++) {
+	for (int i = 0; i < istate->cache_nr; i++) {
 		struct cache_entry *ce = istate->cache[i];

 		if (ce_skip_worktree(ce)) {
-			path_count[restarted]++;
+			path_count++;
 			if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) {
 				if (S_ISSPARSEDIR(ce->ce_mode)) {
-					if (restarted)
-						BUG("ensure-full-index did not fully flatten?");
-					ensure_full_index(istate);
-					restarted = 1;
-					goto restart;
+					to_restart = 1;
+					break;
 				}
 				ce->ce_flags &= ~CE_SKIP_WORKTREE;
 			}
 		}
 	}

-	if (path_count[0])
-		trace2_data_intmax("index", istate->repo,
-				   "sparse_path_count", path_count[0]);
-	if (restarted)
-		trace2_data_intmax("index", istate->repo,
-				   "sparse_path_count_full", path_count[1]);
-	trace2_region_leave("index", "clear_skip_worktree_from_present_files",
+	trace2_data_intmax("index", istate->repo,
+			   "sparse_path_count", path_count);
+	trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse",
+			    istate->repo);
+	return to_restart;
+}
+
+static void clear_skip_worktree_from_present_files_full(struct index_state *istate)
+{
+	const char *last_dirname = NULL;
+	size_t dir_len = 0;
+	int dir_found = 1;
+
+	int path_count = 0;
+
+	trace2_region_enter("index", "clear_skip_worktree_from_present_files_full",
 			    istate->repo);
+	for (int i = 0; i < istate->cache_nr; i++) {
+		struct cache_entry *ce = istate->cache[i];
+
+		if (S_ISSPARSEDIR(ce->ce_mode))
+			BUG("ensure-full-index did not fully flatten?");
+
+		if (ce_skip_worktree(ce)) {
+			path_count++;
+			if (path_found(ce->name, &last_dirname, &dir_len, &dir_found))
+				ce->ce_flags &= ~CE_SKIP_WORKTREE;
+		}
+	}
+
+	trace2_data_intmax("index", istate->repo,
+			   "full_path_count", path_count);
+	trace2_region_leave("index", "clear_skip_worktree_from_present_files_full",
+			    istate->repo);
+}
+
+void clear_skip_worktree_from_present_files(struct index_state *istate)
+{
+	if (!core_apply_sparse_checkout ||
+	    sparse_expect_files_outside_of_patterns)
+		return;
+
+	if (clear_skip_worktree_from_present_files_sparse(istate)) {
+		ensure_full_index(istate);
+		clear_skip_worktree_from_present_files_full(istate);
+	}
 }

 /*
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/5] sparse-checkout: refactor skip worktree retry logic
  2024-06-26 14:29   ` [PATCH v2 1/5] sparse-checkout: refactor skip worktree retry logic Derrick Stolee via GitGitGadget
@ 2024-06-27 20:59     ` Junio C Hamano
  2024-06-28  0:51       ` Elijah Newren
  2024-06-28  0:31     ` Elijah Newren
  1 sibling, 1 reply; 44+ messages in thread
From: Junio C Hamano @ 2024-06-27 20:59 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, newren, anh, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> -void clear_skip_worktree_from_present_files(struct index_state *istate)
> +static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate)
>  {
>  	const char *last_dirname = NULL;
>  	size_t dir_len = 0;
>  	int dir_found = 1;
>  
> -	int i;
> -	int path_count[2] = {0, 0};
> -	int restarted = 0;
> +	int path_count = 0;
> +	int to_restart = 0;
>  
> -	if (!core_apply_sparse_checkout ||
> -	    sparse_expect_files_outside_of_patterns)
> -		return;
> -
> -	trace2_region_enter("index", "clear_skip_worktree_from_present_files",
> +	trace2_region_enter("index", "clear_skip_worktree_from_present_files_sparse",
>  			    istate->repo);
> -restart:
> -	for (i = 0; i < istate->cache_nr; i++) {
> +	for (int i = 0; i < istate->cache_nr; i++) {
>  		struct cache_entry *ce = istate->cache[i];
>  
>  		if (ce_skip_worktree(ce)) {
> -			path_count[restarted]++;
> +			path_count++;
>  			if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) {
>  				if (S_ISSPARSEDIR(ce->ce_mode)) {
> -					if (restarted)
> -						BUG("ensure-full-index did not fully flatten?");
> -					ensure_full_index(istate);
> -					restarted = 1;
> -					goto restart;
> +					to_restart = 1;
> +					break;
>  				}
>  				ce->ce_flags &= ~CE_SKIP_WORKTREE;
>  			}
>  		}
>  	}

Both original and the rewritten code shares one trait, which is that
it goes from the beginning to check all paths that are marked with
SKIP_WORKTREE bit to clear CE_SKIP_WORKTREE bits from them, until
they find a S_ISSPARSEDIR entry and lazily call ensure_full_index(),
but then when retrying after calling ensure_full_index(), it
restarts from the beginning.  I wonder if it would help, especially
if the S_ISSPARSEDIR entry comes very late in the index (e.g., by
returning "here is where we have already checked during the first
run, until we realized that we first need to do ensure_full_index()"
to the caller from here, and then the caller tells the second phase
to restart from there), to reduce the number of calls to path_found()?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/5] sparse-checkout: refactor skip worktree retry logic
  2024-06-27 20:59     ` Junio C Hamano
@ 2024-06-28  0:51       ` Elijah Newren
  2024-06-28  1:49         ` Derrick Stolee
  2024-06-28  5:50         ` Junio C Hamano
  0 siblings, 2 replies; 44+ messages in thread
From: Elijah Newren @ 2024-06-28  0:51 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Derrick Stolee via GitGitGadget, git, anh, Derrick Stolee

On Thu, Jun 27, 2024 at 1:59 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > -void clear_skip_worktree_from_present_files(struct index_state *istate)
> > +static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate)
> >  {
> >       const char *last_dirname = NULL;
> >       size_t dir_len = 0;
> >       int dir_found = 1;
> >
> > -     int i;
> > -     int path_count[2] = {0, 0};
> > -     int restarted = 0;
> > +     int path_count = 0;
> > +     int to_restart = 0;
> >
> > -     if (!core_apply_sparse_checkout ||
> > -         sparse_expect_files_outside_of_patterns)
> > -             return;
> > -
> > -     trace2_region_enter("index", "clear_skip_worktree_from_present_files",
> > +     trace2_region_enter("index", "clear_skip_worktree_from_present_files_sparse",
> >                           istate->repo);
> > -restart:
> > -     for (i = 0; i < istate->cache_nr; i++) {
> > +     for (int i = 0; i < istate->cache_nr; i++) {
> >               struct cache_entry *ce = istate->cache[i];
> >
> >               if (ce_skip_worktree(ce)) {
> > -                     path_count[restarted]++;
> > +                     path_count++;
> >                       if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) {
> >                               if (S_ISSPARSEDIR(ce->ce_mode)) {
> > -                                     if (restarted)
> > -                                             BUG("ensure-full-index did not fully flatten?");
> > -                                     ensure_full_index(istate);
> > -                                     restarted = 1;
> > -                                     goto restart;
> > +                                     to_restart = 1;
> > +                                     break;
> >                               }
> >                               ce->ce_flags &= ~CE_SKIP_WORKTREE;
> >                       }
> >               }
> >       }
>
> Both original and the rewritten code shares one trait, which is that
> it goes from the beginning to check all paths that are marked with
> SKIP_WORKTREE bit to clear CE_SKIP_WORKTREE bits from them, until
> they find a S_ISSPARSEDIR entry and lazily call ensure_full_index(),
> but then when retrying after calling ensure_full_index(), it
> restarts from the beginning.  I wonder if it would help, especially
> if the S_ISSPARSEDIR entry comes very late in the index (e.g., by

"the S_ISSPARSEDIR entry"?  Are you assuming there is only one?

> returning "here is where we have already checked during the first
> run, until we realized that we first need to do ensure_full_index()"
> to the caller from here, and then the caller tells the second phase
> to restart from there), to reduce the number of calls to path_found()?

My first assumption about how to "restart from there", was to pass the
current index, i, to the more involved check.  That might fail to be
the correct index when there are multiple S_ISSPARSEDIR entries and
the first one (or more) are actually missing from the working copy as
expected.  However, if "restart from there" is just passing the
name of the relevant directory and we do some kind of binary search to
find where the earliest entry under that directory would exist in the
expanded index, then that seems like a reasonable additional
optimization.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/5] sparse-checkout: refactor skip worktree retry logic
  2024-06-28  0:51       ` Elijah Newren
@ 2024-06-28  1:49         ` Derrick Stolee
  2024-06-28  5:50         ` Junio C Hamano
  1 sibling, 0 replies; 44+ messages in thread
From: Derrick Stolee @ 2024-06-28  1:49 UTC (permalink / raw)
  To: Elijah Newren, Junio C Hamano; +Cc: Derrick Stolee via GitGitGadget, git, anh

On 6/27/24 8:51 PM, Elijah Newren wrote:
> On Thu, Jun 27, 2024 at 1:59 PM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> Both original and the rewritten code shares one trait, which is that
>> it goes from the beginning to check all paths that are marked with
>> SKIP_WORKTREE bit to clear CE_SKIP_WORKTREE bits from them, until
>> they find a S_ISSPARSEDIR entry and lazily call ensure_full_index(),
>> but then when retrying after calling ensure_full_index(), it
>> restarts from the beginning.  I wonder if it would help, especially
>> if the S_ISSPARSEDIR entry comes very late in the index (e.g., by
> 
> "the S_ISSPARSEDIR entry"?  Are you assuming there is only one?

In fact, the situation is more drastic when there are many sparse
directories and this is one late in the list. The ensure_full_index()
call could cause many new entries that should be no-ops earlier in
the lex order.

>> returning "here is where we have already checked during the first
>> run, until we realized that we first need to do ensure_full_index()"
>> to the caller from here, and then the caller tells the second phase
>> to restart from there), to reduce the number of calls to path_found()?
> 
> My first assumption about how to "restart from there", was to pass the
> current index, i, to the more involved check.  That might fail to be
> the correct index when there are multiple S_ISSPARSEDIR entries and
> the first one (or more) are actually missing from the working copy as
> expected.  However, if "restart from there" is just passing the
> name of the relevant directory and we do some kind of binary search to
> find where the earliest entry under that directory would exist in the
> expanded index, then that seems like a reasonable additional
> optimization.

Yes, you're on the right track. Requires replacing the response from
the sparse loop to be the existing cache entry position or path
instead of only a boolean "should restart" indicator.

Another option would be to explore the whole index for the sparse
directories that exist, then expand the index only to those paths.
That should be fastest, but would require using the partially-
expanded state that has so far only happened inside of the
sparse-checkout builtin.

All this to say: there are some interesting directions to go from
here. I'm willing to take some time to explore the options. Should
we consider merging this improvement now and then consider further
improvements in a follow-up series (if they are fruitful)?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/5] sparse-checkout: refactor skip worktree retry logic
  2024-06-28  0:51       ` Elijah Newren
  2024-06-28  1:49         ` Derrick Stolee
@ 2024-06-28  5:50         ` Junio C Hamano
  1 sibling, 0 replies; 44+ messages in thread
From: Junio C Hamano @ 2024-06-28  5:50 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Derrick Stolee via GitGitGadget, git, anh, Derrick Stolee

Elijah Newren <newren@gmail.com> writes:

>> restarts from the beginning.  I wonder if it would help, especially
>> if the S_ISSPARSEDIR entry comes very late in the index (e.g., by
>
> "the S_ISSPARSEDIR entry"?  Are you assuming there is only one?

No, but I was referring to the first one we encounter in the first
attempt loop that causes us to run the ensure_full() thing.  If the
first SPARSEDIR comes very late, there are many index entries before
that one that are already dealt with in the first attempt loop, and
for these early entries that are already handled, the second attempt
loop will do nothing but just skip.

> expected.  However, if "restart from there" is just passing the
> name of the relevant directory and we do some kind of binary search to
> find where the earliest entry under that directory would exist in the
> expanded index, then that seems like a reasonable additional
> optimization.

Perhaps.  I do not think it is known if it is worth doing, and I do
think it is better to finish this current round and then explore
further optimization opportunities on top after the dust settles.

Thanks.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/5] sparse-checkout: refactor skip worktree retry logic
  2024-06-26 14:29   ` [PATCH v2 1/5] sparse-checkout: refactor skip worktree retry logic Derrick Stolee via GitGitGadget
  2024-06-27 20:59     ` Junio C Hamano
@ 2024-06-28  0:31     ` Elijah Newren
  2024-06-28  1:56       ` Derrick Stolee
  1 sibling, 1 reply; 44+ messages in thread
From: Elijah Newren @ 2024-06-28  0:31 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, gitster, anh, Derrick Stolee

On Wed, Jun 26, 2024 at 7:29 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> From: Derrick Stolee <stolee@gmail.com>
>
[...]
> If users are having trouble with the performance of this operation and
> don't care about paths outside of the sparse-checkout, they can disable
> them using the sparse.expectFilesOutsideOfPatterns config option
> introduced in ecc7c8841d (repo_read_index: add config to expect files
> outside sparse patterns, 2022-02-25).

So, I had some heartburn with this paragraph when reading v1, but
decided to focus on other stuff.  But it still really bugs me while
reading v2.  So...

The purpose for the sparse.expectFilesOutsideOfPatterns option is very
specifically for the virtual-filesystem usecase (Google, specifically)
where sparse files aren't meant to stay sparse but be brought to life
the instant they are accessed (via a specialized filesystem and kernel
modules and whatnot).  Using it for any other reason, such as a
workaround for performance issues here, seems risky to me and I don't
like seeing it suggested.  The "if users...don't care about paths
outside of the sparse-checkout" really doesn't do it justice.  See the
"User-facing issues" section of
https://lore.kernel.org/git/b263cc75b7d4f426fb051d2cae5ae0fabeebb9c9.1642092230.git.gitgitgadget@gmail.com/;
we're foisting all those bugs back on users if they don't have such a
specialized filesystem and turn this knob on.  And then we'll get many
bug reports that are close to impossible to track down, and virtually
impossible to fix even if we do track it down.  I had for a while
suggested numerous fixes we needed in order to make
SKIP_WORKTREE-but-actually-present files work better, which required
fixes all over the codebase and which was serving as an impediment
e.g. to Victoria's desired sparse-index changes.  We gave up on those
other efforts long ago in favor of this much cleaner and simpler
solution of just clearing the SKIP_WORKTREE bit if the file wasn't
missing.

Further, I've seen people read git.git commit messages on Stack
Overflow and make suggestions based off them, so I'm a bit worried
this paragraph as worded will end up causing active harm, when I think
it's original intent was merely to provide full context around the
history of the clear_skip_worktree_from_present_files_sparse()
function.  While this bit of information is part of the history of
this function, I'm not sure it's really all that relevant to your
series.

Could we omit this paragraph, or if you want to keep it, reword it in
some way that doesn't make it look like some tradeoff that isn't that
big of a deal for non-specialized-vfs users?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 1/5] sparse-checkout: refactor skip worktree retry logic
  2024-06-28  0:31     ` Elijah Newren
@ 2024-06-28  1:56       ` Derrick Stolee
  0 siblings, 0 replies; 44+ messages in thread
From: Derrick Stolee @ 2024-06-28  1:56 UTC (permalink / raw)
  To: Elijah Newren, Derrick Stolee via GitGitGadget; +Cc: git, gitster, anh

On 6/27/24 8:31 PM, Elijah Newren wrote:
> On Wed, Jun 26, 2024 at 7:29 AM Derrick Stolee via GitGitGadget
> <gitgitgadget@gmail.com> wrote:
>>
>> From: Derrick Stolee <stolee@gmail.com>
>>
> [...]
>> If users are having trouble with the performance of this operation and
>> don't care about paths outside of the sparse-checkout, they can disable
>> them using the sparse.expectFilesOutsideOfPatterns config option
>> introduced in ecc7c8841d (repo_read_index: add config to expect files
>> outside sparse patterns, 2022-02-25).
> 
> So, I had some heartburn with this paragraph when reading v1, but
> decided to focus on other stuff.  But it still really bugs me while
> reading v2.  So...

Thanks for bringing this up with the details I've omitted from the
context of my response.

> Could we omit this paragraph, or if you want to keep it, reword it in
> some way that doesn't make it look like some tradeoff that isn't that
> big of a deal for non-specialized-vfs users?

Makes sense to delete the paragraph. I could ship a v3, or is this
a commit-message surgery you'd be willing to do on my behalf, Junio?

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v2 2/5] sparse-index: refactor path_found()
  2024-06-26 14:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2024-06-26 14:29   ` [PATCH v2 1/5] sparse-checkout: refactor skip worktree retry logic Derrick Stolee via GitGitGadget
@ 2024-06-26 14:29   ` Derrick Stolee via GitGitGadget
  2024-06-26 14:29   ` [PATCH v2 3/5] sparse-index: use strbuf in path_found() Derrick Stolee via GitGitGadget
                     ` (4 subsequent siblings)
  6 siblings, 0 replies; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-26 14:29 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

In advance of changing the behavior of path_found(), take all of the
intermediate data values and group them into a single struct. This
simplifies the method prototype as well as the initialization. Future
changes can be made directly to the struct and method without changing
the callers with this approach.

Note that the clear_path_found_data() method is currently empty, as
there is nothing to free. This method is a placeholder for future
changes that require a non-trivial implementation. Its stub is created
now so consumers could call it now and not change in future changes.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 sparse-index.c | 45 +++++++++++++++++++++++++++++----------------
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/sparse-index.c b/sparse-index.c
index e0457c87fff..de6e727f5c1 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -439,8 +439,22 @@ void ensure_correct_sparsity(struct index_state *istate)
 		ensure_full_index(istate);
 }
 
-static int path_found(const char *path, const char **dirname, size_t *dir_len,
-		      int *dir_found)
+struct path_found_data {
+	const char *dirname;
+	size_t dir_len;
+	int dir_found;
+};
+
+#define PATH_FOUND_DATA_INIT { \
+	.dir_found = 1 \
+}
+
+static void clear_path_found_data(struct path_found_data *data)
+{
+	return;
+}
+
+static int path_found(const char *path, struct path_found_data *data)
 {
 	struct stat st;
 	char *newdir;
@@ -450,7 +464,7 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len,
 	 * If dirname corresponds to a directory that doesn't exist, and this
 	 * path starts with dirname, then path can't exist.
 	 */
-	if (!*dir_found && !memcmp(path, *dirname, *dir_len))
+	if (!data->dir_found && !memcmp(path, data->dirname, data->dir_len))
 		return 0;
 
 	/*
@@ -472,15 +486,16 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len,
 	 * If path starts with directory (which we already lstat'ed and found),
 	 * then no need to lstat parent directory again.
 	 */
-	if (*dir_found && *dirname && memcmp(path, *dirname, *dir_len))
+	if (data->dir_found && data->dirname &&
+	    memcmp(path, data->dirname, data->dir_len))
 		return 0;
 
 	/* Free previous dirname, and cache path's dirname */
-	*dirname = path;
-	*dir_len = newdir - path + 1;
+	data->dirname = path;
+	data->dir_len = newdir - path + 1;
 
-	tmp = xstrndup(path, *dir_len);
-	*dir_found = !lstat(tmp, &st);
+	tmp = xstrndup(path, data->dir_len);
+	data->dir_found = !lstat(tmp, &st);
 	free(tmp);
 
 	return 0;
@@ -488,9 +503,7 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len,
 
 static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate)
 {
-	const char *last_dirname = NULL;
-	size_t dir_len = 0;
-	int dir_found = 1;
+	struct path_found_data data = PATH_FOUND_DATA_INIT;
 
 	int path_count = 0;
 	int to_restart = 0;
@@ -502,7 +515,7 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist
 
 		if (ce_skip_worktree(ce)) {
 			path_count++;
-			if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) {
+			if (path_found(ce->name, &data)) {
 				if (S_ISSPARSEDIR(ce->ce_mode)) {
 					to_restart = 1;
 					break;
@@ -516,14 +529,13 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist
 			   "sparse_path_count", path_count);
 	trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse",
 			    istate->repo);
+	clear_path_found_data(&data);
 	return to_restart;
 }
 
 static void clear_skip_worktree_from_present_files_full(struct index_state *istate)
 {
-	const char *last_dirname = NULL;
-	size_t dir_len = 0;
-	int dir_found = 1;
+	struct path_found_data data = PATH_FOUND_DATA_INIT;
 
 	int path_count = 0;
 
@@ -537,7 +549,7 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista
 
 		if (ce_skip_worktree(ce)) {
 			path_count++;
-			if (path_found(ce->name, &last_dirname, &dir_len, &dir_found))
+			if (path_found(ce->name, &data))
 				ce->ce_flags &= ~CE_SKIP_WORKTREE;
 		}
 	}
@@ -546,6 +558,7 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista
 			   "full_path_count", path_count);
 	trace2_region_leave("index", "clear_skip_worktree_from_present_files_full",
 			    istate->repo);
+	clear_path_found_data(&data);
 }
 
 void clear_skip_worktree_from_present_files(struct index_state *istate)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 3/5] sparse-index: use strbuf in path_found()
  2024-06-26 14:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
  2024-06-26 14:29   ` [PATCH v2 1/5] sparse-checkout: refactor skip worktree retry logic Derrick Stolee via GitGitGadget
  2024-06-26 14:29   ` [PATCH v2 2/5] sparse-index: refactor path_found() Derrick Stolee via GitGitGadget
@ 2024-06-26 14:29   ` Derrick Stolee via GitGitGadget
  2024-06-26 14:29   ` [PATCH v2 4/5] sparse-index: count lstat() calls Derrick Stolee via GitGitGadget
                     ` (3 subsequent siblings)
  6 siblings, 0 replies; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-26 14:29 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The path_found() method previously reused strings from the cache entries
the calling methods were using. This prevents string manipulation in
place and causes some odd reallocation before the final lstat() call in
the method.

Refactor the method to use strbufs and copy the path into the strbuf,
but also only the parent directory and not the whole path. This looks
like extra copying when assigning the path to the strbuf, but we save an
allocation by dropping the 'tmp' string, and we are "reusing" the copy
from 'tmp' to put the data in the strbuf.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 sparse-index.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/sparse-index.c b/sparse-index.c
index de6e727f5c1..fec4f393360 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -440,31 +440,30 @@ void ensure_correct_sparsity(struct index_state *istate)
 }
 
 struct path_found_data {
-	const char *dirname;
-	size_t dir_len;
+	struct strbuf dir;
 	int dir_found;
 };
 
 #define PATH_FOUND_DATA_INIT { \
+	.dir = STRBUF_INIT, \
 	.dir_found = 1 \
 }
 
 static void clear_path_found_data(struct path_found_data *data)
 {
-	return;
+	strbuf_release(&data->dir);
 }
 
 static int path_found(const char *path, struct path_found_data *data)
 {
 	struct stat st;
 	char *newdir;
-	char *tmp;
 
 	/*
 	 * If dirname corresponds to a directory that doesn't exist, and this
 	 * path starts with dirname, then path can't exist.
 	 */
-	if (!data->dir_found && !memcmp(path, data->dirname, data->dir_len))
+	if (!data->dir_found && !memcmp(path, data->dir.buf, data->dir.len))
 		return 0;
 
 	/*
@@ -486,17 +485,15 @@ static int path_found(const char *path, struct path_found_data *data)
 	 * If path starts with directory (which we already lstat'ed and found),
 	 * then no need to lstat parent directory again.
 	 */
-	if (data->dir_found && data->dirname &&
-	    memcmp(path, data->dirname, data->dir_len))
+	if (data->dir_found && data->dir.buf &&
+	    memcmp(path, data->dir.buf, data->dir.len))
 		return 0;
 
 	/* Free previous dirname, and cache path's dirname */
-	data->dirname = path;
-	data->dir_len = newdir - path + 1;
+	strbuf_reset(&data->dir);
+	strbuf_add(&data->dir, path, newdir - path + 1);
 
-	tmp = xstrndup(path, data->dir_len);
-	data->dir_found = !lstat(tmp, &st);
-	free(tmp);
+	data->dir_found = !lstat(data->dir.buf, &st);
 
 	return 0;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 4/5] sparse-index: count lstat() calls
  2024-06-26 14:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (2 preceding siblings ...)
  2024-06-26 14:29   ` [PATCH v2 3/5] sparse-index: use strbuf in path_found() Derrick Stolee via GitGitGadget
@ 2024-06-26 14:29   ` Derrick Stolee via GitGitGadget
  2024-06-26 14:29   ` [PATCH v2 5/5] sparse-index: improve lstat caching of sparse paths Derrick Stolee via GitGitGadget
                     ` (2 subsequent siblings)
  6 siblings, 0 replies; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-26 14:29 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The clear_skip_worktree.. methods already report some statistics about
how many cache entries are checked against path_found() due to having
the skip-worktree bit set. However, due to path_found() performing some
caching, this isn't the only information that would be helpful to
report.

Add a new lstat_count member to the path_found_data struct to count the
number of times path_found() calls lstat(). This will be helpful to help
explain performance problems in this method as well as to demonstrate
future changes to the caching algorithm in a more concrete way than
end-to-end timings.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 sparse-index.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/sparse-index.c b/sparse-index.c
index fec4f393360..8577fa726b8 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -442,6 +442,7 @@ void ensure_correct_sparsity(struct index_state *istate)
 struct path_found_data {
 	struct strbuf dir;
 	int dir_found;
+	size_t lstat_count;
 };
 
 #define PATH_FOUND_DATA_INIT { \
@@ -469,6 +470,7 @@ static int path_found(const char *path, struct path_found_data *data)
 	/*
 	 * If path itself exists, return 1.
 	 */
+	data->lstat_count++;
 	if (!lstat(path, &st))
 		return 1;
 
@@ -493,6 +495,7 @@ static int path_found(const char *path, struct path_found_data *data)
 	strbuf_reset(&data->dir);
 	strbuf_add(&data->dir, path, newdir - path + 1);
 
+	data->lstat_count++;
 	data->dir_found = !lstat(data->dir.buf, &st);
 
 	return 0;
@@ -524,6 +527,8 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist
 
 	trace2_data_intmax("index", istate->repo,
 			   "sparse_path_count", path_count);
+	trace2_data_intmax("index", istate->repo,
+			   "sparse_lstat_count", data.lstat_count);
 	trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse",
 			    istate->repo);
 	clear_path_found_data(&data);
@@ -553,6 +558,8 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista
 
 	trace2_data_intmax("index", istate->repo,
 			   "full_path_count", path_count);
+	trace2_data_intmax("index", istate->repo,
+			   "full_lstat_count", data.lstat_count);
 	trace2_region_leave("index", "clear_skip_worktree_from_present_files_full",
 			    istate->repo);
 	clear_path_found_data(&data);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v2 5/5] sparse-index: improve lstat caching of sparse paths
  2024-06-26 14:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (3 preceding siblings ...)
  2024-06-26 14:29   ` [PATCH v2 4/5] sparse-index: count lstat() calls Derrick Stolee via GitGitGadget
@ 2024-06-26 14:29   ` Derrick Stolee via GitGitGadget
  2024-06-27 21:14     ` Junio C Hamano
  2024-06-27 21:46   ` [PATCH v2 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Junio C Hamano
  2024-06-28 12:43   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  6 siblings, 1 reply; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-26 14:29 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The clear_skip_worktree_from_present_files() method was first introduced
in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files
present in worktree, 2022-01-14) to allow better interaction with the
working directory in the presence of paths outside of the
sparse-checkout. The initial implementation would lstat() every single
SKIP_WORKTREE path to see if it existed; if it ran across a sparse
directory that existed (when a sparse index was in use), then it would
expand the index and then check every SKIP_WORKTREE path.

Since these lstat() calls were very expensive, this was improved in
d79d299352 (Accelerate clear_skip_worktree_from_present_files() by
caching, 2022-01-14) by caching directories that do not exist so it
could avoid lstat()ing any files under such directories. However, there
are some inefficiencies in that caching mechanism.

The caching mechanism stored only the parent directory as not existing,
even if a higher parent directory also does not exist. This means that
wasted lstat() calls would occur when the paths passed to path_found()
change immediate parent directories but within the same parent directory
that does not exist.

To create an example repository that demonstrates this problem, it helps
to have a directory outside of the sparse-checkout that contains many
deep paths. In particular, the first paths (in lexicographic order)
underneath the sparse directory should have deep directory structures,
maximizing the difference between the old caching algorithm that looks
to a single parent and the new caching algorithm that looks to the
top-most missing directory.

The performance test script p2000-sparse-operations.sh takes the sample
repository and copies its HEAD to several copies nested in directories
of the form f<i>/f<j>/f<k> where i, j, and k are numbers from 1 to 4.
The sparse-checkout cone is then selected as "f2/f4/". Creating "f1/f1/"
will trigger the behavior and also lead to some interesting cases for
the caching algorithm since "f1/f1/" exists but "f1/f2/" and "f3/" do
not.

This is difficult to notice when running performance tests using the Git
repository (or a blow-up of the Git repository, as in
p2000-sparse-operations.sh) because Git has a very shallow directory
structure.

This change reorganizes the caching algorithm to focus on storing the
highest level leading directory that does not exist; specifically this
means that that directory's parent _does_ exist. By doing a little extra
work on a path passed to path_found(), we can short-circuit all of the
paths passed to path_found() afterwards that match a prefix with that
non-existing directory. When in a repository where the first sparse file
is likely to have a much deeper path than the first non-existing
directory, this can realize significant gains.

The details of this algorithm require careful attention, so the new
implementation of path_found() has detailed comments, including the use
of a new max_common_dir_prefix() method that may be of independent
interest.

It's worth noting that this is not universally positive, since we are
doing extra lstat() calls to establish the exact path to cache. In the
blow-up of the Git repository, we can see that the lstat count
_increases_ from 28 to 31. However, these numbers were already
artificially low.

Contributor Elijah Newren created a publicly-available test repository
that demonstrates the difference in these caching algorithms in the most
extreme way. To test, follow these steps:

  git clone --sparse https://github.com/newren/gvfs-like-git-bomb
  cd gvfs-like-git-bomb
  ./runme.sh                   # NOTE: check scripts before running!

At this point, assuming you do not have index.sparse=true set globally,
the index has one million paths with the SKIP_WORKTREE bit and they will
all be sent to path_found() in the sparse loop. You can measure this by
running 'git status' with GIT_TRACE2_PERF=1:

    Sparse files in the index: 1,000,000
  sparse_lstat_count (before):   200,000
   sparse_lstat_count (after):         2

And here are the performance numbers:

  Benchmark 1: old
    Time (mean ± σ):     397.5 ms ±   4.1 ms
    Range (min … max):   391.2 ms … 404.8 ms    10 runs

  Benchmark 2: new
    Time (mean ± σ):     252.7 ms ±   3.1 ms
    Range (min … max):   249.4 ms … 259.5 ms    11 runs

  Summary
    'new' ran
      1.57 ± 0.02 times faster than 'old'

By modifying this example further, we can demonstrate a more realistic
example and include the sparse index expansion. Continue by creating
this directory, confusing both caching algorithms somewhat:

  mkdir -p bomb/d/e/f/a/a

Then re-run the 'git status' tests to see these statistics:

    Sparse files in the index: 1,000,000
  sparse_lstat_count (before):   724,010
   sparse_lstat_count (after):       106

  Benchmark 1: old
    Time (mean ± σ):     753.0 ms ±   3.5 ms
    Range (min … max):   749.7 ms … 760.9 ms    10 runs

  Benchmark 2: new
    Time (mean ± σ):     201.4 ms ±   3.2 ms
    Range (min … max):   196.0 ms … 207.9 ms    14 runs

  Summary
    'new' ran
      3.74 ± 0.06 times faster than 'old'

Note that if this repository had a sparse index enabled, the additional
cost of expanding the sparse index affects the total time of these
commands by over four seconds, significantly diminishing the benefit of
the caching algorithm. Having existing paths outside of the
sparse-checkout is a known performance issue for the sparse index and is
a known trade-off for the performance benefits given when no such paths
exist.

Using an internal monorepo with over two million paths at HEAD and a
typical sparse-checkout cone such that the sparse index contains
~190,000 entries (including over two thousand sparse trees), I was able
to measure these lstat counts when one sparse directory actually exists
on disk:

  Sparse files in expanded index: 1,841,997
       full_lstat_count (before): 1,188,161
       full_lstat_count  (after):     4,404

This resulted in this absolute time change, on a warm disk:

      Time in full loop (before): 13.481 s
      Time in full loop  (after):  0.081 s

(These times were calculated on a Windows machine, where lstat() is
slower than a similar Linux machine.)

Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 sparse-index.c | 114 ++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 90 insertions(+), 24 deletions(-)

diff --git a/sparse-index.c b/sparse-index.c
index 8577fa726b8..7e433250248 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -440,14 +440,21 @@ void ensure_correct_sparsity(struct index_state *istate)
 }
 
 struct path_found_data {
+	/**
+	 * The path stored in 'dir', if non-empty, corresponds to the most-
+	 * recent path that we checked where:
+	 *
+	 *   1. The path should be a directory, according to the index.
+	 *   2. The path does not exist.
+	 *   3. The parent path _does_ exist. (This may be the root of the
+	 *      working directory.)
+	 */
 	struct strbuf dir;
-	int dir_found;
 	size_t lstat_count;
 };
 
 #define PATH_FOUND_DATA_INIT { \
-	.dir = STRBUF_INIT, \
-	.dir_found = 1 \
+	.dir = STRBUF_INIT \
 }
 
 static void clear_path_found_data(struct path_found_data *data)
@@ -455,49 +462,108 @@ static void clear_path_found_data(struct path_found_data *data)
 	strbuf_release(&data->dir);
 }
 
+/**
+ * Return the length of the largest common substring that ends in a
+ * slash ('/') to indicate the largest common parent directory. Returns
+ * zero if no common directory exists.
+ */
+static size_t max_common_dir_prefix(const char *path1, const char *path2)
+{
+	size_t common_prefix = 0;
+	for (size_t i = 0; path1[i] && path2[i]; i++) {
+		if (path1[i] != path2[i])
+			break;
+
+		/*
+		 * If they agree at a directory separator, then add one
+		 * to make sure it is included in the common prefix string.
+		 */
+		if (path1[i] == '/')
+			common_prefix = i + 1;
+	}
+
+	return common_prefix;
+}
+
 static int path_found(const char *path, struct path_found_data *data)
 {
 	struct stat st;
-	char *newdir;
+	size_t common_prefix;
 
 	/*
-	 * If dirname corresponds to a directory that doesn't exist, and this
-	 * path starts with dirname, then path can't exist.
+	 * If data->dir is non-empty, then it contains a path that doesn't
+	 * exist, including an ending slash ('/'). If it is a prefix of 'path',
+	 * then we can return 0.
 	 */
-	if (!data->dir_found && !memcmp(path, data->dir.buf, data->dir.len))
+	if (data->dir.len && !memcmp(path, data->dir.buf, data->dir.len))
 		return 0;
 
 	/*
-	 * If path itself exists, return 1.
+	 * Otherwise, we must check if the current path exists. If it does, then
+	 * return 1. The cached directory will be skipped until we come across
+	 * a missing path again.
 	 */
 	data->lstat_count++;
 	if (!lstat(path, &st))
 		return 1;
 
 	/*
-	 * Otherwise, path does not exist so we'll return 0...but we'll first
-	 * determine some info about its parent directory so we can avoid
-	 * lstat calls for future cache entries.
+	 * At this point, we know that 'path' doesn't exist, and we know that
+	 * the parent directory of 'data->dir' does exist. Let's set 'data->dir'
+	 * to be the top-most non-existing directory of 'path'. If the first
+	 * parent of 'path' exists, then we will act as though 'path'
+	 * corresponds to a directory (by adding a slash).
 	 */
-	newdir = strrchr(path, '/');
-	if (!newdir)
-		return 0; /* Didn't find a parent dir; just return 0 now. */
+	common_prefix = max_common_dir_prefix(path, data->dir.buf);
 
 	/*
-	 * If path starts with directory (which we already lstat'ed and found),
-	 * then no need to lstat parent directory again.
+	 * At this point, 'path' and 'data->dir' have a common existing parent
+	 * directory given by path[0..common_prefix] (which could have length 0).
+	 * We "grow" the data->dir buffer by checking for existing directories
+	 * along 'path'.
 	 */
-	if (data->dir_found && data->dir.buf &&
-	    memcmp(path, data->dir.buf, data->dir.len))
-		return 0;
 
-	/* Free previous dirname, and cache path's dirname */
-	strbuf_reset(&data->dir);
-	strbuf_add(&data->dir, path, newdir - path + 1);
+	strbuf_setlen(&data->dir, common_prefix);
+	while (1) {
+		/* Find the next directory in 'path'. */
+		const char *rest = path + data->dir.len;
+		const char *next_slash = strchr(rest, '/');
 
-	data->lstat_count++;
-	data->dir_found = !lstat(data->dir.buf, &st);
+		/*
+		 * If there are no more slashes, then 'path' doesn't contain a
+		 * non-existent _parent_ directory. Set 'data->dir' to be equal
+		 * to 'path' plus an additional slash, so it can be used for
+		 * caching in the future. The filename of 'path' is considered
+		 * a non-existent directory.
+		 *
+		 * Note: if "{path}/" exists as a directory, then it will never
+		 * appear as a prefix of other callers to this method, assuming
+		 * the context from the clear_skip_worktree... methods. If this
+		 * method is reused, then this must be reconsidered.
+		 */
+		if (!next_slash) {
+			strbuf_addstr(&data->dir, rest);
+			strbuf_addch(&data->dir, '/');
+			break;
+		}
 
+		/*
+		 * Now that we have a slash, let's grow 'data->dir' to include
+		 * this slash, then test if we should stop.
+		 */
+		strbuf_add(&data->dir, rest, next_slash - rest + 1);
+
+		/* If the parent dir doesn't exist, then stop here. */
+		data->lstat_count++;
+		if (lstat(data->dir.buf, &st))
+			return 0;
+	}
+
+	/*
+	 * At this point, 'data->dir' is equal to 'path' plus a slash character,
+	 * and the parent directory of 'path' definitely exists. Moreover, we
+	 * know that 'path' doesn't exist, or we would have returned 1 earlier.
+	 */
 	return 0;
 }
 
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/5] sparse-index: improve lstat caching of sparse paths
  2024-06-26 14:29   ` [PATCH v2 5/5] sparse-index: improve lstat caching of sparse paths Derrick Stolee via GitGitGadget
@ 2024-06-27 21:14     ` Junio C Hamano
  2024-06-28  1:56       ` Derrick Stolee
  0 siblings, 1 reply; 44+ messages in thread
From: Junio C Hamano @ 2024-06-27 21:14 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, newren, anh, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

>  struct path_found_data {
> +	/**
> +	 * The path stored in 'dir', if non-empty, corresponds to the most-
> +	 * recent path that we checked where:
> +	 *
> +	 *   1. The path should be a directory, according to the index.
> +	 *   2. The path does not exist.
> +	 *   3. The parent path _does_ exist. (This may be the root of the
> +	 *      working directory.)
> +	 */
>  	struct strbuf dir;
> -	int dir_found;
>  	size_t lstat_count;
>  };
>  
>  #define PATH_FOUND_DATA_INIT { \
> -	.dir = STRBUF_INIT, \
> -	.dir_found = 1 \
> +	.dir = STRBUF_INIT \
>  }
>  
>  static void clear_path_found_data(struct path_found_data *data)
> @@ -455,49 +462,108 @@ static void clear_path_found_data(struct path_found_data *data)
>  	strbuf_release(&data->dir);
>  }
>  
> +/**
> + * Return the length of the largest common substring that ends in a

"largest" here is understandable (it means longest).

> + * slash ('/') to indicate the largest common parent directory. Returns

here I find it a bit confusing.  It is the deepest common parent
directory between the two (or "the common parent directory with the
longest path"), but my initial reaction was "largest common parent
directory?  wouldn't the root level by definition the 'largest'
(having the largest number of paths underneath) directory that is
common?)".

> + * zero if no common directory exists.
> + */
> +static size_t max_common_dir_prefix(const char *path1, const char *path2)
> +{
> +	size_t common_prefix = 0;
> +	for (size_t i = 0; path1[i] && path2[i]; i++) {
> +		if (path1[i] != path2[i])
> +			break;
> +
> +		/*
> +		 * If they agree at a directory separator, then add one
> +		 * to make sure it is included in the common prefix string.
> +		 */
> +		if (path1[i] == '/')
> +			common_prefix = i + 1;
> +	}
> +
> +	return common_prefix;
> +}

Looking good.  I assume that these two paths are relative to the
top-level of the working tree (in other words, they do not begin
with a slash)?

>  static int path_found(const char *path, struct path_found_data *data)
>  {
> ...
> +	 * At this point, we know that 'path' doesn't exist, and we know that
> +	 * the parent directory of 'data->dir' does exist. Let's set 'data->dir'
> +	 * to be the top-most non-existing directory of 'path'. If the first
> +	 * parent of 'path' exists, then we will act as though 'path'
> +	 * corresponds to a directory (by adding a slash).
>  	 */
> -	newdir = strrchr(path, '/');
> -	if (!newdir)
> -		return 0; /* Didn't find a parent dir; just return 0 now. */
> +	common_prefix = max_common_dir_prefix(path, data->dir.buf);
> ...
> +	strbuf_setlen(&data->dir, common_prefix);
> +	while (1) {

Oooh, nice.  So you learned /a/b/c/d did not exist, so check /a/b/c,
and then /a/b/ and stop, because you know /a does exist already.
With luck, our next query is for /a/b/c/e or for /a/b/f, and knowing
that /a/b/ does not exist would allow us to say "no, they do not
exist" without having to lstat().  OK.


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 5/5] sparse-index: improve lstat caching of sparse paths
  2024-06-27 21:14     ` Junio C Hamano
@ 2024-06-28  1:56       ` Derrick Stolee
  0 siblings, 0 replies; 44+ messages in thread
From: Derrick Stolee @ 2024-06-28  1:56 UTC (permalink / raw)
  To: Junio C Hamano, Derrick Stolee via GitGitGadget; +Cc: git, newren, anh

On 6/27/24 5:14 PM, Junio C Hamano wrote:
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>> +/**
>> + * Return the length of the largest common substring that ends in a
> 
> "largest" here is understandable (it means longest).
> 
>> + * slash ('/') to indicate the largest common parent directory. Returns
> 
> here I find it a bit confusing.  It is the deepest common parent
> directory between the two (or "the common parent directory with the
> longest path"), but my initial reaction was "largest common parent
> directory?  wouldn't the root level by definition the 'largest'
> (having the largest number of paths underneath) directory that is
> common?)".

I'm not sure why my brain got stuck on "largest" here when "longest"
would be a better choice.

>> + * zero if no common directory exists.
>> + */
>> +static size_t max_common_dir_prefix(const char *path1, const char *path2)
>> +{
>> +	size_t common_prefix = 0;
>> +	for (size_t i = 0; path1[i] && path2[i]; i++) {
>> +		if (path1[i] != path2[i])
>> +			break;
>> +
>> +		/*
>> +		 * If they agree at a directory separator, then add one
>> +		 * to make sure it is included in the common prefix string.
>> +		 */
>> +		if (path1[i] == '/')
>> +			common_prefix = i + 1;
>> +	}
>> +
>> +	return common_prefix;
>> +}
> 
> Looking good.  I assume that these two paths are relative to the
> top-level of the working tree (in other words, they do not begin
> with a slash)?

Yes, they do not begin with a slash. In this specific application, they
are paths from cache entries in the index. If this were generalized for
use elsewhere then paths beginning with a slash should be considered.

>>   static int path_found(const char *path, struct path_found_data *data)
>>   {
>> ...
>> +	 * At this point, we know that 'path' doesn't exist, and we know that
>> +	 * the parent directory of 'data->dir' does exist. Let's set 'data->dir'
>> +	 * to be the top-most non-existing directory of 'path'. If the first
>> +	 * parent of 'path' exists, then we will act as though 'path'
>> +	 * corresponds to a directory (by adding a slash).
>>   	 */
>> -	newdir = strrchr(path, '/');
>> -	if (!newdir)
>> -		return 0; /* Didn't find a parent dir; just return 0 now. */
>> +	common_prefix = max_common_dir_prefix(path, data->dir.buf);
>> ...
>> +	strbuf_setlen(&data->dir, common_prefix);
>> +	while (1) {
> 
> Oooh, nice.  So you learned /a/b/c/d did not exist, so check /a/b/c,
> and then /a/b/ and stop, because you know /a does exist already.
> With luck, our next query is for /a/b/c/e or for /a/b/f, and knowing
> that /a/b/ does not exist would allow us to say "no, they do not
> exist" without having to lstat().  OK.

I probably should have led with an example like this, as it makes the
point more clearly than my abstract description.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/5] sparse-index: improve clear_skip_worktree_from_present_files()
  2024-06-26 14:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (4 preceding siblings ...)
  2024-06-26 14:29   ` [PATCH v2 5/5] sparse-index: improve lstat caching of sparse paths Derrick Stolee via GitGitGadget
@ 2024-06-27 21:46   ` Junio C Hamano
  2024-06-28  0:59     ` Elijah Newren
  2024-06-28 12:43   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  6 siblings, 1 reply; 44+ messages in thread
From: Junio C Hamano @ 2024-06-27 21:46 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, newren, anh, Derrick Stolee

"Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:

> Updates in v2
> =============
>
> Thanks to Elijah for a thorough review, leading to valuable improvements.
>
>  * I was mistaken that the sparse index was required for this logic to
>    happen. This has changed several descriptions across the commit messages.
>  * The final lstat() in path_found() was not needed, so is removed in v2.
>    This saves even more time and lstat() calls, updating the stats.
>  * Elijah created a particularly nasty example for testing, which I include
>    in my final patch. He gets a "Helped-by" credit for this.
>  * Several comments, variables, and other improvements based on Elijah's
>    recommendations.
>
> Thanks, Stolee

Thanks, both.  This round was a pleasant read.

Let's mark it for 'next' soonish.


        

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/5] sparse-index: improve clear_skip_worktree_from_present_files()
  2024-06-27 21:46   ` [PATCH v2 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Junio C Hamano
@ 2024-06-28  0:59     ` Elijah Newren
  2024-06-28  1:57       ` Derrick Stolee
  0 siblings, 1 reply; 44+ messages in thread
From: Elijah Newren @ 2024-06-28  0:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Derrick Stolee via GitGitGadget, git, anh, Derrick Stolee

On Thu, Jun 27, 2024 at 2:46 PM Junio C Hamano <gitster@pobox.com> wrote:
>
> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>
> > Updates in v2
> > =============
> >
> > Thanks to Elijah for a thorough review, leading to valuable improvements.
> >
> >  * I was mistaken that the sparse index was required for this logic to
> >    happen. This has changed several descriptions across the commit messages.
> >  * The final lstat() in path_found() was not needed, so is removed in v2.
> >    This saves even more time and lstat() calls, updating the stats.
> >  * Elijah created a particularly nasty example for testing, which I include
> >    in my final patch. He gets a "Helped-by" credit for this.
> >  * Several comments, variables, and other improvements based on Elijah's
> >    recommendations.
> >
> > Thanks, Stolee
>
> Thanks, both.  This round was a pleasant read.
>
> Let's mark it for 'next' soonish.

Yeah, this version looks pretty good.  There was a possible further
improvement you suggested, but I think that wouldn't need to hold up
this series.

However, there is one paragraph from the commit message of 1/5 that
I'd like to see stricken first (or rewritten).  Once that's done, I
think this series is ready for next.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v2 0/5] sparse-index: improve clear_skip_worktree_from_present_files()
  2024-06-28  0:59     ` Elijah Newren
@ 2024-06-28  1:57       ` Derrick Stolee
  0 siblings, 0 replies; 44+ messages in thread
From: Derrick Stolee @ 2024-06-28  1:57 UTC (permalink / raw)
  To: Elijah Newren, Junio C Hamano; +Cc: Derrick Stolee via GitGitGadget, git, anh

On 6/27/24 8:59 PM, Elijah Newren wrote:
> On Thu, Jun 27, 2024 at 2:46 PM Junio C Hamano <gitster@pobox.com> wrote:
>>
>> "Derrick Stolee via GitGitGadget" <gitgitgadget@gmail.com> writes:
>>
>>> Updates in v2
>>> =============
>>>
>>> Thanks to Elijah for a thorough review, leading to valuable improvements.
>>>
>>>   * I was mistaken that the sparse index was required for this logic to
>>>     happen. This has changed several descriptions across the commit messages.
>>>   * The final lstat() in path_found() was not needed, so is removed in v2.
>>>     This saves even more time and lstat() calls, updating the stats.
>>>   * Elijah created a particularly nasty example for testing, which I include
>>>     in my final patch. He gets a "Helped-by" credit for this.
>>>   * Several comments, variables, and other improvements based on Elijah's
>>>     recommendations.
>>>
>>> Thanks, Stolee
>>
>> Thanks, both.  This round was a pleasant read.
>>
>> Let's mark it for 'next' soonish.
> 
> Yeah, this version looks pretty good.  There was a possible further
> improvement you suggested, but I think that wouldn't need to hold up
> this series.
> 
> However, there is one paragraph from the commit message of 1/5 that
> I'd like to see stricken first (or rewritten).  Once that's done, I
> think this series is ready for next.

I agree about the removal of that paragraph, but also there is a
"largest" to "longest" replacement in a few comments that makes sense
to change.

I can send a v3 with these changes.

Thanks,
-Stolee

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 0/5] sparse-index: improve clear_skip_worktree_from_present_files()
  2024-06-26 14:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
                     ` (5 preceding siblings ...)
  2024-06-27 21:46   ` [PATCH v2 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Junio C Hamano
@ 2024-06-28 12:43   ` Derrick Stolee via GitGitGadget
  2024-06-28 12:43     ` [PATCH v3 1/5] sparse-checkout: refactor skip worktree retry logic Derrick Stolee via GitGitGadget
                       ` (5 more replies)
  6 siblings, 6 replies; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-28 12:43 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee

While doing some investigation in a private monorepo with sparse-checkout
and a sparse index, I accidentally left a modified file outside of my
sparse-checkout cone. This caused my Git commands to slow to a crawl, so I
reran with GIT_TRACE2_PERF=1.

While I was able to identify clear_skip_worktree_from_present_files() as the
culprit, it took longer than desired to figure out what was going on. This
series intends to both fix the performance issue (as much as possible) and
do some refactoring to make it easier to understand what is happening.

In the end, I was able to reduce the number of lstat() calls in my case from
over 1.1 million to about 4,400, improving the time from 13.4s to 81ms on a
warm disk cache. (These numbers are from a test after v2, which somehow hit
the old caching algorithm even worse than my test in v1.)


Updates in v3
=============

 * Removed the incorrect paragraph in the commit message of patch 1.
 * Replaced "largest" with "longest" in the final patch.

Thanks, Stolee

Derrick Stolee (5):
  sparse-checkout: refactor skip worktree retry logic
  sparse-index: refactor path_found()
  sparse-index: use strbuf in path_found()
  sparse-index: count lstat() calls
  sparse-index: improve lstat caching of sparse paths

 sparse-index.c | 216 +++++++++++++++++++++++++++++++++++++------------
 1 file changed, 164 insertions(+), 52 deletions(-)


base-commit: 66ac6e4bcd111be3fa9c2a6b3fafea718d00678d
Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1754%2Fderrickstolee%2Fclear-skip-speed-v3
Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1754/derrickstolee/clear-skip-speed-v3
Pull-Request: https://github.com/gitgitgadget/git/pull/1754

Range-diff vs v2:

 1:  93d0baed0b0 ! 1:  0844cda94cf sparse-checkout: refactor skip worktree retry logic
     @@ Commit message
          stored in the index, so caching was introduced in d79d299352 (Accelerate
          clear_skip_worktree_from_present_files() by caching, 2022-01-14).
      
     -    If users are having trouble with the performance of this operation and
     -    don't care about paths outside of the sparse-checkout, they can disable
     -    them using the sparse.expectFilesOutsideOfPatterns config option
     -    introduced in ecc7c8841d (repo_read_index: add config to expect files
     -    outside sparse patterns, 2022-02-25).
     -
          This check is particularly confusing in the presence of a sparse index,
          as a sparse tree entry corresponding to an existing directory must first
          be expanded to a full index before examining the paths within. This is
 2:  69c3beaabf7 = 2:  c242e2c9168 sparse-index: refactor path_found()
 3:  0a82e6b4183 = 3:  ad63bf746ca sparse-index: use strbuf in path_found()
 4:  9549f5b8062 = 4:  db6ded0df0d sparse-index: count lstat() calls
 5:  0cb344ac14f ! 5:  1f58e19691f sparse-index: improve lstat caching of sparse paths
     @@ sparse-index.c: static void clear_path_found_data(struct path_found_data *data)
       }
       
      +/**
     -+ * Return the length of the largest common substring that ends in a
     -+ * slash ('/') to indicate the largest common parent directory. Returns
     ++ * Return the length of the longest common substring that ends in a
     ++ * slash ('/') to indicate the longest common parent directory. Returns
      + * zero if no common directory exists.
      + */
      +static size_t max_common_dir_prefix(const char *path1, const char *path2)

-- 
gitgitgadget

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [PATCH v3 1/5] sparse-checkout: refactor skip worktree retry logic
  2024-06-28 12:43   ` [PATCH v3 " Derrick Stolee via GitGitGadget
@ 2024-06-28 12:43     ` Derrick Stolee via GitGitGadget
  2024-06-28 12:43     ` [PATCH v3 2/5] sparse-index: refactor path_found() Derrick Stolee via GitGitGadget
                       ` (4 subsequent siblings)
  5 siblings, 0 replies; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-28 12:43 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The clear_skip_worktree_from_present_files() method was introduced in
af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files present
in worktree, 2022-01-14) to help cases where sparse-checkout is enabled
but some paths outside of the sparse-checkout also exist on disk.  This
operation can be slow as it needs to check path existence in a way not
stored in the index, so caching was introduced in d79d299352 (Accelerate
clear_skip_worktree_from_present_files() by caching, 2022-01-14).

This check is particularly confusing in the presence of a sparse index,
as a sparse tree entry corresponding to an existing directory must first
be expanded to a full index before examining the paths within. This is
currently implemented using a 'goto' and a boolean variable to ensure we
restart only once.

Even with that caching, it was noticed that this could take a long time
to execute. 89aaab11a3 (index: add trace2 region for clear skip
worktree, 2022-11-03) introduced trace2 regions to measure this time.
Further, the way the loop repeats itself was slightly confusing and
prone to breakage, so a BUG() statement was added in 8c7abdc596 (index:
raise a bug if the index is materialised more than once, 2022-11-03) to
be sure that the second run of the loop does not hit any sparse trees.

One thing that can be confusing about the current setup is that the
trace2 regions nest and it is not clear that a second loop is running
after a sparse index is expanded. Here is an example of what the regions
look like in a typical case:

| region_enter | ... | label:clear_skip_worktree_from_present_files
| region_enter | ... | ..label:update
| region_leave | ... | ..label:update
| region_enter | ... | ..label:ensure_full_index
| region_enter | ... | ....label:update
| region_leave | ... | ....label:update
| region_leave | ... | ..label:ensure_full_index
| data         | ... | ..sparse_path_count:1
| data         | ... | ..sparse_path_count_full:269538
| region_leave | ... | label:clear_skip_worktree_from_present_files

One thing that is particularly difficult to understand about these
regions is that most of the time is spent between the close of the
ensure_full_index region and the reporting of the end data. This is
because of the restart of the loop being within the same region as the
first iteration of the loop.

This change refactors the method into two separate methods that are
traced separately. This will be more important later when we change
other features of the methods, but for now the only functional change is
the difference in the structure of the trace regions.

After this change, the same telemetry section is split into three
distinct chunks:

| region_enter | ... | label:clear_skip_worktree_from_present_files_sparse
| data         | ... | ..sparse_path_count:1
| region_leave | ... | label:clear_skip_worktree_from_present_files_sparse
| region_enter | ... | label:update
| region_leave | ... | label:update
| region_enter | ... | label:ensure_full_index
| region_enter | ... | ..label:update
| region_leave | ... | ..label:update
| region_leave | ... | label:ensure_full_index
| region_enter | ... | label:clear_skip_worktree_from_present_files_full
| data         | ... | ..full_path_count:269538
| region_leave | ... | label:clear_skip_worktree_from_present_files_full

Here, we see the sparse loop terminating early with its first sparse
path being a sparse directory containing a file. Then, that loop's
region terminates before ensure_full_index begins (in this case, the
cache-tree must also be computed). Then, _after_ the index is expanded,
the full loop begins with its own region.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 sparse-index.c | 77 ++++++++++++++++++++++++++++++++++----------------
 1 file changed, 53 insertions(+), 24 deletions(-)

diff --git a/sparse-index.c b/sparse-index.c
index e48e40cae71..e0457c87fff 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -486,49 +486,78 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len,
 	return 0;
 }

-void clear_skip_worktree_from_present_files(struct index_state *istate)
+static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate)
 {
 	const char *last_dirname = NULL;
 	size_t dir_len = 0;
 	int dir_found = 1;

-	int i;
-	int path_count[2] = {0, 0};
-	int restarted = 0;
+	int path_count = 0;
+	int to_restart = 0;

-	if (!core_apply_sparse_checkout ||
-	    sparse_expect_files_outside_of_patterns)
-		return;
-
-	trace2_region_enter("index", "clear_skip_worktree_from_present_files",
+	trace2_region_enter("index", "clear_skip_worktree_from_present_files_sparse",
 			    istate->repo);
-restart:
-	for (i = 0; i < istate->cache_nr; i++) {
+	for (int i = 0; i < istate->cache_nr; i++) {
 		struct cache_entry *ce = istate->cache[i];

 		if (ce_skip_worktree(ce)) {
-			path_count[restarted]++;
+			path_count++;
 			if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) {
 				if (S_ISSPARSEDIR(ce->ce_mode)) {
-					if (restarted)
-						BUG("ensure-full-index did not fully flatten?");
-					ensure_full_index(istate);
-					restarted = 1;
-					goto restart;
+					to_restart = 1;
+					break;
 				}
 				ce->ce_flags &= ~CE_SKIP_WORKTREE;
 			}
 		}
 	}

-	if (path_count[0])
-		trace2_data_intmax("index", istate->repo,
-				   "sparse_path_count", path_count[0]);
-	if (restarted)
-		trace2_data_intmax("index", istate->repo,
-				   "sparse_path_count_full", path_count[1]);
-	trace2_region_leave("index", "clear_skip_worktree_from_present_files",
+	trace2_data_intmax("index", istate->repo,
+			   "sparse_path_count", path_count);
+	trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse",
+			    istate->repo);
+	return to_restart;
+}
+
+static void clear_skip_worktree_from_present_files_full(struct index_state *istate)
+{
+	const char *last_dirname = NULL;
+	size_t dir_len = 0;
+	int dir_found = 1;
+
+	int path_count = 0;
+
+	trace2_region_enter("index", "clear_skip_worktree_from_present_files_full",
 			    istate->repo);
+	for (int i = 0; i < istate->cache_nr; i++) {
+		struct cache_entry *ce = istate->cache[i];
+
+		if (S_ISSPARSEDIR(ce->ce_mode))
+			BUG("ensure-full-index did not fully flatten?");
+
+		if (ce_skip_worktree(ce)) {
+			path_count++;
+			if (path_found(ce->name, &last_dirname, &dir_len, &dir_found))
+				ce->ce_flags &= ~CE_SKIP_WORKTREE;
+		}
+	}
+
+	trace2_data_intmax("index", istate->repo,
+			   "full_path_count", path_count);
+	trace2_region_leave("index", "clear_skip_worktree_from_present_files_full",
+			    istate->repo);
+}
+
+void clear_skip_worktree_from_present_files(struct index_state *istate)
+{
+	if (!core_apply_sparse_checkout ||
+	    sparse_expect_files_outside_of_patterns)
+		return;
+
+	if (clear_skip_worktree_from_present_files_sparse(istate)) {
+		ensure_full_index(istate);
+		clear_skip_worktree_from_present_files_full(istate);
+	}
 }

 /*
-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 2/5] sparse-index: refactor path_found()
  2024-06-28 12:43   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  2024-06-28 12:43     ` [PATCH v3 1/5] sparse-checkout: refactor skip worktree retry logic Derrick Stolee via GitGitGadget
@ 2024-06-28 12:43     ` Derrick Stolee via GitGitGadget
  2024-06-28 12:43     ` [PATCH v3 3/5] sparse-index: use strbuf in path_found() Derrick Stolee via GitGitGadget
                       ` (3 subsequent siblings)
  5 siblings, 0 replies; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-28 12:43 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

In advance of changing the behavior of path_found(), take all of the
intermediate data values and group them into a single struct. This
simplifies the method prototype as well as the initialization. Future
changes can be made directly to the struct and method without changing
the callers with this approach.

Note that the clear_path_found_data() method is currently empty, as
there is nothing to free. This method is a placeholder for future
changes that require a non-trivial implementation. Its stub is created
now so consumers could call it now and not change in future changes.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 sparse-index.c | 45 +++++++++++++++++++++++++++++----------------
 1 file changed, 29 insertions(+), 16 deletions(-)

diff --git a/sparse-index.c b/sparse-index.c
index e0457c87fff..de6e727f5c1 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -439,8 +439,22 @@ void ensure_correct_sparsity(struct index_state *istate)
 		ensure_full_index(istate);
 }
 
-static int path_found(const char *path, const char **dirname, size_t *dir_len,
-		      int *dir_found)
+struct path_found_data {
+	const char *dirname;
+	size_t dir_len;
+	int dir_found;
+};
+
+#define PATH_FOUND_DATA_INIT { \
+	.dir_found = 1 \
+}
+
+static void clear_path_found_data(struct path_found_data *data)
+{
+	return;
+}
+
+static int path_found(const char *path, struct path_found_data *data)
 {
 	struct stat st;
 	char *newdir;
@@ -450,7 +464,7 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len,
 	 * If dirname corresponds to a directory that doesn't exist, and this
 	 * path starts with dirname, then path can't exist.
 	 */
-	if (!*dir_found && !memcmp(path, *dirname, *dir_len))
+	if (!data->dir_found && !memcmp(path, data->dirname, data->dir_len))
 		return 0;
 
 	/*
@@ -472,15 +486,16 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len,
 	 * If path starts with directory (which we already lstat'ed and found),
 	 * then no need to lstat parent directory again.
 	 */
-	if (*dir_found && *dirname && memcmp(path, *dirname, *dir_len))
+	if (data->dir_found && data->dirname &&
+	    memcmp(path, data->dirname, data->dir_len))
 		return 0;
 
 	/* Free previous dirname, and cache path's dirname */
-	*dirname = path;
-	*dir_len = newdir - path + 1;
+	data->dirname = path;
+	data->dir_len = newdir - path + 1;
 
-	tmp = xstrndup(path, *dir_len);
-	*dir_found = !lstat(tmp, &st);
+	tmp = xstrndup(path, data->dir_len);
+	data->dir_found = !lstat(tmp, &st);
 	free(tmp);
 
 	return 0;
@@ -488,9 +503,7 @@ static int path_found(const char *path, const char **dirname, size_t *dir_len,
 
 static int clear_skip_worktree_from_present_files_sparse(struct index_state *istate)
 {
-	const char *last_dirname = NULL;
-	size_t dir_len = 0;
-	int dir_found = 1;
+	struct path_found_data data = PATH_FOUND_DATA_INIT;
 
 	int path_count = 0;
 	int to_restart = 0;
@@ -502,7 +515,7 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist
 
 		if (ce_skip_worktree(ce)) {
 			path_count++;
-			if (path_found(ce->name, &last_dirname, &dir_len, &dir_found)) {
+			if (path_found(ce->name, &data)) {
 				if (S_ISSPARSEDIR(ce->ce_mode)) {
 					to_restart = 1;
 					break;
@@ -516,14 +529,13 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist
 			   "sparse_path_count", path_count);
 	trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse",
 			    istate->repo);
+	clear_path_found_data(&data);
 	return to_restart;
 }
 
 static void clear_skip_worktree_from_present_files_full(struct index_state *istate)
 {
-	const char *last_dirname = NULL;
-	size_t dir_len = 0;
-	int dir_found = 1;
+	struct path_found_data data = PATH_FOUND_DATA_INIT;
 
 	int path_count = 0;
 
@@ -537,7 +549,7 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista
 
 		if (ce_skip_worktree(ce)) {
 			path_count++;
-			if (path_found(ce->name, &last_dirname, &dir_len, &dir_found))
+			if (path_found(ce->name, &data))
 				ce->ce_flags &= ~CE_SKIP_WORKTREE;
 		}
 	}
@@ -546,6 +558,7 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista
 			   "full_path_count", path_count);
 	trace2_region_leave("index", "clear_skip_worktree_from_present_files_full",
 			    istate->repo);
+	clear_path_found_data(&data);
 }
 
 void clear_skip_worktree_from_present_files(struct index_state *istate)
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 3/5] sparse-index: use strbuf in path_found()
  2024-06-28 12:43   ` [PATCH v3 " Derrick Stolee via GitGitGadget
  2024-06-28 12:43     ` [PATCH v3 1/5] sparse-checkout: refactor skip worktree retry logic Derrick Stolee via GitGitGadget
  2024-06-28 12:43     ` [PATCH v3 2/5] sparse-index: refactor path_found() Derrick Stolee via GitGitGadget
@ 2024-06-28 12:43     ` Derrick Stolee via GitGitGadget
  2024-06-28 12:43     ` [PATCH v3 4/5] sparse-index: count lstat() calls Derrick Stolee via GitGitGadget
                       ` (2 subsequent siblings)
  5 siblings, 0 replies; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-28 12:43 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The path_found() method previously reused strings from the cache entries
the calling methods were using. This prevents string manipulation in
place and causes some odd reallocation before the final lstat() call in
the method.

Refactor the method to use strbufs and copy the path into the strbuf,
but also only the parent directory and not the whole path. This looks
like extra copying when assigning the path to the strbuf, but we save an
allocation by dropping the 'tmp' string, and we are "reusing" the copy
from 'tmp' to put the data in the strbuf.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 sparse-index.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/sparse-index.c b/sparse-index.c
index de6e727f5c1..fec4f393360 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -440,31 +440,30 @@ void ensure_correct_sparsity(struct index_state *istate)
 }
 
 struct path_found_data {
-	const char *dirname;
-	size_t dir_len;
+	struct strbuf dir;
 	int dir_found;
 };
 
 #define PATH_FOUND_DATA_INIT { \
+	.dir = STRBUF_INIT, \
 	.dir_found = 1 \
 }
 
 static void clear_path_found_data(struct path_found_data *data)
 {
-	return;
+	strbuf_release(&data->dir);
 }
 
 static int path_found(const char *path, struct path_found_data *data)
 {
 	struct stat st;
 	char *newdir;
-	char *tmp;
 
 	/*
 	 * If dirname corresponds to a directory that doesn't exist, and this
 	 * path starts with dirname, then path can't exist.
 	 */
-	if (!data->dir_found && !memcmp(path, data->dirname, data->dir_len))
+	if (!data->dir_found && !memcmp(path, data->dir.buf, data->dir.len))
 		return 0;
 
 	/*
@@ -486,17 +485,15 @@ static int path_found(const char *path, struct path_found_data *data)
 	 * If path starts with directory (which we already lstat'ed and found),
 	 * then no need to lstat parent directory again.
 	 */
-	if (data->dir_found && data->dirname &&
-	    memcmp(path, data->dirname, data->dir_len))
+	if (data->dir_found && data->dir.buf &&
+	    memcmp(path, data->dir.buf, data->dir.len))
 		return 0;
 
 	/* Free previous dirname, and cache path's dirname */
-	data->dirname = path;
-	data->dir_len = newdir - path + 1;
+	strbuf_reset(&data->dir);
+	strbuf_add(&data->dir, path, newdir - path + 1);
 
-	tmp = xstrndup(path, data->dir_len);
-	data->dir_found = !lstat(tmp, &st);
-	free(tmp);
+	data->dir_found = !lstat(data->dir.buf, &st);
 
 	return 0;
 }
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 4/5] sparse-index: count lstat() calls
  2024-06-28 12:43   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (2 preceding siblings ...)
  2024-06-28 12:43     ` [PATCH v3 3/5] sparse-index: use strbuf in path_found() Derrick Stolee via GitGitGadget
@ 2024-06-28 12:43     ` Derrick Stolee via GitGitGadget
  2024-06-28 12:43     ` [PATCH v3 5/5] sparse-index: improve lstat caching of sparse paths Derrick Stolee via GitGitGadget
  2024-06-28 15:07     ` [PATCH v3 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Elijah Newren
  5 siblings, 0 replies; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-28 12:43 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The clear_skip_worktree.. methods already report some statistics about
how many cache entries are checked against path_found() due to having
the skip-worktree bit set. However, due to path_found() performing some
caching, this isn't the only information that would be helpful to
report.

Add a new lstat_count member to the path_found_data struct to count the
number of times path_found() calls lstat(). This will be helpful to help
explain performance problems in this method as well as to demonstrate
future changes to the caching algorithm in a more concrete way than
end-to-end timings.

Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 sparse-index.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/sparse-index.c b/sparse-index.c
index fec4f393360..8577fa726b8 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -442,6 +442,7 @@ void ensure_correct_sparsity(struct index_state *istate)
 struct path_found_data {
 	struct strbuf dir;
 	int dir_found;
+	size_t lstat_count;
 };
 
 #define PATH_FOUND_DATA_INIT { \
@@ -469,6 +470,7 @@ static int path_found(const char *path, struct path_found_data *data)
 	/*
 	 * If path itself exists, return 1.
 	 */
+	data->lstat_count++;
 	if (!lstat(path, &st))
 		return 1;
 
@@ -493,6 +495,7 @@ static int path_found(const char *path, struct path_found_data *data)
 	strbuf_reset(&data->dir);
 	strbuf_add(&data->dir, path, newdir - path + 1);
 
+	data->lstat_count++;
 	data->dir_found = !lstat(data->dir.buf, &st);
 
 	return 0;
@@ -524,6 +527,8 @@ static int clear_skip_worktree_from_present_files_sparse(struct index_state *ist
 
 	trace2_data_intmax("index", istate->repo,
 			   "sparse_path_count", path_count);
+	trace2_data_intmax("index", istate->repo,
+			   "sparse_lstat_count", data.lstat_count);
 	trace2_region_leave("index", "clear_skip_worktree_from_present_files_sparse",
 			    istate->repo);
 	clear_path_found_data(&data);
@@ -553,6 +558,8 @@ static void clear_skip_worktree_from_present_files_full(struct index_state *ista
 
 	trace2_data_intmax("index", istate->repo,
 			   "full_path_count", path_count);
+	trace2_data_intmax("index", istate->repo,
+			   "full_lstat_count", data.lstat_count);
 	trace2_region_leave("index", "clear_skip_worktree_from_present_files_full",
 			    istate->repo);
 	clear_path_found_data(&data);
-- 
gitgitgadget


^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [PATCH v3 5/5] sparse-index: improve lstat caching of sparse paths
  2024-06-28 12:43   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (3 preceding siblings ...)
  2024-06-28 12:43     ` [PATCH v3 4/5] sparse-index: count lstat() calls Derrick Stolee via GitGitGadget
@ 2024-06-28 12:43     ` Derrick Stolee via GitGitGadget
  2024-06-28 15:07     ` [PATCH v3 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Elijah Newren
  5 siblings, 0 replies; 44+ messages in thread
From: Derrick Stolee via GitGitGadget @ 2024-06-28 12:43 UTC (permalink / raw)
  To: git; +Cc: gitster, newren, anh, Derrick Stolee, Derrick Stolee

From: Derrick Stolee <stolee@gmail.com>

The clear_skip_worktree_from_present_files() method was first introduced
in af6a51875a (repo_read_index: clear SKIP_WORKTREE bit from files
present in worktree, 2022-01-14) to allow better interaction with the
working directory in the presence of paths outside of the
sparse-checkout. The initial implementation would lstat() every single
SKIP_WORKTREE path to see if it existed; if it ran across a sparse
directory that existed (when a sparse index was in use), then it would
expand the index and then check every SKIP_WORKTREE path.

Since these lstat() calls were very expensive, this was improved in
d79d299352 (Accelerate clear_skip_worktree_from_present_files() by
caching, 2022-01-14) by caching directories that do not exist so it
could avoid lstat()ing any files under such directories. However, there
are some inefficiencies in that caching mechanism.

The caching mechanism stored only the parent directory as not existing,
even if a higher parent directory also does not exist. This means that
wasted lstat() calls would occur when the paths passed to path_found()
change immediate parent directories but within the same parent directory
that does not exist.

To create an example repository that demonstrates this problem, it helps
to have a directory outside of the sparse-checkout that contains many
deep paths. In particular, the first paths (in lexicographic order)
underneath the sparse directory should have deep directory structures,
maximizing the difference between the old caching algorithm that looks
to a single parent and the new caching algorithm that looks to the
top-most missing directory.

The performance test script p2000-sparse-operations.sh takes the sample
repository and copies its HEAD to several copies nested in directories
of the form f<i>/f<j>/f<k> where i, j, and k are numbers from 1 to 4.
The sparse-checkout cone is then selected as "f2/f4/". Creating "f1/f1/"
will trigger the behavior and also lead to some interesting cases for
the caching algorithm since "f1/f1/" exists but "f1/f2/" and "f3/" do
not.

This is difficult to notice when running performance tests using the Git
repository (or a blow-up of the Git repository, as in
p2000-sparse-operations.sh) because Git has a very shallow directory
structure.

This change reorganizes the caching algorithm to focus on storing the
highest level leading directory that does not exist; specifically this
means that that directory's parent _does_ exist. By doing a little extra
work on a path passed to path_found(), we can short-circuit all of the
paths passed to path_found() afterwards that match a prefix with that
non-existing directory. When in a repository where the first sparse file
is likely to have a much deeper path than the first non-existing
directory, this can realize significant gains.

The details of this algorithm require careful attention, so the new
implementation of path_found() has detailed comments, including the use
of a new max_common_dir_prefix() method that may be of independent
interest.

It's worth noting that this is not universally positive, since we are
doing extra lstat() calls to establish the exact path to cache. In the
blow-up of the Git repository, we can see that the lstat count
_increases_ from 28 to 31. However, these numbers were already
artificially low.

Contributor Elijah Newren created a publicly-available test repository
that demonstrates the difference in these caching algorithms in the most
extreme way. To test, follow these steps:

  git clone --sparse https://github.com/newren/gvfs-like-git-bomb
  cd gvfs-like-git-bomb
  ./runme.sh                   # NOTE: check scripts before running!

At this point, assuming you do not have index.sparse=true set globally,
the index has one million paths with the SKIP_WORKTREE bit and they will
all be sent to path_found() in the sparse loop. You can measure this by
running 'git status' with GIT_TRACE2_PERF=1:

    Sparse files in the index: 1,000,000
  sparse_lstat_count (before):   200,000
   sparse_lstat_count (after):         2

And here are the performance numbers:

  Benchmark 1: old
    Time (mean ± σ):     397.5 ms ±   4.1 ms
    Range (min … max):   391.2 ms … 404.8 ms    10 runs

  Benchmark 2: new
    Time (mean ± σ):     252.7 ms ±   3.1 ms
    Range (min … max):   249.4 ms … 259.5 ms    11 runs

  Summary
    'new' ran
      1.57 ± 0.02 times faster than 'old'

By modifying this example further, we can demonstrate a more realistic
example and include the sparse index expansion. Continue by creating
this directory, confusing both caching algorithms somewhat:

  mkdir -p bomb/d/e/f/a/a

Then re-run the 'git status' tests to see these statistics:

    Sparse files in the index: 1,000,000
  sparse_lstat_count (before):   724,010
   sparse_lstat_count (after):       106

  Benchmark 1: old
    Time (mean ± σ):     753.0 ms ±   3.5 ms
    Range (min … max):   749.7 ms … 760.9 ms    10 runs

  Benchmark 2: new
    Time (mean ± σ):     201.4 ms ±   3.2 ms
    Range (min … max):   196.0 ms … 207.9 ms    14 runs

  Summary
    'new' ran
      3.74 ± 0.06 times faster than 'old'

Note that if this repository had a sparse index enabled, the additional
cost of expanding the sparse index affects the total time of these
commands by over four seconds, significantly diminishing the benefit of
the caching algorithm. Having existing paths outside of the
sparse-checkout is a known performance issue for the sparse index and is
a known trade-off for the performance benefits given when no such paths
exist.

Using an internal monorepo with over two million paths at HEAD and a
typical sparse-checkout cone such that the sparse index contains
~190,000 entries (including over two thousand sparse trees), I was able
to measure these lstat counts when one sparse directory actually exists
on disk:

  Sparse files in expanded index: 1,841,997
       full_lstat_count (before): 1,188,161
       full_lstat_count  (after):     4,404

This resulted in this absolute time change, on a warm disk:

      Time in full loop (before): 13.481 s
      Time in full loop  (after):  0.081 s

(These times were calculated on a Windows machine, where lstat() is
slower than a similar Linux machine.)

Helped-by: Elijah Newren <newren@gmail.com>
Signed-off-by: Derrick Stolee <stolee@gmail.com>
---
 sparse-index.c | 114 ++++++++++++++++++++++++++++++++++++++-----------
 1 file changed, 90 insertions(+), 24 deletions(-)

diff --git a/sparse-index.c b/sparse-index.c
index 8577fa726b8..9913a6078cd 100644
--- a/sparse-index.c
+++ b/sparse-index.c
@@ -440,14 +440,21 @@ void ensure_correct_sparsity(struct index_state *istate)
 }

 struct path_found_data {
+	/**
+	 * The path stored in 'dir', if non-empty, corresponds to the most-
+	 * recent path that we checked where:
+	 *
+	 *   1. The path should be a directory, according to the index.
+	 *   2. The path does not exist.
+	 *   3. The parent path _does_ exist. (This may be the root of the
+	 *      working directory.)
+	 */
 	struct strbuf dir;
-	int dir_found;
 	size_t lstat_count;
 };

 #define PATH_FOUND_DATA_INIT { \
-	.dir = STRBUF_INIT, \
-	.dir_found = 1 \
+	.dir = STRBUF_INIT \
 }

 static void clear_path_found_data(struct path_found_data *data)
@@ -455,49 +462,108 @@ static void clear_path_found_data(struct path_found_data *data)
 	strbuf_release(&data->dir);
 }

+/**
+ * Return the length of the longest common substring that ends in a
+ * slash ('/') to indicate the longest common parent directory. Returns
+ * zero if no common directory exists.
+ */
+static size_t max_common_dir_prefix(const char *path1, const char *path2)
+{
+	size_t common_prefix = 0;
+	for (size_t i = 0; path1[i] && path2[i]; i++) {
+		if (path1[i] != path2[i])
+			break;
+
+		/*
+		 * If they agree at a directory separator, then add one
+		 * to make sure it is included in the common prefix string.
+		 */
+		if (path1[i] == '/')
+			common_prefix = i + 1;
+	}
+
+	return common_prefix;
+}
+
 static int path_found(const char *path, struct path_found_data *data)
 {
 	struct stat st;
-	char *newdir;
+	size_t common_prefix;

 	/*
-	 * If dirname corresponds to a directory that doesn't exist, and this
-	 * path starts with dirname, then path can't exist.
+	 * If data->dir is non-empty, then it contains a path that doesn't
+	 * exist, including an ending slash ('/'). If it is a prefix of 'path',
+	 * then we can return 0.
 	 */
-	if (!data->dir_found && !memcmp(path, data->dir.buf, data->dir.len))
+	if (data->dir.len && !memcmp(path, data->dir.buf, data->dir.len))
 		return 0;

 	/*
-	 * If path itself exists, return 1.
+	 * Otherwise, we must check if the current path exists. If it does, then
+	 * return 1. The cached directory will be skipped until we come across
+	 * a missing path again.
 	 */
 	data->lstat_count++;
 	if (!lstat(path, &st))
 		return 1;

 	/*
-	 * Otherwise, path does not exist so we'll return 0...but we'll first
-	 * determine some info about its parent directory so we can avoid
-	 * lstat calls for future cache entries.
+	 * At this point, we know that 'path' doesn't exist, and we know that
+	 * the parent directory of 'data->dir' does exist. Let's set 'data->dir'
+	 * to be the top-most non-existing directory of 'path'. If the first
+	 * parent of 'path' exists, then we will act as though 'path'
+	 * corresponds to a directory (by adding a slash).
 	 */
-	newdir = strrchr(path, '/');
-	if (!newdir)
-		return 0; /* Didn't find a parent dir; just return 0 now. */
+	common_prefix = max_common_dir_prefix(path, data->dir.buf);

 	/*
-	 * If path starts with directory (which we already lstat'ed and found),
-	 * then no need to lstat parent directory again.
+	 * At this point, 'path' and 'data->dir' have a common existing parent
+	 * directory given by path[0..common_prefix] (which could have length 0).
+	 * We "grow" the data->dir buffer by checking for existing directories
+	 * along 'path'.
 	 */
-	if (data->dir_found && data->dir.buf &&
-	    memcmp(path, data->dir.buf, data->dir.len))
-		return 0;

-	/* Free previous dirname, and cache path's dirname */
-	strbuf_reset(&data->dir);
-	strbuf_add(&data->dir, path, newdir - path + 1);
+	strbuf_setlen(&data->dir, common_prefix);
+	while (1) {
+		/* Find the next directory in 'path'. */
+		const char *rest = path + data->dir.len;
+		const char *next_slash = strchr(rest, '/');

-	data->lstat_count++;
-	data->dir_found = !lstat(data->dir.buf, &st);
+		/*
+		 * If there are no more slashes, then 'path' doesn't contain a
+		 * non-existent _parent_ directory. Set 'data->dir' to be equal
+		 * to 'path' plus an additional slash, so it can be used for
+		 * caching in the future. The filename of 'path' is considered
+		 * a non-existent directory.
+		 *
+		 * Note: if "{path}/" exists as a directory, then it will never
+		 * appear as a prefix of other callers to this method, assuming
+		 * the context from the clear_skip_worktree... methods. If this
+		 * method is reused, then this must be reconsidered.
+		 */
+		if (!next_slash) {
+			strbuf_addstr(&data->dir, rest);
+			strbuf_addch(&data->dir, '/');
+			break;
+		}

+		/*
+		 * Now that we have a slash, let's grow 'data->dir' to include
+		 * this slash, then test if we should stop.
+		 */
+		strbuf_add(&data->dir, rest, next_slash - rest + 1);
+
+		/* If the parent dir doesn't exist, then stop here. */
+		data->lstat_count++;
+		if (lstat(data->dir.buf, &st))
+			return 0;
+	}
+
+	/*
+	 * At this point, 'data->dir' is equal to 'path' plus a slash character,
+	 * and the parent directory of 'path' definitely exists. Moreover, we
+	 * know that 'path' doesn't exist, or we would have returned 1 earlier.
+	 */
 	return 0;
 }

-- 
gitgitgadget

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 0/5] sparse-index: improve clear_skip_worktree_from_present_files()
  2024-06-28 12:43   ` [PATCH v3 " Derrick Stolee via GitGitGadget
                       ` (4 preceding siblings ...)
  2024-06-28 12:43     ` [PATCH v3 5/5] sparse-index: improve lstat caching of sparse paths Derrick Stolee via GitGitGadget
@ 2024-06-28 15:07     ` Elijah Newren
  2024-06-28 19:34       ` Junio C Hamano
  5 siblings, 1 reply; 44+ messages in thread
From: Elijah Newren @ 2024-06-28 15:07 UTC (permalink / raw)
  To: Derrick Stolee via GitGitGadget; +Cc: git, gitster, anh, Derrick Stolee

On Fri, Jun 28, 2024 at 5:43 AM Derrick Stolee via GitGitGadget
<gitgitgadget@gmail.com> wrote:
>
> While doing some investigation in a private monorepo with sparse-checkout
> and a sparse index, I accidentally left a modified file outside of my
> sparse-checkout cone. This caused my Git commands to slow to a crawl, so I
> reran with GIT_TRACE2_PERF=1.
>
> While I was able to identify clear_skip_worktree_from_present_files() as the
> culprit, it took longer than desired to figure out what was going on. This
> series intends to both fix the performance issue (as much as possible) and
> do some refactoring to make it easier to understand what is happening.
>
> In the end, I was able to reduce the number of lstat() calls in my case from
> over 1.1 million to about 4,400, improving the time from 13.4s to 81ms on a
> warm disk cache. (These numbers are from a test after v2, which somehow hit
> the old caching algorithm even worse than my test in v1.)
>
>
> Updates in v3
> =============
>
>  * Removed the incorrect paragraph in the commit message of patch 1.
>  * Replaced "largest" with "longest" in the final patch.
>
> Thanks, Stolee
>
> Derrick Stolee (5):
>   sparse-checkout: refactor skip worktree retry logic
>   sparse-index: refactor path_found()
>   sparse-index: use strbuf in path_found()
>   sparse-index: count lstat() calls
>   sparse-index: improve lstat caching of sparse paths
>
>  sparse-index.c | 216 +++++++++++++++++++++++++++++++++++++------------
>  1 file changed, 164 insertions(+), 52 deletions(-)
>
>
> base-commit: 66ac6e4bcd111be3fa9c2a6b3fafea718d00678d
> Published-As: https://github.com/gitgitgadget/git/releases/tag/pr-1754%2Fderrickstolee%2Fclear-skip-speed-v3
> Fetch-It-Via: git fetch https://github.com/gitgitgadget/git pr-1754/derrickstolee/clear-skip-speed-v3
> Pull-Request: https://github.com/gitgitgadget/git/pull/1754
>
> Range-diff vs v2:
>
>  1:  93d0baed0b0 ! 1:  0844cda94cf sparse-checkout: refactor skip worktree retry logic
>      @@ Commit message
>           stored in the index, so caching was introduced in d79d299352 (Accelerate
>           clear_skip_worktree_from_present_files() by caching, 2022-01-14).
>
>      -    If users are having trouble with the performance of this operation and
>      -    don't care about paths outside of the sparse-checkout, they can disable
>      -    them using the sparse.expectFilesOutsideOfPatterns config option
>      -    introduced in ecc7c8841d (repo_read_index: add config to expect files
>      -    outside sparse patterns, 2022-02-25).
>      -
>           This check is particularly confusing in the presence of a sparse index,
>           as a sparse tree entry corresponding to an existing directory must first
>           be expanded to a full index before examining the paths within. This is
>  2:  69c3beaabf7 = 2:  c242e2c9168 sparse-index: refactor path_found()
>  3:  0a82e6b4183 = 3:  ad63bf746ca sparse-index: use strbuf in path_found()
>  4:  9549f5b8062 = 4:  db6ded0df0d sparse-index: count lstat() calls
>  5:  0cb344ac14f ! 5:  1f58e19691f sparse-index: improve lstat caching of sparse paths
>      @@ sparse-index.c: static void clear_path_found_data(struct path_found_data *data)
>        }
>
>       +/**
>      -+ * Return the length of the largest common substring that ends in a
>      -+ * slash ('/') to indicate the largest common parent directory. Returns
>      ++ * Return the length of the longest common substring that ends in a
>      ++ * slash ('/') to indicate the longest common parent directory. Returns
>       + * zero if no common directory exists.
>       + */
>       +static size_t max_common_dir_prefix(const char *path1, const char *path2)
>
> --
> gitgitgadget

This version covers the last two outstanding items.

Reviewed-by: Elijah Newren <newren@gmail.com>

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [PATCH v3 0/5] sparse-index: improve clear_skip_worktree_from_present_files()
  2024-06-28 15:07     ` [PATCH v3 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Elijah Newren
@ 2024-06-28 19:34       ` Junio C Hamano
  0 siblings, 0 replies; 44+ messages in thread
From: Junio C Hamano @ 2024-06-28 19:34 UTC (permalink / raw)
  To: Elijah Newren; +Cc: Derrick Stolee via GitGitGadget, git, anh, Derrick Stolee

Elijah Newren <newren@gmail.com> writes:

> This version covers the last two outstanding items.
>
> Reviewed-by: Elijah Newren <newren@gmail.com>

Thanks both.  Will mark it for 'next'.


^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2024-06-28 19:34 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-20 16:11 [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Derrick Stolee via GitGitGadget
2024-06-20 16:11 ` [PATCH 1/5] sparse-index: refactor skip worktree retry logic Derrick Stolee via GitGitGadget
2024-06-24 22:12   ` Elijah Newren
2024-06-26 12:42     ` Derrick Stolee
2024-06-20 16:11 ` [PATCH 2/5] sparse-index: refactor path_found() Derrick Stolee via GitGitGadget
2024-06-24 22:13   ` Elijah Newren
2024-06-26 12:43     ` Derrick Stolee
2024-06-20 16:11 ` [PATCH 3/5] sparse-index: use strbuf in path_found() Derrick Stolee via GitGitGadget
2024-06-24 22:13   ` Elijah Newren
2024-06-20 16:11 ` [PATCH 4/5] sparse-index: count lstat() calls Derrick Stolee via GitGitGadget
2024-06-24 22:13   ` Elijah Newren
2024-06-20 16:11 ` [PATCH 5/5] sparse-index: improve lstat caching of sparse paths Derrick Stolee via GitGitGadget
2024-06-24 22:14   ` Elijah Newren
2024-06-25  0:08     ` Junio C Hamano
2024-06-26 13:06     ` Derrick Stolee
2024-06-28  0:10       ` Elijah Newren
2024-06-20 19:16 ` [PATCH 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Junio C Hamano
2024-06-20 20:21   ` Derrick Stolee
2024-06-20 21:02     ` Junio C Hamano
2024-06-26 14:29 ` [PATCH v2 " Derrick Stolee via GitGitGadget
2024-06-26 14:29   ` [PATCH v2 1/5] sparse-checkout: refactor skip worktree retry logic Derrick Stolee via GitGitGadget
2024-06-27 20:59     ` Junio C Hamano
2024-06-28  0:51       ` Elijah Newren
2024-06-28  1:49         ` Derrick Stolee
2024-06-28  5:50         ` Junio C Hamano
2024-06-28  0:31     ` Elijah Newren
2024-06-28  1:56       ` Derrick Stolee
2024-06-26 14:29   ` [PATCH v2 2/5] sparse-index: refactor path_found() Derrick Stolee via GitGitGadget
2024-06-26 14:29   ` [PATCH v2 3/5] sparse-index: use strbuf in path_found() Derrick Stolee via GitGitGadget
2024-06-26 14:29   ` [PATCH v2 4/5] sparse-index: count lstat() calls Derrick Stolee via GitGitGadget
2024-06-26 14:29   ` [PATCH v2 5/5] sparse-index: improve lstat caching of sparse paths Derrick Stolee via GitGitGadget
2024-06-27 21:14     ` Junio C Hamano
2024-06-28  1:56       ` Derrick Stolee
2024-06-27 21:46   ` [PATCH v2 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Junio C Hamano
2024-06-28  0:59     ` Elijah Newren
2024-06-28  1:57       ` Derrick Stolee
2024-06-28 12:43   ` [PATCH v3 " Derrick Stolee via GitGitGadget
2024-06-28 12:43     ` [PATCH v3 1/5] sparse-checkout: refactor skip worktree retry logic Derrick Stolee via GitGitGadget
2024-06-28 12:43     ` [PATCH v3 2/5] sparse-index: refactor path_found() Derrick Stolee via GitGitGadget
2024-06-28 12:43     ` [PATCH v3 3/5] sparse-index: use strbuf in path_found() Derrick Stolee via GitGitGadget
2024-06-28 12:43     ` [PATCH v3 4/5] sparse-index: count lstat() calls Derrick Stolee via GitGitGadget
2024-06-28 12:43     ` [PATCH v3 5/5] sparse-index: improve lstat caching of sparse paths Derrick Stolee via GitGitGadget
2024-06-28 15:07     ` [PATCH v3 0/5] sparse-index: improve clear_skip_worktree_from_present_files() Elijah Newren
2024-06-28 19:34       ` Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).