From: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
To: Victoria Dye <vdye@github.com>, git@vger.kernel.org
Cc: derrickstolee@github.com
Subject: Re: [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse
Date: Fri, 2 Sep 2022 11:47:08 -0700 [thread overview]
Message-ID: <754c64ff-c81f-cd67-d303-6c4876cbd5a0@gmail.com> (raw)
In-Reply-To: <4b65d7dc-e711-43a6-8763-62be79a3e4a9@github.com>
On 9/1/2022 8:28 PM, Victoria Dye wrote:
> Shaoxuan Yuan wrote:
>> Before this patch, whenever --sparse is used, `git-grep` utilizes the
>> ensure_full_index() method to expand the index and search all the
>> entries. Because this method requires walking all the trees and
>> constructing the index, it is the slow part within the whole command.
>>
>> To achieve better performance, this patch uses grep_tree() to search the
>> sparse directory entries and get rid of the ensure_full_index() method.
>>
>> Why grep_tree() is a better choice over ensure_full_index()?
>>
>> 1) grep_tree() is as correct as ensure_full_index(). grep_tree() looks
>> into every sparse-directory entry (represented by a tree) recursively
>> when looping over the index, and the result of doing so matches the
>> result of expanding the index.
>>
>> 2) grep_tree() utilizes pathspecs to limit the scope of searching.
>> ensure_full_index() always expands the index when --sparse is used,
>> that means it will always walk all the trees and blobs in the repo
>> without caring if the user only wants a subset of the content, i.e.
>> using a pathspec. On the other hand, grep_tree() will only search
>> the contents that match the pathspec, and thus possibly walking fewer
>> trees.
>>
>> 3) grep_tree() does not construct and copy back a new index, while
>> ensure_full_index() does. This also saves some time.
>
> Would you mind adding some 'ensure_not_expanded' cases to 't1092' to codify
> this (probably in the 'grep is not expanded' test created in patch 2)? If
> I'm understanding this patch correctly, you've updated 'git grep' so that it
> *never* needs to expand the index. In that case, it would be good to
> exercise a bunch of 'git grep' options (pathspecs inside and outside the
> sparse cone, wildcard pathspecs, etc.) to confirm that.
Sure!
>>
>> ----------------
>> Performance test
>>
>> - Summary:
>>
>> p2000 tests demonstrate a ~91% execution time reduction for
>> `git grep --cached --sparse <pattern> -- <pathspec>` using tree-walking
>> logic.
>>
>> Test HEAD~ HEAD
>> ---------------------------------------------------------------------------------------------------
>> 2000.78: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (full-v3) 0.11 0.09 (≈)
>> 2000.79: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (full-v4) 0.08 0.09 (≈)
>> 2000.80: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (sparse-v3) 0.44 0.04 (-90.9%)
>> 2000.81: git grep --cached --sparse bogus -- f2/f1/f1/builtin/* (sparse-v4) 0.46 0.04 (-91.3%)
>
> These are fantastic results!
>
>>
>> - Command used for testing:
>>
>> git grep --cached --sparse bogus -- f2/f1/f1/builtin/*
>>
>> The reason for specifying a pathspec is that, if we don't specify a
>> pathspec, then grep_tree() will walk all the trees and blobs to find the
>> pattern, and the time consumed doing so is not too different from using
>> the original ensure_full_index() method, which also spends most of the
>> time walking trees. However, when a pathspec is specified, this latest
>> logic will only walk the area of trees enclosed by the pathspec, and the
>> time consumed is reasonably a lot less.
>>
>> That is, if we don't specify a pathspec, the performance difference [1]
>> is quite small: both methods walk all the trees and take generally same
>> amount of time (even with the index construction time included for
>> ensure_full_index()).
>
> This makes sense, thanks for the thorough explanation of the results.
>
>>
>> [1] Performance test result without pathspec:
>>
>> Test HEAD~ HEAD
>> -----------------------------------------------------------------------------
>> 2000.78: git grep --cached --sparse bogus (full-v3) 6.17 5.19 (≈)
>> 2000.79: git grep --cached --sparse bogus (full-v4) 6.19 5.46 (≈)
>> 2000.80: git grep --cached --sparse bogus (sparse-v3) 6.57 6.44 (≈)
>> 2000.81: git grep --cached --sparse bogus (sparse-v4) 6.65 6.28 (≈)
>>
>> Suggested-by: Derrick Stolee <derrickstolee@github.com>
>> Helped-by: Derrick Stolee <derrickstolee@github.com>
>> Helped-by: Victoria Dye <vdye@github.com>
>> Signed-off-by: Shaoxuan Yuan <shaoxuan.yuan02@gmail.com>
>> ---
>> builtin/grep.c | 32 ++++++++++++++++++++++++++-----
>> t/perf/p2000-sparse-operations.sh | 1 +
>> 2 files changed, 28 insertions(+), 5 deletions(-)
>>
>> diff --git a/builtin/grep.c b/builtin/grep.c
>> index a0b4dbc1dc..8c0edccd8e 100644
>> --- a/builtin/grep.c
>> +++ b/builtin/grep.c
>> @@ -522,9 +522,6 @@ static int grep_cache(struct grep_opt *opt,
>> if (repo_read_index(repo) < 0)
>> die(_("index file corrupt"));
>>
>> - if (grep_sparse)
>> - ensure_full_index(repo->index);
>> -
>> for (nr = 0; nr < repo->index->cache_nr; nr++) {
>> const struct cache_entry *ce = repo->index->cache[nr];
>>
>> @@ -537,8 +534,26 @@ static int grep_cache(struct grep_opt *opt,
>>
>> strbuf_setlen(&name, name_base_len);
>> strbuf_addstr(&name, ce->name);
>> + if (S_ISSPARSEDIR(ce->ce_mode)) {
>> + enum object_type type;
>> + struct tree_desc tree;
>> + void *data;
>> + unsigned long size;
>> + struct strbuf base = STRBUF_INIT;
>> +
>> + strbuf_addstr(&base, ce->name);
>> +
>> + data = read_object_file(&ce->oid, &type, &size);
>> + init_tree_desc(&tree, data, size);
>>
>> - if (S_ISREG(ce->ce_mode) &&
>> + /*
>> + * sneak in the ce_mode using check_attr parameter
>> + */
>> + hit |= grep_tree(opt, pathspec, &tree, &base,
>> + base.len, ce->ce_mode);
>> + strbuf_release(&base);
>> + free(data);
>> + } else if (S_ISREG(ce->ce_mode) &&
>> match_pathspec(repo->index, pathspec, name.buf, name.len, 0, NULL,
>> S_ISDIR(ce->ce_mode) ||
>> S_ISGITLINK(ce->ce_mode))) {
>> @@ -598,7 +613,14 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
>> int te_len = tree_entry_len(&entry);
>>
>> if (match != all_entries_interesting) {
>> - strbuf_addstr(&name, base->buf + tn_len);
>> + if (S_ISSPARSEDIR(check_attr)) {
>> + // object is a sparse directory entry
>> + strbuf_addbuf(&name, base);
>> + } else {
>> + // object is a commit or a root tree
>> + strbuf_addstr(&name, base->buf + tn_len);
>> + }
>
> Hmm, I'm not entirely sure I follow what's going on with 'name'. I'll try to
> talk myself through it.
>
> Stepping back a bit in the context of 'grep_tree()': the goal of the
> function is, given a tree descriptor 'tree', to recursively scan the tree to
> find any 'grep' matches within items matching 'pathspec'. It is also called
> with a strbuf 'base', a length 'tn_len', and a boolean 'check_attr'; it's
> not immediately clear to me what those args are or what they do. What I can
I was confused for quite a while about the meaning of these args, too.
I think 'base' is the object's ref or SHA, e.g. HEAD, HEAD~, or a <SHA>.
Before this patch, the object was expected to be a root tree or a
commit. I _think_ 'base' can also be "<submodule>/", e.g. "sub/" when
grepping a submodule.
'tn_len' stands for "tree_name_len"?
'check_attr', as you wrote below, is for "commit or not", at lease that
was all its use case before this patch.
> see is that:
>
> - 'check_attr' is true iff the "tree" being grepped is actually a commit.
I think this is correct. Though as Derrick Stolee said here [1], this
patch is abusing the 'check_attr' (passing 'ce_mode' through it), and if
that caused any confusions, my apologies.
[1]
https://lore.kernel.org/git/e74b326d-ce4a-31c3-5424-e35858cdb569@github.com
> - both non-recursive callers ('grep_object()' and 'grep_submodule()') call
> 'grep_tree()' with 'tn_len == base.len'.
>
> Stepping into 'grep_tree()', we iterate over the entries *inside of* 'tree'.
> We assign the length of the tree entry's path to 'te_len'. Notably, a tree
> entry's path *not* the path from the root of the repo to the entry - it's
> just the filename of the entry (e.g., for entry 'folder1/a', the path is
> 'a').
Yes.
> Next, we skip the first 'tn_len' characters of 'base->buf' and assign that
> value to 'name'. Because 'tn_len == base.len', for this first iteration,
> it's an empty string. Then, we check if the tree entry is interesting with
> path 'name'. But 'name' is an empty string, so 'tree_entry_interesting()'
> thinks the tree entry is at the root of the repository, even if it isn't!
Yes, that is the reason why it kept ignoring sub-root-level trees: the
pathspec can never match a tree that is not at root level if this
root-level assumption exists.
> At this point, I think I've figured out what the deal with 'base' is. Before
> this patch, only 'grep_object()' and 'grep_submodule()'. In the former case,
> it's either "<objectname>:", or empty; in the latter, it's the path to the
> submodule. Both of those are things you'd want to skip to get the correct
Yep, this resonates with my reply above!
> path to the tree entry for 'tree_entry_interesting()', but it isn't true in
> your case; you need the path from the repository root to your tree for
> 'tree_entry_interesting()' to work properly.
Well said! I think this phrasing is very accurate.
> Based on all of that, I *think* you can drop the 'check_attr' changes to
> 'grep_tree()' and update how you provide 'base' and 'tn_len' so
> 1) 'base' is the path to the tree root, and 2) 'tn_len' is 0 so that full
> path is provided to 'tree_entry_interesting()':
>
> ----->8----->8----->8----->8----->8----->8----->8----->8----->8----->8-----
> diff --git a/builtin/grep.c b/builtin/grep.c
> index 8c0edccd8e..85c83190f1 100644
> --- a/builtin/grep.c
> +++ b/builtin/grep.c
> @@ -546,11 +546,7 @@ static int grep_cache(struct grep_opt *opt,
> data = read_object_file(&ce->oid, &type, &size);
> init_tree_desc(&tree, data, size);
>
> - /*
> - * sneak in the ce_mode using check_attr parameter
> - */
> - hit |= grep_tree(opt, pathspec, &tree, &base,
> - base.len, ce->ce_mode);
> + hit |= grep_tree(opt, pathspec, &tree, &base, 0, 0);
> strbuf_release(&base);
> free(data);
> } else if (S_ISREG(ce->ce_mode) &&
> @@ -613,14 +609,6 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
> int te_len = tree_entry_len(&entry);
>
> if (match != all_entries_interesting) {
> - if (S_ISSPARSEDIR(check_attr)) {
> - // object is a sparse directory entry
> - strbuf_addbuf(&name, base);
> - } else {
> - // object is a commit or a root tree
> - strbuf_addstr(&name, base->buf + tn_len);
> - }
> -
> match = tree_entry_interesting(repo->index,
> &entry, &name,
> 0, pathspec);
> -----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----8<-----
Thank you for this diff! I think this is also what Derrick suggested in
his review. In fact, this approach is more to the root of the problem:
the expected format of the path/base.
> I still find all of this confusing, and it's possible I'm still not properly
> understanding how 'name' and 'tn_len' are supposed to be used. Regardless, I
> *am* fairly certain that finding the right values for those args is the
> going to be the cleanest (and least fragile) way to handle sparse
> directories, rather than using the 'check_attr' arg for something it isn't.
Right.
> It might take some time + lots of debugging/experimenting, but it's really
> important that the implementation you settle on is something you (and,
> ideally, the readers of your patches) confidently and completely understand,
> rather than something that seems to work but doesn't have a clear
> explanation. As always, I'm happy to help if you'd like another set of eyes
> on the problem!
Right. I admit that the approach I was taking is pretty shady. The way
suggested by you and Derrick is more explainable and to-the-point.
Lesson learned!
Thanks,
Shaoxuan
next prev parent reply other threads:[~2022-09-02 18:47 UTC|newest]
Thread overview: 69+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-08-17 7:56 [PATCH v1 0/2] grep: integrate with sparse index Shaoxuan Yuan
2022-08-17 7:56 ` [PATCH v1 1/2] builtin/grep.c: add --sparse option Shaoxuan Yuan
2022-08-17 14:12 ` Derrick Stolee
2022-08-17 17:13 ` Junio C Hamano
2022-08-17 17:34 ` Victoria Dye
2022-08-17 17:43 ` Derrick Stolee
2022-08-17 18:47 ` Junio C Hamano
2022-08-17 17:37 ` Elijah Newren
2022-08-24 18:20 ` Shaoxuan Yuan
2022-08-24 19:08 ` Derrick Stolee
2022-08-17 7:56 ` [PATCH v1 2/2] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
2022-08-17 14:23 ` Derrick Stolee
2022-08-24 21:06 ` Shaoxuan Yuan
2022-08-25 0:39 ` Derrick Stolee
2022-08-17 13:46 ` [PATCH v1 0/2] grep: " Derrick Stolee
2022-08-29 23:28 ` [PATCH v2 " Shaoxuan Yuan
2022-08-29 23:28 ` [PATCH v2 1/2] builtin/grep.c: add --sparse option Shaoxuan Yuan
2022-08-29 23:28 ` [PATCH v2 2/2] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
2022-08-30 13:45 ` Derrick Stolee
2022-09-01 4:57 ` [PATCH v3 0/3] grep: " Shaoxuan Yuan
2022-09-01 4:57 ` [PATCH v3 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
2022-09-01 4:57 ` [PATCH v3 2/3] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
2022-09-01 4:57 ` [PATCH v3 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
2022-09-01 17:03 ` Derrick Stolee
2022-09-01 18:31 ` Shaoxuan Yuan
2022-09-01 17:17 ` Junio C Hamano
2022-09-01 17:27 ` Junio C Hamano
2022-09-01 22:49 ` Shaoxuan Yuan
2022-09-01 22:36 ` Shaoxuan Yuan
2022-09-02 3:28 ` Victoria Dye
2022-09-02 18:47 ` Shaoxuan Yuan [this message]
2022-09-03 0:36 ` [PATCH v4 0/3] grep: integrate with sparse index Shaoxuan Yuan
2022-09-03 0:36 ` [PATCH v4 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
2022-09-03 0:36 ` [PATCH v4 2/3] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
2022-09-03 0:36 ` [PATCH v4 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
2022-09-03 4:39 ` Junio C Hamano
2022-09-08 0:24 ` Shaoxuan Yuan
2022-09-08 0:18 ` [PATCH v5 0/3] grep: integrate with sparse index Shaoxuan Yuan
2022-09-08 0:18 ` [PATCH v5 1/3] builtin/grep.c: add --sparse option Shaoxuan Yuan
2022-09-10 1:07 ` Victoria Dye
2022-09-14 6:08 ` Elijah Newren
2022-09-15 2:57 ` Junio C Hamano
2022-09-18 2:14 ` Elijah Newren
2022-09-18 19:52 ` Victoria Dye
2022-09-19 1:23 ` Junio C Hamano
2022-09-19 4:27 ` Shaoxuan Yuan
2022-09-19 11:03 ` Ævar Arnfjörð Bjarmason
2022-09-20 7:13 ` Elijah Newren
2022-09-17 3:34 ` Shaoxuan Yuan
2022-09-18 4:24 ` Elijah Newren
2022-09-19 4:13 ` Shaoxuan Yuan
2022-09-17 3:45 ` Shaoxuan Yuan
2022-09-08 0:18 ` [PATCH v5 2/3] builtin/grep.c: integrate with sparse index Shaoxuan Yuan
2022-09-08 0:18 ` [PATCH v5 3/3] builtin/grep.c: walking tree instead of expanding index with --sparse Shaoxuan Yuan
2022-09-08 17:59 ` Junio C Hamano
2022-09-08 20:46 ` Derrick Stolee
2022-09-08 20:56 ` Junio C Hamano
2022-09-08 21:06 ` Shaoxuan Yuan
2022-09-09 12:49 ` Derrick Stolee
2022-09-13 17:23 ` Junio C Hamano
2022-09-10 2:04 ` Victoria Dye
2022-09-23 4:18 ` [PATCH v6 0/1] grep: integrate with sparse index Shaoxuan Yuan
2022-09-23 4:18 ` [PATCH v6 1/1] builtin/grep.c: " Shaoxuan Yuan
2022-09-23 16:40 ` Junio C Hamano
2022-09-23 16:58 ` Junio C Hamano
2022-09-26 17:28 ` Junio C Hamano
2022-09-23 14:13 ` [PATCH v6 0/1] grep: " Derrick Stolee
2022-09-23 16:01 ` Victoria Dye
2022-09-23 17:08 ` Junio C Hamano
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=754c64ff-c81f-cd67-d303-6c4876cbd5a0@gmail.com \
--to=shaoxuan.yuan02@gmail.com \
--cc=derrickstolee@github.com \
--cc=git@vger.kernel.org \
--cc=vdye@github.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).