git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Derrick Stolee <stolee@gmail.com>
To: Stefan Beller <sbeller@google.com>,
	Derrick Stolee <dstolee@microsoft.com>
Cc: git <git@vger.kernel.org>,
	"Ævar Arnfjörð Bjarmason" <avarab@gmail.com>,
	"Lars Schneider" <larsxschneider@gmail.com>
Subject: Re: [PATCH 3/3] commit-graph: lazy-load trees
Date: Tue, 3 Apr 2018 14:22:23 -0400	[thread overview]
Message-ID: <aa3340e8-dbd9-7cf7-1711-a6f675bf6b8c@gmail.com> (raw)
In-Reply-To: <CAGZ79kZ0XZRiKcJG-5Ckd=XjE-3GfGHkNuyu4590OyfGPve4Rg@mail.gmail.com>

On 4/3/2018 2:00 PM, Stefan Beller wrote:
> On Tue, Apr 3, 2018 at 5:00 AM, Derrick Stolee <dstolee@microsoft.com> wrote:
>> The commit-graph file provides quick access to commit data, including
>> the OID of the root tree for each commit in the graph. When performing
>> a deep commit-graph walk, we may not need to load most of the trees
>> for these commits.
>>
>> Delay loading the tree object for a commit loaded from the graph
>> until requested via get_commit_tree(). Do not lazy-load trees for
>> commits not in the graph, since that requires duplicate parsing
>> and the relative peformance improvement when trees are not needed
>> is small.
>>
>> On the Linux repository, performance tests were run for the following
>> command:
>>
>>          git log --graph --oneline -1000
>>
>> Before: 0.83s
>> After:  0.65s
>> Rel %: -21.6%
> This is an awesome speedup.
>
>> Adding '-- kernel/' to the command requires loading the root tree
>> for every commit that is walked.
> and as the walk prunes those commits that do not touch kernel/
> which from my quick glance is the real core thing. Linus' announcements
> claim that > 50% is drivers, networking and documentation[1].
> So the "-- kernel/" walk needs to walk twice as many commits to find
> a thousand commits that actually touch kernel/ ?
>
> [1] http://lkml.iu.edu/hypermail/linux/kernel/1801.3/02794.html
> http://lkml.iu.edu/hypermail/linux/kernel/1803.3/00580.html
>
>> There was no measureable performance
>> change as a result of this patch.
> ... which means that the walking itself is really fast now and the
> dominating effects are setup and checking the tree?

Yeah. I was concerned that since we take two accesses into the 
commit-graph file that we could measurably slow down cases where we need 
to load the trees. That is not an issue since we will likely parse the 
tree after loading, and parsing is much slower than these commit-graph 
accesses.


> Is git smart enough to not load the root tree for "log -- ./" or
> would we get the desired performance numbers from that?

I wonder, since it only really needs the OID of the root tree to 
determine TREESAME. If it cares about following TREESAME relationships 
on ./, then it should do that.

>
>> @@ -317,6 +315,27 @@ int parse_commit_in_graph(struct commit *item)
>>          return 0;
>>   }
>>
>> +static struct tree *load_tree_for_commit(struct commit_graph *g, struct commit *c)
>> +{
>> +       struct object_id oid;
>> +       const unsigned char *commit_data = g->chunk_commit_data + (g->hash_len + 16) * (c->graph_pos);
> What is 16? (I imagine it is the "length of the row" - g->hash_len ?)
> Would it make sense to have a constant/define for an entire row instead?
> (By any chance what is the meaning of GRAPH_DATA_WIDTH, which is 36?
> That is defined but never used.)

Yeah, I should use GRAPH_DATA_WIDTH here instead.

>
>> +struct tree *get_commit_tree_in_graph(const struct commit *c)
>> +{
>> +       if (c->tree)
>> +               return c->tree;
> This double checking is defensive programming, in case someone
> doesn't check themselves (as get_commit_tree does below).
>
> ok.
>
>> @@ -17,6 +17,13 @@ char *get_commit_graph_filename(const char *obj_dir);
>>    */
>>   int parse_commit_in_graph(struct commit *item);
>>
>> +/*
>> + * For performance reasons, a commit loaded from the graph does not
>> + * have a tree loaded until trying to consume it for the first time.
> That is the theme of this series/patch, but do we need to write it down
> into the codebase? I'd be inclined to omit this part and only go with:
>
>    Load the root tree of a commit and return the tree.

OK.

>
>>   struct tree *get_commit_tree(const struct commit *commit)
>>   {
>> -       return commit->tree;
>> +       if (commit->tree || !commit->object.parsed)
> I understand to return the tree from the commit
> when we have the tree in the commit object (the first
> part).
>
> But 'when we have not (yet) parsed the commit object',
> we also just return its tree? Could you explain the
> second part of the condition?
> Is that for commits that are not part of the commit graph?
> (But then why does it need to be negated?)

Some callers check the value of 'commit->tree' without a guarantee that 
the commit was parsed. In this case, the way to preserve the existing 
behavior is to continue returning NULL. If I remove the "|| 
!commit->object.parsed" then the BUG("commit has NULL tree, but was not 
loaded from commit-graph") is hit in these two tests:

t6012-rev-list-simplify.sh
t6110-rev-list-sparse.sh

I prefer to keep the BUG() statement and instead use this if statement. 
If someone has more clarity on why this is a good existing behavior, 
then please chime in.

Thanks,
-Stolee

  reply	other threads:[~2018-04-03 18:23 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-04-03 12:00 [PATCH 0/3] Lazy-load trees when reading commit-graph Derrick Stolee
2018-04-03 12:00 ` [PATCH 1/3] commit: create get_commit_tree() method Derrick Stolee
2018-04-03 12:00 ` [PATCH 2/3] treewide: use get_commit_tree() for tree access Derrick Stolee
2018-04-03 12:00 ` [PATCH 3/3] commit-graph: lazy-load trees Derrick Stolee
2018-04-03 18:00   ` Stefan Beller
2018-04-03 18:22     ` Derrick Stolee [this message]
2018-04-03 18:37       ` Stefan Beller
2018-04-03 12:15 ` [PATCH 0/3] Lazy-load trees when reading commit-graph Derrick Stolee
2018-04-03 13:06 ` Jeff King
2018-04-03 13:14   ` Derrick Stolee
2018-04-03 20:20     ` Jeff King
2018-04-04 12:08       ` Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 0/4] " Derrick Stolee
2018-04-06 19:09   ` [PATCH v2 1/4] treewide: rename tree to maybe_tree Derrick Stolee
2018-04-06 19:09   ` [PATCH v2 2/4] commit: create get_commit_tree() method Derrick Stolee
2018-04-06 19:09   ` [PATCH v2 3/4] treewide: replace maybe_tree with accessor methods Derrick Stolee
2018-04-06 19:09   ` [PATCH v2 4/4] commit-graph: lazy-load trees for commits Derrick Stolee
2018-04-06 19:21   ` [PATCH v2 0/4] Lazy-load trees when reading commit-graph Jeff King
2018-04-06 19:41     ` Derrick Stolee
2018-04-06 19:45     ` Stefan Beller
2018-04-08 23:18     ` Junio C Hamano
2018-04-09 13:15       ` Derrick Stolee
2018-04-09 17:25         ` Stefan Beller
2018-04-07 18:40   ` Jakub Narebski
2018-04-08  1:17     ` Derrick Stolee
2018-04-11 20:41       ` Jakub Narebski

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aa3340e8-dbd9-7cf7-1711-a6f675bf6b8c@gmail.com \
    --to=stolee@gmail.com \
    --cc=avarab@gmail.com \
    --cc=dstolee@microsoft.com \
    --cc=git@vger.kernel.org \
    --cc=larsxschneider@gmail.com \
    --cc=sbeller@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).