From: Derrick Stolee <stolee@gmail.com>
To: Jeff King <peff@peff.net>, Derrick Stolee <dstolee@microsoft.com>
Cc: git@vger.kernel.org, avarab@gmail.com, sbeller@google.com,
larsxschneider@gmail.com
Subject: Re: [PATCH 0/3] Lazy-load trees when reading commit-graph
Date: Tue, 3 Apr 2018 09:14:50 -0400 [thread overview]
Message-ID: <007c630a-0df8-803c-b637-fb5ccefa7b7c@gmail.com> (raw)
In-Reply-To: <20180403130615.GA18824@sigill.intra.peff.net>
On 4/3/2018 9:06 AM, Jeff King wrote:
> On Tue, Apr 03, 2018 at 08:00:54AM -0400, Derrick Stolee wrote:
>
>> There are several commit-graph walks that require loading many commits
>> but never walk the trees reachable from those commits. However, the
>> current logic in parse_commit() requires the root tree to be loaded.
>> This only uses lookup_tree(), but when reading commits from the commit-
>> graph file, the hashcpy() to load the root tree hash and the time spent
>> checking the object cache take more time than parsing the rest of the
>> commit.
>>
>> In this patch series, all direct references to accessing the 'tree'
>> member of struct commit are replaced instead by one of the following
>> methods:
>>
>> struct tree *get_commit_tree(struct commit *)
>> struct object_id *get_commit_tree_oid(struct commit *)
> This seems like a pretty sane thing to do. There may still be a few
> parts of the code, notably fsck, that are reliant on a "struct object"
> having been instantiated to determine whether an object is "used". I
> tried to clean that up around the time of c2d17b3b6e (fsck: check
> HAS_OBJ more consistently, 2017-01-16), but I won't be surprised if
> there's still some hidden assumptions.
>
> I peeked at the fsck.c parts of patch 2, and it looks like we may be OK,
> since you use get_commit_tree() during the walk.
>
>> This replacement was assisted by a Coccinelle script, but the 'tree'
>> member is overloaded in other types, so the script gave false-positives
>> that were removed from the diff.
> That catches any existing in-tree callers, but not any topics in flight.
> We could add the Coccinelle script and re-run it to catch any future
> ones. But perhaps simpler still: can we rename the "tree" member to
> "maybe_tree" or something, since nobody should be accessing it directly?
> That will give us a compile error if an older topic is merged.
That's a good idea. I can reorg in v2 to rename 'tree' to 'maybe_tree'
(and add an explicit comment that 'maybe_tree' could be NULL, so don't
reference it directly). The check that the rename patch is correct is
simply "does it compile?" Then Coccinelle could do the transfer of
"c->maybe_tree" to "get_commit_tree(c)" safely, and can be added to the
scripts.
I guess one caveat is to not include the mutators (c->maybe_tree = ...),
but that's probably something Coccinelle can do.
>
>> On the Linux repository, performance tests were run for the following
>> command:
>>
>> git log --graph --oneline -1000
>>
>> Before: 0.83s
>> After: 0.65s
>> Rel %: -21.6%
> Neat.
>
> Not strictly related, but I happened across another thing today that
> might be worth investigating here. We allocate "struct commit" in these
> nice big allocation blocks. But then for every commit we allocate at
> least one "struct commit_list" to store the parent pointer.
>
> I was looking at this from a memory consumption angle (I found a
> pathological repository full of single-parent commits, and this wastes
> an extra 16 bytes plus malloc overhead for every 64-byte "struct
> commit").
>
> But I wonder if it could also have non-negligible overhead in calling
> malloc() for your case, since you've optimized out so much of the rest
> of the work.
That is a pattern I'm seeing: we strip out one bit and something else
shows up as a hot spot. This game of whack-a-mole is quite productive.
> I'm not sure what the exact solution would be, but I imagine something
> like variable-sized "struct commit"s with the parent pointers embedded,
> with some kind of flag to indicate the number of parents (and probably
> some fallback to break out to a linked list for extreme cases of more
> than 2 parents). It may end up pretty invasive, though, as there's a
> lot of open-coded traversals of that parent list.
>
> Anyway, not anything to do with this patch, but food for thought as you
> micro-optimize these traversals.
One other thing that I've been thinking about is that 'struct commit' is
so much bigger than the other structs in 'union any_object'. This means
that the object cache, which I think creates blocks of 'union
any_object' for memory-alignment reasons, is overly bloated. This would
be especially true when walking many more trees than commits.
Perhaps there are other memory concerns in that case (such as cached
buffers) that the 'union any_object' is not a concern, but it is worth
thinking about as we brainstorm how to reduce the parent-list memory.
Thanks,
-Stolee
next prev parent reply other threads:[~2018-04-03 13:14 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-04-03 12:00 [PATCH 0/3] Lazy-load trees when reading commit-graph Derrick Stolee
2018-04-03 12:00 ` [PATCH 1/3] commit: create get_commit_tree() method Derrick Stolee
2018-04-03 12:00 ` [PATCH 2/3] treewide: use get_commit_tree() for tree access Derrick Stolee
2018-04-03 12:00 ` [PATCH 3/3] commit-graph: lazy-load trees Derrick Stolee
2018-04-03 18:00 ` Stefan Beller
2018-04-03 18:22 ` Derrick Stolee
2018-04-03 18:37 ` Stefan Beller
2018-04-03 12:15 ` [PATCH 0/3] Lazy-load trees when reading commit-graph Derrick Stolee
2018-04-03 13:06 ` Jeff King
2018-04-03 13:14 ` Derrick Stolee [this message]
2018-04-03 20:20 ` Jeff King
2018-04-04 12:08 ` Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 0/4] " Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 1/4] treewide: rename tree to maybe_tree Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 2/4] commit: create get_commit_tree() method Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 3/4] treewide: replace maybe_tree with accessor methods Derrick Stolee
2018-04-06 19:09 ` [PATCH v2 4/4] commit-graph: lazy-load trees for commits Derrick Stolee
2018-04-06 19:21 ` [PATCH v2 0/4] Lazy-load trees when reading commit-graph Jeff King
2018-04-06 19:41 ` Derrick Stolee
2018-04-06 19:45 ` Stefan Beller
2018-04-08 23:18 ` Junio C Hamano
2018-04-09 13:15 ` Derrick Stolee
2018-04-09 17:25 ` Stefan Beller
2018-04-07 18:40 ` Jakub Narebski
2018-04-08 1:17 ` Derrick Stolee
2018-04-11 20:41 ` Jakub Narebski
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=007c630a-0df8-803c-b637-fb5ccefa7b7c@gmail.com \
--to=stolee@gmail.com \
--cc=avarab@gmail.com \
--cc=dstolee@microsoft.com \
--cc=git@vger.kernel.org \
--cc=larsxschneider@gmail.com \
--cc=peff@peff.net \
--cc=sbeller@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).