From: Jeff King <peff@peff.net>
To: Thomas Rast <trast@inf.ethz.ch>
Cc: git@vger.kernel.org, "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>,
"Junio C Hamano" <gitster@pobox.com>
Subject: Re: [PATCH] Hold an 'unsigned long' chunk of the sha1 in obj_hash
Date: Wed, 1 May 2013 16:49:47 -0400 [thread overview]
Message-ID: <20130501204947.GA12789@sigill.intra.peff.net> (raw)
In-Reply-To: <6c2b67a2f0b67ee796c7676e3febe4c61ab85d4a.1366912627.git.trast@inf.ethz.ch>
On Thu, Apr 25, 2013 at 08:04:01PM +0200, Thomas Rast wrote:
> And probing lookups happen a lot: some simple instrumentation shows
> that 'git rev-list --all --objects' on my git.git,
>
> * 19.4M objects are accessed in lookup_object and grow_object_hash
> combined, while
>
> * the linear probing loops in lookup_object and insert_obj_hash run a
> total of 9.4M times.
>
> So we take a slightly different approach, and trade some memory for
> better cache locality. Namely, we change the hash table slots to
> contain
>
> struct object *obj;
> unsigned long sha1prefix;
I think this is a clever idea, though I do worry about the extra memory
use (it's not all _that_ much in the grand scheme, but it works against
the cache locality benefit). I just posted (but forgot to cc you) a
patch that takes a different approach: to actually move the likely
candidate to the front of the collision change. The patch is here:
http://article.gmane.org/gmane.comp.version-control.git/223139
It does a bit better than the numbers you have here:
> I get a decent speedup, for example using git.git as a test
> repository:
>
> Test before after
> ---------------------------------------------------------------------------------
> 0001.1: rev-list --all 0.42(0.40+0.01) 0.41(0.39+0.01) -1.5%**
> 0001.2: rev-list --all --objects 2.40(2.37+0.03) 2.28(2.25+0.03) -5.0%***
> ---------------------------------------------------------------------------------
>
> And even more in linux.git:
>
> ---------------------------------------------------------------------------------
> 0001.1: rev-list --all 3.31(3.19+0.12) 3.21(3.09+0.11) -2.9%**
> 0001.2: rev-list --all --objects 27.99(27.70+0.26) 25.99(25.71+0.25) -7.1%***
> ---------------------------------------------------------------------------------
It _might_ still be advantageous to do your patch on top, but I suspect
it will diminish the returns from your patch (since the point of it is
to probe less far down the chain on average).
> I expected the big win to be in grow_object_hash(), but perf says that
> 'git rev-list --all --objects' spends a whopping 33.6% of its time in
> lookup_object(), and this change gets that down to 30.0%.
I'm not surprised. I spent some time trying to optimize grow_object_hash
and realized that it doesn't make much difference. The killer in
"rev-list --objects --all" is that we hit the same tree entry objects
over and over.
Another avenue I'd like to explore is actually doing a tree-diff from
the last processed commit, since we should need to examine only the
changed entries. I suspect it won't be a big benefit, though, because
even though the tree diff can happen in O(# of entries), we are trying
to beat doing an O(1) hash for each entry. So it's the same algorithmic
complexity, and it is hard to beat a few hashcmp() calls. We'll see.
-Peff
next prev parent reply other threads:[~2013-05-01 20:49 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-25 18:04 [PATCH] Hold an 'unsigned long' chunk of the sha1 in obj_hash Thomas Rast
2013-04-25 20:13 ` Junio C Hamano
2013-04-25 21:09 ` Thomas Rast
2013-04-25 21:12 ` Duy Nguyen
2013-05-01 20:49 ` Jeff King [this message]
2013-05-02 0:53 ` [PATCH] process tree diffs during "rev-list --objects" Jeff King
2013-05-02 8:52 ` [PATCH] Hold an 'unsigned long' chunk of the sha1 in obj_hash Thomas Rast
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130501204947.GA12789@sigill.intra.peff.net \
--to=peff@peff.net \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=pclouds@gmail.com \
--cc=trast@inf.ethz.ch \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).