From: Karsten Blees <karsten.blees@gmail.com>
To: "Vicent Martí" <tanoku@gmail.com>
Cc: Junio C Hamano <gitster@pobox.com>,
Duy Nguyen <pclouds@gmail.com>, Karsten Blees <blees@dcon.de>,
Git Mailing List <git@vger.kernel.org>, Jeff King <peff@peff.net>
Subject: Re: What's cooking in git.git (Oct 2013, #06; Fri, 25)
Date: Tue, 29 Oct 2013 10:09:11 +0100 [thread overview]
Message-ID: <CAH7EuMHgH6Oe_SvjyutBaakRfyZGHpp_iimaAzpV09AnHTYntw@mail.gmail.com> (raw)
In-Reply-To: <CAFFjANRaphYdg6VM8cqJY3NmPz+gNE7S9S1jAgPPctUZio7+Tw@mail.gmail.com>
On Mon, Oct 28, 2013 at 10:04 PM, Vicent Martí <tanoku@gmail.com> wrote:
>
> On Mon, Oct 28, 2013 at 8:45 PM, Karsten Blees <karsten.blees@gmail.com> wrote:
>
> > Regarding performance, khash uses open addressing, which requires more key comparisons (O(1/(1-load_factor))) than chaining (O(1+load_factor)). However, any measurable differences will most likely be dwarfed by IO costs in this particular use case.
>
> I don't think this is true. If you actually run a couple insertion and
> lookup benchmarks comparing the two implementations, you'll find khash
> to be around ~30% faster for most workloads (venturing a guess from
> past experience). I am obviously not the author of khash, but I've
> found that the theoretical increase in key comparisons is completely
> dwarfed by the benefit of increased cache locality during the probing
> fase. I still haven't found a faster C hash table implementation than
> khash for the general case, that's why I normally use it despite the
> worrisome preprocessor crash-party going on in that header file.
Yes, cache locality is where open addressing shines, however, you
loose that benefit when the keys are pointers (e.g. sha1's).
>
>
> > Btw., pack-objects.c::rehash_objects() in patch 03 unnecessarily checks for duplicates. That's probably the reason for the high hashcmp times you found in the first round of the patch series.
>
> Patch 03 is a refactoring -- the duplicate checking code has been in
> pack-objects.c for years. I am not sure duplicate checking is
> superfluous at all, there are many cases where you could be
> double-inserting objects in a pack-objects run, and you really don't
> want to generate packfiles with dupe objects.
The point is that its in _rehash_. Duplicate checking should be in
add/put. Rehash only rearranges entries that are alread _in_ the map,
and it usually only needs the hash code for that. So pack-objects
implements rehash in terms of a full clear + add-all instead, which is
of course slower than what khash, hashmap etc. would do.
Ciao,
Karsten
next prev parent reply other threads:[~2013-10-29 9:09 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-10-25 23:23 What's cooking in git.git (Oct 2013, #06; Fri, 25) Junio C Hamano
2013-10-26 10:30 ` Duy Nguyen
2013-10-28 15:48 ` Junio C Hamano
2013-10-28 16:16 ` Vicent Martí
2013-10-28 16:41 ` Junio C Hamano
2013-10-28 19:45 ` Karsten Blees
2013-10-28 21:04 ` Vicent Martí
2013-10-29 9:09 ` Karsten Blees [this message]
2013-11-14 19:38 ` Karsten Blees
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CAH7EuMHgH6Oe_SvjyutBaakRfyZGHpp_iimaAzpV09AnHTYntw@mail.gmail.com \
--to=karsten.blees@gmail.com \
--cc=blees@dcon.de \
--cc=git@vger.kernel.org \
--cc=gitster@pobox.com \
--cc=pclouds@gmail.com \
--cc=peff@peff.net \
--cc=tanoku@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).