git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Karsten Blees <karsten.blees@gmail.com>
To: "Vicent Martí" <tanoku@gmail.com>
Cc: Junio C Hamano <gitster@pobox.com>,
	Duy Nguyen <pclouds@gmail.com>, Karsten Blees <blees@dcon.de>,
	Git Mailing List <git@vger.kernel.org>, Jeff King <peff@peff.net>
Subject: Re: What's cooking in git.git (Oct 2013, #06; Fri, 25)
Date: Tue, 29 Oct 2013 10:09:11 +0100	[thread overview]
Message-ID: <CAH7EuMHgH6Oe_SvjyutBaakRfyZGHpp_iimaAzpV09AnHTYntw@mail.gmail.com> (raw)
In-Reply-To: <CAFFjANRaphYdg6VM8cqJY3NmPz+gNE7S9S1jAgPPctUZio7+Tw@mail.gmail.com>

On Mon, Oct 28, 2013 at 10:04 PM, Vicent Martí <tanoku@gmail.com> wrote:
>
> On Mon, Oct 28, 2013 at 8:45 PM, Karsten Blees <karsten.blees@gmail.com> wrote:
>
> > Regarding performance, khash uses open addressing, which requires more key comparisons (O(1/(1-load_factor))) than chaining (O(1+load_factor)). However, any measurable differences will most likely be dwarfed by IO costs in this particular use case.
>
> I don't think this is true. If you actually run a couple insertion and
> lookup benchmarks comparing the two implementations, you'll find khash
> to be around ~30% faster for most workloads (venturing a guess from
> past experience). I am obviously not the author of khash, but I've
> found that the theoretical increase in key comparisons is completely
> dwarfed by the benefit of increased cache locality during the probing
> fase. I still haven't found a faster C hash table implementation than
> khash for the general case, that's why I normally use it despite the
> worrisome preprocessor crash-party going on in that header file.

Yes, cache locality is where open addressing shines, however, you
loose that benefit when the keys are pointers (e.g. sha1's).

>
>
> > Btw., pack-objects.c::rehash_objects() in patch 03 unnecessarily checks for duplicates. That's probably the reason for the high hashcmp times you found in the first round of the patch series.
>
> Patch 03 is a refactoring -- the duplicate checking code has been in
> pack-objects.c for years. I am not sure duplicate checking is
> superfluous at all, there are many cases where you could be
> double-inserting objects in a pack-objects run, and you really don't
> want to generate packfiles with dupe objects.

The point is that its in _rehash_. Duplicate checking should be in
add/put. Rehash only rearranges entries that are alread _in_ the map,
and it usually only needs the hash code for that. So pack-objects
implements rehash in terms of a full clear + add-all instead, which is
of course slower than what khash, hashmap etc. would do.

Ciao,
Karsten

  reply	other threads:[~2013-10-29  9:09 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-25 23:23 What's cooking in git.git (Oct 2013, #06; Fri, 25) Junio C Hamano
2013-10-26 10:30 ` Duy Nguyen
2013-10-28 15:48   ` Junio C Hamano
2013-10-28 16:16     ` Vicent Martí
2013-10-28 16:41       ` Junio C Hamano
2013-10-28 19:45       ` Karsten Blees
2013-10-28 21:04         ` Vicent Martí
2013-10-29  9:09           ` Karsten Blees [this message]
2013-11-14 19:38             ` Karsten Blees

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAH7EuMHgH6Oe_SvjyutBaakRfyZGHpp_iimaAzpV09AnHTYntw@mail.gmail.com \
    --to=karsten.blees@gmail.com \
    --cc=blees@dcon.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=tanoku@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).