git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Karsten Blees <karsten.blees@gmail.com>
To: "Vicent Martí" <tanoku@gmail.com>, "Junio C Hamano" <gitster@pobox.com>
Cc: Duy Nguyen <pclouds@gmail.com>, Karsten Blees <blees@dcon.de>,
	Git Mailing List <git@vger.kernel.org>, Jeff King <peff@peff.net>
Subject: Re: What's cooking in git.git (Oct 2013, #06; Fri, 25)
Date: Mon, 28 Oct 2013 20:45:44 +0100	[thread overview]
Message-ID: <526EBEE8.7070807@gmail.com> (raw)
In-Reply-To: <CAFFjANSnuS6_+uAd43AayojJyK-wj2wMxQ6DBD6JyN=A7xh2_A@mail.gmail.com>

Am 28.10.2013 17:16, schrieb Vicent Martí:
> On Mon, Oct 28, 2013 at 4:48 PM, Junio C Hamano <gitster@pobox.com> wrote:
>>> jk/pack-bitmap adds khash.h, which from a first glance looks like yet
>>> another hash table implementation. I was just wondering if kb's new
>>> hash tables can cover the need of pack-bitmap.c too so we can remove
>>> khash.h later..
>>
>> Good thinking ;-).
> 
> We use the khash tables to map:
> 
>     - sha1 (const char *) to (void *)
>     - sha1 (const char *) to int
> 
> The new `hashmap.c` covers the first case quite well (albeit slightly
> more verbosely than I'd like), but in the second case it doesn't quite
> work. Since the new hash needs to embed the "struct hashmap_entry" on
> all its values (to allow for separate chaining), having it map to
> `int` keys requires a struct like this:
> 
>     struct sha1_position {
>         struct hashmap_entry {
>             struct hashmap_entry *next;
>             unsigned int hash;
>         };
>         int position;
>     }
> 

Hmm...isn't that position rather an index into two separately maintained arrays? So you'd rather have:

    struct sha1_position {
        struct hashmap_entry {
            struct hashmap_entry *next;
            unsigned int hash;
        };
        uint32_t pack_name_hash;
        struct object *object;
     }

> khash on the other hand is capable of storing the position values as
> part of the hash table itself (i.e. `int **buckets`), and saves us
> from thousands of bytes of allocations + indirection.
> 

However, khash keeps separate arrays for flags, keys and values, all of them overallocated by 1 / load factor (so there's lots of unused space). ext_index.objects and ext_index.hashes are also overallocated by the usual alloc_nr() factor 1.5.

Regarding memory consumption, I think both implementations will be pretty similar. Hashmap allocates many small regions while khash re-allocates a few big ones...I really don't know which is better, ideally entries would be allocated in chunks to minimize both heap overhead and memcpy disadvantes.

Regarding performance, khash uses open addressing, which requires more key comparisons (O(1/(1-load_factor))) than chaining (O(1+load_factor)). However, any measurable differences will most likely be dwarfed by IO costs in this particular use case.


Btw., pack-objects.c::rehash_objects() in patch 03 unnecessarily checks for duplicates. That's probably the reason for the high hashcmp times you found in the first round of the patch series.

Cheers,
Karsten

  parent reply	other threads:[~2013-10-28 19:45 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-10-25 23:23 What's cooking in git.git (Oct 2013, #06; Fri, 25) Junio C Hamano
2013-10-26 10:30 ` Duy Nguyen
2013-10-28 15:48   ` Junio C Hamano
2013-10-28 16:16     ` Vicent Martí
2013-10-28 16:41       ` Junio C Hamano
2013-10-28 19:45       ` Karsten Blees [this message]
2013-10-28 21:04         ` Vicent Martí
2013-10-29  9:09           ` Karsten Blees
2013-11-14 19:38             ` Karsten Blees

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=526EBEE8.7070807@gmail.com \
    --to=karsten.blees@gmail.com \
    --cc=blees@dcon.de \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=pclouds@gmail.com \
    --cc=peff@peff.net \
    --cc=tanoku@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).