Re: [PATCH 4/4] pack-revindex: radix-sort the revindex

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Jeff King <peff@peff.net>
To: Shawn Pearce <spearce@spearce.org>
Cc: git <git@vger.kernel.org>
Subject: Re: [PATCH 4/4] pack-revindex: radix-sort the revindex
Date: Mon, 8 Jul 2013 03:57:12 -0400	[thread overview]
Message-ID: <20130708075712.GC25072@sigill.intra.peff.net> (raw)
In-Reply-To: <CAJo=hJugvqBEQwPYcttNH+R8xUKxy1uDm5EjiWaye-wEuTxk-w@mail.gmail.com>

On Sun, Jul 07, 2013 at 04:52:23PM -0700, Shawn O. Pearce wrote:

> On Sun, Jul 7, 2013 at 3:14 AM, Jeff King <peff@peff.net> wrote:
> > The pack revindex stores the offsets of the objects in the
> > pack in sorted order, allowing us to easily find the on-disk
> > size of each object. To compute it, we populate an array
> > with the offsets from the sha1-sorted idx file, and then use
> > qsort to order it by offsets.
> >
> > That does O(n log n) offset comparisons, and profiling shows
> > that we spend most of our time in cmp_offset. However, since
> > we are sorting on a simple off_t, we can use numeric sorts
> > that perform better. A radix sort can run in O(k*n), where k
> > is the number of "digits" in our number. For a 64-bit off_t,
> > using 16-bit "digits" gives us k=4.
> 
> Did you try the simple bucket sort Colby now uses in JGit?
> 
> The sort is pretty simple:
> 
>   bucket_size = pack_length / object_count;
>   buckets[] = malloc(object_count * sizeof(int));
> 
>   foreach obj in idx:
>     push_chain(buckets[obj.offset / bucket_size], obj.idx_nth);
> 
>   foreach bucket:
>     insertion sort by offset

I did do something similar (though I flattened my buckets into a single
list afterwards), but I ended up closer to 700ms (down from 830ms, but
with the radix sort around 200ms). It's entirely possible I screwed up
something in the implementation (the bucket insertion can be done in a
lot of different ways, many of which are terrible), but I didn't keep a
copy of that attempt. If you try it and have better numbers, I'd be
happy to see them.

> We observed on linux.git that most buckets have an average number of
> objects. IIRC the bucket_size was ~201 bytes and most buckets had very
> few objects each. For lookups we keep the bucket_size parameter and a
> bucket index table. This arrangement uses 8 bytes per object in the
> reverse index, making it very memory efficient. Searches are typically
> below O(log N) time because each bucket has <log N entries.

I didn't measure lookups at all; I was focused on time to build the
index. So if there were benefits there that make up for a longer setup
time, I wouldn't have measured them (of course, we also care about the
case with few lookups, so it would be a tradeoff). You could also leave
each bucket unsorted and only lazily sort it when a lookup hits the
bucket, which might help that case (I didn't look to see if you do that
in JGit).

-Peff

next prev parent reply	other threads:[~2013-07-08  7:57 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-07-07 10:01 [RFC/PATCH 0/4] cat-file --batch-disk-sizes Jeff King
2013-07-07 10:03 ` [PATCH 1/4] zero-initialize object_info structs Jeff King
2013-07-07 17:34   ` Junio C Hamano
2013-07-07 10:04 ` [PATCH 2/4] teach sha1_object_info_extended a "disk_size" query Jeff King
2013-07-07 10:09 ` [PATCH 3/4] cat-file: add --batch-disk-sizes option Jeff King
2013-07-07 17:49   ` Junio C Hamano
2013-07-07 18:19     ` Jeff King
2013-07-08 11:04     ` Duy Nguyen
2013-07-08 12:00       ` Ramkumar Ramachandra
2013-07-08 13:13         ` Duy Nguyen
2013-07-08 13:37           ` Ramkumar Ramachandra
2013-07-09  2:55             ` Duy Nguyen
2013-07-09 10:32               ` Ramkumar Ramachandra
2013-07-10 11:16             ` Jeff King
2013-07-08 16:40           ` Junio C Hamano
2013-07-10 11:04     ` Jeff King
2013-07-11 16:35       ` Junio C Hamano
2013-07-07 21:15   ` brian m. carlson
2013-07-10 10:57     ` Jeff King
2013-07-07 10:14 ` [PATCH 4/4] pack-revindex: radix-sort the revindex Jeff King
2013-07-07 23:52   ` Shawn Pearce
2013-07-08  7:57     ` Jeff King [this message]
2013-07-08 15:38       ` Shawn Pearce
2013-07-08 20:50   ` Brandon Casey
2013-07-08 21:35     ` Brandon Casey
2013-07-10 10:57       ` Jeff King
2013-07-10 10:52     ` Jeff King
2013-07-10 11:34 ` [PATCHv2 00/10] cat-file formats/on-disk sizes Jeff King
2013-07-10 11:35   ` [PATCH 01/10] zero-initialize object_info structs Jeff King
2013-07-10 11:35   ` [PATCH 02/10] teach sha1_object_info_extended a "disk_size" query Jeff King
2013-07-10 11:36   ` [PATCH 03/10] t1006: modernize output comparisons Jeff King
2013-07-10 11:38   ` [PATCH 04/10] cat-file: teach --batch to stream blob objects Jeff King
2013-07-10 11:38   ` [PATCH 05/10] cat-file: refactor --batch option parsing Jeff King
2013-07-10 11:45   ` [PATCH 06/10] cat-file: add --batch-check=<format> Jeff King
2013-07-10 11:57     ` Eric Sunshine
2013-07-10 14:51     ` Ramkumar Ramachandra
2013-07-11 11:24       ` Jeff King
2013-07-10 11:46   ` [PATCH 07/10] cat-file: add %(objectsize:disk) format atom Jeff King
2013-07-10 11:48   ` [PATCH 08/10] cat-file: split --batch input lines on whitespace Jeff King
2013-07-10 15:29     ` Ramkumar Ramachandra
2013-07-11 11:36       ` Jeff King
2013-07-11 17:42         ` Junio C Hamano
2013-07-11 20:45         ` [PATCHv3 " Jeff King
2013-07-10 11:50   ` [PATCH 09/10] pack-revindex: use unsigned to store number of objects Jeff King
2013-07-10 11:55   ` [PATCH 10/10] pack-revindex: radix-sort the revindex Jeff King
2013-07-10 12:00     ` Jeff King
2013-07-10 13:17     ` Ramkumar Ramachandra
2013-07-11 11:03       ` Jeff King
2013-07-10 17:10     ` Brandon Casey
2013-07-11 11:17       ` Jeff King
2013-07-11 12:16     ` [PATCHv3 " Jeff King
2013-07-11 21:12       ` Brandon Casey

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130708075712.GC25072@sigill.intra.peff.net \
    --to=peff@peff.net \
    --cc=git@vger.kernel.org \
    --cc=spearce@spearce.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).