Re: [PATCH 1/3] Lazily open pack index files on demand

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Shawn O. Pearce" <spearce@spearce.org>
To: Nicolas Pitre <nico@cam.org>
Cc: Dana How <danahow@gmail.com>, Junio C Hamano <junkio@cox.net>,
	git@vger.kernel.org
Subject: Re: [PATCH 1/3] Lazily open pack index files on demand
Date: Sun, 27 May 2007 17:52:45 -0400	[thread overview]
Message-ID: <20070527215245.GD28023@spearce.org> (raw)
In-Reply-To: <alpine.LFD.0.99.0705271110550.3366@xanadu.home>

Nicolas Pitre <nico@cam.org> wrote:
> On Sat, 26 May 2007, Shawn O. Pearce wrote:
> 
> > Dana How <danahow@gmail.com> wrote:
> > > Shawn:  When I first saw the index-loading code,  my first
> > > thought was that all the index tables should be
> > > merged (easy since sorted) so callers only need to do one search.
> 
> There is also the question of memory footprint.  If you have a global 
> index, then for each object you need to have a tupple containing SHA1 + 
> pack offset + reference to corresponding pack.  Right now we only need 
> SHA1 + pack offset.

I'm about half-way through a super-index implementation.  Right now
the super-index is defined to be just one index file per repository
(objects/pack/super.index) with a format that looks like the
following:

  header:
    uint32_t sdx_signature ('PSDX')
    uint32_t sdx_version (1)
    uint16_t sdx_packs
    uint8_t  sdx_prefix_len;
    uint8_t  __reserved;

  pack table:
    /* SHA-1 of each pack's sorted SHA-1 object list */
    unsigned char pack_sha1[20][sdx_packs];

  fan-out table:
    /* This is the standard fan-out also used in a tOc/.idx */
    uint32_t fan_out[256];

  records:
    unsigned char prefix[hdr.sdx_prefix_len - 1];
    int8_t        splits[hdr.sdx_packs];

  trailer:
    unsigned char sha1_of_the_above[20];

I build the super-index by merging the .idx of all available
packfiles; since they are already sorted the merge is obviously
quite trivial.

The sdx_prefix_len field is initialized to something that gives a
reasonably unique object name; e.g. in git.git an sdx_prefix_len
of 3 or 4 gets pretty good at narrowing the set of objects down
very small.  The idea here is that the sdx_prefix_len should be
almost the unique abbreviation length for this repository.

We store 1 less byte of the prefix in the record because of the
fan-out table already accounting for the first byte of the prefix.

The splits array contains a single signed integer for each packfile;
if the integer is 0 then that packfile does not contain any object
that starts with that corresponding prefix.  In such a case we
can completely avoid looking at that corresponding packfile.
With my lazy index loading change, I may not even need to open
that index.  ;-)

If the splits array entry is non-zero and is negative, its the number
of times we need to halve down (towards 'lo') to get to entries
that start with this prefix and that are in that packfile's fan-out
table entry range.  If its positive its the number of times we have
to halve up (towards 'hi').  This way we can jump more directly to
the relevant index records and avoid redoing binary search work we
have already accomplished in the super index.

So we can generally build super index records at a cost of 3 or
4 bytes + sdx_packs.  We can also determine which packfile(s) we
need to scan quite quickly, and we can jump further into the part
of the index avoiding a number of expensive hashcmp() calls.  It
may actually be a good savings at runtime, well worth the slightly
higher memory footprint.

My testing is not yet complete, so I cannot offer any hard numbers
against any interesting/common data sets...

> BTW I think the Newton-Raphson based index lookup approach should be 
> revived at some point.

That doesn't help with 10 packfiles though, does it?

-- 
Shawn.

next prev parent reply	other threads:[~2007-05-27 21:53 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-26  5:24 [PATCH 1/3] Lazily open pack index files on demand Shawn O. Pearce
2007-05-26  8:29 ` Junio C Hamano
2007-05-26 17:30   ` Shawn O. Pearce
2007-05-26 17:31   ` Dana How
2007-05-27  2:43     ` Nicolas Pitre
2007-05-27  4:31       ` Dana How
2007-05-27 14:41         ` Nicolas Pitre
2007-05-27  3:34     ` Shawn O. Pearce
2007-05-27  4:40       ` Dana How
2007-05-27 15:29         ` Nicolas Pitre
2007-05-27 21:35           ` Shawn O. Pearce
2007-05-28  1:35             ` Dana How
2007-05-28  2:30               ` A Large Angry SCM
2007-05-28 18:31               ` Nicolas Pitre
2007-05-28  2:18             ` Nicolas Pitre
2007-05-27 15:26       ` Nicolas Pitre
2007-05-27 16:06         ` Dana How
2007-05-27 21:52         ` Shawn O. Pearce [this message]
2007-05-27 23:35           ` Nicolas Pitre
2007-05-28 16:22             ` Linus Torvalds
2007-05-28 17:13               ` Nicolas Pitre
2007-05-28 17:40               ` Karl Hasselström
  -- strict thread matches above, loose matches on Subject: below --
2007-05-27 10:46 Martin Koegler
2007-05-27 15:36 ` Nicolas Pitre
2007-05-29  0:09 linux
2007-05-29  3:26 ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070527215245.GD28023@spearce.org \
    --to=spearce@spearce.org \
    --cc=danahow@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=junkio@cox.net \
    --cc=nico@cam.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).