git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Linus Torvalds <torvalds@linux-foundation.org>
To: linux@horizon.com
Cc: git@vger.kernel.org
Subject: Re: [PATCH 1/3] Lazily open pack index files on demand
Date: Mon, 28 May 2007 20:26:42 -0700 (PDT)	[thread overview]
Message-ID: <alpine.LFD.0.98.0705282014480.26602@woody.linux-foundation.org> (raw)
In-Reply-To: <20070529000949.28007.qmail@science.horizon.com>



On Mon, 28 May 2007, linux@horizon.com wrote:
> 
> Even losing a constant factor of 2, it still seems like it might offer a 
> factor-of-2 speedup for large repositories.

Well, not so much according to the numbers I had.

Yes, SHA-1's are very nicely distributed on a larger scale, but on a 
_small_ scale they aren't. 

So you end up being able to get very good initial guesses for the first 
few iterations, but once you get "close enough", you can't do anything at 
all, and then you're actually worse off.

Also, please do realize that the binary search is actually a *smart* 
binary search, which does a radix-256 fan-out at the beginning, getting 
rid of the first 8 levels of searching for free.

The Newton-Raphson thing can do that too (and in my trial patch, did), but 
it means that you actually get rid of just the initial guess (the one that 
worked the best!) and you still have the problem that once you get close 
enough, the distribution at a local level is not at all nice and linear.

So what I did with my patch (you can find it in the archives - search for 
Newton-Raphson, I'd say about 3 months ago) was to do three cycles of 
approximating, and then a linear search. The linear search has good cache 
behaviour, so it's not as horrid as it might sound, but for some numbers I 
did, my approximate thing would hit on the exact first entry about half 
the time, but would have to walk up to ~20 entries occasionally.

Which meant that the binary search (with the initial radix-256 fan-out) 
actually mostly outperformed the Newton-Raphson thing.

Also, object lookup is certainly noticeable, but the biggest cost of it is 
the cache misses, and even then it's not really _that_ noticeable. And 
it's really neither slow nor stupid as it is.

So I'd love to see somebody try to do a _proper_ apprixmation of Newton- 
Raphson, not the quick hack I did. But I think factor-of-two is actually 
optimistic, even for pretty large repos. And it would have to be no worse 
than what we have now for average-sized ones.

(And object lookup time is generally not the dominant cost, so while it's 
noticeable, cutting it down by even a whole 50% isn't going to speed up 
any normal git operations by more than a couple of percent.. Generating 
diffs in particular is a much more costly op for things like "git blame")

		Linus

  reply	other threads:[~2007-05-29  3:26 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-05-29  0:09 [PATCH 1/3] Lazily open pack index files on demand linux
2007-05-29  3:26 ` Linus Torvalds [this message]
  -- strict thread matches above, loose matches on Subject: below --
2007-05-27 10:46 Martin Koegler
2007-05-27 15:36 ` Nicolas Pitre
2007-05-26  5:24 Shawn O. Pearce
2007-05-26  8:29 ` Junio C Hamano
2007-05-26 17:30   ` Shawn O. Pearce
2007-05-26 17:31   ` Dana How
2007-05-27  2:43     ` Nicolas Pitre
2007-05-27  4:31       ` Dana How
2007-05-27 14:41         ` Nicolas Pitre
2007-05-27  3:34     ` Shawn O. Pearce
2007-05-27  4:40       ` Dana How
2007-05-27 15:29         ` Nicolas Pitre
2007-05-27 21:35           ` Shawn O. Pearce
2007-05-28  1:35             ` Dana How
2007-05-28  2:30               ` A Large Angry SCM
2007-05-28 18:31               ` Nicolas Pitre
2007-05-28  2:18             ` Nicolas Pitre
2007-05-27 15:26       ` Nicolas Pitre
2007-05-27 16:06         ` Dana How
2007-05-27 21:52         ` Shawn O. Pearce
2007-05-27 23:35           ` Nicolas Pitre
2007-05-28 16:22             ` Linus Torvalds
2007-05-28 17:13               ` Nicolas Pitre
2007-05-28 17:40               ` Karl Hasselström

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=alpine.LFD.0.98.0705282014480.26602@woody.linux-foundation.org \
    --to=torvalds@linux-foundation.org \
    --cc=git@vger.kernel.org \
    --cc=linux@horizon.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).