From: Linus Torvalds <torvalds@linux-foundation.org>
To: "Randal L. Schwartz" <merlyn@stonehenge.com>
Cc: Peter Eriksen <s022018@student.dtu.dk>, git@vger.kernel.org
Subject: Re: Distribution of longest common hash prefixes
Date: Mon, 2 Apr 2007 10:00:22 -0700 (PDT) [thread overview]
Message-ID: <Pine.LNX.4.64.0704020938470.6730@woody.linux-foundation.org> (raw)
In-Reply-To: <86bqi6kae7.fsf@blue.stonehenge.com>
On Mon, 2 Apr 2007, Randal L. Schwartz wrote:
>
> I don't have access to the linux-2.6 kernel, but on git.git at
> d8b6a1a10b93666246984a50d64a163e71163aeb I get this:
>
> $ git-rev-list --objects HEAD | sort | perl -lne '
> substr($_, 40) = "";
> ($p ^ $_) =~ /^(\0*)/;
> $count[length $1]++;
> $p = $_;
> END { print "$_: $count[$_]" for 0..$#count }
> '
> 0: 16
> 1: 240
> 2: 3839
> 3: 24458
> 4: 8275
> 5: 619
> 6: 45
> 7:
> 8: 1
>
> Yeay Perl. :)
No yay yet.. That counts hex digits, not bits.
However, both this and Peter's original thing show an interesting pattern
in common: for the case where the data is dense (ie a few bits in common),
you actually don't end up counting "bits in common", but "edges when the
bits change in the sorted output".
For example, in the above, the 16/240/3839 comes simply from the fact that
there are sixteen times that the first digit changes (and that makes the
program think that it has zero bits in common). There are 256 times that
the two first digit changes, but 16 of those the first one changed too, so
only in 240 cases did just the second digit change).
And there are 4096 places where the three first digit change, but 256 of
those were already counted, so you get 3840 for the third case (but the
git repo didn't have enough objects, so you missed one, and then the next
ones will hit a peak and then start an exponential decrease.
So with a nice random linear distribution (which we'd expect from a good
hash), you should see an exponential increase to a maximum (which you'd
expect to be at "floor(lnx(nr-objects))", and then an exponential decrease
right back.
With the kernel, with 439342 objects reachable from HEAD, the peak should
be around 4 (for a base-16 thing) and around 18 for the binary thing.
Which is exactly what you get..
Linus
--- for the kernel, using your nybble-counter ---
0: 16
1: 240
2: 3840
3: 61357
4: 293375
5: 74775
6: 5372
7: 350
8: 16
9: 1
next prev parent reply other threads:[~2007-04-02 17:00 UTC|newest]
Thread overview: 20+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-04-02 14:58 Distribution of longest common hash prefixes Peter Eriksen
2007-04-02 15:20 ` Linus Torvalds
2007-04-02 16:29 ` Randal L. Schwartz
2007-04-02 17:00 ` Linus Torvalds [this message]
2007-04-02 17:17 ` Randal L. Schwartz
2007-04-02 17:33 ` Randal L. Schwartz
2007-04-03 17:04 ` James Cloos
2007-04-03 17:11 ` Randal L. Schwartz
2007-04-03 17:21 ` Shawn O. Pearce
2007-04-03 17:50 ` Linus Torvalds
2007-04-03 18:22 ` Junio C Hamano
2007-04-03 19:27 ` Linus Torvalds
2007-04-03 19:34 ` Nicolas Pitre
2007-04-03 20:25 ` Junio C Hamano
2007-04-03 20:39 ` Nicolas Pitre
2007-04-03 23:08 ` Olivier Galibert
2007-04-03 23:22 ` Linus Torvalds
2007-04-04 21:03 ` James Cloos
2007-04-02 17:18 ` James Cloos
2007-04-02 15:28 ` Peter Eriksen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Pine.LNX.4.64.0704020938470.6730@woody.linux-foundation.org \
--to=torvalds@linux-foundation.org \
--cc=git@vger.kernel.org \
--cc=merlyn@stonehenge.com \
--cc=s022018@student.dtu.dk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).