git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Distribution of longest common hash prefixes
@ 2007-04-02 14:58 Peter Eriksen
  2007-04-02 15:20 ` Linus Torvalds
  2007-04-02 15:28 ` Peter Eriksen
  0 siblings, 2 replies; 20+ messages in thread
From: Peter Eriksen @ 2007-04-02 14:58 UTC (permalink / raw)
  To: git

Hello,

With the recent and ongoing discussion about a more efficient .idx
format, I got curious about how long the longest common prefix between
two hashes is in git various projects. So I did

$ git rev-list --objects HEAD | cut -b1-40 | sort >proj.list
$ lcprefix proj.list

and decided to compare all neighbour hashes with each other, finding the
longest common prefix in bits between them. These numbers were then
tabulated into bins on the number of bits.

Why does it look so sparse and strange?

The following is the output of the program bellow on the linux-2.6.git
archive:


0000d162d198a60d558ab4be3f54f608ff8b7473
0000de01ec150d1a291564818571f719a6a6190f
lcprefix = 20
00038c17317a3633f176f17876e69d8e31a5c708
00038ef0cad0a60ee7cace23ade5dfb325b7700d
lcprefix = 22
00108a8dd8d4d772ac6a91efde40191392ba4624
00108ba9a78dcef5629ead0e8bea35d0c08c9ea7
lcprefix = 23
001c2d57248586464da570017e368e188eb4d270
001c2d594f26b529fdabb6cf05deaa384f400e12
lcprefix = 28
002c9920c7bc3cbf74b4cdc03491f80f68917528
002c9922e55255556f1bebea8772aa58c9724825
lcprefix = 30
15d954e50cae6e816b534bf959c49a2920bef808
15d954e51e5b40cf4d5930708fe98076adb1063a
lcprefix = 35
d37bdb4d4930ddb390902c19cffe6552d93d3fcf
d37bdb4d4c9c821e4765f10c206fa613f7781b65
lcprefix = 37
 0:  0
 1:  0
 2:  0
 3:  0
 4:  0
 5:  0
 6:  0
 7:  0
 8:  0
 9:  0
10:  0
11:  0
12:  0
13:  0
14:  0
15:  0
16:  0
17:  0
18:  0
19:  8
20: 19
21:  0
22: 87
23: 70
24:  0
25:  0
26:  0
27:  0
28: 116
29:  0
30: 36975
31:  0
32:  0
33:  0
34:  0
35: 325071
36:  0
37: 76996
38:  0
39:  0
40:  0
41:  0
42:  0
43:  0
44:  0
45:  0
46:  0
47:  0
48:  0
49:  0
50:  0
51:  0
52:  0
53:  0
54:  0
55:  0
56:  0
57:  0
58:  0
59:  0
60:  0
61:  0
62:  0
63:  0


======================= >8 ============================
#include <stdio.h>
#include <string.h>


int h2d(char c)
{
	if ('a' <= c && c <= 'f')
		return c-'a'+10;
	else
		return c-'0';
}

int lcprefix(char *a, char *b)
{
	int i = 0;
	int j = 4;
	while (a[i] == b[i])
		i++;

	while ((h2d(a[i])<<j & 128) == (h2d(b[i])<<j & 128))
		j++;
	
	return i*4 + j - 4;
}

int main(int argc, char **argv)
{
	FILE *fp;
	char old[41];
	char cur[41];
	int lcp = 0;
	int table[64];
	int i;

	memset(table, 0, 64*sizeof(int));
	memset(old, '0', 40);
	old[40] = '\0';

	fp = fopen(argv[1], "r");
	fscanf(fp, "%s\n", cur);

	if (lcp < lcprefix(old, cur)) {
		lcp = lcprefix(old, cur);
	}

	table[lcp]++;
	while (fscanf(fp, "%s\n", cur) != EOF) {
		table[lcp]++;
		if (lcp < lcprefix(old, cur)) {
			printf("%s\n%s\n", old, cur);
			lcp = lcprefix(old, cur);
			printf("lcprefix = %d\n", lcp);
		}
		memcpy(old, cur, 40);
	}

	for(i = 0; i < 64; i++) {
		printf("%2d: %2d\n", i, table[i]);
	}
	
	return 0;
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2007-04-04 21:12 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-02 14:58 Distribution of longest common hash prefixes Peter Eriksen
2007-04-02 15:20 ` Linus Torvalds
2007-04-02 16:29   ` Randal L. Schwartz
2007-04-02 17:00     ` Linus Torvalds
2007-04-02 17:17       ` Randal L. Schwartz
2007-04-02 17:33         ` Randal L. Schwartz
2007-04-03 17:04           ` James Cloos
2007-04-03 17:11             ` Randal L. Schwartz
2007-04-03 17:21               ` Shawn O. Pearce
2007-04-03 17:50                 ` Linus Torvalds
2007-04-03 18:22                   ` Junio C Hamano
2007-04-03 19:27                     ` Linus Torvalds
2007-04-03 19:34                     ` Nicolas Pitre
2007-04-03 20:25                       ` Junio C Hamano
2007-04-03 20:39                         ` Nicolas Pitre
2007-04-03 23:08                           ` Olivier Galibert
2007-04-03 23:22                             ` Linus Torvalds
2007-04-04 21:03                   ` James Cloos
2007-04-02 17:18     ` James Cloos
2007-04-02 15:28 ` Peter Eriksen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).