git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Thomas Rast <trast@student.ethz.ch>
To: Junio C Hamano <gitster@pobox.com>
Cc: Thomas Rast <trast@student.ethz.ch>,
	Thomas Gummerer <t.gummerer@gmail.com>, <git@vger.kernel.org>,
	<mhagger@alum.mit.edu>, <pclouds@gmail.com>
Subject: Re: Review of current github code [Re: [GSoC] Designing a faster index format - Progress report week 6]
Date: Fri, 1 Jun 2012 14:11:37 +0200	[thread overview]
Message-ID: <87vcjb1bba.fsf@thomas.inf.ethz.ch> (raw)
In-Reply-To: <7vhauw6x0p.fsf@alter.siamese.dyndns.org> (Junio C. Hamano's message of "Thu, 31 May 2012 11:11:34 -0700")

Junio C Hamano <gitster@pobox.com> writes:

> Thomas Rast <trast@student.ethz.ch> writes:
>
>>   Test                      this tree      
>>   -----------------------------------------
>>   0002.1: v[23]: ls-files   0.13(0.11+0.02)
>>   0002.4: v4: ls-files      0.11(0.08+0.02)
>>   0002.5: v5: ls-files      0.09(0.06+0.02)
>>
>> I made up a hacky perf script on the spot, it's pasted at the far end of
>> this email.  It would most likely still be slower than v4 if we didn't
>> switch away from SHA1, though -- we haven't really spent much time
>> looking into the speed, except for one particular avoidance of name
>> copies that translated into a roughly 30% speedup.
>
> Do you mean by "switch away from SHA-1" that your suspicion is a
> large part of the speed-up may be coming from the fact that the
> index file as a whole is no longer hashed?

Yes.  Since the v5 index is only slightly smaller than v4 one, the
reduction in data read cannot explain the difference alone.

I tried to quantify this a little.  For SHA1 and the v2/v4 index
(25MB/14MB, resp.), I get about 70ms/44ms for

  time git hash-object --stdin <.git/index

On the other hand I get about 35ms/22ms for

  time ~/g/test-crc32 .git/index

I do have a system crc32 utility, but it uses read() in 8k blocks
instead of mmap() and takes about 87ms.

So we can see that the switch from 25MB to 14MB fully explains the
speedup for v2->v4, and the switch from SHA1 to CRC32 explains the
speedup for v4->v5.

However, aside from gaining 20ms here, CRC32 is also suitable for
checking very short chunks of data, as is planned for the partial
loading support in v5.

> As long as the new format allows us to notice corruption in the file
> to a similar degree of confidence by some other means, I personally
> do not see it as a regression in safety.
> 
> We however eventually would need to hook the logic to check for
> index corruption into fsck.  Actually adding such a code to fsck can
> and probably should remain outside the GSoC project, but please make
> sure you have necessary checksums in the format to allow us to do so
> in the future.

I actually expect that a full loading of the index will verify all
checksums that are present in the file.  Since file additions and such
will still need a full rewrite, and thus a full read, I expect this to
happen every so often as a matter of normal operations.  fsck could of
course still learn to load the index at some point, for good measure.


diff --git i/Makefile w/Makefile
index 63eacda..76856bc 100644
--- i/Makefile
+++ w/Makefile
@@ -481,6 +481,7 @@ X =
 PROGRAMS += $(patsubst %.o,git-%$X,$(PROGRAM_OBJS))
 
 TEST_PROGRAMS_NEED_X += test-chmtime
+TEST_PROGRAMS_NEED_X += test-crc32
 TEST_PROGRAMS_NEED_X += test-credential
 TEST_PROGRAMS_NEED_X += test-ctype
 TEST_PROGRAMS_NEED_X += test-date
diff --git i/test-crc32.c w/test-crc32.c
index e69de29..092de48 100644
--- i/test-crc32.c
+++ w/test-crc32.c
@@ -0,0 +1,32 @@
+#include "git-compat-util.h"
+#include <zlib.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <fcntl.h>
+#include <unistd.h>
+#include <sys/mman.h>
+
+int main (int argc, char *argv[])
+
+{
+	unsigned int crc;
+	struct stat st;
+	int fd;
+	void *map;
+
+	if (argc != 2)
+		die("usage: %s <file>\n", argv[0]);
+	fd = open(argv[1], O_RDONLY);
+	if (fd < 0)
+		die_errno("open");
+	if (fstat(fd, &st) < 0)
+		die_errno("fstat");
+	map = mmap(NULL, st.st_size, PROT_READ, MAP_SHARED, fd, 0);
+	if (map == MAP_FAILED)
+		die_errno("mmap");
+
+	crc = crc32(0, map, st.st_size);
+	printf("%8x\n", crc);
+
+	return 0;
+}


-- 
Thomas Rast
trast@{inf,student}.ethz.ch

  reply	other threads:[~2012-06-01 12:11 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-05-28 21:44 [GSoC] Designing a faster index format - Progress report week 6 Thomas Gummerer
2012-05-31 15:50 ` Review of current github code [Re: [GSoC] Designing a faster index format - Progress report week 6] Thomas Rast
2012-05-31 18:11   ` Junio C Hamano
2012-06-01 12:11     ` Thomas Rast [this message]
2012-06-01 14:49   ` Thomas Gummerer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87vcjb1bba.fsf@thomas.inf.ethz.ch \
    --to=trast@student.ethz.ch \
    --cc=git@vger.kernel.org \
    --cc=gitster@pobox.com \
    --cc=mhagger@alum.mit.edu \
    --cc=pclouds@gmail.com \
    --cc=t.gummerer@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).