git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
To: git@vger.kernel.org
Cc: "Thomas Rast" <trast@inf.ethz.ch>,
	"Joshua Redstone" <joshua.redstone@fb.com>,
	"Nguyễn Thái Ngọc Duy" <pclouds@gmail.com>
Subject: [PATCH 6/6] Automatically switch to crc32 checksum for index when it's too large
Date: Mon,  6 Feb 2012 12:48:39 +0700	[thread overview]
Message-ID: <1328507319-24687-6-git-send-email-pclouds@gmail.com> (raw)
In-Reply-To: <1328507319-24687-1-git-send-email-pclouds@gmail.com>

An experiment with -O3 is done on Intel D510@1.66GHz. At around 250k
entries, index reading time exceeds 0.5s. Switching to crc32 brings it
back lower than 0.2s.

On 4M files index, reading time with SHA-1 takes ~8.4, crc32 2.8s.

Signed-off-by: Nguyễn Thái Ngọc Duy <pclouds@gmail.com>
---
 I know no real repositories this size though. gentoo-x86 is "only"
 120k. Haven't checked libreoffice repo yet.

 On 2M files index, allocating one big block (i.e. reverting debed2a
 (read-cache.c: allocate index entries individually - 2011-10-24)
 saves about 0.3s. Maybe we can allocate one big block, then malloc
 separately when the block is fully used.

 Writing time is still high. "git update-index --crc32" on crc32 250k index
 takes 0.9s (so writing time is about 0.5s)

 A better solution may be narrow clone (or just the narrow checkout
 part), where index only contains entries from checked out
 subdirectories.

 Documentation/config.txt |    7 +++++++
 builtin/update-index.c   |    1 +
 cache.h                  |    1 +
 config.c                 |    5 +++++
 environment.c            |    1 +
 read-cache.c             |    8 ++++++++
 6 files changed, 23 insertions(+), 0 deletions(-)

diff --git a/Documentation/config.txt b/Documentation/config.txt
index abeb82b..55b7596 100644
--- a/Documentation/config.txt
+++ b/Documentation/config.txt
@@ -540,6 +540,13 @@ relatively high IO latencies.  With this set to 'true', git will do the
 index comparison to the filesystem data in parallel, allowing
 overlapping IO's.
 
+core.crc32IndexThreshold::
+	Usually SHA-1 is used to check for index integerity. When the
+	number of entries in index exceeds this threshold, crc32 will
+	be used instead. Zero means SHA-1 always be used. Negative
+	value disables this threshold (i.e. crc32 or SHA-1 is decided
+	by other means).
+
 core.createObject::
 	You can set this to 'link', in which case a hardlink followed by
 	a delete of the source are used to make sure that object creation
diff --git a/builtin/update-index.c b/builtin/update-index.c
index 6913226..5cb51c7 100644
--- a/builtin/update-index.c
+++ b/builtin/update-index.c
@@ -856,6 +856,7 @@ int cmd_update_index(int argc, const char **argv, const char *prefix)
 	argc = parse_options_end(&ctx);
 
 	if (do_crc != -1) {
+		core_crc32_index_threshold = -1;
 		if (do_crc)
 			the_index.hdr_flags |= CACHE_F_CRC;
 		else
diff --git a/cache.h b/cache.h
index 7352402..d05856b 100644
--- a/cache.h
+++ b/cache.h
@@ -610,6 +610,7 @@ extern unsigned long pack_size_limit_cfg;
 extern int read_replace_refs;
 extern int fsync_object_files;
 extern int core_preload_index;
+extern int core_crc32_index_threshold;
 extern int core_apply_sparse_checkout;
 
 enum branch_track {
diff --git a/config.c b/config.c
index 40f9c6d..905e071 100644
--- a/config.c
+++ b/config.c
@@ -671,6 +671,11 @@ static int git_default_core_config(const char *var, const char *value)
 		return 0;
 	}
 
+	if (!strcmp(var, "core.crc32indexthreshold")) {
+		core_crc32_index_threshold = git_config_int(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "core.createobject")) {
 		if (!strcmp(value, "rename"))
 			object_creation_mode = OBJECT_CREATION_USES_RENAMES;
diff --git a/environment.c b/environment.c
index c93b8f4..9d9dfc2 100644
--- a/environment.c
+++ b/environment.c
@@ -66,6 +66,7 @@ unsigned long pack_size_limit_cfg;
 
 /* Parallel index stat data preload? */
 int core_preload_index = 0;
+int core_crc32_index_threshold = 250000;
 
 /* This is set by setup_git_dir_gently() and/or git_default_config() */
 char *git_work_tree_cfg;
diff --git a/read-cache.c b/read-cache.c
index a34878e..fd032d8 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -1582,6 +1582,14 @@ int write_index(struct index_state *istate, int newfd)
 		}
 	}
 
+	if (core_crc32_index_threshold >= 0) {
+		if (core_crc32_index_threshold > 0 &&
+		    istate->cache_nr >= core_crc32_index_threshold)
+			istate->hdr_flags |= CACHE_F_CRC;
+		else
+			istate->hdr_flags &= ~CACHE_F_CRC;
+	}
+
 	hdr.h.hdr_signature = htonl(CACHE_SIGNATURE);
 	if (istate->hdr_flags) {
 		hdr.h.hdr_version = htonl(4);
-- 
1.7.8.36.g69ee2

  parent reply	other threads:[~2012-02-06  5:44 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-06  5:48 [PATCH 1/6] read-cache: use sha1file for sha1 calculation Nguyễn Thái Ngọc Duy
2012-02-06  5:48 ` [PATCH 2/6] csum-file: make sha1 calculation optional Nguyễn Thái Ngọc Duy
2012-02-06  5:48 ` [PATCH 3/6] Stop producing index version 2 Nguyễn Thái Ngọc Duy
2012-02-06  7:10   ` Junio C Hamano
2012-02-07  3:09     ` Shawn Pearce
2012-02-07  4:50       ` Nguyen Thai Ngoc Duy
2012-02-07  8:51         ` Nguyen Thai Ngoc Duy
2012-02-07  5:21       ` Junio C Hamano
2012-02-07 17:25       ` Thomas Rast
2012-02-06  5:48 ` [PATCH 4/6] Introduce index version 4 with global flags Nguyễn Thái Ngọc Duy
2012-02-06  5:48 ` [PATCH 5/6] Allow to use crc32 as a lighter checksum on index Nguyễn Thái Ngọc Duy
2012-02-07  3:17   ` Shawn Pearce
2012-02-07  4:04     ` Dave Zarzycki
2012-02-07  4:29       ` Dave Zarzycki
2012-02-06  5:48 ` Nguyễn Thái Ngọc Duy [this message]
2012-02-06  8:50   ` [PATCH 6/6] Automatically switch to crc32 checksum for index when it's too large Dave Zarzycki
2012-02-06  8:54     ` Nguyen Thai Ngoc Duy
2012-02-06  9:07       ` Dave Zarzycki
2012-02-06  7:34 ` [PATCH 1/6] read-cache: use sha1file for sha1 calculation Junio C Hamano
2012-02-06  8:36   ` Nguyen Thai Ngoc Duy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1328507319-24687-6-git-send-email-pclouds@gmail.com \
    --to=pclouds@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=joshua.redstone@fb.com \
    --cc=trast@inf.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).