Git development

Git development
 help / color / mirror / Atom feed

* [PATCH v3 0/6] Bulk check-in
From: Junio C Hamano @ 2011-12-02  0:40 UTC (permalink / raw)
  To: git
In-Reply-To: <1322699263-14475-6-git-send-email-gitster@pobox.com>

I would declare that the earlier parts of the v2 that are about factoring
out various API pieces from existing code are basically completed, so they
are not part of this iteration.

The bulk-checkin patch from v2 has been tweaked a bit (deflate_to_pack()
initializes "already_hashed_to" pointer to 0, instead of the current file
position "seekback"), and then the rest of the series builds on top of it
to add a new in-pack encoding that I am tentatively calling "chunked".

The basic idea is to represent a large/huge blob as a concatenation of
smaller blobs. An entry in a pack in "chunked" representation records a
list of object names of the component blob objects.  The object name given
to such a blob is computed exactly the same way as before. In other words,
the name of a object does not depend on its representation; we hash "blob
<size> NUL" and the whole large blob contents to come up with its name. It
is *not* the hash of the component blob object names.

As can be seen in the log message of the "support chunked-object encoding"
patch, many pieces are still missing from this series and filling them
will be a long and tortuous journey. But we need to start somewhere.

I specifically excluded any heuristics to split large objects into chunks
in a self-synchronising way so that a small edit near the beginning of a
large blob results in a handful of new component blobs followed by the
same component blobs as used to represent the same blob before such an
edit, and I do not plan to work on that part myself. My impression from
listening Avery's plug for "bup" is that it is a solved problem; it should
be reasonably straightforward to lift the logic and plug it into the
framework presented here (once the codebase gets solid enough, that is).

After this series, the next step for me is likely to teach the streaming
interface about "chunked" objects, and then pack-objects to take notice
and reuse "chunked" representation when sending things out (which means
that sending a "chunked" encoded blob would involve sending the component
blobs it uses, among other things), but I expect that it will extend well
into next year.

Junio C Hamano (6):
  bulk-checkin: replace fast-import based implementation
  varint-in-pack: refactor varint encoding/decoding
  new representation types in the packstream
  bulk-checkin: allow the same data to be multiply hashed
  bulk-checkin: support chunked-object encoding
  chunked-object: fallback checkout codepaths

 Makefile               |    3 +
 builtin/add.c          |    5 +
 builtin/pack-objects.c |   34 ++---
 bulk-checkin.c         |  415 ++++++++++++++++++++++++++++++++++++++++++++++++
 bulk-checkin.h         |   17 ++
 cache.h                |   13 ++-
 config.c               |    9 +
 environment.c          |    2 +
 pack-write.c           |   50 +++++-
 pack.h                 |    2 +
 sha1_file.c            |  150 +++++++++---------
 split-chunk.c          |   28 ++++
 t/t1050-large.sh       |  135 +++++++++++++++-
 zlib.c                 |    9 +-
 14 files changed, 760 insertions(+), 112 deletions(-)
 create mode 100644 bulk-checkin.c
 create mode 100644 bulk-checkin.h
 create mode 100644 split-chunk.c

-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply

* [PATCH v3 1/6] bulk-checkin: replace fast-import based implementation
From: Junio C Hamano @ 2011-12-02  0:40 UTC (permalink / raw)
  To: git
In-Reply-To: <1322786449-25753-1-git-send-email-gitster@pobox.com>

This extends the earlier approach to stream a large file directly from the
filesystem to its own packfile, and allows "git add" to send large files
directly into a single pack. Older code used to spawn fast-import, but the
new bulk-checkin API replaces it.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Makefile               |    2 +
 builtin/add.c          |    5 +
 builtin/pack-objects.c |    6 +-
 bulk-checkin.c         |  275 ++++++++++++++++++++++++++++++++++++++++++++++++
 bulk-checkin.h         |   16 +++
 cache.h                |    2 +
 config.c               |    4 +
 environment.c          |    1 +
 sha1_file.c            |   67 +-----------
 t/t1050-large.sh       |   94 +++++++++++++++--
 zlib.c                 |    9 ++-
 11 files changed, 403 insertions(+), 78 deletions(-)
 create mode 100644 bulk-checkin.c
 create mode 100644 bulk-checkin.h

diff --git a/Makefile b/Makefile
index 3139c19..418dd2e 100644
--- a/Makefile
+++ b/Makefile
@@ -505,6 +505,7 @@ LIB_H += argv-array.h
 LIB_H += attr.h
 LIB_H += blob.h
 LIB_H += builtin.h
+LIB_H += bulk-checkin.h
 LIB_H += cache.h
 LIB_H += cache-tree.h
 LIB_H += color.h
@@ -591,6 +592,7 @@ LIB_OBJS += base85.o
 LIB_OBJS += bisect.o
 LIB_OBJS += blob.o
 LIB_OBJS += branch.o
+LIB_OBJS += bulk-checkin.o
 LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o
 LIB_OBJS += color.o
diff --git a/builtin/add.c b/builtin/add.c
index c59b0c9..1c42900 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -13,6 +13,7 @@
 #include "diff.h"
 #include "diffcore.h"
 #include "revision.h"
+#include "bulk-checkin.h"
 
 static const char * const builtin_add_usage[] = {
 	"git add [options] [--] <filepattern>...",
@@ -458,11 +459,15 @@ int cmd_add(int argc, const char **argv, const char *prefix)
 		free(seen);
 	}
 
+	plug_bulk_checkin();
+
 	exit_status |= add_files_to_cache(prefix, pathspec, flags);
 
 	if (add_new_files)
 		exit_status |= add_files(&dir, flags);
 
+	unplug_bulk_checkin();
+
  finish:
 	if (active_cache_changed) {
 		if (write_cache(newfd, active_cache, active_nr) ||
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index b458b6d..dde913e 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -76,7 +76,7 @@ static struct pack_idx_option pack_idx_opts;
 static const char *base_name;
 static int progress = 1;
 static int window = 10;
-static unsigned long pack_size_limit, pack_size_limit_cfg;
+static unsigned long pack_size_limit;
 static int depth = 50;
 static int delta_search_threads;
 static int pack_to_stdout;
@@ -2009,10 +2009,6 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			    pack_idx_opts.version);
 		return 0;
 	}
-	if (!strcmp(k, "pack.packsizelimit")) {
-		pack_size_limit_cfg = git_config_ulong(k, v);
-		return 0;
-	}
 	return git_default_config(k, v, cb);
 }
 
diff --git a/bulk-checkin.c b/bulk-checkin.c
new file mode 100644
index 0000000..6b0b6d4
--- /dev/null
+++ b/bulk-checkin.c
@@ -0,0 +1,275 @@
+/*
+ * Copyright (c) 2011, Google Inc.
+ */
+#include "bulk-checkin.h"
+#include "csum-file.h"
+#include "pack.h"
+
+static int pack_compression_level = Z_DEFAULT_COMPRESSION;
+
+static struct bulk_checkin_state {
+	unsigned plugged:1;
+
+	char *pack_tmp_name;
+	struct sha1file *f;
+	off_t offset;
+	struct pack_idx_option pack_idx_opts;
+
+	struct pack_idx_entry **written;
+	uint32_t alloc_written;
+	uint32_t nr_written;
+} state;
+
+static void finish_bulk_checkin(struct bulk_checkin_state *state)
+{
+	unsigned char sha1[20];
+	char packname[PATH_MAX];
+	int i;
+
+	if (!state->f)
+		return;
+
+	if (state->nr_written == 0) {
+		close(state->f->fd);
+		unlink(state->pack_tmp_name);
+		goto clear_exit;
+	} else if (state->nr_written == 1) {
+		sha1close(state->f, sha1, CSUM_FSYNC);
+	} else {
+		int fd = sha1close(state->f, sha1, 0);
+		fixup_pack_header_footer(fd, sha1, state->pack_tmp_name,
+					 state->nr_written, sha1,
+					 state->offset);
+		close(fd);
+	}
+
+	sprintf(packname, "%s/pack/pack-", get_object_directory());
+	finish_tmp_packfile(packname, state->pack_tmp_name,
+			    state->written, state->nr_written,
+			    &state->pack_idx_opts, sha1);
+	for (i = 0; i < state->nr_written; i++)
+		free(state->written[i]);
+
+clear_exit:
+	free(state->written);
+	memset(state, 0, sizeof(*state));
+
+	/* Make objects we just wrote available to ourselves */
+	reprepare_packed_git();
+}
+
+static int already_written(struct bulk_checkin_state *state, unsigned char sha1[])
+{
+	int i;
+
+	/* The object may already exist in the repository */
+	if (has_sha1_file(sha1))
+		return 1;
+
+	/* Might want to keep the list sorted */
+	for (i = 0; i < state->nr_written; i++)
+		if (!hashcmp(state->written[i]->sha1, sha1))
+			return 1;
+
+	/* This is a new object we need to keep */
+	return 0;
+}
+
+/*
+ * Read the contents from fd for size bytes, streaming it to the
+ * packfile in state while updating the hash in ctx. Signal a failure
+ * by returning a negative value when the resulting pack would exceed
+ * the pack size limit and this is not the first object in the pack,
+ * so that the caller can discard what we wrote from the current pack
+ * by truncating it and opening a new one. The caller will then call
+ * us again after rewinding the input fd.
+ *
+ * The already_hashed_to pointer is kept untouched by the caller to
+ * make sure we do not hash the same byte when we are called
+ * again. This way, the caller does not have to checkpoint its hash
+ * status before calling us just in case we ask it to call us again
+ * with a new pack.
+ */
+static int stream_to_pack(struct bulk_checkin_state *state,
+			  git_SHA_CTX *ctx, off_t *already_hashed_to,
+			  int fd, size_t size, enum object_type type,
+			  const char *path, unsigned flags)
+{
+	git_zstream s;
+	unsigned char obuf[16384];
+	unsigned hdrlen;
+	int status = Z_OK;
+	int write_object = (flags & HASH_WRITE_OBJECT);
+	off_t offset = 0;
+
+	memset(&s, 0, sizeof(s));
+	git_deflate_init(&s, pack_compression_level);
+
+	hdrlen = encode_in_pack_object_header(type, size, obuf);
+	s.next_out = obuf + hdrlen;
+	s.avail_out = sizeof(obuf) - hdrlen;
+
+	while (status != Z_STREAM_END) {
+		unsigned char ibuf[16384];
+
+		if (size && !s.avail_in) {
+			ssize_t rsize = size < sizeof(ibuf) ? size : sizeof(ibuf);
+			if (xread(fd, ibuf, rsize) != rsize)
+				die("failed to read %d bytes from '%s'",
+				    (int)rsize, path);
+			offset += rsize;
+			if (*already_hashed_to < offset) {
+				size_t hsize = offset - *already_hashed_to;
+				if (rsize < hsize)
+					hsize = rsize;
+				if (hsize)
+					git_SHA1_Update(ctx, ibuf, hsize);
+				*already_hashed_to = offset;
+			}
+			s.next_in = ibuf;
+			s.avail_in = rsize;
+			size -= rsize;
+		}
+
+		status = git_deflate(&s, size ? 0 : Z_FINISH);
+
+		if (!s.avail_out || status == Z_STREAM_END) {
+			if (write_object) {
+				size_t written = s.next_out - obuf;
+
+				/* would we bust the size limit? */
+				if (state->nr_written &&
+				    pack_size_limit_cfg &&
+				    pack_size_limit_cfg < state->offset + written) {
+					git_deflate_abort(&s);
+					return -1;
+				}
+
+				sha1write(state->f, obuf, written);
+				state->offset += written;
+			}
+			s.next_out = obuf;
+			s.avail_out = sizeof(obuf);
+		}
+
+		switch (status) {
+		case Z_OK:
+		case Z_BUF_ERROR:
+		case Z_STREAM_END:
+			continue;
+		default:
+			die("unexpected deflate failure: %d", status);
+		}
+	}
+	git_deflate_end(&s);
+	return 0;
+}
+
+/* Lazily create backing packfile for the state */
+static void prepare_to_stream(struct bulk_checkin_state *state,
+			      unsigned flags)
+{
+	if (!(flags & HASH_WRITE_OBJECT) || state->f)
+		return;
+
+	state->f = create_tmp_packfile(&state->pack_tmp_name);
+	reset_pack_idx_option(&state->pack_idx_opts);
+
+	/* Pretend we are going to write only one object */
+	state->offset = write_pack_header(state->f, 1);
+	if (!state->offset)
+		die_errno("unable to write pack header");
+}
+
+static int deflate_to_pack(struct bulk_checkin_state *state,
+			   unsigned char result_sha1[],
+			   int fd, size_t size,
+			   enum object_type type, const char *path,
+			   unsigned flags)
+{
+	off_t seekback, already_hashed_to;
+	git_SHA_CTX ctx;
+	unsigned char obuf[16384];
+	unsigned header_len;
+	struct sha1file_checkpoint checkpoint;
+	struct pack_idx_entry *idx = NULL;
+
+	seekback = lseek(fd, 0, SEEK_CUR);
+	if (seekback == (off_t) -1)
+		return error("cannot find the current offset");
+
+	header_len = sprintf((char *)obuf, "%s %" PRIuMAX,
+			     typename(type), (uintmax_t)size) + 1;
+	git_SHA1_Init(&ctx);
+	git_SHA1_Update(&ctx, obuf, header_len);
+
+	/* Note: idx is non-NULL when we are writing */
+	if ((flags & HASH_WRITE_OBJECT) != 0)
+		idx = xcalloc(1, sizeof(*idx));
+
+	already_hashed_to = 0;
+
+	while (1) {
+		prepare_to_stream(state, flags);
+		if (idx) {
+			sha1file_checkpoint(state->f, &checkpoint);
+			idx->offset = state->offset;
+			crc32_begin(state->f);
+		}
+		if (!stream_to_pack(state, &ctx, &already_hashed_to,
+				    fd, size, type, path, flags))
+			break;
+		/*
+		 * Writing this object to the current pack will make
+		 * it too big; we need to truncate it, start a new
+		 * pack, and write into it.
+		 */
+		if (!idx)
+			die("BUG: should not happen");
+		sha1file_truncate(state->f, &checkpoint);
+		state->offset = checkpoint.offset;
+		finish_bulk_checkin(state);
+		if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
+			return error("cannot seek back");
+	}
+	git_SHA1_Final(result_sha1, &ctx);
+	if (!idx)
+		return 0;
+
+	idx->crc32 = crc32_end(state->f);
+	if (already_written(state, result_sha1)) {
+		sha1file_truncate(state->f, &checkpoint);
+		state->offset = checkpoint.offset;
+		free(idx);
+	} else {
+		hashcpy(idx->sha1, result_sha1);
+		ALLOC_GROW(state->written,
+			   state->nr_written + 1,
+			   state->alloc_written);
+		state->written[state->nr_written++] = idx;
+	}
+	return 0;
+}
+
+int index_bulk_checkin(unsigned char *sha1,
+		       int fd, size_t size, enum object_type type,
+		       const char *path, unsigned flags)
+{
+	int status = deflate_to_pack(&state, sha1, fd, size, type,
+				     path, flags);
+	if (!state.plugged)
+		finish_bulk_checkin(&state);
+	return status;
+}
+
+void plug_bulk_checkin(void)
+{
+	state.plugged = 1;
+}
+
+void unplug_bulk_checkin(void)
+{
+	state.plugged = 0;
+	if (state.f)
+		finish_bulk_checkin(&state);
+}
diff --git a/bulk-checkin.h b/bulk-checkin.h
new file mode 100644
index 0000000..4f599f8
--- /dev/null
+++ b/bulk-checkin.h
@@ -0,0 +1,16 @@
+/*
+ * Copyright (c) 2011, Google Inc.
+ */
+#ifndef BULK_CHECKIN_H
+#define BULK_CHECKIN_H
+
+#include "cache.h"
+
+extern int index_bulk_checkin(unsigned char sha1[],
+			      int fd, size_t size, enum object_type type,
+			      const char *path, unsigned flags);
+
+extern void plug_bulk_checkin(void);
+extern void unplug_bulk_checkin(void);
+
+#endif
diff --git a/cache.h b/cache.h
index 2e6ad36..4f20861 100644
--- a/cache.h
+++ b/cache.h
@@ -35,6 +35,7 @@ int git_inflate(git_zstream *, int flush);
 void git_deflate_init(git_zstream *, int level);
 void git_deflate_init_gzip(git_zstream *, int level);
 void git_deflate_end(git_zstream *);
+int git_deflate_abort(git_zstream *);
 int git_deflate_end_gently(git_zstream *);
 int git_deflate(git_zstream *, int flush);
 unsigned long git_deflate_bound(git_zstream *, unsigned long);
@@ -598,6 +599,7 @@ extern size_t packed_git_window_size;
 extern size_t packed_git_limit;
 extern size_t delta_base_cache_limit;
 extern unsigned long big_file_threshold;
+extern unsigned long pack_size_limit_cfg;
 extern int read_replace_refs;
 extern int fsync_object_files;
 extern int core_preload_index;
diff --git a/config.c b/config.c
index edf9914..c736802 100644
--- a/config.c
+++ b/config.c
@@ -797,6 +797,10 @@ int git_default_config(const char *var, const char *value, void *dummy)
 		return 0;
 	}
 
+	if (!strcmp(var, "pack.packsizelimit")) {
+		pack_size_limit_cfg = git_config_ulong(var, value);
+		return 0;
+	}
 	/* Add other config variables here and to Documentation/config.txt. */
 	return 0;
 }
diff --git a/environment.c b/environment.c
index 0bee6a7..31e4284 100644
--- a/environment.c
+++ b/environment.c
@@ -60,6 +60,7 @@ char *notes_ref_name;
 int grafts_replace_parents = 1;
 int core_apply_sparse_checkout;
 struct startup_info *startup_info;
+unsigned long pack_size_limit_cfg;
 
 /* Parallel index stat data preload? */
 int core_preload_index = 0;
diff --git a/sha1_file.c b/sha1_file.c
index 27f3b9b..c96e366 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -18,6 +18,7 @@
 #include "refs.h"
 #include "pack-revindex.h"
 #include "sha1-lookup.h"
+#include "bulk-checkin.h"
 
 #ifndef O_NOATIME
 #if defined(__linux__) && (defined(__i386__) || defined(__PPC__))
@@ -2679,10 +2680,8 @@ static int index_core(unsigned char *sha1, int fd, size_t size,
 }
 
 /*
- * This creates one packfile per large blob, because the caller
- * immediately wants the result sha1, and fast-import can report the
- * object name via marks mechanism only by closing the created
- * packfile.
+ * This creates one packfile per large blob unless bulk-checkin
+ * machinery is "plugged".
  *
  * This also bypasses the usual "convert-to-git" dance, and that is on
  * purpose. We could write a streaming version of the converting
@@ -2696,65 +2695,7 @@ static int index_stream(unsigned char *sha1, int fd, size_t size,
 			enum object_type type, const char *path,
 			unsigned flags)
 {
-	struct child_process fast_import;
-	char export_marks[512];
-	const char *argv[] = { "fast-import", "--quiet", export_marks, NULL };
-	char tmpfile[512];
-	char fast_import_cmd[512];
-	char buf[512];
-	int len, tmpfd;
-
-	strcpy(tmpfile, git_path("hashstream_XXXXXX"));
-	tmpfd = git_mkstemp_mode(tmpfile, 0600);
-	if (tmpfd < 0)
-		die_errno("cannot create tempfile: %s", tmpfile);
-	if (close(tmpfd))
-		die_errno("cannot close tempfile: %s", tmpfile);
-	sprintf(export_marks, "--export-marks=%s", tmpfile);
-
-	memset(&fast_import, 0, sizeof(fast_import));
-	fast_import.in = -1;
-	fast_import.argv = argv;
-	fast_import.git_cmd = 1;
-	if (start_command(&fast_import))
-		die_errno("index-stream: git fast-import failed");
-
-	len = sprintf(fast_import_cmd, "blob\nmark :1\ndata %lu\n",
-		      (unsigned long) size);
-	write_or_whine(fast_import.in, fast_import_cmd, len,
-		       "index-stream: feeding fast-import");
-	while (size) {
-		char buf[10240];
-		size_t sz = size < sizeof(buf) ? size : sizeof(buf);
-		ssize_t actual;
-
-		actual = read_in_full(fd, buf, sz);
-		if (actual < 0)
-			die_errno("index-stream: reading input");
-		if (write_in_full(fast_import.in, buf, actual) != actual)
-			die_errno("index-stream: feeding fast-import");
-		size -= actual;
-	}
-	if (close(fast_import.in))
-		die_errno("index-stream: closing fast-import");
-	if (finish_command(&fast_import))
-		die_errno("index-stream: finishing fast-import");
-
-	tmpfd = open(tmpfile, O_RDONLY);
-	if (tmpfd < 0)
-		die_errno("index-stream: cannot open fast-import mark");
-	len = read(tmpfd, buf, sizeof(buf));
-	if (len < 0)
-		die_errno("index-stream: reading fast-import mark");
-	if (close(tmpfd) < 0)
-		die_errno("index-stream: closing fast-import mark");
-	if (unlink(tmpfile))
-		die_errno("index-stream: unlinking fast-import mark");
-	if (len != 44 ||
-	    memcmp(":1 ", buf, 3) ||
-	    get_sha1_hex(buf + 3, sha1))
-		die_errno("index-stream: unexpected fast-import mark: <%s>", buf);
-	return 0;
+	return index_bulk_checkin(sha1, fd, size, type, path, flags);
 }
 
 int index_fd(unsigned char *sha1, int fd, struct stat *st,
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index deba111..29d6024 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -7,21 +7,97 @@ test_description='adding and checking out large blobs'
 
 test_expect_success setup '
 	git config core.bigfilethreshold 200k &&
-	echo X | dd of=large bs=1k seek=2000
+	echo X | dd of=large1 bs=1k seek=2000 &&
+	echo X | dd of=large2 bs=1k seek=2000 &&
+	echo X | dd of=large3 bs=1k seek=2000 &&
+	echo Y | dd of=huge bs=1k seek=2500
 '
 
-test_expect_success 'add a large file' '
-	git add large &&
-	# make sure we got a packfile and no loose objects
-	test -f .git/objects/pack/pack-*.pack &&
-	test ! -f .git/objects/??/??????????????????????????????????????
+test_expect_success 'add a large file or two' '
+	git add large1 huge large2 &&
+	# make sure we got a single packfile and no loose objects
+	bad= count=0 idx= &&
+	for p in .git/objects/pack/pack-*.pack
+	do
+		count=$(( $count + 1 ))
+		if test -f "$p" && idx=${p%.pack}.idx && test -f "$idx"
+		then
+			continue
+		fi
+		bad=t
+	done &&
+	test -z "$bad" &&
+	test $count = 1 &&
+	cnt=$(git show-index <"$idx" | wc -l) &&
+	test $cnt = 2 &&
+	for l in .git/objects/??/??????????????????????????????????????
+	do
+		test -f "$l" || continue
+		bad=t
+	done &&
+	test -z "$bad" &&
+
+	# attempt to add another copy of the same
+	git add large3 &&
+	bad= count=0 &&
+	for p in .git/objects/pack/pack-*.pack
+	do
+		count=$(( $count + 1 ))
+		if test -f "$p" && idx=${p%.pack}.idx && test -f "$idx"
+		then
+			continue
+		fi
+		bad=t
+	done &&
+	test -z "$bad" &&
+	test $count = 1
 '
 
 test_expect_success 'checkout a large file' '
-	large=$(git rev-parse :large) &&
-	git update-index --add --cacheinfo 100644 $large another &&
+	large1=$(git rev-parse :large1) &&
+	git update-index --add --cacheinfo 100644 $large1 another &&
 	git checkout another &&
-	cmp large another ;# this must not be test_cmp
+	cmp large1 another ;# this must not be test_cmp
+'
+
+test_expect_success 'packsize limit' '
+	test_create_repo mid &&
+	(
+		cd mid &&
+		git config core.bigfilethreshold 64k &&
+		git config pack.packsizelimit 256k &&
+
+		# mid1 and mid2 will fit within 256k limit but
+		# appending mid3 will bust the limit and will
+		# result in a separate packfile.
+		test-genrandom "a" $(( 66 * 1024 )) >mid1 &&
+		test-genrandom "b" $(( 80 * 1024 )) >mid2 &&
+		test-genrandom "c" $(( 128 * 1024 )) >mid3 &&
+		git add mid1 mid2 mid3 &&
+
+		count=0
+		for pi in .git/objects/pack/pack-*.idx
+		do
+			test -f "$pi" && count=$(( $count + 1 ))
+		done &&
+		test $count = 2 &&
+
+		(
+			git hash-object --stdin <mid1
+			git hash-object --stdin <mid2
+			git hash-object --stdin <mid3
+		) |
+		sort >expect &&
+
+		for pi in .git/objects/pack/pack-*.idx
+		do
+			git show-index <"$pi"
+		done |
+		sed -e "s/^[0-9]* \([0-9a-f]*\) .*/\1/" |
+		sort >actual &&
+
+		test_cmp expect actual
+	)
 '
 
 test_done
diff --git a/zlib.c b/zlib.c
index 3c63d48..2b2c0c7 100644
--- a/zlib.c
+++ b/zlib.c
@@ -188,13 +188,20 @@ void git_deflate_init_gzip(git_zstream *strm, int level)
 	    strm->z.msg ? strm->z.msg : "no message");
 }
 
-void git_deflate_end(git_zstream *strm)
+int git_deflate_abort(git_zstream *strm)
 {
 	int status;
 
 	zlib_pre_call(strm);
 	status = deflateEnd(&strm->z);
 	zlib_post_call(strm);
+	return status;
+}
+
+void git_deflate_end(git_zstream *strm)
+{
+	int status = git_deflate_abort(strm);
+
 	if (status == Z_OK)
 		return;
 	error("deflateEnd: %s (%s)", zerr_to_string(status),
-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* [PATCH v3 3/6] new representation types in the packstream
From: Junio C Hamano @ 2011-12-02  0:40 UTC (permalink / raw)
  To: git
In-Reply-To: <1322786449-25753-1-git-send-email-gitster@pobox.com>

In addition to four basic types (commit, tree, blob and tag), the pack
stream can encode a few other "representation" types, such as REF_DELTA
and OFS_DELTA. As we allocate 3 bits in the first byte for this purpose,
we do not have much room to add new representation types in place, but we
do have one value reserved for future expansion.

When encoding a new representation type, the early part of the in-pack
object header is encoded as if its type is OBJ_EXT (= 5) using exactly the
same way as before. That is, the lower 4-bit of the first byte is used for
the lowest 4-bit of the size information, the next 3-bit has the type
information, and the MSB says if the subsequent bytes encodes higher bits
for the size information.

An in-pack object header that records OBJ_EXT as the type is followed by
an integer in the same variable-length encoding as OFS_DELTA offset is
encoded. This value is the real type of the representation minus 8 (as we
do not need to use OBJ_EXT to encode types smaller than 8).  Because we do
not foresee very many representation types, in practice we would have a
single byte with its MSB clear, to represent types 8-135.

The code does not type=8 and upwards for anything yet.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache.h      |    4 +++-
 pack-write.c |   23 +++++++++++++++++------
 sha1_file.c  |   11 +++++++++++
 3 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/cache.h b/cache.h
index 4f20861..4a3b421 100644
--- a/cache.h
+++ b/cache.h
@@ -381,12 +381,14 @@ enum object_type {
 	OBJ_TREE = 2,
 	OBJ_BLOB = 3,
 	OBJ_TAG = 4,
-	/* 5 for future expansion */
+	OBJ_EXT = 5,
 	OBJ_OFS_DELTA = 6,
 	OBJ_REF_DELTA = 7,
 	OBJ_ANY,
 	OBJ_MAX
 };
+#define OBJ_LAST_BASE_TYPE OBJ_REF_DELTA
+#define OBJ_LAST_VALID_TYPE OBJ_REF_DELTA

 static inline enum object_type object_type(unsigned int mode)
 {
diff --git a/pack-write.c b/pack-write.c
index 5702cec..9309dd1 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -338,22 +338,33 @@ int encode_in_pack_varint(uintmax_t value, unsigned char *buf)
  */
 int encode_in_pack_object_header(enum object_type type, uintmax_t size, unsigned char *hdr)
 {
-	int n = 1;
+	unsigned char *hdr_base;
 	unsigned char c;
+	enum object_type header_type;

-	if (type < OBJ_COMMIT || type > OBJ_REF_DELTA)
+	if (type < OBJ_COMMIT || OBJ_LAST_VALID_TYPE < type)
 		die("bad type %d", type);
+	else if (OBJ_LAST_BASE_TYPE < type)
+		header_type = OBJ_EXT;
+	else
+		header_type = type;

-	c = (type << 4) | (size & 15);
+	c = (header_type << 4) | (size & 15);
 	size >>= 4;
+	hdr_base = hdr;
 	while (size) {
 		*hdr++ = c | 0x80;
 		c = size & 0x7f;
 		size >>= 7;
-		n++;
 	}
-	*hdr = c;
-	return n;
+	*hdr++ = c;
+	if (header_type != type) {
+		int sz;
+		type = type - (OBJ_LAST_BASE_TYPE + 1);
+		sz = encode_in_pack_varint(type, hdr);
+		hdr += sz;
+	}
+	return hdr - hdr_base;
 }

 struct sha1file *create_tmp_packfile(char **pack_tmp_name)
diff --git a/sha1_file.c b/sha1_file.c
index f066c2b..14902cc 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1275,6 +1275,17 @@ unsigned long unpack_object_header_buffer(const unsigned char *buf,
 		shift += 7;
 	}
 	*sizep = size;
+	if (*type == OBJ_EXT) {
+		const unsigned char *p = buf + used;
+		uintmax_t val = decode_in_pack_varint(&p);
+
+		if (p == buf + used && !val) {
+			error("bad extended object type");
+			return 0;
+		}
+		*type = val + (OBJ_LAST_BASE_TYPE + 1);
+		used = p - buf;
+	}
 	return used;
 }

-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* [PATCH v3 6/6] chunked-object: fallback checkout codepaths
From: Junio C Hamano @ 2011-12-02  0:40 UTC (permalink / raw)
  To: git
In-Reply-To: <1322786449-25753-1-git-send-email-gitster@pobox.com>

This prepares the default codepaths based on the traditional "slurping
everything in-core" model around read_sha1_file() API for objects that use
chunked encoding. Needless to say, these codepaths are unsuitable for the
kind of objects that use chunked encoding and are intended to only serve
as the fallback where specialized "large object API" support is still
lacking.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 sha1_file.c      |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 t/t1050-large.sh |   14 +++++++++++++-
 2 files changed, 67 insertions(+), 1 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 14902cc..7355716 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1607,6 +1607,11 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 		if (sizep)
 			*sizep = size;
 		break;
+	case OBJ_CHUNKED_BLOB:
+		if (sizep)
+			*sizep = size;
+		type = OBJ_DEKNUHC(type);
+		break;
 	default:
 		error("unknown object type %i at offset %"PRIuMAX" in %s",
 		      type, (uintmax_t)obj_offset, p->pack_name);
@@ -1648,6 +1653,51 @@ static void *unpack_compressed_entry(struct packed_git *p,
 	return buffer;
 }
 
+static void *unpack_chunked_entry(struct packed_git *p,
+				  struct pack_window **w_curs,
+				  off_t curpos,
+				  unsigned long size)
+{
+	/*
+	 * *NOTE* *NOTE* *NOTE*
+	 *
+	 * In the longer term, we should aim to exercise this codepath
+	 * less and less often, as it defeats the whole purpose of
+	 * chuncked object encoding!
+	 */
+	unsigned char *buffer;
+	const unsigned char *in, *ptr;
+	unsigned long avail, ofs;
+	int chunk_cnt;
+
+	buffer = xmallocz(size);
+	in = use_pack(p, w_curs, curpos, &avail);
+	ptr = in;
+	chunk_cnt = decode_in_pack_varint(&ptr);
+	curpos += ptr - in;
+	ofs = 0;
+	while (chunk_cnt--) {
+		unsigned long csize;
+		unsigned char *data;
+		enum object_type type;
+
+		in = use_pack(p, w_curs, curpos, &avail);
+		data = read_sha1_file(in, &type, &csize);
+		if (!data)
+			die("malformed chunked object contents ('%s' does not exist)",
+			    sha1_to_hex(in));
+		if (type != OBJ_BLOB)
+			die("malformed chunked object contents (not a blob)");
+		if (size < ofs + csize)
+			die("malformed chunked object contents (sizes do not add up)");
+		memcpy(buffer + ofs, data, csize);
+		ofs += csize;
+		curpos += 20;
+		free(data);
+	}
+	return buffer;
+}
+
 #define MAX_DELTA_CACHE (256)
 
 static size_t delta_base_cached;
@@ -1883,6 +1933,10 @@ void *unpack_entry(struct packed_git *p, off_t obj_offset,
 	case OBJ_TAG:
 		data = unpack_compressed_entry(p, &w_curs, curpos, *sizep);
 		break;
+	case OBJ_CHUNKED_BLOB:
+		data = unpack_chunked_entry(p, &w_curs, curpos, *sizep);
+		*type = OBJ_DEKNUHC(*type);
+		break;
 	default:
 		data = NULL;
 		error("unknown object type %i at offset %"PRIuMAX" in %s",
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index d6cb66d..eea45d1 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -124,8 +124,20 @@ test_expect_success 'split limit' '
 		# switch to a better chunking heuristics.
 		echo cruft >head &&
 		cat split >>head &&
-		git add head
+		git add head &&
 
+		echo blob >expect &&
+		git cat-file -t :split >actual &&
+		test_cmp expect actual &&
+
+		git cat-file -p :split >actual &&
+		# You probably do not want to use test_cmp here...
+		cmp split actual &&
+
+		mv split expect &&
+		git checkout split &&
+		# You probably do not want to use test_cmp here...
+		cmp expect split
 	)
 '
 
-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* [PATCH v3 2/6] varint-in-pack: refactor varint encoding/decoding
From: Junio C Hamano @ 2011-12-02  0:40 UTC (permalink / raw)
  To: git
In-Reply-To: <1322786449-25753-1-git-send-email-gitster@pobox.com>

Refactor encode/decode_in_pack_varint() functions from OFS_DELTA codepaths
to read and write variable-length integers in the pack stream.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/pack-objects.c |   28 +++++++++++++---------------
 pack-write.c           |   27 +++++++++++++++++++++++++++
 pack.h                 |    2 ++
 sha1_file.c            |   18 ++++++------------
 4 files changed, 48 insertions(+), 27 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index dde913e..72206a9 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -210,7 +210,7 @@ static unsigned long write_object(struct sha1file *f,
 {
 	unsigned long size, limit, datalen;
 	void *buf;
-	unsigned char header[10], dheader[10];
+	unsigned char header[10];
 	unsigned hdrlen;
 	enum object_type type;
 	int usable_delta, to_reuse;
@@ -304,17 +304,16 @@ static unsigned long write_object(struct sha1file *f,
 			 * base from this object's position in the pack.
 			 */
 			off_t ofs = entry->idx.offset - entry->delta->idx.offset;
-			unsigned pos = sizeof(dheader) - 1;
-			dheader[pos] = ofs & 127;
-			while (ofs >>= 7)
-				dheader[--pos] = 128 | (--ofs & 127);
-			if (limit && hdrlen + sizeof(dheader) - pos + datalen + 20 >= limit) {
+			unsigned char dheader[10];
+			unsigned pos = encode_in_pack_varint(ofs, dheader);
+
+			if (limit && hdrlen + pos + datalen + 20 >= limit) {
 				free(buf);
 				return 0;
 			}
 			sha1write(f, header, hdrlen);
-			sha1write(f, dheader + pos, sizeof(dheader) - pos);
-			hdrlen += sizeof(dheader) - pos;
+			sha1write(f, dheader, pos);
+			hdrlen += pos;
 		} else if (type == OBJ_REF_DELTA) {
 			/*
 			 * Deltas with a base reference contain
@@ -369,17 +368,16 @@ static unsigned long write_object(struct sha1file *f,
 
 		if (type == OBJ_OFS_DELTA) {
 			off_t ofs = entry->idx.offset - entry->delta->idx.offset;
-			unsigned pos = sizeof(dheader) - 1;
-			dheader[pos] = ofs & 127;
-			while (ofs >>= 7)
-				dheader[--pos] = 128 | (--ofs & 127);
-			if (limit && hdrlen + sizeof(dheader) - pos + datalen + 20 >= limit) {
+			unsigned char dheader[10];
+			unsigned pos = encode_in_pack_varint(ofs, dheader);
+
+			if (limit && hdrlen + pos + datalen + 20 >= limit) {
 				unuse_pack(&w_curs);
 				return 0;
 			}
 			sha1write(f, header, hdrlen);
-			sha1write(f, dheader + pos, sizeof(dheader) - pos);
-			hdrlen += sizeof(dheader) - pos;
+			sha1write(f, dheader, pos);
+			hdrlen += pos;
 			reused_delta++;
 		} else if (type == OBJ_REF_DELTA) {
 			if (limit && hdrlen + 20 + datalen + 20 >= limit) {
diff --git a/pack-write.c b/pack-write.c
index cadc3e1..5702cec 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -302,6 +302,33 @@ char *index_pack_lockfile(int ip_out)
 	return NULL;
 }
 
+uintmax_t decode_in_pack_varint(const unsigned char **bufp)
+{
+	const unsigned char *buf = *bufp;
+	unsigned char c = *buf++;
+	uintmax_t val = c & 127;
+	while (c & 128) {
+		val += 1;
+		if (!val || MSB(val, 7))
+			return 0; /* overflow */
+		c = *buf++;
+		val = (val << 7) + (c & 127);
+	}
+	*bufp = buf;
+	return val;
+}
+
+int encode_in_pack_varint(uintmax_t value, unsigned char *buf)
+{
+	unsigned char varint[16];
+	unsigned pos = sizeof(varint) - 1;
+	varint[pos] = value & 127;
+	while (value >>= 7)
+		varint[--pos] = 128 | (--value & 127);
+	memcpy(buf, varint + pos, sizeof(varint) - pos);
+	return sizeof(varint) - pos;
+}
+
 /*
  * The per-object header is a pretty dense thing, which is
  *  - first byte: low four bits are "size", then three bits of "type",
diff --git a/pack.h b/pack.h
index cfb0f69..d7dc6ca 100644
--- a/pack.h
+++ b/pack.h
@@ -79,6 +79,8 @@ extern off_t write_pack_header(struct sha1file *f, uint32_t);
 extern void fixup_pack_header_footer(int, unsigned char *, const char *, uint32_t, unsigned char *, off_t);
 extern char *index_pack_lockfile(int fd);
 extern int encode_in_pack_object_header(enum object_type, uintmax_t, unsigned char *);
+extern int encode_in_pack_varint(uintmax_t, unsigned char *);
+extern uintmax_t decode_in_pack_varint(const unsigned char **);
 
 #define PH_ERROR_EOF		(-1)
 #define PH_ERROR_PACK_SIGNATURE	(-2)
diff --git a/sha1_file.c b/sha1_file.c
index c96e366..f066c2b 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1484,20 +1484,14 @@ static off_t get_delta_base(struct packed_git *p,
 	 * is stupid, as then a REF_DELTA would be smaller to store.
 	 */
 	if (type == OBJ_OFS_DELTA) {
-		unsigned used = 0;
-		unsigned char c = base_info[used++];
-		base_offset = c & 127;
-		while (c & 128) {
-			base_offset += 1;
-			if (!base_offset || MSB(base_offset, 7))
-				return 0;  /* overflow */
-			c = base_info[used++];
-			base_offset = (base_offset << 7) + (c & 127);
-		}
-		base_offset = delta_obj_offset - base_offset;
+		const unsigned char *buf = base_info;
+		uintmax_t ofs = decode_in_pack_varint(&buf);
+		if (!ofs && buf == base_info)
+			return 0; /* overflow */
+		base_offset = delta_obj_offset - ofs;
 		if (base_offset <= 0 || base_offset >= delta_obj_offset)
 			return 0;  /* out of bound */
-		*curpos += used;
+		*curpos += buf - base_info;
 	} else if (type == OBJ_REF_DELTA) {
 		/* The base entry _must_ be in the same pack */
 		base_offset = find_pack_entry_one(base_info, p);
-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* [PATCH v3 5/6] bulk-checkin: support chunked-object encoding
From: Junio C Hamano @ 2011-12-02  0:40 UTC (permalink / raw)
  To: git
In-Reply-To: <1322786449-25753-1-git-send-email-gitster@pobox.com>

Instead of recording a huge object as a single entry in the packfile,
record it as a concatenation of smaller blob objects. This is primarily to
make it possible to repack repositories with huge objects and transfer
huge objects out of such repositories.

This is the first step of a long and painful journey. We would still need
to teach many codepaths about the new encoding:

 * The streaming checkout codepath must learn how to read these from the
   object store and write a concatenation of component blob contents to
   the working tree. Note that after a repack, a component blob object
   could be represented as a delta to another blob object.

 * The in-core object reading codepath must learn to notice and at least
   reject reading such objects entirely in-core. It is expected that
   nobody is interested in producing a patch out of these huge objects, at
   least initially.

 * The object connectivity machinery must learn that component blob
   objects are reachable from an object that uses them, so that "gc"
   will not lose them, and "fsck" will not complain about them.

 * The pack-object machinery must learn to copy an object that is encoded
   in chunked-object encoding literally to the destination (while perhaps
   validating the object name).

 * The index-pack and verify-pack machineries need to be told about it.

The split-chunk logic used here is kept deliberately useless in order to
avoid distracting the reviewers, and also make sure that the chunked
encoding machinery does not depend any particular chunk splitting
heuristics. We may want to replace it with a better heuristics, perhaps
the one used in "bup" that is based on a self-synchronizing rolling
checksum logic, or something.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Makefile         |    1 +
 bulk-checkin.c   |  127 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 bulk-checkin.h   |    1 +
 cache.h          |    9 +++-
 config.c         |    5 ++
 environment.c    |    1 +
 split-chunk.c    |   28 ++++++++++++
 t/t1050-large.sh |   29 ++++++++++++
 8 files changed, 198 insertions(+), 3 deletions(-)
 create mode 100644 split-chunk.c

diff --git a/Makefile b/Makefile
index 418dd2e..7d2fc3a 100644
--- a/Makefile
+++ b/Makefile
@@ -679,6 +679,7 @@ LIB_OBJS += sha1_name.o
 LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
+LIB_OBJS += split-chunk.o
 LIB_OBJS += strbuf.o
 LIB_OBJS += streaming.o
 LIB_OBJS += string-list.o
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 6f1ce58..ebdacb8 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -5,6 +5,10 @@
 #include "csum-file.h"
 #include "pack.h"
 
+#ifndef CHUNK_MAX
+#define CHUNK_MAX 2000
+#endif
+
 static int pack_compression_level = Z_DEFAULT_COMPRESSION;
 
 static struct bulk_checkin_state {
@@ -268,12 +272,131 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
 	return 0;
 }
 
+/* This is only called when actually writing the object out */
+static int store_in_chunks(struct bulk_checkin_state *state,
+			   unsigned char result_sha1[],
+			   int fd, size_t size,
+			   enum object_type type, const char *path,
+			   unsigned flags,
+			   struct chunk_ctx *up)
+{
+	struct pack_idx_entry *idx;
+	struct chunk_ctx ctx;
+	unsigned char chunk[CHUNK_MAX][20];
+	unsigned char header[100];
+	unsigned header_len;
+	unsigned chunk_cnt, i;
+	size_t remainder = size;
+	size_t write_size;
+	int status;
+
+	header_len = sprintf((char *)header, "%s %" PRIuMAX,
+			     typename(type), (uintmax_t)size) + 1;
+
+	memset(&ctx, 0, sizeof(ctx));
+	ctx.up = up;
+	git_SHA1_Init(&ctx.ctx);
+	git_SHA1_Update(&ctx.ctx, header, header_len);
+
+	for (chunk_cnt = 0, remainder = size;
+	     remainder && chunk_cnt < CHUNK_MAX - 1;
+	     chunk_cnt++) {
+		size_t chunk_size = carve_chunk(fd, remainder);
+		status = deflate_to_pack(state, chunk[chunk_cnt], fd,
+					 chunk_size, OBJ_BLOB, path, flags,
+					 &ctx);
+		if (status)
+			return status;
+		remainder -= chunk_size;
+	}
+
+	if (remainder) {
+		if (split_size_limit_cfg && split_size_limit_cfg < remainder)
+			status = store_in_chunks(state, chunk[chunk_cnt], fd,
+						 remainder, OBJ_BLOB, path, flags,
+						 &ctx);
+		else
+			status = deflate_to_pack(state, chunk[chunk_cnt], fd,
+						 remainder, OBJ_BLOB, path, flags,
+						 &ctx);
+		if (status)
+			return status;
+		chunk_cnt++;
+	}
+
+	/*
+	 * Now we have chunk_cnt chunks (the last one may be a large
+	 * blob that itself is represented as concatenation of
+	 * multiple blobs).
+	 */
+	git_SHA1_Final(result_sha1, &ctx.ctx);
+	if (already_written(state, result_sha1))
+		return 0;
+
+	/*
+	 * The standard type & size header is followed by
+	 * - the number of chunks (varint)
+	 * - the object names of the chunks (20 * chunk_cnt bytes)
+	 * - the resulting object name (20 bytes)
+	 */
+	type = OBJ_CHUNKED(type);
+	header_len = encode_in_pack_object_header(type, size, header);
+	header_len += encode_in_pack_varint(chunk_cnt, header + header_len);
+
+	write_size = header_len + 20 * (chunk_cnt + 1);
+
+	prepare_to_stream(state, flags);
+	if (state->nr_written &&
+	    pack_size_limit_cfg &&
+	    pack_size_limit_cfg < (state->offset + write_size)) {
+		finish_bulk_checkin(state);
+		prepare_to_stream(state, flags);
+	}
+
+	idx = xcalloc(1, sizeof(*idx));
+	idx->offset = state->offset;
+
+	crc32_begin(state->f);
+	sha1write(state->f, header, header_len);
+	for (i = 0; i < chunk_cnt; i++)
+		sha1write(state->f, chunk[i], 20);
+	sha1write(state->f, result_sha1, 20);
+	idx->crc32 = crc32_end(state->f);
+	hashcpy(idx->sha1, result_sha1);
+	ALLOC_GROW(state->written,
+		   state->nr_written + 1,
+		   state->alloc_written);
+	state->written[state->nr_written++] = idx;
+	state->offset += write_size;
+
+	return 0;
+}
+
 int index_bulk_checkin(unsigned char *sha1,
 		       int fd, size_t size, enum object_type type,
 		       const char *path, unsigned flags)
 {
-	int status = deflate_to_pack(&state, sha1, fd, size, type,
-				     path, flags, NULL);
+	int status;
+
+	/*
+	 * For now, we only deal with blob objects, as validation
+	 * of a huge tree object that is split into chunks will be
+	 * too cumbersome to be worth it.
+	 *
+	 * Note that we only have to use store_in_chunks() codepath
+	 * when we are actually writing things out. deflate_to_pack()
+	 * codepath can hash arbitrarily huge object without keeping
+	 * everything in core just fine.
+	 */
+	if ((flags & HASH_WRITE_OBJECT) &&
+	    type == OBJ_BLOB &&
+	    split_size_limit_cfg &&
+	    split_size_limit_cfg < size)
+		status = store_in_chunks(&state, sha1, fd, size, type,
+					 path, flags, NULL);
+	else
+		status = deflate_to_pack(&state, sha1, fd, size, type,
+					 path, flags, NULL);
 	if (!state.plugged)
 		finish_bulk_checkin(&state);
 	return status;
diff --git a/bulk-checkin.h b/bulk-checkin.h
index 4f599f8..89f1741 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -12,5 +12,6 @@ extern int index_bulk_checkin(unsigned char sha1[],
 
 extern void plug_bulk_checkin(void);
 extern void unplug_bulk_checkin(void);
+extern size_t carve_chunk(int fd, size_t size);
 
 #endif
diff --git a/cache.h b/cache.h
index 4a3b421..374f712 100644
--- a/cache.h
+++ b/cache.h
@@ -384,11 +384,17 @@ enum object_type {
 	OBJ_EXT = 5,
 	OBJ_OFS_DELTA = 6,
 	OBJ_REF_DELTA = 7,
+	/* OBJ_CHUNKED_COMMIT = 8 */
+	/* OBJ_CHUNKED_TREE = 9 */
+	OBJ_CHUNKED_BLOB = 10,
+	/* OBJ_CHUNKED_TAG = 11 */
 	OBJ_ANY,
 	OBJ_MAX
 };
 #define OBJ_LAST_BASE_TYPE OBJ_REF_DELTA
-#define OBJ_LAST_VALID_TYPE OBJ_REF_DELTA
+#define OBJ_LAST_VALID_TYPE OBJ_CHUNKED_BLOB
+#define OBJ_CHUNKED(type) ((type) + 7)
+#define OBJ_DEKNUHC(type) ((type) - 7)
 
 static inline enum object_type object_type(unsigned int mode)
 {
@@ -602,6 +608,7 @@ extern size_t packed_git_limit;
 extern size_t delta_base_cache_limit;
 extern unsigned long big_file_threshold;
 extern unsigned long pack_size_limit_cfg;
+extern unsigned long split_size_limit_cfg;
 extern int read_replace_refs;
 extern int fsync_object_files;
 extern int core_preload_index;
diff --git a/config.c b/config.c
index c736802..bdc3be1 100644
--- a/config.c
+++ b/config.c
@@ -801,6 +801,11 @@ int git_default_config(const char *var, const char *value, void *dummy)
 		pack_size_limit_cfg = git_config_ulong(var, value);
 		return 0;
 	}
+
+	if (!strcmp(var, "pack.splitsizelimit")) {
+		split_size_limit_cfg = git_config_ulong(var, value);
+		return 0;
+	}
 	/* Add other config variables here and to Documentation/config.txt. */
 	return 0;
 }
diff --git a/environment.c b/environment.c
index 31e4284..b66df28 100644
--- a/environment.c
+++ b/environment.c
@@ -61,6 +61,7 @@ int grafts_replace_parents = 1;
 int core_apply_sparse_checkout;
 struct startup_info *startup_info;
 unsigned long pack_size_limit_cfg;
+unsigned long split_size_limit_cfg;
 
 /* Parallel index stat data preload? */
 int core_preload_index = 0;
diff --git a/split-chunk.c b/split-chunk.c
new file mode 100644
index 0000000..1b60e63
--- /dev/null
+++ b/split-chunk.c
@@ -0,0 +1,28 @@
+#include "git-compat-util.h"
+#include "bulk-checkin.h"
+
+/* Cut at around 512kB */
+#define TARGET_CHUNK_SIZE_LOG2 19
+#define TARGET_CHUNK_SIZE (1U << TARGET_CHUNK_SIZE_LOG2)
+
+/*
+ * Carve out around 500kB to be stored as a separate blob
+ */
+size_t carve_chunk(int fd, size_t size)
+{
+	size_t chunk_size;
+	off_t seekback = lseek(fd, 0, SEEK_CUR);
+
+	if (seekback == (off_t) -1)
+		die("cannot find the current offset");
+
+	/* Next patch will do something complex to find out where to cut */
+	chunk_size = size;
+	if (TARGET_CHUNK_SIZE < chunk_size)
+		chunk_size = TARGET_CHUNK_SIZE;
+
+	if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
+		return error("cannot seek back");
+
+	return chunk_size;
+}
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 29d6024..d6cb66d 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -100,4 +100,33 @@ test_expect_success 'packsize limit' '
 	)
 '
 
+test_expect_success 'split limit' '
+	test_create_repo split &&
+	(
+		cd split &&
+		git config core.bigfilethreshold 2m &&
+		git config pack.splitsizelimit 1m &&
+
+		test-genrandom "a" $(( 4800 * 1024 )) >split &&
+		git add split &&
+
+		# This should result in a new chunked object "tail"
+		# that shares most of the component blobs in its
+		# early part with "split".
+		cat split >tail &&
+		echo cruft >>tail &&
+		git add tail &&
+
+		# This should result in a new chunked object "head"
+		# that begins with its own unique component blobs
+		# but quickly synchronize and start using the same
+		# component blobs with "split" and "tail", once we
+		# switch to a better chunking heuristics.
+		echo cruft >head &&
+		cat split >>head &&
+		git add head
+
+	)
+'
+
 test_done
-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* [PATCH v3 4/6] bulk-checkin: allow the same data to be multiply hashed
From: Junio C Hamano @ 2011-12-02  0:40 UTC (permalink / raw)
  To: git
In-Reply-To: <1322786449-25753-1-git-send-email-gitster@pobox.com>

This updates stream_to_pack() machinery to feed the data it is writing out
to multiple hash contexts at the same time. Right now we only use a single
git_SHA_CTX, so there is no change in functionality.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 bulk-checkin.c |   33 +++++++++++++++++++++++++--------
 1 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 6b0b6d4..6f1ce58 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -75,6 +75,20 @@ static int already_written(struct bulk_checkin_state *state, unsigned char sha1[
 	return 0;
 }
 
+struct chunk_ctx {
+	struct chunk_ctx *up;
+	git_SHA_CTX ctx;
+};
+
+static void chunk_SHA1_Update(struct chunk_ctx *ctx,
+			      const unsigned char *buf, size_t size)
+{
+	while (ctx) {
+		git_SHA1_Update(&ctx->ctx, buf, size);
+		ctx = ctx->up;
+	}
+}
+
 /*
  * Read the contents from fd for size bytes, streaming it to the
  * packfile in state while updating the hash in ctx. Signal a failure
@@ -91,7 +105,7 @@ static int already_written(struct bulk_checkin_state *state, unsigned char sha1[
  * with a new pack.
  */
 static int stream_to_pack(struct bulk_checkin_state *state,
-			  git_SHA_CTX *ctx, off_t *already_hashed_to,
+			  struct chunk_ctx *ctx, off_t *already_hashed_to,
 			  int fd, size_t size, enum object_type type,
 			  const char *path, unsigned flags)
 {
@@ -123,7 +137,7 @@ static int stream_to_pack(struct bulk_checkin_state *state,
 				if (rsize < hsize)
 					hsize = rsize;
 				if (hsize)
-					git_SHA1_Update(ctx, ibuf, hsize);
+					chunk_SHA1_Update(ctx, ibuf, hsize);
 				*already_hashed_to = offset;
 			}
 			s.next_in = ibuf;
@@ -185,10 +199,11 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
 			   unsigned char result_sha1[],
 			   int fd, size_t size,
 			   enum object_type type, const char *path,
-			   unsigned flags)
+			   unsigned flags,
+			   struct chunk_ctx *up)
 {
 	off_t seekback, already_hashed_to;
-	git_SHA_CTX ctx;
+	struct chunk_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
 	struct sha1file_checkpoint checkpoint;
@@ -200,8 +215,10 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
 
 	header_len = sprintf((char *)obuf, "%s %" PRIuMAX,
 			     typename(type), (uintmax_t)size) + 1;
-	git_SHA1_Init(&ctx);
-	git_SHA1_Update(&ctx, obuf, header_len);
+	memset(&ctx, 0, sizeof(ctx));
+	ctx.up = up;
+	git_SHA1_Init(&ctx.ctx);
+	git_SHA1_Update(&ctx.ctx, obuf, header_len);
 
 	/* Note: idx is non-NULL when we are writing */
 	if ((flags & HASH_WRITE_OBJECT) != 0)
@@ -232,7 +249,7 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
 		if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
 			return error("cannot seek back");
 	}
-	git_SHA1_Final(result_sha1, &ctx);
+	git_SHA1_Final(result_sha1, &ctx.ctx);
 	if (!idx)
 		return 0;
 
@@ -256,7 +273,7 @@ int index_bulk_checkin(unsigned char *sha1,
 		       const char *path, unsigned flags)
 {
 	int status = deflate_to_pack(&state, sha1, fd, size, type,
-				     path, flags);
+				     path, flags, NULL);
 	if (!state.plugged)
 		finish_bulk_checkin(&state);
 	return status;
-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* Re: Anyone have a commit hook for forbidding old branches from being merged in?
From: Neal Kreitzinger @ 2011-12-02  1:37 UTC (permalink / raw)
  To: Ævar Arnfjörð Bjarmason; +Cc: Git Mailing List
In-Reply-To: <CACBZZX4LyTaz=fU1vvgpeL904QFjJULCMVSP0uutcuxZT+-vWQ@mail.gmail.com>

On 12/1/2011 9:34 AM, Ævar Arnfjörð Bjarmason wrote:
> I work on a web application that due to underlying database schema
> changes etc only even compiles and runs for a given 2 week moving
> window.
>
> Thus if someone started a branch say 1 month ago, works on it one and
> off, and then merges it back into the mainline it becomes impossible
> to bisect that code if it has a problem. You either have to:
>
> * Revert the whole merge * Manually eyeball the code to see where
> the error might be * Brute-force manually bisect it by checking out
> only the files altered in those commits instead of the commit at a
> given data. Usually individual files are still compatible with the
> new code.
>
> But the whole reason this is a problem is because people don't rebase
> their branches before merging them in, unintentionally causing
> problems.
>
> So before I write a hook to do this, is there anything that
> implements a hook that:
>
> * Checks if you're pushing a merge commit * If so, is that merge
> based off and old version of $MAINBRANCH * Is the base of that
> branch more than N days old? * If so reject the push

It sounds like you're saying that people should rebase before merging to 
main.  That means their merge would be a fast-forward.  You could just 
reject anyone who has not done a current rebase.  Then you could use 
this technique from the pre-rebase.sample hook to enforce up-to-date 
rebases:

only_in_main='git rev-list "^$topic" main'
if test -z "$only-in-main"
then
     exit 0
else
     echo >&2 "error: please rebase on main before merging to main."
     exit 1
fi

v/r,
neal

^ permalink raw reply

* Re: [PATCH] Implement fast hash-collision detection
From: Bill Zaumen @ 2011-12-02  2:59 UTC (permalink / raw)
  To: Jeff King; +Cc: git, gitster, pclouds, spearce, torvalds
In-Reply-To: <20111201052615.GA22141@sigill.intra.peff.net>

On Thu, 2011-12-01 at 00:26 -0500, Jeff King wrote:

> Cryptographically speaking, I think that claim is sound, and you can
> certainly construct attack scenarios where this detection would help.
> However, quantifying the effectiveness is difficult.  What is the
> likelihood of a malicious collided-object replacement being detected
> without your scheme? What is it with it?
> 
> There are many questions that factor into guessing the latter.
> 
> How often does Linus fetch from his own kernel.org repository? He would
> usually push, I would think.  Even if he does fetch, he wouldn't be
> getting the old objects that he already has. I guess this is the reason
> for your digest-of-digests for each commit object sent? 

Yes, - the digest of digests is used to check things that would not be
sent.

>
> But what about
> objects that are no longer in current commits, but are in older ones?

That's a good question, of course. After Linus pushes a commit, he'll
notify others, and if some fetch before the repository is hacked,
they will detect an error on a subsequent fetch.  For fetches, the
server reports a series of commits, and the client responds with 'have'
for those it has, with the CRCs added so the server can check for a
mismatch.  I made minimal changes to fetch-pack.c and upload-pack.c:
just adding the CRC fields to the messages already sent.  The server
asks about more commits than it actually transfers, but all the ones
it asks about are tested.  One could send additional 'have' replies
if necessary (for ones the server didn't mention) but I didn't do
that, partly for simplicity but also because I was looking at the
fetch-pack.c and update-pack.c code for the first time. If desired,
such changes can be added.

I also do some similar checking when a commit is pushed - the server
tells the client the last commit it has and the client will send the
CRCs in its reply to allow the server to cross check those.  I didn't
mention that before because only the latest is really checked. Again,
I just changed a message format (backwards compatible, of course),
but additional checking could be added if desired.

You could also add options to check tips of branches and all commits
that have tags (e.g., a v1.0 tag)  All of that simply requires more
work on commands such as fetch-pack, upload-pack, send-pack and
receive-pack.

> What about the server being more clever about hiding the replacement
> object? E.g., instead of just breaking into kernel.org and inserting a
> replacement object, the attacker runs a malicious git-daemon that
> returns the bogus object to cloners, but the real object to fetchers.

That's really a server-security issue, not a git one.  Perhaps
repositories should be configured so that all the executables are on
read-only partitions.  It's an important question in general of
course, but it is probably useful to distinguish attacks that put
bad data on a server from ones that install new software.

> 
> > It's also possible to write some additional commands to (for example)
> > fetch the SHA-1 hashes and CRCs from all remote repositories you use
> > and compare these to make sure they are all consistent, something that
> > can be run ocassionally.
> 
> But we can already do that. Assume you have an existing repo "foo". To
> verify the copy at git://example.com/foo.git, do a fresh clone to
> "bar", and then compare the objects in "foo" to "bar", either byte-wise
> or by digest.

Of course, but that is an expensive operation - in the case of Git
transferring some 50 MBytes of data per repository.  A command to
fetch the SHA-1 ID and a CRC or message digest for each object would
not only run faster, but should put a much lower load on the server.

Getting back to the birthday attack question (this is an area where
your comments were very useful for me), there's a case I didn't
consider.

Suppose two developers bounce code back and forth by email and one of
them commits it, but the other developer is a bad guy.  The bad guy
would then have had an opportunity to use a birthday attack by sending
back subtly modified code (e.g., changes to how comments are formated,
additional blank lines, etc.)  He can even put a humorous comment at
the end of the file such as "the first 200 Chinese characters I learned"
and then include the Chinese characters (I've tested this with gcc -
the Chinese characters, represented in Unicode, print in an editor and
are ignored by the compiler.)  Unicode is a lot closer to binary data
so you have a lot of bits you can alter in a small amount of space,
with each character requiring multiple bytes to represent it. The
comment will look silly but innocuous.  I think Linus Torvalds once
suggested being suspicious of anything that looked like "line noise"
in a patch.  Non-western unicode characters can serve the same function
but look legitimate, at least to people who don't know the language
and when coupled with some "social engineering" to set expectations.

As an example of how this attack might work, without breaking into a
system, assume two programmers collaborating on a project both have
write-access to the same shared repository.

1. The project is using  Java, with a rule that all classes and methods
that are protected or public be documented (so javadoc can create API
documentation).

2. Programmer A emails some Java source code to programmer B with a
request to edit the comments to improve them or fix any obvious
mistakes.

3. Programmer B fixes the comments, but also creates a modified file
with the same SHA-1 hash as the correct file in order to add some bugs
or security flaws. 

4. Programmer B creates a branch from an earlier version, adds some
tests, puts the contents of the modified file into the directory tree
under an obscure name, adds it and does a commit.  B then pushes it,
creating a new remote branch.

5. Programmer B then tells Programmer A that he'll have the modified
file back quickly, but could he please fetch his new remote branch
and run a test, and answer some questions about what happens as B needs
that information to finish his review of the documentation.

6. Programmer A tells B the results, so B knows that A has fetched the
remote branch, an B then sends A the corrected file (not the modified
one). Programmer A reviews the file, notes that everything seems OK,
specifically that only comments were changed, and runs commit -a
followed by a push.  Because Git tries to be smart, it will (I think)
notice from the SHA-1 hashes and from the remote branches that the
server already has the object so there is no need to send it.

8. Programmer B fetches the changes, deletes his temporary branch, both
locally and on the shared repository.  He tells A that the temporary
branch is deleted so that B should run "git branch update --prune ..."

So what happens?  Hopefully someone finds the problem, either through a
source-code review or some QA testing, but regardless Programmer A may
get the blame as the evidence of any tampering has pretty much been
erased.  In the worst case, a release with a security hole goes out.

Why would Programmer B do that?  Maybe he's leaving the company because
he's hard to work with and is blaming Programmer A for the problem, and
wants to "get back" at Programmer A by harming his career. But in any
case he didn't have to break into the repository to get the effect he
wanted. At least is is extremely hard to do in terms of computational
resources.

Bill

^ permalink raw reply

* Re: Proposal: create meaningful aliases for git reset's hard/soft/mixed
From: Thomas Rast @ 2011-12-02  7:26 UTC (permalink / raw)
  To: Phil Hord; +Cc: Philippe Vaucher, Junio C Hamano, git, Christian Couder
In-Reply-To: <CABURp0rtCUbJXLHtXv_1g6GRKL3mX-T+3vN1=QO4CUibqXdEMg@mail.gmail.com>

Phil Hord wrote:
> 
> Think outside the "reset" command.  Like this:
> 
> From the "most popular" comment on http://progit.org/2011/07/11/reset.html:
> > I remember them as:
> > --soft      -> git uncommit
> > --mixed  -> git unadd
> > --hard     -> git undo
> 
> I don't particular like these names, but conceptually they are helpful.

I think all of these, but the last one in particular, are *very*
dangerous oversimplifications.  Doubly so if you then use "undo" with
a revision argument.

-- 
Thomas Rast
trast@{inf,student}.ethz.ch

^ permalink raw reply

* Re: Proposal: create meaningful aliases for git reset's hard/soft/mixed
From: Miles Bader @ 2011-12-02  7:45 UTC (permalink / raw)
  To: Thomas Rast
  Cc: Phil Hord, Philippe Vaucher, Junio C Hamano, git,
	Christian Couder
In-Reply-To: <201112020826.14114.trast@student.ethz.ch>

Thomas Rast <trast@student.ethz.ch> writes:
>> > I remember them as:
>> > --soft      -> git uncommit
>> > --mixed  -> git unadd
>> > --hard     -> git undo
>> 
>> I don't particular like these names, but conceptually they are helpful.
>
> I think all of these, but the last one in particular, are *very*
> dangerous oversimplifications.  Doubly so if you then use "undo" with
> a revision argument.

I agree.  Not only is it completely wrong when used with a revision
argument, but "undo" is so vague that it's probably useless for _any_
git command, much less one so dangerous as "reset --hard".

-miles

-- 
Friendship, n. A ship big enough to carry two in fair weather, but only one
in foul.

^ permalink raw reply

* Re: Status after 'git clone --no-checkout' ?
From: norbert.nemec @ 2011-12-02  7:53 UTC (permalink / raw)
  To: git
In-Reply-To: <20111201190058.GC2873@sigill.intra.peff.net>

Thanks a lot for this concise explanation -- exactly what I was hoping for!


Am 01.12.11 20:00, schrieb Jeff King:
> On Wed, Nov 30, 2011 at 02:02:22PM +0100, norbert.nemec wrote:
>
>> what exactly is the status after 'git clone --no-checkout'? Is there
>> any straightforward way how one could end up in this state starting
>> from a regularly checked out repository?
>
> You have a HEAD which points to some actual commit, but no index or
> working tree. I don't think there is a particular name for this state.
>
> You can get something similar in an existing repo by deleting all of the
> working tree files and removing .git/index.
>
>> 'git checkout' without any further options serves to move from the
>> aforementioned special state to a regular checked out state.
>> Otherwise it never seems to do anything. Are there any other
>> situations where 'git checkout' on its own would have any effect?
>
> By itself, I don't think so. But you can use "git checkout -f" to
> discard changes in the index and working tree, setting them back to the
> state in HEAD.
>
> At one point, some people used "git checkout" as a no-op, because it
> would print the "ahead/behind" information with respect to the upstream.
> These days, that information is part of "git status", so I suspect
> people use that instead.
>
> -Peff

^ permalink raw reply

* Re: [PATCHv2 0/4] git-p4: small fixes to branches and labels; tests
From: Luke Diamand @ 2011-12-02  8:49 UTC (permalink / raw)
  To: Vitor Antunes; +Cc: Pete Wyckoff, git
In-Reply-To: <CAOpHH-XL5OGpnihEgqnXqUUFsMxXn2wSdLadegnC1epg44vs8A@mail.gmail.com>

On 01/12/11 21:59, Vitor Antunes wrote:
> On Dec 1, 2011 4:03 AM, "Pete Wyckoff"<pw@padd.com>  wrote:
>> I see your point.  P4 labels are the only way that they support
>> tagging, apparently.  I'm okay with leaving label support in
>> git-p4.  And it will be nice if Luke makes it behave a bit
>> better.  But doing heroics to emulate cross-commit tags feels
>> like a lot of work, and the wrong direction.
>
> Agreed. Lets keep it simple.
>

I think I'm going to have to go away and do a bit more work on this. The 
existing label code is still quite buggy (or my understanding is 
broken). Either way I'd rather get it to the point where it actually 
works and passes all its tests.

The two issues I'm seeing are:

- two p4 labels covering the same set of files; only one of them gets 
imported.

- if you have a p4 label on a subset of files then it gets dropped 
(which is fine) but so do most of the other labels (as far as I can tell).

I think if this could be made to work it would actually be really useful 
though.

(Pete - I've found your previous email; not sure why I didn't see it 
before. I'll roll that change in with what I'm doing).

Regards!
Luke

^ permalink raw reply

* Re: Anyone have a commit hook for forbidding old branches from being merged in?
From: Ævar Arnfjörð Bjarmason @ 2011-12-02 11:27 UTC (permalink / raw)
  To: Neal Kreitzinger; +Cc: Git Mailing List
In-Reply-To: <4ED82BCA.5080909@gmail.com>

On Fri, Dec 2, 2011 at 02:37, Neal Kreitzinger <nkreitzinger@gmail.com> wrote:

> It sounds like you're saying that people should rebase before merging to
> main.  That means their merge would be a fast-forward.  You could just
> reject anyone who has not done a current rebase.  Then you could use this
> technique from the pre-rebase.sample hook to enforce up-to-date rebases:

That would be an overly invasive change to people's workflows. It's
fine if you merge something in as long as the initial merge base isn't
N days older than the current state of the mainline.

^ permalink raw reply

* [PATCH v2 2/3] grep: enable threading with -p and -W using lazy attribute lookup
From: Thomas Rast @ 2011-12-02 13:07 UTC (permalink / raw)
  To: René Scharfe; +Cc: Eric Herman, git, Junio C Hamano
In-Reply-To: <cover.1322830368.git.trast@student.ethz.ch>

Lazily load the userdiff attributes in match_funcname().  Use a
separate mutex around this loading to protect the (not thread-safe)
attributes machinery.  This lets us re-enable threading with -p and
-W while reducing the overhead caused by looking up attributes.

Signed-off-by: Thomas Rast <trast@student.ethz.ch>
---
 builtin/grep.c |   10 +++++++-
 grep.c         |   74 ++++++++++++++++++++++++++++++++++----------------------
 grep.h         |    7 +++++
 3 files changed, 61 insertions(+), 30 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 988ea1d..65b1ffe 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -256,6 +256,7 @@ static void start_threads(struct grep_opt *opt)
 
 	pthread_mutex_init(&grep_mutex, NULL);
 	pthread_mutex_init(&read_sha1_mutex, NULL);
+	pthread_mutex_init(&grep_attr_mutex, NULL);
 	pthread_cond_init(&cond_add, NULL);
 	pthread_cond_init(&cond_write, NULL);
 	pthread_cond_init(&cond_result, NULL);
@@ -303,6 +304,7 @@ static int wait_all(void)
 
 	pthread_mutex_destroy(&grep_mutex);
 	pthread_mutex_destroy(&read_sha1_mutex);
+	pthread_mutex_destroy(&grep_attr_mutex);
 	pthread_cond_destroy(&cond_add);
 	pthread_cond_destroy(&cond_write);
 	pthread_cond_destroy(&cond_result);
@@ -1002,9 +1004,15 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 		opt.regflags |= REG_ICASE;
 
 #ifndef NO_PTHREADS
-	if (online_cpus() == 1 || !grep_threads_ok(&opt))
+	if (online_cpus() == 1)
 		use_threads = 0;
+#else
+	use_threads = 0;
+#endif
 
+	opt.use_threads = use_threads;
+
+#ifndef NO_PTHREADS
 	if (use_threads) {
 		if (opt.pre_context || opt.post_context || opt.file_break ||
 		    opt.funcbody)
diff --git a/grep.c b/grep.c
index 7a070e9..4dd7da2 100644
--- a/grep.c
+++ b/grep.c
@@ -2,6 +2,7 @@
 #include "grep.h"
 #include "userdiff.h"
 #include "xdiff-interface.h"
+#include "thread-utils.h"
 
 void append_header_grep_pattern(struct grep_opt *opt, enum grep_header_field field, const char *pat)
 {
@@ -806,10 +807,46 @@ static void show_line(struct grep_opt *opt, char *bol, char *eol,
 	opt->output(opt, "\n", 1);
 }
 
-static int match_funcname(struct grep_opt *opt, char *bol, char *eol)
+#ifndef NO_PTHREADS
+/*
+ * This lock protects access to the gitattributes machinery, which is
+ * not thread-safe.
+ */
+pthread_mutex_t grep_attr_mutex;
+
+static inline void grep_attr_lock(struct grep_opt *opt)
+{
+	if (opt->use_threads)
+		pthread_mutex_lock(&grep_attr_mutex);
+}
+
+static inline void grep_attr_unlock(struct grep_opt *opt)
+{
+	if (opt->use_threads)
+		pthread_mutex_unlock(&grep_attr_mutex);
+}
+#else
+#define grep_attr_lock(opt)
+#define grep_attr_unlock(opt)
+#endif
+
+static int match_funcname(struct grep_opt *opt, const char *name, char *bol, char *eol)
 {
 	xdemitconf_t *xecfg = opt->priv;
-	if (xecfg && xecfg->find_func) {
+	if (xecfg && !xecfg->find_func) {
+		struct userdiff_driver *drv;
+		grep_attr_lock(opt);
+		drv = userdiff_find_by_path(name);
+		grep_attr_unlock(opt);
+		if (drv && drv->funcname.pattern) {
+			const struct userdiff_funcname *pe = &drv->funcname;
+			xdiff_set_find_func(xecfg, pe->pattern, pe->cflags);
+		} else {
+			xecfg = opt->priv = NULL;
+		}
+	}
+
+	if (xecfg) {
 		char buf[1];
 		return xecfg->find_func(bol, eol - bol, buf, 1,
 					xecfg->find_func_priv) >= 0;
@@ -835,7 +872,7 @@ static void show_funcname_line(struct grep_opt *opt, const char *name,
 		if (lno <= opt->last_shown)
 			break;
 
-		if (match_funcname(opt, bol, eol)) {
+		if (match_funcname(opt, name, bol, eol)) {
 			show_line(opt, bol, eol, name, lno, '=');
 			break;
 		}
@@ -848,7 +885,7 @@ static void show_pre_context(struct grep_opt *opt, const char *name, char *buf,
 	unsigned cur = lno, from = 1, funcname_lno = 0;
 	int funcname_needed = !!opt->funcname;
 
-	if (opt->funcbody && !match_funcname(opt, bol, end))
+	if (opt->funcbody && !match_funcname(opt, name, bol, end))
 		funcname_needed = 2;
 
 	if (opt->pre_context < lno)
@@ -864,7 +901,7 @@ static void show_pre_context(struct grep_opt *opt, const char *name, char *buf,
 		while (bol > buf && bol[-1] != '\n')
 			bol--;
 		cur--;
-		if (funcname_needed && match_funcname(opt, bol, eol)) {
+		if (funcname_needed && match_funcname(opt, name, bol, eol)) {
 			funcname_lno = cur;
 			funcname_needed = 0;
 		}
@@ -942,19 +979,6 @@ static int look_ahead(struct grep_opt *opt,
 	return 0;
 }
 
-int grep_threads_ok(const struct grep_opt *opt)
-{
-	/* If this condition is true, then we may use the attribute
-	 * machinery in grep_buffer_1. The attribute code is not
-	 * thread safe, so we disable the use of threads.
-	 */
-	if ((opt->funcname || opt->funcbody)
-	    && !opt->unmatch_name_only && !opt->status_only && !opt->name_only)
-		return 0;
-
-	return 1;
-}
-
 static void std_output(struct grep_opt *opt, const void *buf, size_t size)
 {
 	fwrite(buf, size, 1, stdout);
@@ -1008,16 +1032,8 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
 	}
 
 	memset(&xecfg, 0, sizeof(xecfg));
-	if ((opt->funcname || opt->funcbody)
-	    && !opt->unmatch_name_only && !opt->status_only &&
-	    !opt->name_only && !binary_match_only && !collect_hits) {
-		struct userdiff_driver *drv = userdiff_find_by_path(name);
-		if (drv && drv->funcname.pattern) {
-			const struct userdiff_funcname *pe = &drv->funcname;
-			xdiff_set_find_func(&xecfg, pe->pattern, pe->cflags);
-			opt->priv = &xecfg;
-		}
-	}
+	opt->priv = &xecfg;
+
 	try_lookahead = should_lookahead(opt);
 
 	while (left) {
@@ -1093,7 +1109,7 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
 				show_function = 1;
 			goto next_line;
 		}
-		if (show_function && match_funcname(opt, bol, eol))
+		if (show_function && match_funcname(opt, name, bol, eol))
 			show_function = 0;
 		if (show_function ||
 		    (last_hit && lno <= last_hit + opt->post_context)) {
diff --git a/grep.h b/grep.h
index a652800..15d227c 100644
--- a/grep.h
+++ b/grep.h
@@ -115,6 +115,7 @@ struct grep_opt {
 	int show_hunk_mark;
 	int file_break;
 	int heading;
+	int use_threads;
 	void *priv;
 
 	void (*output)(struct grep_opt *opt, const void *data, size_t size);
@@ -131,4 +132,10 @@ struct grep_opt {
 extern struct grep_opt *grep_opt_dup(const struct grep_opt *opt);
 extern int grep_threads_ok(const struct grep_opt *opt);
 
+#ifndef NO_PTHREADS
+/* Mutex used around access to the attributes machinery if
+ * opt->use_threads.  Must be initialized/destroyed by callers! */
+extern pthread_mutex_t grep_attr_mutex;
+#endif
+
 #endif
-- 
1.7.8.rc4.388.ge53ab

^ permalink raw reply related

* [PATCH v2 0/3] grep multithreading and scaling
From: Thomas Rast @ 2011-12-02 13:07 UTC (permalink / raw)
  To: René Scharfe; +Cc: Eric Herman, git, Junio C Hamano
In-Reply-To: <201111291507.04754.trast@student.ethz.ch>

[Eric, I measured some numbers that may be interesting to the
discussion about b2924dc.  See below.]

This round wraps up the original patch I posted, plus the draft patch
I posted inline the other day with René's review taken into account.
I also added a patch that rips out threading in the non-worktree case;
read on for the reasoning.

René Scharfe wrote:
> Hmm, why are [gitattributes lookups] that expensive?
> 
> callgrind tells me that userdiff_find_by_path() contributes only 0.18%
> to the total cost with your first patch.  Timings in my virtual machine
> are very volatile, but it seems that here the difference is in the
> system time while user is basically the same for all combinations of
> patches.

Well, turns out I was measuring something completely stupid.  I had

  git grep --cached -W INITRAMFS_ROOT_UID

where I put the --cached originally because that makes it independent
of the worktree (which in the very first measurements I still had
wiped, as I tend to do for this repo; I checked it out again after
that).  This in fact gives me (~/g/git-grep --cached
INITRAMFS_ROOT_UID, leaving aside -W; best of 10):

  THREADS=8:   2.88user 0.21system 0:02.94elapsed
  THREADS=4:   2.89user 0.29system 0:02.99elapsed
  THREADS=2:   2.83user 0.36system 0:02.87elapsed
  NO_PTHREADS: 2.16user 0.08system 0:02.25elapsed

Uhuh.  Doesn't scale so well after all.  But removing the --cached, as
most people probably would:

  THREADS=8:   0.19user 0.32system 0:00.16elapsed
  THREADS=4:   0.16user 0.34system 0:00.17elapsed
  THREADS=2:   0.18user 0.32system 0:00.26elapsed
  NO_PTHREADS: 0.12user 0.17system 0:00.31elapsed

So I conclude that during any grep that cannot use the worktree,
having any threads hurts.

In addition, during a grep that *can* use the worktree, THREADS=8
still helps somewhat on my dual-core i7, though it goes downhill from
there (12 is again as fast as 4; I verified these details using
best-of-50 timings, and it is reproducible.)

I have also run timings on a 2*6-core workstation running OS X, where
performance is best at 5 cores:

  2 threads:  0.96 real   0.41 user   1.27 sys
  3 threads:  0.68 real   0.41 user   1.30 sys
  4 threads:  0.54 real   0.43 user   1.63 sys
  5 threads:  0.50 real   0.41 user   1.51 sys
  6 threads:  0.54 real   0.43 user   1.63 sys
  7 threads:  0.86 real   0.49 user   1.93 sys
  8 threads:  0.98 real   0.51 user   2.07 sys

I kid you not.  That's best-of-50 and rather stable.  It's on the same
tree as the Linux machine too, except for the problem that the OS X FS
is set to case-insensitive and thus cannot represent the tree exactly.
So from git's POV, there are unstaged changes.

Sadly I do not have access to a Linux box having more than 2 physical
cores.  If you have one, please run some tests :-)

So based on my measurements, I would suggest that unless we have
evidence of it scaling beyond 8 cores on some machine, b2924dc (grep:
detect number of CPUs for thread spawning) be dropped.  For now I'm
ignoring the problem that on OS X it doesn't even scale to 8; I'd
rather check how it fares on Linux first.

I added a third patch on top that disables threading in any case that
does not hit the worktree.  I wonder if I missed something or if it
really is that simple.  The neat part is that it's also a reduction in
code required, and at the same time avoids any issues 2/3 might have
with a future attributes-from-trees implementation.

With this I get

  worktree, 8 threads: 0.15user 0.37system 0:00.17elapsed
  --cached, 8 threads: 2.18user 0.07system 0:02.27elapsed

Of course, we could probably gain a huge boost if the read_sha1
machinery could be made threaded, so that it can unpack several
objects at a time.  In addition, I can well imagine that there are
combinations of delta density, object size, and luck where it pays off
to grep in parallel.  Do we care?

Now I really should do something else than fretting over the
sub-second performance of git-grep...

Thomas Rast (3):
  grep: load funcname patterns for -W
  grep: enable threading with -p and -W using lazy attribute lookup
  grep: disable threading in all but worktree case

 builtin/grep.c  |  153 ++++++++++++++++--------------------------------------
 grep.c          |   73 ++++++++++++++++----------
 grep.h          |    7 +++
 t/t7810-grep.sh |   14 +++++
 4 files changed, 112 insertions(+), 135 deletions(-)

-- 
1.7.8.rc4.388.ge53ab

^ permalink raw reply

* [PATCH v2 1/3] grep: load funcname patterns for -W
From: Thomas Rast @ 2011-12-02 13:07 UTC (permalink / raw)
  To: René Scharfe; +Cc: Eric Herman, git, Junio C Hamano
In-Reply-To: <cover.1322830368.git.trast@student.ethz.ch>

git-grep avoids loading the funcname patterns unless they are needed.
ba8ea74 (grep: add option to show whole function as context,
2011-08-01) forgot to extend this test also to the new funcbody
feature.  Do so.

The catch is that we also have to disable threading when using
userdiff, as explained in grep_threads_ok().  So we must be careful to
introduce the same test there.
---
 grep.c          |    7 ++++---
 t/t7810-grep.sh |   14 ++++++++++++++
 2 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/grep.c b/grep.c
index b29d09c..7a070e9 100644
--- a/grep.c
+++ b/grep.c
@@ -948,8 +948,8 @@ int grep_threads_ok(const struct grep_opt *opt)
 	 * machinery in grep_buffer_1. The attribute code is not
 	 * thread safe, so we disable the use of threads.
 	 */
-	if (opt->funcname && !opt->unmatch_name_only && !opt->status_only &&
-	    !opt->name_only)
+	if ((opt->funcname || opt->funcbody)
+	    && !opt->unmatch_name_only && !opt->status_only && !opt->name_only)
 		return 0;
 
 	return 1;
@@ -1008,7 +1008,8 @@ static int grep_buffer_1(struct grep_opt *opt, const char *name,
 	}
 
 	memset(&xecfg, 0, sizeof(xecfg));
-	if (opt->funcname && !opt->unmatch_name_only && !opt->status_only &&
+	if ((opt->funcname || opt->funcbody)
+	    && !opt->unmatch_name_only && !opt->status_only &&
 	    !opt->name_only && !binary_match_only && !collect_hits) {
 		struct userdiff_driver *drv = userdiff_find_by_path(name);
 		if (drv && drv->funcname.pattern) {
diff --git a/t/t7810-grep.sh b/t/t7810-grep.sh
index 81263b7..7ba5b16 100755
--- a/t/t7810-grep.sh
+++ b/t/t7810-grep.sh
@@ -523,6 +523,20 @@ test_expect_success 'grep -W' '
 	test_cmp expected actual
 '
 
+cat >expected <<EOF
+hello.c=	printf("Hello world.\n");
+hello.c:	return 0;
+hello.c-	/* char ?? */
+EOF
+
+test_expect_success 'grep -W with userdiff' '
+	test_when_finished "rm -f .gitattributes" &&
+	git config diff.custom.xfuncname "(printf.*|})$" &&
+	echo "hello.c diff=custom" >.gitattributes &&
+	git grep -W return >actual &&
+	test_cmp expected actual
+'
+
 test_expect_success 'grep from a subdirectory to search wider area (1)' '
 	mkdir -p s &&
 	(
-- 
1.7.8.rc4.388.ge53ab

^ permalink raw reply related

* [PATCH v2 3/3] grep: disable threading in all but worktree case
From: Thomas Rast @ 2011-12-02 13:07 UTC (permalink / raw)
  To: René Scharfe; +Cc: Eric Herman, git, Junio C Hamano
In-Reply-To: <cover.1322830368.git.trast@student.ethz.ch>

Measuring grep performance showed that in all but the worktree case
(as opposed to --cached, <committish> or <treeish>), threading
actually slows things down.  For example, on my dual-core
hyperthreaded i7 in a linux-2.6.git at v2.6.37-rc2, I got:

Threads       worktree case                 | --cached case
--------------------------------------------------------------------------
8 (default) | 2.17user 0.15sys 0:02.20real  | 0.11user 0.26sys 0:00.11real
4           | 2.06user 0.17sys 0:02.08real  | 0.11user 0.26sys 0:00.12real
2           | 2.02user 0.25sys 0:02.08real  | 0.15user 0.37sys 0:00.28real
NO_PTHREADS | 1.57user 0.05sys 0:01.64real  | 0.09user 0.12sys 0:00.22real

I conjecture that this is caused by contention on read_sha1_mutex.

So disable threading entirely when not scanning the worktree, to get
the NO_PTHREADS performance in that case.  This obsoletes all code
related to grep_sha1_async.  The thread startup must be delayed until
after all arguments have been parsed, but this does not have a
measurable effect.
---
 builtin/grep.c |  157 ++++++++++++++++----------------------------------------
 1 files changed, 44 insertions(+), 113 deletions(-)

diff --git a/builtin/grep.c b/builtin/grep.c
index 65b1ffe..edf6a31 100644
--- a/builtin/grep.c
+++ b/builtin/grep.c
@@ -34,21 +34,13 @@
 		       const char *name);
 static void *load_file(const char *filename, size_t *sz);
 
-enum work_type {WORK_SHA1, WORK_FILE};
-
 /* We use one producer thread and THREADS consumer
  * threads. The producer adds struct work_items to 'todo' and the
  * consumers pick work items from the same array.
  */
 struct work_item {
-	enum work_type type;
 	char *name;
-
-	/* if type == WORK_SHA1, then 'identifier' is a SHA1,
-	 * otherwise type == WORK_FILE, and 'identifier' is a NUL
-	 * terminated filename.
-	 */
-	void *identifier;
+	char *filename;
 	char done;
 	struct strbuf out;
 };
@@ -86,21 +78,6 @@ static inline void grep_unlock(void)
 		pthread_mutex_unlock(&grep_mutex);
 }
 
-/* Used to serialize calls to read_sha1_file. */
-static pthread_mutex_t read_sha1_mutex;
-
-static inline void read_sha1_lock(void)
-{
-	if (use_threads)
-		pthread_mutex_lock(&read_sha1_mutex);
-}
-
-static inline void read_sha1_unlock(void)
-{
-	if (use_threads)
-		pthread_mutex_unlock(&read_sha1_mutex);
-}
-
 /* Signalled when a new work_item is added to todo. */
 static pthread_cond_t cond_add;
 
@@ -114,7 +91,7 @@ static inline void read_sha1_unlock(void)
 
 static int skip_first_line;
 
-static void add_work(enum work_type type, char *name, void *id)
+static void add_work(char *name, char *filename)
 {
 	grep_lock();
 
@@ -122,9 +99,8 @@ static void add_work(enum work_type type, char *name, void *id)
 		pthread_cond_wait(&cond_write, &grep_mutex);
 	}
 
-	todo[todo_end].type = type;
 	todo[todo_end].name = name;
-	todo[todo_end].identifier = id;
+	todo[todo_end].filename = filename;
 	todo[todo_end].done = 0;
 	strbuf_reset(&todo[todo_end].out);
 	todo_end = (todo_end + 1) % ARRAY_SIZE(todo);
@@ -152,19 +128,10 @@ static void add_work(enum work_type type, char *name, void *id)
 	return ret;
 }
 
-static void grep_sha1_async(struct grep_opt *opt, char *name,
-			    const unsigned char *sha1)
-{
-	unsigned char *s;
-	s = xmalloc(20);
-	memcpy(s, sha1, 20);
-	add_work(WORK_SHA1, name, s);
-}
-
 static void grep_file_async(struct grep_opt *opt, char *name,
 			    const char *filename)
 {
-	add_work(WORK_FILE, name, xstrdup(filename));
+	add_work(name, xstrdup(filename));
 }
 
 static void work_done(struct work_item *w)
@@ -194,7 +161,7 @@ static void work_done(struct work_item *w)
 			write_or_die(1, p, len);
 		}
 		free(w->name);
-		free(w->identifier);
+		free(w->filename);
 	}
 
 	if (old_done != todo_done)
@@ -213,29 +180,18 @@ static void work_done(struct work_item *w)
 
 	while (1) {
 		struct work_item *w = get_work();
+		size_t sz;
+		void* data;
+
 		if (!w)
 			break;
 
 		opt->output_priv = w;
-		if (w->type == WORK_SHA1) {
-			unsigned long sz;
-			void* data = load_sha1(w->identifier, &sz, w->name);
-
-			if (data) {
-				hit |= grep_buffer(opt, w->name, data, sz);
-				free(data);
-			}
-		} else if (w->type == WORK_FILE) {
-			size_t sz;
-			void* data = load_file(w->identifier, &sz);
-			if (data) {
-				hit |= grep_buffer(opt, w->name, data, sz);
-				free(data);
-			}
-		} else {
-			assert(0);
+		data = load_file(w->filename, &sz);
+		if (data) {
+			hit |= grep_buffer(opt, w->name, data, sz);
+			free(data);
 		}
-
 		work_done(w);
 	}
 	free_grep_patterns(arg);
@@ -255,7 +211,6 @@ static void start_threads(struct grep_opt *opt)
 	int i;
 
 	pthread_mutex_init(&grep_mutex, NULL);
-	pthread_mutex_init(&read_sha1_mutex, NULL);
 	pthread_mutex_init(&grep_attr_mutex, NULL);
 	pthread_cond_init(&cond_add, NULL);
 	pthread_cond_init(&cond_write, NULL);
@@ -303,7 +258,6 @@ static int wait_all(void)
 	}
 
 	pthread_mutex_destroy(&grep_mutex);
-	pthread_mutex_destroy(&read_sha1_mutex);
 	pthread_mutex_destroy(&grep_attr_mutex);
 	pthread_cond_destroy(&cond_add);
 	pthread_cond_destroy(&cond_write);
@@ -312,9 +266,6 @@ static int wait_all(void)
 	return hit;
 }
 #else /* !NO_PTHREADS */
-#define read_sha1_lock()
-#define read_sha1_unlock()
-
 static int wait_all(void)
 {
 	return 0;
@@ -371,21 +322,11 @@ static int grep_config(const char *var, const char *value, void *cb)
 	return 0;
 }
 
-static void *lock_and_read_sha1_file(const unsigned char *sha1, enum object_type *type, unsigned long *size)
-{
-	void *data;
-
-	read_sha1_lock();
-	data = read_sha1_file(sha1, type, size);
-	read_sha1_unlock();
-	return data;
-}
-
 static void *load_sha1(const unsigned char *sha1, unsigned long *size,
 		       const char *name)
 {
 	enum object_type type;
-	void *data = lock_and_read_sha1_file(sha1, &type, size);
+	void *data = read_sha1_file(sha1, &type, size);
 
 	if (!data)
 		error(_("'%s': unable to read %s"), name, sha1_to_hex(sha1));
@@ -398,6 +339,9 @@ static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1,
 {
 	struct strbuf pathbuf = STRBUF_INIT;
 	char *name;
+	int hit;
+	unsigned long sz;
+	void *data;
 
 	if (opt->relative && opt->prefix_length) {
 		quote_path_relative(filename + tree_name_len, -1, &pathbuf,
@@ -409,25 +353,15 @@ static int grep_sha1(struct grep_opt *opt, const unsigned char *sha1,
 
 	name = strbuf_detach(&pathbuf, NULL);
 
-#ifndef NO_PTHREADS
-	if (use_threads) {
-		grep_sha1_async(opt, name, sha1);
-		return 0;
-	} else
-#endif
-	{
-		int hit;
-		unsigned long sz;
-		void *data = load_sha1(sha1, &sz, name);
-		if (!data)
-			hit = 0;
-		else
-			hit = grep_buffer(opt, name, data, sz);
+	data = load_sha1(sha1, &sz, name);
+	if (!data)
+		hit = 0;
+	else
+		hit = grep_buffer(opt, name, data, sz);
 
-		free(data);
-		free(name);
-		return hit;
-	}
+	free(data);
+	free(name);
+	return hit;
 }
 
 static void *load_file(const char *filename, size_t *sz)
@@ -586,7 +520,7 @@ static int grep_tree(struct grep_opt *opt, const struct pathspec *pathspec,
 			void *data;
 			unsigned long size;
 
-			data = lock_and_read_sha1_file(entry.sha1, &type, &size);
+			data = read_sha1_file(entry.sha1, &type, &size);
 			if (!data)
 				die(_("unable to read tree (%s)"),
 				    sha1_to_hex(entry.sha1));
@@ -616,10 +550,8 @@ static int grep_object(struct grep_opt *opt, const struct pathspec *pathspec,
 		struct strbuf base;
 		int hit, len;
 
-		read_sha1_lock();
 		data = read_object_with_reference(obj->sha1, tree_type,
 						  &size, NULL);
-		read_sha1_unlock();
 
 		if (!data)
 			die(_("unable to read tree (%s)"), sha1_to_hex(obj->sha1));
@@ -1003,26 +935,6 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 	if (!opt.fixed && opt.ignore_case)
 		opt.regflags |= REG_ICASE;
 
-#ifndef NO_PTHREADS
-	if (online_cpus() == 1)
-		use_threads = 0;
-#else
-	use_threads = 0;
-#endif
-
-	opt.use_threads = use_threads;
-
-#ifndef NO_PTHREADS
-	if (use_threads) {
-		if (opt.pre_context || opt.post_context || opt.file_break ||
-		    opt.funcbody)
-			skip_first_line = 1;
-		start_threads(&opt);
-	}
-#else
-	use_threads = 0;
-#endif
-
 	compile_grep_patterns(&opt);
 
 	/* Check revs and then paths */
@@ -1044,6 +956,25 @@ int cmd_grep(int argc, const char **argv, const char *prefix)
 		break;
 	}
 
+#ifndef NO_PTHREADS
+	if (online_cpus() == 1 || cached || list.nr)
+		use_threads = 0;
+#else
+	use_threads = 0;
+#endif
+
+	opt.use_threads = use_threads;
+
+#ifndef NO_PTHREADS
+	if (use_threads) {
+		opt.use_threads = use_threads;
+		if (opt.pre_context || opt.post_context || opt.file_break ||
+		    opt.funcbody)
+			skip_first_line = 1;
+		start_threads(&opt);
+	}
+#endif
+
 	/* The rest are paths */
 	if (!seen_dashdash) {
 		int j;
-- 
1.7.8.rc4.388.ge53ab

^ permalink raw reply related

* Re: Suggestion on hashing
From: Nguyen Thai Ngoc Duy @ 2011-12-02 14:22 UTC (permalink / raw)
  To: Bill Zaumen; +Cc: Jeff King, Git Mailing List
In-Reply-To: <1322813319.4340.109.camel@yos>

(I'm not sure why you dropped git@vger. I see nothing private here so
I bring git@vger back)

On Fri, Dec 2, 2011 at 3:08 PM, Bill Zaumen <bill.zaumen@gmail.com> wrote:
> At one point Nguyen said that "What I'm thinking is whether it's
> possible to decouple two sha-1 roles in git, as object identifier
> and digest, separately. Each sha-1 identifies an object and an extra
> set of digests on the "same" object."
>
> My code pretty much does that (it just uses a CRC instead of a real
> digest, but I can easily change that).

It'd be easier to look at your code if you split it into a series of
smaller patches.

> So the question is whether
> using SHA-1 as an ID and SHA-256(?) as a digest is a better long term
> solution than simply replacing SHA-1.

I would not stick with any algorithm permanently. No one knows when
SHA-256 might be broken.

> If there is some interest in pursuing it further, I could make those
> changes fairly easily.  Then you'd have two message digests, a SHA-1
> and a longer one, with the longer one stored parallel to the actual
> object. Then it becomes easy to compute a digest of all the digests
> in a commit's tree and store that in a commit, if that is what you
> want to do.

I personally would like to see how it works out especially when
computing new digests is much more expensive than SHA-1. And I hope
that by delaying computing new digests (stored outside actual
objects), we could make minimum code changes to git. Though security
concerns may be the killer factor and I haven't worked that out yet.

> Replacing SHA-1 with something like SHA-256 sounds easier to implement,

SHA-1 charateristics (like 20 byte length) are hard coded everywhere
in git, it'd be a big audit.

> but the problem is all the existing repositories.  While rewriting all
> the objects and trees to use new hashes is similar to a rebase in most
> cases, there is a complication - submodules.  Git stores the hash of
> a submodule's commit in its tree because a particular revision of
> a project 'goes' with a particular revision of a submodule. But, a
> submodule can exist in one revision and not in the next or previous
> revision  Furthermore A could be a submodule of B at one point in time,
> and many commits later, B could end up being a submodule of A.
> Fixing it up could be pretty complicated (plus having to deal with
> network failures - to update GitHub for example, you'd have to download
> submodules it uses, possibly from somewhere else and some submodules may
> not be publicly accessible (e.g., a private project kept on GitHub but
> with a critical submodule kept in house behind a corporate firewall).
> Also, you might have to update a git repository and its submodules
> concurrently, so that you always can find a new value when you need
> it.
>
> My guess is that this could be far more complicated than what I did.
> Excluding two files that are not used (the symbol PACKDB is not
> defined), I added two new files, crcdb.h and objd-crcdb.c which store
> CRCs for loose objects - 517 lines total including lots of comments in
> the header file - full documentation for each function.  The other
> changes include 1475 lines of new code in previously existing git files
> and 136 deletions (most trivial).  There were also minor changes to
> the makefile and test scripts.

You'd need to convince git maintainer this is worth doing first,
before talking how big the changes are ;-)

> Bill
-- 
Duy

^ permalink raw reply

* Re: Proposal: create meaningful aliases for git reset's hard/soft/mixed
From: Philippe Vaucher @ 2011-12-02 14:27 UTC (permalink / raw)
  To: Phil Hord; +Cc: Junio C Hamano, git, Christian Couder
In-Reply-To: <CABURp0rtCUbJXLHtXv_1g6GRKL3mX-T+3vN1=QO4CUibqXdEMg@mail.gmail.com>

> > Why worse? I'd understand if you said it's doesn't improve it enough
> > for it to be worth the change tho.
>
> I think that's what "you should aim higher" means.

Yes, but my question was why was the proposal _worse_ in his mind.
Anyway, it's not really important, probably something he typed in a
hurry.


> How about:
>  --soft: git checkout -B <commit>
>  --mixed: git reset -- <paths>
>  --hard:  git checkout --clean

I like the idea... but as other pointed out those are not equivalent.

Maybe we'd start by listing the features we want to be able to do:

- Move git's HEAD to a particular commit without touching the files or the index
- Move git's HEAD to a particular commit and clear the index but
without touching the files
- Move git's HEAD to a particular commit and clear the index and have
all the files match that particular commit files
- Move git's HEAD to a particular commit and clear the index and have
all the files match that particular commit files and remove files that
are unknown to that commit

Is there a scenario I'm missing? Once we have the scenarios nailed
down we can start thinking about how to express them.

Philippe

^ permalink raw reply

* Re: Workflow Recommendation - Probably your 1000th
From: Stephen Bash @ 2011-12-02 15:14 UTC (permalink / raw)
  To: bradford; +Cc: git
In-Reply-To: <CAEbKVFQLvyTq+VL9DJZtp4YZLUgeR56N9u5RrsGqEB=e81O3zQ@mail.gmail.com>

----- Original Message -----
> From: "bradford" <fingermark@gmail.com>
> To: "Stephen Bash" <bash@genarts.com>
> Cc: git@vger.kernel.org
> Sent: Thursday, December 1, 2011 3:46:52 PM
> Subject: Re: Workflow Recommendation - Probably your 1000th
> 
> Thanks, Stephen.   I guess I'm looking for more input on the
> advantages and disadvantages of using a QA and production branch vs
> just doing everything out of master.
> 
> Trying to go through the following:
> http://news.ycombinator.com/item?id=1617425
> scottchacon.com/2011/08/31/github-flow.html
> 
> We have some weeks where we release very frequently and some weeks
> where we release only once a week and have to do production fixes in
> the meantime.  Sure other people have similar experiences.

Before continuing I guess two key assumptions factor into our workflow:
 1) we still work in a traditional major/minor release cycle with potentially weeks or even months between releases
 2) our customers can be running almost any historical version of our software

>From that perspective having a maintenance branch for each major revision of our software gives us a holding area where devs can fix bugs at any time without necessarily going through the entire tag/release/merge process (you can envision a "hot fix branch" that is long-lived).  For example, we often have documentation fixes that will sit on the maintenance branch until a software fix needs to go out.  But other non-critical fixes also end up waiting on something that really requires a maintenance release (or enough fixes pile up and necessitate a release themselves).

HTH,
Stephen

^ permalink raw reply

* Re: Proposal: create meaningful aliases for git reset's hard/soft/mixed
From: Phil Hord @ 2011-12-02 15:28 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Philippe Vaucher, Junio C Hamano, git, Christian Couder
In-Reply-To: <201112020826.14114.trast@student.ethz.ch>

On Fri, Dec 2, 2011 at 2:26 AM, Thomas Rast <trast@student.ethz.ch> wrote:
> Phil Hord wrote:
>>
>> Think outside the "reset" command.  Like this:
>>
>> From the "most popular" comment on http://progit.org/2011/07/11/reset.html:
>> > I remember them as:
>> > --soft      -> git uncommit
>> > --mixed  -> git unadd
>> > --hard     -> git undo
>>
>> I don't particular like these names, but conceptually they are helpful.
>
> I think all of these, but the last one in particular, are *very*
> dangerous oversimplifications.  Doubly so if you then use "undo" with
> a revision argument.

I agree.  That's why I also said this:

> How about:
>  --soft: git checkout -B <commit>
>  --mixed: git reset -- <paths>
>  --hard:  git checkout --clean

But maybe I wasn't clear enough.  I'm not suggesting git-alias for
these.  I am proposing new commands to replace common usages of
git-reset.  These commands would need basic safeguards against
foot-shooting, of course.

Phil

^ permalink raw reply

* Re: Proposal: create meaningful aliases for git reset's hard/soft/mixed
From: Phil Hord @ 2011-12-02 15:38 UTC (permalink / raw)
  To: Philippe Vaucher; +Cc: Junio C Hamano, git, Christian Couder
In-Reply-To: <CAGK7Mr7zdstbm7QsrYq9a6m9ui_r8Ak8XtyWADLQ0n-mXiov4w@mail.gmail.com>

On Fri, Dec 2, 2011 at 9:27 AM, Philippe Vaucher
<philippe.vaucher@gmail.com> wrote:
> Maybe we'd start by listing the features we want to be able to do:
>
> - Move git's HEAD to a particular commit without touching the files or the index
> - Move git's HEAD to a particular commit and clear the index but
> without touching the files
> - Move git's HEAD to a particular commit and clear the index and have
> all the files match that particular commit files
> - Move git's HEAD to a particular commit and clear the index and have
> all the files match that particular commit files and remove files that
> are unknown to that commit
>
> Is there a scenario I'm missing? Once we have the scenarios nailed
> down we can start thinking about how to express them.

Aim higher.

Do not think about the git-reset command and all of its features.
Moreover, do not limit yourself to git-reset's functionality.

Think about why you need to use git-reset.  Why do new users need to
use git-reset?  What is it they are after?

For me, it was the three I mentioned before.

So, let's look at yours:

> - Move git's HEAD to a particular commit without touching the files or the index

I know what this is, but I don't know to describe it without saying
"reset".  It's like teleportation.  "Move me to a new location in the
tree".
git teleport <commit>


> - Move git's HEAD to a particular commit and clear the index but
> without touching the files

git teleport --index <commit>


> - Move git's HEAD to a particular commit and clear the index and have
> all the files match that particular commit files

git checkout --clean <commit>


> - Move git's HEAD to a particular commit and clear the index and have
> all the files match that particular commit files and remove files that
> are unknown to that commit

git checkout --clean <commit> && git clean -fd  # maybe this needs a switch?


One you left out is this:
- Do NOT move git's HEAD; clear the index and workdir

git reset


I think the ability to move git's HEAD is what makes reset dangerous,
especially in the hands of new users.

Phil

^ permalink raw reply

* Re: [PATCH v2 3/3] grep: disable threading in all but worktree case
From: René Scharfe @ 2011-12-02 16:15 UTC (permalink / raw)
  To: Thomas Rast; +Cc: Eric Herman, git, Junio C Hamano
In-Reply-To: <5328add8b32f83b4cdbd2e66283f77c125ec127a.1322830368.git.trast@student.ethz.ch>

Am 02.12.2011 14:07, schrieb Thomas Rast:
> Measuring grep performance showed that in all but the worktree case
> (as opposed to --cached,<committish>  or<treeish>), threading
> actually slows things down.  For example, on my dual-core
> hyperthreaded i7 in a linux-2.6.git at v2.6.37-rc2, I got:
>
> Threads       worktree case                 | --cached case
> --------------------------------------------------------------------------
> 8 (default) | 2.17user 0.15sys 0:02.20real  | 0.11user 0.26sys 0:00.11real
> 4           | 2.06user 0.17sys 0:02.08real  | 0.11user 0.26sys 0:00.12real
> 2           | 2.02user 0.25sys 0:02.08real  | 0.15user 0.37sys 0:00.28real
> NO_PTHREADS | 1.57user 0.05sys 0:01.64real  | 0.09user 0.12sys 0:00.22real

Are the columns mixed up?

> I conjecture that this is caused by contention on read_sha1_mutex.

Yeah, and I wonder why we need to have this lock in the first place. In 
theory, multiple readers shouldn't have to affect each other at all, 
right?  The lock could be pushed down into read_sha1_file(), or a 
thread-safe variant of the function added.

In pratice, however, the code in sha1_file.c etc. scares me. ;-)

> So disable threading entirely when not scanning the worktree, to get
> the NO_PTHREADS performance in that case.  This obsoletes all code
> related to grep_sha1_async.  The thread startup must be delayed until
> after all arguments have been parsed, but this does not have a
> measurable effect.

This is a bit radical.  I think the underlying issue that 
read_sha1_file() is not thread-safe can be solved eventually and then 
we'd need to readd that code.

How about adding a parameter to control the number of threads 
(--threads?) instead that defaults to eight (or five) for the worktree 
and one for the rest?  That would also make benchmarking easier.

René

PS: Patches one and three missed a signoff.

^ permalink raw reply

* Re: Git Install link is broken
From: Konstantin Khomoutov @ 2011-12-02 16:16 UTC (permalink / raw)
  To: Graham Wideman; +Cc: git
In-Reply-To: <20111201235608.EGUQ3756.fed1rmfepo203.cox.net@fed1rmimpo306.cox.net>

On Thu, 01 Dec 2011 15:56:12 -0800
Graham Wideman <initcontact@grahamwideman.com> wrote:

> On this page:
> http://code.google.com/p/msysgit/
> 
> the links to "install msysGit" point to:
> https://git.wiki.kernel.org/index.php/MSysGit:InstallMSysGit
> 
> .. which returns page not found error.
http://groups.google.com/group/msysgit/browse_thread/thread/6412cc38d14b612d/37c8653e45dcc14c

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox