Git development

Git development
 help / color / mirror / Atom feed

* [PATCH v3 4/6] bulk-checkin: allow the same data to be multiply hashed
From: Junio C Hamano @ 2011-12-02  0:40 UTC (permalink / raw)
  To: git
In-Reply-To: <1322786449-25753-1-git-send-email-gitster@pobox.com>

This updates stream_to_pack() machinery to feed the data it is writing out
to multiple hash contexts at the same time. Right now we only use a single
git_SHA_CTX, so there is no change in functionality.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 bulk-checkin.c |   33 +++++++++++++++++++++++++--------
 1 files changed, 25 insertions(+), 8 deletions(-)

diff --git a/bulk-checkin.c b/bulk-checkin.c
index 6b0b6d4..6f1ce58 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -75,6 +75,20 @@ static int already_written(struct bulk_checkin_state *state, unsigned char sha1[
 	return 0;
 }
 
+struct chunk_ctx {
+	struct chunk_ctx *up;
+	git_SHA_CTX ctx;
+};
+
+static void chunk_SHA1_Update(struct chunk_ctx *ctx,
+			      const unsigned char *buf, size_t size)
+{
+	while (ctx) {
+		git_SHA1_Update(&ctx->ctx, buf, size);
+		ctx = ctx->up;
+	}
+}
+
 /*
  * Read the contents from fd for size bytes, streaming it to the
  * packfile in state while updating the hash in ctx. Signal a failure
@@ -91,7 +105,7 @@ static int already_written(struct bulk_checkin_state *state, unsigned char sha1[
  * with a new pack.
  */
 static int stream_to_pack(struct bulk_checkin_state *state,
-			  git_SHA_CTX *ctx, off_t *already_hashed_to,
+			  struct chunk_ctx *ctx, off_t *already_hashed_to,
 			  int fd, size_t size, enum object_type type,
 			  const char *path, unsigned flags)
 {
@@ -123,7 +137,7 @@ static int stream_to_pack(struct bulk_checkin_state *state,
 				if (rsize < hsize)
 					hsize = rsize;
 				if (hsize)
-					git_SHA1_Update(ctx, ibuf, hsize);
+					chunk_SHA1_Update(ctx, ibuf, hsize);
 				*already_hashed_to = offset;
 			}
 			s.next_in = ibuf;
@@ -185,10 +199,11 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
 			   unsigned char result_sha1[],
 			   int fd, size_t size,
 			   enum object_type type, const char *path,
-			   unsigned flags)
+			   unsigned flags,
+			   struct chunk_ctx *up)
 {
 	off_t seekback, already_hashed_to;
-	git_SHA_CTX ctx;
+	struct chunk_ctx ctx;
 	unsigned char obuf[16384];
 	unsigned header_len;
 	struct sha1file_checkpoint checkpoint;
@@ -200,8 +215,10 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
 
 	header_len = sprintf((char *)obuf, "%s %" PRIuMAX,
 			     typename(type), (uintmax_t)size) + 1;
-	git_SHA1_Init(&ctx);
-	git_SHA1_Update(&ctx, obuf, header_len);
+	memset(&ctx, 0, sizeof(ctx));
+	ctx.up = up;
+	git_SHA1_Init(&ctx.ctx);
+	git_SHA1_Update(&ctx.ctx, obuf, header_len);
 
 	/* Note: idx is non-NULL when we are writing */
 	if ((flags & HASH_WRITE_OBJECT) != 0)
@@ -232,7 +249,7 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
 		if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
 			return error("cannot seek back");
 	}
-	git_SHA1_Final(result_sha1, &ctx);
+	git_SHA1_Final(result_sha1, &ctx.ctx);
 	if (!idx)
 		return 0;
 
@@ -256,7 +273,7 @@ int index_bulk_checkin(unsigned char *sha1,
 		       const char *path, unsigned flags)
 {
 	int status = deflate_to_pack(&state, sha1, fd, size, type,
-				     path, flags);
+				     path, flags, NULL);
 	if (!state.plugged)
 		finish_bulk_checkin(&state);
 	return status;
-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* [PATCH v3 5/6] bulk-checkin: support chunked-object encoding
From: Junio C Hamano @ 2011-12-02  0:40 UTC (permalink / raw)
  To: git
In-Reply-To: <1322786449-25753-1-git-send-email-gitster@pobox.com>

Instead of recording a huge object as a single entry in the packfile,
record it as a concatenation of smaller blob objects. This is primarily to
make it possible to repack repositories with huge objects and transfer
huge objects out of such repositories.

This is the first step of a long and painful journey. We would still need
to teach many codepaths about the new encoding:

 * The streaming checkout codepath must learn how to read these from the
   object store and write a concatenation of component blob contents to
   the working tree. Note that after a repack, a component blob object
   could be represented as a delta to another blob object.

 * The in-core object reading codepath must learn to notice and at least
   reject reading such objects entirely in-core. It is expected that
   nobody is interested in producing a patch out of these huge objects, at
   least initially.

 * The object connectivity machinery must learn that component blob
   objects are reachable from an object that uses them, so that "gc"
   will not lose them, and "fsck" will not complain about them.

 * The pack-object machinery must learn to copy an object that is encoded
   in chunked-object encoding literally to the destination (while perhaps
   validating the object name).

 * The index-pack and verify-pack machineries need to be told about it.

The split-chunk logic used here is kept deliberately useless in order to
avoid distracting the reviewers, and also make sure that the chunked
encoding machinery does not depend any particular chunk splitting
heuristics. We may want to replace it with a better heuristics, perhaps
the one used in "bup" that is based on a self-synchronizing rolling
checksum logic, or something.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Makefile         |    1 +
 bulk-checkin.c   |  127 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 bulk-checkin.h   |    1 +
 cache.h          |    9 +++-
 config.c         |    5 ++
 environment.c    |    1 +
 split-chunk.c    |   28 ++++++++++++
 t/t1050-large.sh |   29 ++++++++++++
 8 files changed, 198 insertions(+), 3 deletions(-)
 create mode 100644 split-chunk.c

diff --git a/Makefile b/Makefile
index 418dd2e..7d2fc3a 100644
--- a/Makefile
+++ b/Makefile
@@ -679,6 +679,7 @@ LIB_OBJS += sha1_name.o
 LIB_OBJS += shallow.o
 LIB_OBJS += sideband.o
 LIB_OBJS += sigchain.o
+LIB_OBJS += split-chunk.o
 LIB_OBJS += strbuf.o
 LIB_OBJS += streaming.o
 LIB_OBJS += string-list.o
diff --git a/bulk-checkin.c b/bulk-checkin.c
index 6f1ce58..ebdacb8 100644
--- a/bulk-checkin.c
+++ b/bulk-checkin.c
@@ -5,6 +5,10 @@
 #include "csum-file.h"
 #include "pack.h"
 
+#ifndef CHUNK_MAX
+#define CHUNK_MAX 2000
+#endif
+
 static int pack_compression_level = Z_DEFAULT_COMPRESSION;
 
 static struct bulk_checkin_state {
@@ -268,12 +272,131 @@ static int deflate_to_pack(struct bulk_checkin_state *state,
 	return 0;
 }
 
+/* This is only called when actually writing the object out */
+static int store_in_chunks(struct bulk_checkin_state *state,
+			   unsigned char result_sha1[],
+			   int fd, size_t size,
+			   enum object_type type, const char *path,
+			   unsigned flags,
+			   struct chunk_ctx *up)
+{
+	struct pack_idx_entry *idx;
+	struct chunk_ctx ctx;
+	unsigned char chunk[CHUNK_MAX][20];
+	unsigned char header[100];
+	unsigned header_len;
+	unsigned chunk_cnt, i;
+	size_t remainder = size;
+	size_t write_size;
+	int status;
+
+	header_len = sprintf((char *)header, "%s %" PRIuMAX,
+			     typename(type), (uintmax_t)size) + 1;
+
+	memset(&ctx, 0, sizeof(ctx));
+	ctx.up = up;
+	git_SHA1_Init(&ctx.ctx);
+	git_SHA1_Update(&ctx.ctx, header, header_len);
+
+	for (chunk_cnt = 0, remainder = size;
+	     remainder && chunk_cnt < CHUNK_MAX - 1;
+	     chunk_cnt++) {
+		size_t chunk_size = carve_chunk(fd, remainder);
+		status = deflate_to_pack(state, chunk[chunk_cnt], fd,
+					 chunk_size, OBJ_BLOB, path, flags,
+					 &ctx);
+		if (status)
+			return status;
+		remainder -= chunk_size;
+	}
+
+	if (remainder) {
+		if (split_size_limit_cfg && split_size_limit_cfg < remainder)
+			status = store_in_chunks(state, chunk[chunk_cnt], fd,
+						 remainder, OBJ_BLOB, path, flags,
+						 &ctx);
+		else
+			status = deflate_to_pack(state, chunk[chunk_cnt], fd,
+						 remainder, OBJ_BLOB, path, flags,
+						 &ctx);
+		if (status)
+			return status;
+		chunk_cnt++;
+	}
+
+	/*
+	 * Now we have chunk_cnt chunks (the last one may be a large
+	 * blob that itself is represented as concatenation of
+	 * multiple blobs).
+	 */
+	git_SHA1_Final(result_sha1, &ctx.ctx);
+	if (already_written(state, result_sha1))
+		return 0;
+
+	/*
+	 * The standard type & size header is followed by
+	 * - the number of chunks (varint)
+	 * - the object names of the chunks (20 * chunk_cnt bytes)
+	 * - the resulting object name (20 bytes)
+	 */
+	type = OBJ_CHUNKED(type);
+	header_len = encode_in_pack_object_header(type, size, header);
+	header_len += encode_in_pack_varint(chunk_cnt, header + header_len);
+
+	write_size = header_len + 20 * (chunk_cnt + 1);
+
+	prepare_to_stream(state, flags);
+	if (state->nr_written &&
+	    pack_size_limit_cfg &&
+	    pack_size_limit_cfg < (state->offset + write_size)) {
+		finish_bulk_checkin(state);
+		prepare_to_stream(state, flags);
+	}
+
+	idx = xcalloc(1, sizeof(*idx));
+	idx->offset = state->offset;
+
+	crc32_begin(state->f);
+	sha1write(state->f, header, header_len);
+	for (i = 0; i < chunk_cnt; i++)
+		sha1write(state->f, chunk[i], 20);
+	sha1write(state->f, result_sha1, 20);
+	idx->crc32 = crc32_end(state->f);
+	hashcpy(idx->sha1, result_sha1);
+	ALLOC_GROW(state->written,
+		   state->nr_written + 1,
+		   state->alloc_written);
+	state->written[state->nr_written++] = idx;
+	state->offset += write_size;
+
+	return 0;
+}
+
 int index_bulk_checkin(unsigned char *sha1,
 		       int fd, size_t size, enum object_type type,
 		       const char *path, unsigned flags)
 {
-	int status = deflate_to_pack(&state, sha1, fd, size, type,
-				     path, flags, NULL);
+	int status;
+
+	/*
+	 * For now, we only deal with blob objects, as validation
+	 * of a huge tree object that is split into chunks will be
+	 * too cumbersome to be worth it.
+	 *
+	 * Note that we only have to use store_in_chunks() codepath
+	 * when we are actually writing things out. deflate_to_pack()
+	 * codepath can hash arbitrarily huge object without keeping
+	 * everything in core just fine.
+	 */
+	if ((flags & HASH_WRITE_OBJECT) &&
+	    type == OBJ_BLOB &&
+	    split_size_limit_cfg &&
+	    split_size_limit_cfg < size)
+		status = store_in_chunks(&state, sha1, fd, size, type,
+					 path, flags, NULL);
+	else
+		status = deflate_to_pack(&state, sha1, fd, size, type,
+					 path, flags, NULL);
 	if (!state.plugged)
 		finish_bulk_checkin(&state);
 	return status;
diff --git a/bulk-checkin.h b/bulk-checkin.h
index 4f599f8..89f1741 100644
--- a/bulk-checkin.h
+++ b/bulk-checkin.h
@@ -12,5 +12,6 @@ extern int index_bulk_checkin(unsigned char sha1[],
 
 extern void plug_bulk_checkin(void);
 extern void unplug_bulk_checkin(void);
+extern size_t carve_chunk(int fd, size_t size);
 
 #endif
diff --git a/cache.h b/cache.h
index 4a3b421..374f712 100644
--- a/cache.h
+++ b/cache.h
@@ -384,11 +384,17 @@ enum object_type {
 	OBJ_EXT = 5,
 	OBJ_OFS_DELTA = 6,
 	OBJ_REF_DELTA = 7,
+	/* OBJ_CHUNKED_COMMIT = 8 */
+	/* OBJ_CHUNKED_TREE = 9 */
+	OBJ_CHUNKED_BLOB = 10,
+	/* OBJ_CHUNKED_TAG = 11 */
 	OBJ_ANY,
 	OBJ_MAX
 };
 #define OBJ_LAST_BASE_TYPE OBJ_REF_DELTA
-#define OBJ_LAST_VALID_TYPE OBJ_REF_DELTA
+#define OBJ_LAST_VALID_TYPE OBJ_CHUNKED_BLOB
+#define OBJ_CHUNKED(type) ((type) + 7)
+#define OBJ_DEKNUHC(type) ((type) - 7)
 
 static inline enum object_type object_type(unsigned int mode)
 {
@@ -602,6 +608,7 @@ extern size_t packed_git_limit;
 extern size_t delta_base_cache_limit;
 extern unsigned long big_file_threshold;
 extern unsigned long pack_size_limit_cfg;
+extern unsigned long split_size_limit_cfg;
 extern int read_replace_refs;
 extern int fsync_object_files;
 extern int core_preload_index;
diff --git a/config.c b/config.c
index c736802..bdc3be1 100644
--- a/config.c
+++ b/config.c
@@ -801,6 +801,11 @@ int git_default_config(const char *var, const char *value, void *dummy)
 		pack_size_limit_cfg = git_config_ulong(var, value);
 		return 0;
 	}
+
+	if (!strcmp(var, "pack.splitsizelimit")) {
+		split_size_limit_cfg = git_config_ulong(var, value);
+		return 0;
+	}
 	/* Add other config variables here and to Documentation/config.txt. */
 	return 0;
 }
diff --git a/environment.c b/environment.c
index 31e4284..b66df28 100644
--- a/environment.c
+++ b/environment.c
@@ -61,6 +61,7 @@ int grafts_replace_parents = 1;
 int core_apply_sparse_checkout;
 struct startup_info *startup_info;
 unsigned long pack_size_limit_cfg;
+unsigned long split_size_limit_cfg;
 
 /* Parallel index stat data preload? */
 int core_preload_index = 0;
diff --git a/split-chunk.c b/split-chunk.c
new file mode 100644
index 0000000..1b60e63
--- /dev/null
+++ b/split-chunk.c
@@ -0,0 +1,28 @@
+#include "git-compat-util.h"
+#include "bulk-checkin.h"
+
+/* Cut at around 512kB */
+#define TARGET_CHUNK_SIZE_LOG2 19
+#define TARGET_CHUNK_SIZE (1U << TARGET_CHUNK_SIZE_LOG2)
+
+/*
+ * Carve out around 500kB to be stored as a separate blob
+ */
+size_t carve_chunk(int fd, size_t size)
+{
+	size_t chunk_size;
+	off_t seekback = lseek(fd, 0, SEEK_CUR);
+
+	if (seekback == (off_t) -1)
+		die("cannot find the current offset");
+
+	/* Next patch will do something complex to find out where to cut */
+	chunk_size = size;
+	if (TARGET_CHUNK_SIZE < chunk_size)
+		chunk_size = TARGET_CHUNK_SIZE;
+
+	if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
+		return error("cannot seek back");
+
+	return chunk_size;
+}
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index 29d6024..d6cb66d 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -100,4 +100,33 @@ test_expect_success 'packsize limit' '
 	)
 '
 
+test_expect_success 'split limit' '
+	test_create_repo split &&
+	(
+		cd split &&
+		git config core.bigfilethreshold 2m &&
+		git config pack.splitsizelimit 1m &&
+
+		test-genrandom "a" $(( 4800 * 1024 )) >split &&
+		git add split &&
+
+		# This should result in a new chunked object "tail"
+		# that shares most of the component blobs in its
+		# early part with "split".
+		cat split >tail &&
+		echo cruft >>tail &&
+		git add tail &&
+
+		# This should result in a new chunked object "head"
+		# that begins with its own unique component blobs
+		# but quickly synchronize and start using the same
+		# component blobs with "split" and "tail", once we
+		# switch to a better chunking heuristics.
+		echo cruft >head &&
+		cat split >>head &&
+		git add head
+
+	)
+'
+
 test_done
-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* [PATCH v3 2/6] varint-in-pack: refactor varint encoding/decoding
From: Junio C Hamano @ 2011-12-02  0:40 UTC (permalink / raw)
  To: git
In-Reply-To: <1322786449-25753-1-git-send-email-gitster@pobox.com>

Refactor encode/decode_in_pack_varint() functions from OFS_DELTA codepaths
to read and write variable-length integers in the pack stream.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin/pack-objects.c |   28 +++++++++++++---------------
 pack-write.c           |   27 +++++++++++++++++++++++++++
 pack.h                 |    2 ++
 sha1_file.c            |   18 ++++++------------
 4 files changed, 48 insertions(+), 27 deletions(-)

diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index dde913e..72206a9 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -210,7 +210,7 @@ static unsigned long write_object(struct sha1file *f,
 {
 	unsigned long size, limit, datalen;
 	void *buf;
-	unsigned char header[10], dheader[10];
+	unsigned char header[10];
 	unsigned hdrlen;
 	enum object_type type;
 	int usable_delta, to_reuse;
@@ -304,17 +304,16 @@ static unsigned long write_object(struct sha1file *f,
 			 * base from this object's position in the pack.
 			 */
 			off_t ofs = entry->idx.offset - entry->delta->idx.offset;
-			unsigned pos = sizeof(dheader) - 1;
-			dheader[pos] = ofs & 127;
-			while (ofs >>= 7)
-				dheader[--pos] = 128 | (--ofs & 127);
-			if (limit && hdrlen + sizeof(dheader) - pos + datalen + 20 >= limit) {
+			unsigned char dheader[10];
+			unsigned pos = encode_in_pack_varint(ofs, dheader);
+
+			if (limit && hdrlen + pos + datalen + 20 >= limit) {
 				free(buf);
 				return 0;
 			}
 			sha1write(f, header, hdrlen);
-			sha1write(f, dheader + pos, sizeof(dheader) - pos);
-			hdrlen += sizeof(dheader) - pos;
+			sha1write(f, dheader, pos);
+			hdrlen += pos;
 		} else if (type == OBJ_REF_DELTA) {
 			/*
 			 * Deltas with a base reference contain
@@ -369,17 +368,16 @@ static unsigned long write_object(struct sha1file *f,
 
 		if (type == OBJ_OFS_DELTA) {
 			off_t ofs = entry->idx.offset - entry->delta->idx.offset;
-			unsigned pos = sizeof(dheader) - 1;
-			dheader[pos] = ofs & 127;
-			while (ofs >>= 7)
-				dheader[--pos] = 128 | (--ofs & 127);
-			if (limit && hdrlen + sizeof(dheader) - pos + datalen + 20 >= limit) {
+			unsigned char dheader[10];
+			unsigned pos = encode_in_pack_varint(ofs, dheader);
+
+			if (limit && hdrlen + pos + datalen + 20 >= limit) {
 				unuse_pack(&w_curs);
 				return 0;
 			}
 			sha1write(f, header, hdrlen);
-			sha1write(f, dheader + pos, sizeof(dheader) - pos);
-			hdrlen += sizeof(dheader) - pos;
+			sha1write(f, dheader, pos);
+			hdrlen += pos;
 			reused_delta++;
 		} else if (type == OBJ_REF_DELTA) {
 			if (limit && hdrlen + 20 + datalen + 20 >= limit) {
diff --git a/pack-write.c b/pack-write.c
index cadc3e1..5702cec 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -302,6 +302,33 @@ char *index_pack_lockfile(int ip_out)
 	return NULL;
 }
 
+uintmax_t decode_in_pack_varint(const unsigned char **bufp)
+{
+	const unsigned char *buf = *bufp;
+	unsigned char c = *buf++;
+	uintmax_t val = c & 127;
+	while (c & 128) {
+		val += 1;
+		if (!val || MSB(val, 7))
+			return 0; /* overflow */
+		c = *buf++;
+		val = (val << 7) + (c & 127);
+	}
+	*bufp = buf;
+	return val;
+}
+
+int encode_in_pack_varint(uintmax_t value, unsigned char *buf)
+{
+	unsigned char varint[16];
+	unsigned pos = sizeof(varint) - 1;
+	varint[pos] = value & 127;
+	while (value >>= 7)
+		varint[--pos] = 128 | (--value & 127);
+	memcpy(buf, varint + pos, sizeof(varint) - pos);
+	return sizeof(varint) - pos;
+}
+
 /*
  * The per-object header is a pretty dense thing, which is
  *  - first byte: low four bits are "size", then three bits of "type",
diff --git a/pack.h b/pack.h
index cfb0f69..d7dc6ca 100644
--- a/pack.h
+++ b/pack.h
@@ -79,6 +79,8 @@ extern off_t write_pack_header(struct sha1file *f, uint32_t);
 extern void fixup_pack_header_footer(int, unsigned char *, const char *, uint32_t, unsigned char *, off_t);
 extern char *index_pack_lockfile(int fd);
 extern int encode_in_pack_object_header(enum object_type, uintmax_t, unsigned char *);
+extern int encode_in_pack_varint(uintmax_t, unsigned char *);
+extern uintmax_t decode_in_pack_varint(const unsigned char **);
 
 #define PH_ERROR_EOF		(-1)
 #define PH_ERROR_PACK_SIGNATURE	(-2)
diff --git a/sha1_file.c b/sha1_file.c
index c96e366..f066c2b 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1484,20 +1484,14 @@ static off_t get_delta_base(struct packed_git *p,
 	 * is stupid, as then a REF_DELTA would be smaller to store.
 	 */
 	if (type == OBJ_OFS_DELTA) {
-		unsigned used = 0;
-		unsigned char c = base_info[used++];
-		base_offset = c & 127;
-		while (c & 128) {
-			base_offset += 1;
-			if (!base_offset || MSB(base_offset, 7))
-				return 0;  /* overflow */
-			c = base_info[used++];
-			base_offset = (base_offset << 7) + (c & 127);
-		}
-		base_offset = delta_obj_offset - base_offset;
+		const unsigned char *buf = base_info;
+		uintmax_t ofs = decode_in_pack_varint(&buf);
+		if (!ofs && buf == base_info)
+			return 0; /* overflow */
+		base_offset = delta_obj_offset - ofs;
 		if (base_offset <= 0 || base_offset >= delta_obj_offset)
 			return 0;  /* out of bound */
-		*curpos += used;
+		*curpos += buf - base_info;
 	} else if (type == OBJ_REF_DELTA) {
 		/* The base entry _must_ be in the same pack */
 		base_offset = find_pack_entry_one(base_info, p);
-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* [PATCH v3 6/6] chunked-object: fallback checkout codepaths
From: Junio C Hamano @ 2011-12-02  0:40 UTC (permalink / raw)
  To: git
In-Reply-To: <1322786449-25753-1-git-send-email-gitster@pobox.com>

This prepares the default codepaths based on the traditional "slurping
everything in-core" model around read_sha1_file() API for objects that use
chunked encoding. Needless to say, these codepaths are unsuitable for the
kind of objects that use chunked encoding and are intended to only serve
as the fallback where specialized "large object API" support is still
lacking.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 sha1_file.c      |   54 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 t/t1050-large.sh |   14 +++++++++++++-
 2 files changed, 67 insertions(+), 1 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 14902cc..7355716 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1607,6 +1607,11 @@ static int packed_object_info(struct packed_git *p, off_t obj_offset,
 		if (sizep)
 			*sizep = size;
 		break;
+	case OBJ_CHUNKED_BLOB:
+		if (sizep)
+			*sizep = size;
+		type = OBJ_DEKNUHC(type);
+		break;
 	default:
 		error("unknown object type %i at offset %"PRIuMAX" in %s",
 		      type, (uintmax_t)obj_offset, p->pack_name);
@@ -1648,6 +1653,51 @@ static void *unpack_compressed_entry(struct packed_git *p,
 	return buffer;
 }
 
+static void *unpack_chunked_entry(struct packed_git *p,
+				  struct pack_window **w_curs,
+				  off_t curpos,
+				  unsigned long size)
+{
+	/*
+	 * *NOTE* *NOTE* *NOTE*
+	 *
+	 * In the longer term, we should aim to exercise this codepath
+	 * less and less often, as it defeats the whole purpose of
+	 * chuncked object encoding!
+	 */
+	unsigned char *buffer;
+	const unsigned char *in, *ptr;
+	unsigned long avail, ofs;
+	int chunk_cnt;
+
+	buffer = xmallocz(size);
+	in = use_pack(p, w_curs, curpos, &avail);
+	ptr = in;
+	chunk_cnt = decode_in_pack_varint(&ptr);
+	curpos += ptr - in;
+	ofs = 0;
+	while (chunk_cnt--) {
+		unsigned long csize;
+		unsigned char *data;
+		enum object_type type;
+
+		in = use_pack(p, w_curs, curpos, &avail);
+		data = read_sha1_file(in, &type, &csize);
+		if (!data)
+			die("malformed chunked object contents ('%s' does not exist)",
+			    sha1_to_hex(in));
+		if (type != OBJ_BLOB)
+			die("malformed chunked object contents (not a blob)");
+		if (size < ofs + csize)
+			die("malformed chunked object contents (sizes do not add up)");
+		memcpy(buffer + ofs, data, csize);
+		ofs += csize;
+		curpos += 20;
+		free(data);
+	}
+	return buffer;
+}
+
 #define MAX_DELTA_CACHE (256)
 
 static size_t delta_base_cached;
@@ -1883,6 +1933,10 @@ void *unpack_entry(struct packed_git *p, off_t obj_offset,
 	case OBJ_TAG:
 		data = unpack_compressed_entry(p, &w_curs, curpos, *sizep);
 		break;
+	case OBJ_CHUNKED_BLOB:
+		data = unpack_chunked_entry(p, &w_curs, curpos, *sizep);
+		*type = OBJ_DEKNUHC(*type);
+		break;
 	default:
 		data = NULL;
 		error("unknown object type %i at offset %"PRIuMAX" in %s",
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index d6cb66d..eea45d1 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -124,8 +124,20 @@ test_expect_success 'split limit' '
 		# switch to a better chunking heuristics.
 		echo cruft >head &&
 		cat split >>head &&
-		git add head
+		git add head &&
 
+		echo blob >expect &&
+		git cat-file -t :split >actual &&
+		test_cmp expect actual &&
+
+		git cat-file -p :split >actual &&
+		# You probably do not want to use test_cmp here...
+		cmp split actual &&
+
+		mv split expect &&
+		git checkout split &&
+		# You probably do not want to use test_cmp here...
+		cmp expect split
 	)
 '
 
-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* [PATCH v3 3/6] new representation types in the packstream
From: Junio C Hamano @ 2011-12-02  0:40 UTC (permalink / raw)
  To: git
In-Reply-To: <1322786449-25753-1-git-send-email-gitster@pobox.com>

In addition to four basic types (commit, tree, blob and tag), the pack
stream can encode a few other "representation" types, such as REF_DELTA
and OFS_DELTA. As we allocate 3 bits in the first byte for this purpose,
we do not have much room to add new representation types in place, but we
do have one value reserved for future expansion.

When encoding a new representation type, the early part of the in-pack
object header is encoded as if its type is OBJ_EXT (= 5) using exactly the
same way as before. That is, the lower 4-bit of the first byte is used for
the lowest 4-bit of the size information, the next 3-bit has the type
information, and the MSB says if the subsequent bytes encodes higher bits
for the size information.

An in-pack object header that records OBJ_EXT as the type is followed by
an integer in the same variable-length encoding as OFS_DELTA offset is
encoded. This value is the real type of the representation minus 8 (as we
do not need to use OBJ_EXT to encode types smaller than 8).  Because we do
not foresee very many representation types, in practice we would have a
single byte with its MSB clear, to represent types 8-135.

The code does not type=8 and upwards for anything yet.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 cache.h      |    4 +++-
 pack-write.c |   23 +++++++++++++++++------
 sha1_file.c  |   11 +++++++++++
 3 files changed, 31 insertions(+), 7 deletions(-)

diff --git a/cache.h b/cache.h
index 4f20861..4a3b421 100644
--- a/cache.h
+++ b/cache.h
@@ -381,12 +381,14 @@ enum object_type {
 	OBJ_TREE = 2,
 	OBJ_BLOB = 3,
 	OBJ_TAG = 4,
-	/* 5 for future expansion */
+	OBJ_EXT = 5,
 	OBJ_OFS_DELTA = 6,
 	OBJ_REF_DELTA = 7,
 	OBJ_ANY,
 	OBJ_MAX
 };
+#define OBJ_LAST_BASE_TYPE OBJ_REF_DELTA
+#define OBJ_LAST_VALID_TYPE OBJ_REF_DELTA

 static inline enum object_type object_type(unsigned int mode)
 {
diff --git a/pack-write.c b/pack-write.c
index 5702cec..9309dd1 100644
--- a/pack-write.c
+++ b/pack-write.c
@@ -338,22 +338,33 @@ int encode_in_pack_varint(uintmax_t value, unsigned char *buf)
  */
 int encode_in_pack_object_header(enum object_type type, uintmax_t size, unsigned char *hdr)
 {
-	int n = 1;
+	unsigned char *hdr_base;
 	unsigned char c;
+	enum object_type header_type;

-	if (type < OBJ_COMMIT || type > OBJ_REF_DELTA)
+	if (type < OBJ_COMMIT || OBJ_LAST_VALID_TYPE < type)
 		die("bad type %d", type);
+	else if (OBJ_LAST_BASE_TYPE < type)
+		header_type = OBJ_EXT;
+	else
+		header_type = type;

-	c = (type << 4) | (size & 15);
+	c = (header_type << 4) | (size & 15);
 	size >>= 4;
+	hdr_base = hdr;
 	while (size) {
 		*hdr++ = c | 0x80;
 		c = size & 0x7f;
 		size >>= 7;
-		n++;
 	}
-	*hdr = c;
-	return n;
+	*hdr++ = c;
+	if (header_type != type) {
+		int sz;
+		type = type - (OBJ_LAST_BASE_TYPE + 1);
+		sz = encode_in_pack_varint(type, hdr);
+		hdr += sz;
+	}
+	return hdr - hdr_base;
 }

 struct sha1file *create_tmp_packfile(char **pack_tmp_name)
diff --git a/sha1_file.c b/sha1_file.c
index f066c2b..14902cc 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1275,6 +1275,17 @@ unsigned long unpack_object_header_buffer(const unsigned char *buf,
 		shift += 7;
 	}
 	*sizep = size;
+	if (*type == OBJ_EXT) {
+		const unsigned char *p = buf + used;
+		uintmax_t val = decode_in_pack_varint(&p);
+
+		if (p == buf + used && !val) {
+			error("bad extended object type");
+			return 0;
+		}
+		*type = val + (OBJ_LAST_BASE_TYPE + 1);
+		used = p - buf;
+	}
 	return used;
 }

-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* [PATCH v3 1/6] bulk-checkin: replace fast-import based implementation
From: Junio C Hamano @ 2011-12-02  0:40 UTC (permalink / raw)
  To: git
In-Reply-To: <1322786449-25753-1-git-send-email-gitster@pobox.com>

This extends the earlier approach to stream a large file directly from the
filesystem to its own packfile, and allows "git add" to send large files
directly into a single pack. Older code used to spawn fast-import, but the
new bulk-checkin API replaces it.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 Makefile               |    2 +
 builtin/add.c          |    5 +
 builtin/pack-objects.c |    6 +-
 bulk-checkin.c         |  275 ++++++++++++++++++++++++++++++++++++++++++++++++
 bulk-checkin.h         |   16 +++
 cache.h                |    2 +
 config.c               |    4 +
 environment.c          |    1 +
 sha1_file.c            |   67 +-----------
 t/t1050-large.sh       |   94 +++++++++++++++--
 zlib.c                 |    9 ++-
 11 files changed, 403 insertions(+), 78 deletions(-)
 create mode 100644 bulk-checkin.c
 create mode 100644 bulk-checkin.h

diff --git a/Makefile b/Makefile
index 3139c19..418dd2e 100644
--- a/Makefile
+++ b/Makefile
@@ -505,6 +505,7 @@ LIB_H += argv-array.h
 LIB_H += attr.h
 LIB_H += blob.h
 LIB_H += builtin.h
+LIB_H += bulk-checkin.h
 LIB_H += cache.h
 LIB_H += cache-tree.h
 LIB_H += color.h
@@ -591,6 +592,7 @@ LIB_OBJS += base85.o
 LIB_OBJS += bisect.o
 LIB_OBJS += blob.o
 LIB_OBJS += branch.o
+LIB_OBJS += bulk-checkin.o
 LIB_OBJS += bundle.o
 LIB_OBJS += cache-tree.o
 LIB_OBJS += color.o
diff --git a/builtin/add.c b/builtin/add.c
index c59b0c9..1c42900 100644
--- a/builtin/add.c
+++ b/builtin/add.c
@@ -13,6 +13,7 @@
 #include "diff.h"
 #include "diffcore.h"
 #include "revision.h"
+#include "bulk-checkin.h"
 
 static const char * const builtin_add_usage[] = {
 	"git add [options] [--] <filepattern>...",
@@ -458,11 +459,15 @@ int cmd_add(int argc, const char **argv, const char *prefix)
 		free(seen);
 	}
 
+	plug_bulk_checkin();
+
 	exit_status |= add_files_to_cache(prefix, pathspec, flags);
 
 	if (add_new_files)
 		exit_status |= add_files(&dir, flags);
 
+	unplug_bulk_checkin();
+
  finish:
 	if (active_cache_changed) {
 		if (write_cache(newfd, active_cache, active_nr) ||
diff --git a/builtin/pack-objects.c b/builtin/pack-objects.c
index b458b6d..dde913e 100644
--- a/builtin/pack-objects.c
+++ b/builtin/pack-objects.c
@@ -76,7 +76,7 @@ static struct pack_idx_option pack_idx_opts;
 static const char *base_name;
 static int progress = 1;
 static int window = 10;
-static unsigned long pack_size_limit, pack_size_limit_cfg;
+static unsigned long pack_size_limit;
 static int depth = 50;
 static int delta_search_threads;
 static int pack_to_stdout;
@@ -2009,10 +2009,6 @@ static int git_pack_config(const char *k, const char *v, void *cb)
 			    pack_idx_opts.version);
 		return 0;
 	}
-	if (!strcmp(k, "pack.packsizelimit")) {
-		pack_size_limit_cfg = git_config_ulong(k, v);
-		return 0;
-	}
 	return git_default_config(k, v, cb);
 }
 
diff --git a/bulk-checkin.c b/bulk-checkin.c
new file mode 100644
index 0000000..6b0b6d4
--- /dev/null
+++ b/bulk-checkin.c
@@ -0,0 +1,275 @@
+/*
+ * Copyright (c) 2011, Google Inc.
+ */
+#include "bulk-checkin.h"
+#include "csum-file.h"
+#include "pack.h"
+
+static int pack_compression_level = Z_DEFAULT_COMPRESSION;
+
+static struct bulk_checkin_state {
+	unsigned plugged:1;
+
+	char *pack_tmp_name;
+	struct sha1file *f;
+	off_t offset;
+	struct pack_idx_option pack_idx_opts;
+
+	struct pack_idx_entry **written;
+	uint32_t alloc_written;
+	uint32_t nr_written;
+} state;
+
+static void finish_bulk_checkin(struct bulk_checkin_state *state)
+{
+	unsigned char sha1[20];
+	char packname[PATH_MAX];
+	int i;
+
+	if (!state->f)
+		return;
+
+	if (state->nr_written == 0) {
+		close(state->f->fd);
+		unlink(state->pack_tmp_name);
+		goto clear_exit;
+	} else if (state->nr_written == 1) {
+		sha1close(state->f, sha1, CSUM_FSYNC);
+	} else {
+		int fd = sha1close(state->f, sha1, 0);
+		fixup_pack_header_footer(fd, sha1, state->pack_tmp_name,
+					 state->nr_written, sha1,
+					 state->offset);
+		close(fd);
+	}
+
+	sprintf(packname, "%s/pack/pack-", get_object_directory());
+	finish_tmp_packfile(packname, state->pack_tmp_name,
+			    state->written, state->nr_written,
+			    &state->pack_idx_opts, sha1);
+	for (i = 0; i < state->nr_written; i++)
+		free(state->written[i]);
+
+clear_exit:
+	free(state->written);
+	memset(state, 0, sizeof(*state));
+
+	/* Make objects we just wrote available to ourselves */
+	reprepare_packed_git();
+}
+
+static int already_written(struct bulk_checkin_state *state, unsigned char sha1[])
+{
+	int i;
+
+	/* The object may already exist in the repository */
+	if (has_sha1_file(sha1))
+		return 1;
+
+	/* Might want to keep the list sorted */
+	for (i = 0; i < state->nr_written; i++)
+		if (!hashcmp(state->written[i]->sha1, sha1))
+			return 1;
+
+	/* This is a new object we need to keep */
+	return 0;
+}
+
+/*
+ * Read the contents from fd for size bytes, streaming it to the
+ * packfile in state while updating the hash in ctx. Signal a failure
+ * by returning a negative value when the resulting pack would exceed
+ * the pack size limit and this is not the first object in the pack,
+ * so that the caller can discard what we wrote from the current pack
+ * by truncating it and opening a new one. The caller will then call
+ * us again after rewinding the input fd.
+ *
+ * The already_hashed_to pointer is kept untouched by the caller to
+ * make sure we do not hash the same byte when we are called
+ * again. This way, the caller does not have to checkpoint its hash
+ * status before calling us just in case we ask it to call us again
+ * with a new pack.
+ */
+static int stream_to_pack(struct bulk_checkin_state *state,
+			  git_SHA_CTX *ctx, off_t *already_hashed_to,
+			  int fd, size_t size, enum object_type type,
+			  const char *path, unsigned flags)
+{
+	git_zstream s;
+	unsigned char obuf[16384];
+	unsigned hdrlen;
+	int status = Z_OK;
+	int write_object = (flags & HASH_WRITE_OBJECT);
+	off_t offset = 0;
+
+	memset(&s, 0, sizeof(s));
+	git_deflate_init(&s, pack_compression_level);
+
+	hdrlen = encode_in_pack_object_header(type, size, obuf);
+	s.next_out = obuf + hdrlen;
+	s.avail_out = sizeof(obuf) - hdrlen;
+
+	while (status != Z_STREAM_END) {
+		unsigned char ibuf[16384];
+
+		if (size && !s.avail_in) {
+			ssize_t rsize = size < sizeof(ibuf) ? size : sizeof(ibuf);
+			if (xread(fd, ibuf, rsize) != rsize)
+				die("failed to read %d bytes from '%s'",
+				    (int)rsize, path);
+			offset += rsize;
+			if (*already_hashed_to < offset) {
+				size_t hsize = offset - *already_hashed_to;
+				if (rsize < hsize)
+					hsize = rsize;
+				if (hsize)
+					git_SHA1_Update(ctx, ibuf, hsize);
+				*already_hashed_to = offset;
+			}
+			s.next_in = ibuf;
+			s.avail_in = rsize;
+			size -= rsize;
+		}
+
+		status = git_deflate(&s, size ? 0 : Z_FINISH);
+
+		if (!s.avail_out || status == Z_STREAM_END) {
+			if (write_object) {
+				size_t written = s.next_out - obuf;
+
+				/* would we bust the size limit? */
+				if (state->nr_written &&
+				    pack_size_limit_cfg &&
+				    pack_size_limit_cfg < state->offset + written) {
+					git_deflate_abort(&s);
+					return -1;
+				}
+
+				sha1write(state->f, obuf, written);
+				state->offset += written;
+			}
+			s.next_out = obuf;
+			s.avail_out = sizeof(obuf);
+		}
+
+		switch (status) {
+		case Z_OK:
+		case Z_BUF_ERROR:
+		case Z_STREAM_END:
+			continue;
+		default:
+			die("unexpected deflate failure: %d", status);
+		}
+	}
+	git_deflate_end(&s);
+	return 0;
+}
+
+/* Lazily create backing packfile for the state */
+static void prepare_to_stream(struct bulk_checkin_state *state,
+			      unsigned flags)
+{
+	if (!(flags & HASH_WRITE_OBJECT) || state->f)
+		return;
+
+	state->f = create_tmp_packfile(&state->pack_tmp_name);
+	reset_pack_idx_option(&state->pack_idx_opts);
+
+	/* Pretend we are going to write only one object */
+	state->offset = write_pack_header(state->f, 1);
+	if (!state->offset)
+		die_errno("unable to write pack header");
+}
+
+static int deflate_to_pack(struct bulk_checkin_state *state,
+			   unsigned char result_sha1[],
+			   int fd, size_t size,
+			   enum object_type type, const char *path,
+			   unsigned flags)
+{
+	off_t seekback, already_hashed_to;
+	git_SHA_CTX ctx;
+	unsigned char obuf[16384];
+	unsigned header_len;
+	struct sha1file_checkpoint checkpoint;
+	struct pack_idx_entry *idx = NULL;
+
+	seekback = lseek(fd, 0, SEEK_CUR);
+	if (seekback == (off_t) -1)
+		return error("cannot find the current offset");
+
+	header_len = sprintf((char *)obuf, "%s %" PRIuMAX,
+			     typename(type), (uintmax_t)size) + 1;
+	git_SHA1_Init(&ctx);
+	git_SHA1_Update(&ctx, obuf, header_len);
+
+	/* Note: idx is non-NULL when we are writing */
+	if ((flags & HASH_WRITE_OBJECT) != 0)
+		idx = xcalloc(1, sizeof(*idx));
+
+	already_hashed_to = 0;
+
+	while (1) {
+		prepare_to_stream(state, flags);
+		if (idx) {
+			sha1file_checkpoint(state->f, &checkpoint);
+			idx->offset = state->offset;
+			crc32_begin(state->f);
+		}
+		if (!stream_to_pack(state, &ctx, &already_hashed_to,
+				    fd, size, type, path, flags))
+			break;
+		/*
+		 * Writing this object to the current pack will make
+		 * it too big; we need to truncate it, start a new
+		 * pack, and write into it.
+		 */
+		if (!idx)
+			die("BUG: should not happen");
+		sha1file_truncate(state->f, &checkpoint);
+		state->offset = checkpoint.offset;
+		finish_bulk_checkin(state);
+		if (lseek(fd, seekback, SEEK_SET) == (off_t) -1)
+			return error("cannot seek back");
+	}
+	git_SHA1_Final(result_sha1, &ctx);
+	if (!idx)
+		return 0;
+
+	idx->crc32 = crc32_end(state->f);
+	if (already_written(state, result_sha1)) {
+		sha1file_truncate(state->f, &checkpoint);
+		state->offset = checkpoint.offset;
+		free(idx);
+	} else {
+		hashcpy(idx->sha1, result_sha1);
+		ALLOC_GROW(state->written,
+			   state->nr_written + 1,
+			   state->alloc_written);
+		state->written[state->nr_written++] = idx;
+	}
+	return 0;
+}
+
+int index_bulk_checkin(unsigned char *sha1,
+		       int fd, size_t size, enum object_type type,
+		       const char *path, unsigned flags)
+{
+	int status = deflate_to_pack(&state, sha1, fd, size, type,
+				     path, flags);
+	if (!state.plugged)
+		finish_bulk_checkin(&state);
+	return status;
+}
+
+void plug_bulk_checkin(void)
+{
+	state.plugged = 1;
+}
+
+void unplug_bulk_checkin(void)
+{
+	state.plugged = 0;
+	if (state.f)
+		finish_bulk_checkin(&state);
+}
diff --git a/bulk-checkin.h b/bulk-checkin.h
new file mode 100644
index 0000000..4f599f8
--- /dev/null
+++ b/bulk-checkin.h
@@ -0,0 +1,16 @@
+/*
+ * Copyright (c) 2011, Google Inc.
+ */
+#ifndef BULK_CHECKIN_H
+#define BULK_CHECKIN_H
+
+#include "cache.h"
+
+extern int index_bulk_checkin(unsigned char sha1[],
+			      int fd, size_t size, enum object_type type,
+			      const char *path, unsigned flags);
+
+extern void plug_bulk_checkin(void);
+extern void unplug_bulk_checkin(void);
+
+#endif
diff --git a/cache.h b/cache.h
index 2e6ad36..4f20861 100644
--- a/cache.h
+++ b/cache.h
@@ -35,6 +35,7 @@ int git_inflate(git_zstream *, int flush);
 void git_deflate_init(git_zstream *, int level);
 void git_deflate_init_gzip(git_zstream *, int level);
 void git_deflate_end(git_zstream *);
+int git_deflate_abort(git_zstream *);
 int git_deflate_end_gently(git_zstream *);
 int git_deflate(git_zstream *, int flush);
 unsigned long git_deflate_bound(git_zstream *, unsigned long);
@@ -598,6 +599,7 @@ extern size_t packed_git_window_size;
 extern size_t packed_git_limit;
 extern size_t delta_base_cache_limit;
 extern unsigned long big_file_threshold;
+extern unsigned long pack_size_limit_cfg;
 extern int read_replace_refs;
 extern int fsync_object_files;
 extern int core_preload_index;
diff --git a/config.c b/config.c
index edf9914..c736802 100644
--- a/config.c
+++ b/config.c
@@ -797,6 +797,10 @@ int git_default_config(const char *var, const char *value, void *dummy)
 		return 0;
 	}
 
+	if (!strcmp(var, "pack.packsizelimit")) {
+		pack_size_limit_cfg = git_config_ulong(var, value);
+		return 0;
+	}
 	/* Add other config variables here and to Documentation/config.txt. */
 	return 0;
 }
diff --git a/environment.c b/environment.c
index 0bee6a7..31e4284 100644
--- a/environment.c
+++ b/environment.c
@@ -60,6 +60,7 @@ char *notes_ref_name;
 int grafts_replace_parents = 1;
 int core_apply_sparse_checkout;
 struct startup_info *startup_info;
+unsigned long pack_size_limit_cfg;
 
 /* Parallel index stat data preload? */
 int core_preload_index = 0;
diff --git a/sha1_file.c b/sha1_file.c
index 27f3b9b..c96e366 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -18,6 +18,7 @@
 #include "refs.h"
 #include "pack-revindex.h"
 #include "sha1-lookup.h"
+#include "bulk-checkin.h"
 
 #ifndef O_NOATIME
 #if defined(__linux__) && (defined(__i386__) || defined(__PPC__))
@@ -2679,10 +2680,8 @@ static int index_core(unsigned char *sha1, int fd, size_t size,
 }
 
 /*
- * This creates one packfile per large blob, because the caller
- * immediately wants the result sha1, and fast-import can report the
- * object name via marks mechanism only by closing the created
- * packfile.
+ * This creates one packfile per large blob unless bulk-checkin
+ * machinery is "plugged".
  *
  * This also bypasses the usual "convert-to-git" dance, and that is on
  * purpose. We could write a streaming version of the converting
@@ -2696,65 +2695,7 @@ static int index_stream(unsigned char *sha1, int fd, size_t size,
 			enum object_type type, const char *path,
 			unsigned flags)
 {
-	struct child_process fast_import;
-	char export_marks[512];
-	const char *argv[] = { "fast-import", "--quiet", export_marks, NULL };
-	char tmpfile[512];
-	char fast_import_cmd[512];
-	char buf[512];
-	int len, tmpfd;
-
-	strcpy(tmpfile, git_path("hashstream_XXXXXX"));
-	tmpfd = git_mkstemp_mode(tmpfile, 0600);
-	if (tmpfd < 0)
-		die_errno("cannot create tempfile: %s", tmpfile);
-	if (close(tmpfd))
-		die_errno("cannot close tempfile: %s", tmpfile);
-	sprintf(export_marks, "--export-marks=%s", tmpfile);
-
-	memset(&fast_import, 0, sizeof(fast_import));
-	fast_import.in = -1;
-	fast_import.argv = argv;
-	fast_import.git_cmd = 1;
-	if (start_command(&fast_import))
-		die_errno("index-stream: git fast-import failed");
-
-	len = sprintf(fast_import_cmd, "blob\nmark :1\ndata %lu\n",
-		      (unsigned long) size);
-	write_or_whine(fast_import.in, fast_import_cmd, len,
-		       "index-stream: feeding fast-import");
-	while (size) {
-		char buf[10240];
-		size_t sz = size < sizeof(buf) ? size : sizeof(buf);
-		ssize_t actual;
-
-		actual = read_in_full(fd, buf, sz);
-		if (actual < 0)
-			die_errno("index-stream: reading input");
-		if (write_in_full(fast_import.in, buf, actual) != actual)
-			die_errno("index-stream: feeding fast-import");
-		size -= actual;
-	}
-	if (close(fast_import.in))
-		die_errno("index-stream: closing fast-import");
-	if (finish_command(&fast_import))
-		die_errno("index-stream: finishing fast-import");
-
-	tmpfd = open(tmpfile, O_RDONLY);
-	if (tmpfd < 0)
-		die_errno("index-stream: cannot open fast-import mark");
-	len = read(tmpfd, buf, sizeof(buf));
-	if (len < 0)
-		die_errno("index-stream: reading fast-import mark");
-	if (close(tmpfd) < 0)
-		die_errno("index-stream: closing fast-import mark");
-	if (unlink(tmpfile))
-		die_errno("index-stream: unlinking fast-import mark");
-	if (len != 44 ||
-	    memcmp(":1 ", buf, 3) ||
-	    get_sha1_hex(buf + 3, sha1))
-		die_errno("index-stream: unexpected fast-import mark: <%s>", buf);
-	return 0;
+	return index_bulk_checkin(sha1, fd, size, type, path, flags);
 }
 
 int index_fd(unsigned char *sha1, int fd, struct stat *st,
diff --git a/t/t1050-large.sh b/t/t1050-large.sh
index deba111..29d6024 100755
--- a/t/t1050-large.sh
+++ b/t/t1050-large.sh
@@ -7,21 +7,97 @@ test_description='adding and checking out large blobs'
 
 test_expect_success setup '
 	git config core.bigfilethreshold 200k &&
-	echo X | dd of=large bs=1k seek=2000
+	echo X | dd of=large1 bs=1k seek=2000 &&
+	echo X | dd of=large2 bs=1k seek=2000 &&
+	echo X | dd of=large3 bs=1k seek=2000 &&
+	echo Y | dd of=huge bs=1k seek=2500
 '
 
-test_expect_success 'add a large file' '
-	git add large &&
-	# make sure we got a packfile and no loose objects
-	test -f .git/objects/pack/pack-*.pack &&
-	test ! -f .git/objects/??/??????????????????????????????????????
+test_expect_success 'add a large file or two' '
+	git add large1 huge large2 &&
+	# make sure we got a single packfile and no loose objects
+	bad= count=0 idx= &&
+	for p in .git/objects/pack/pack-*.pack
+	do
+		count=$(( $count + 1 ))
+		if test -f "$p" && idx=${p%.pack}.idx && test -f "$idx"
+		then
+			continue
+		fi
+		bad=t
+	done &&
+	test -z "$bad" &&
+	test $count = 1 &&
+	cnt=$(git show-index <"$idx" | wc -l) &&
+	test $cnt = 2 &&
+	for l in .git/objects/??/??????????????????????????????????????
+	do
+		test -f "$l" || continue
+		bad=t
+	done &&
+	test -z "$bad" &&
+
+	# attempt to add another copy of the same
+	git add large3 &&
+	bad= count=0 &&
+	for p in .git/objects/pack/pack-*.pack
+	do
+		count=$(( $count + 1 ))
+		if test -f "$p" && idx=${p%.pack}.idx && test -f "$idx"
+		then
+			continue
+		fi
+		bad=t
+	done &&
+	test -z "$bad" &&
+	test $count = 1
 '
 
 test_expect_success 'checkout a large file' '
-	large=$(git rev-parse :large) &&
-	git update-index --add --cacheinfo 100644 $large another &&
+	large1=$(git rev-parse :large1) &&
+	git update-index --add --cacheinfo 100644 $large1 another &&
 	git checkout another &&
-	cmp large another ;# this must not be test_cmp
+	cmp large1 another ;# this must not be test_cmp
+'
+
+test_expect_success 'packsize limit' '
+	test_create_repo mid &&
+	(
+		cd mid &&
+		git config core.bigfilethreshold 64k &&
+		git config pack.packsizelimit 256k &&
+
+		# mid1 and mid2 will fit within 256k limit but
+		# appending mid3 will bust the limit and will
+		# result in a separate packfile.
+		test-genrandom "a" $(( 66 * 1024 )) >mid1 &&
+		test-genrandom "b" $(( 80 * 1024 )) >mid2 &&
+		test-genrandom "c" $(( 128 * 1024 )) >mid3 &&
+		git add mid1 mid2 mid3 &&
+
+		count=0
+		for pi in .git/objects/pack/pack-*.idx
+		do
+			test -f "$pi" && count=$(( $count + 1 ))
+		done &&
+		test $count = 2 &&
+
+		(
+			git hash-object --stdin <mid1
+			git hash-object --stdin <mid2
+			git hash-object --stdin <mid3
+		) |
+		sort >expect &&
+
+		for pi in .git/objects/pack/pack-*.idx
+		do
+			git show-index <"$pi"
+		done |
+		sed -e "s/^[0-9]* \([0-9a-f]*\) .*/\1/" |
+		sort >actual &&
+
+		test_cmp expect actual
+	)
 '
 
 test_done
diff --git a/zlib.c b/zlib.c
index 3c63d48..2b2c0c7 100644
--- a/zlib.c
+++ b/zlib.c
@@ -188,13 +188,20 @@ void git_deflate_init_gzip(git_zstream *strm, int level)
 	    strm->z.msg ? strm->z.msg : "no message");
 }
 
-void git_deflate_end(git_zstream *strm)
+int git_deflate_abort(git_zstream *strm)
 {
 	int status;
 
 	zlib_pre_call(strm);
 	status = deflateEnd(&strm->z);
 	zlib_post_call(strm);
+	return status;
+}
+
+void git_deflate_end(git_zstream *strm)
+{
+	int status = git_deflate_abort(strm);
+
 	if (status == Z_OK)
 		return;
 	error("deflateEnd: %s (%s)", zerr_to_string(status),
-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply related

* [PATCH v3 0/6] Bulk check-in
From: Junio C Hamano @ 2011-12-02  0:40 UTC (permalink / raw)
  To: git
In-Reply-To: <1322699263-14475-6-git-send-email-gitster@pobox.com>

I would declare that the earlier parts of the v2 that are about factoring
out various API pieces from existing code are basically completed, so they
are not part of this iteration.

The bulk-checkin patch from v2 has been tweaked a bit (deflate_to_pack()
initializes "already_hashed_to" pointer to 0, instead of the current file
position "seekback"), and then the rest of the series builds on top of it
to add a new in-pack encoding that I am tentatively calling "chunked".

The basic idea is to represent a large/huge blob as a concatenation of
smaller blobs. An entry in a pack in "chunked" representation records a
list of object names of the component blob objects.  The object name given
to such a blob is computed exactly the same way as before. In other words,
the name of a object does not depend on its representation; we hash "blob
<size> NUL" and the whole large blob contents to come up with its name. It
is *not* the hash of the component blob object names.

As can be seen in the log message of the "support chunked-object encoding"
patch, many pieces are still missing from this series and filling them
will be a long and tortuous journey. But we need to start somewhere.

I specifically excluded any heuristics to split large objects into chunks
in a self-synchronising way so that a small edit near the beginning of a
large blob results in a handful of new component blobs followed by the
same component blobs as used to represent the same blob before such an
edit, and I do not plan to work on that part myself. My impression from
listening Avery's plug for "bup" is that it is a solved problem; it should
be reasonably straightforward to lift the logic and plug it into the
framework presented here (once the codebase gets solid enough, that is).

After this series, the next step for me is likely to teach the streaming
interface about "chunked" objects, and then pack-objects to take notice
and reuse "chunked" representation when sending things out (which means
that sending a "chunked" encoded blob would involve sending the component
blobs it uses, among other things), but I expect that it will extend well
into next year.

Junio C Hamano (6):
  bulk-checkin: replace fast-import based implementation
  varint-in-pack: refactor varint encoding/decoding
  new representation types in the packstream
  bulk-checkin: allow the same data to be multiply hashed
  bulk-checkin: support chunked-object encoding
  chunked-object: fallback checkout codepaths

 Makefile               |    3 +
 builtin/add.c          |    5 +
 builtin/pack-objects.c |   34 ++---
 bulk-checkin.c         |  415 ++++++++++++++++++++++++++++++++++++++++++++++++
 bulk-checkin.h         |   17 ++
 cache.h                |   13 ++-
 config.c               |    9 +
 environment.c          |    2 +
 pack-write.c           |   50 +++++-
 pack.h                 |    2 +
 sha1_file.c            |  150 +++++++++---------
 split-chunk.c          |   28 ++++
 t/t1050-large.sh       |  135 +++++++++++++++-
 zlib.c                 |    9 +-
 14 files changed, 760 insertions(+), 112 deletions(-)
 create mode 100644 bulk-checkin.c
 create mode 100644 bulk-checkin.h
 create mode 100644 split-chunk.c

-- 
1.7.8.rc4.177.g4d64

^ permalink raw reply

* Git Install link is broken
From: Graham Wideman @ 2011-12-01 23:56 UTC (permalink / raw)
  To: git

Hi folks:

On this page:
http://code.google.com/p/msysgit/

the links to "install msysGit" point to:
https://git.wiki.kernel.org/index.php/MSysGit:InstallMSysGit

.. which returns page not found error.

-- Graham

^ permalink raw reply

* Re: clean bug on ignored subdirectories with no tracked files?
From: Phil Hord @ 2011-12-01 23:35 UTC (permalink / raw)
  To: Phil Hord, Jay Soffian; +Cc: Junio C Hamano, git
In-Reply-To: <CABURp0qv7MB-ZQvvSZQi43nAy1ZaR75-19T2Sd1JBT14Y_dG7w@mail.gmail.com>


On Mon, Nov 21, 2011 at 3:03 PM, Jay Soffian <jaysoffian@gmail.com> wrote:
> On Mon, Nov 21, 2011 at 2:28 PM, Junio C Hamano <gitster@pobox.com> wrote:
>> "clean" without "-x" is meant to preserve untracked but expendable paths
>> (e.g. build products), so if something is removed that is untracked but
>> matches the ignore pattern, then that is a bug to be fixed.  Care to roll
>> a patch to fix it?
> Okay, just confirming it is a bug. I'll add this e-mail to my todo
> list, but I don't have time for a patch anytime soon. 

I think the fix is in dir.c (treat_directory), but I'm not sure how yet.

How's this for starters?

-- >8 --
Subject: [PATCH] clean: Test known breakage of .gitignore and -d

git-clean -d is used to remove untracked directories.  If the
directory still contains .gitignored files it should not be removed.
But git is broken here if neither the ignored files nor the directory
are explicitly ignored.

Document this known breakage with a test for both cases.

Noticed-by: Jay Soffian <jaysoffian@gmail.com>
Signed-off-by: Phil Hord <hordp@cisco.com>
---
 t/t7300-clean.sh |   15 +++++++++++++++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/t/t7300-clean.sh b/t/t7300-clean.sh
index 800b536..e29e383 100755
--- a/t/t7300-clean.sh
+++ b/t/t7300-clean.sh
@@ -14,6 +14,7 @@ test_expect_success 'setup' '
 	mkdir -p src &&
 	touch src/part1.c Makefile &&
 	echo build >.gitignore &&
+	echo /foo/bar >>.gitignore &&
 	echo \*.o >>.gitignore &&
 	git add . &&
 	git commit -m setup &&
@@ -264,6 +265,20 @@ test_expect_success 'git clean -d src/ examples/' '
 
 '
 
+test_expect_failure 'git clean -d leaves .gitignored files alone' '
+
+	mkdir -p objs foo/bar &&
+	touch objs/foo.o &&
+	touch foo/gone &&
+	touch foo/bar/baz &&
+	git clean -nd &&
+	git clean -d &&
+	test ! -f foo/gone &&
+	test -f foo/bar/baz &&
+	test -f objs/foo.o
+
+'
+
 test_expect_success 'git clean -x' '
 
 	mkdir -p build docs &&
-- 
1.7.8.rc4

^ permalink raw reply related

* Re: [HELP] Adding git awareness to the darning patch management system.
From: Jeff King @ 2011-12-01 23:40 UTC (permalink / raw)
  To: Peter Williams; +Cc: git
In-Reply-To: <4ED80EA2.80805@users.sourceforge.net>

On Fri, Dec 02, 2011 at 09:32:50AM +1000, Peter Williams wrote:

> Yes, I think your right.  For most of my purposes, I think that it's
> irrelevant whether a change is staged or not and the choices that I
> offer allow the user to do what he thinks is right for a file with
> changes that are staged but uncommitted.  For me to automatically do
> something based on whether the file was staged for a commit would be
> a mistake as I would be reducing the user's options.
> 
> However, the distinction might be worth making in the file tree
> display to remind the user what's staged and what's not?

I'd personally start with ignoring the distinction, and then wait for
some enterprising user to suggest how it would be marked. That takes the
burden off of you for guessing what representation would be best.

> As an aside, I found it easier to delve into git's innards to find
> out how to implement git binary patches than I did finding out how to
> do things from the CLI :-).

Heh. I think that is because the CLI is written by people (myself
included) who want it to give them access to git's innards. ;)

> >In your case, I think "status" is the most convenient level of
> >abstraction for you, because you are interesting in looking at
> >differences to both the index and HEAD (i.e., the prior commit). But if
> >you find as you implement that want more flexibility, you can switch to
> >using the lower-level commands yourself.
> 
> I'll investigate this approach.  How easy is it to distinguish low
> level commands from high level commands?

git(1) has a list of "porcelain" and "plumbing" commands. If you are
scripting, you generally want to stick with plumbing commands,
lower-level whose output and behavior remains stable across releases.

However, some porcelain commands offer a "plumbing" mode (despite the
name, the "--porcelain" flag to some commands is what you want; it is
about offering parseable output that a custom porcelain could use. In
git lingo, your interface is one such porcelain).

-Peff

^ permalink raw reply

* Re: [HELP] Adding git awareness to the darning patch management system.
From: Peter Williams @ 2011-12-01 23:32 UTC (permalink / raw)
  To: Jeff King; +Cc: git
In-Reply-To: <20111201062733.GB22141@sigill.intra.peff.net>

On 01/12/11 16:27, Jeff King wrote:
> On Thu, Dec 01, 2011 at 10:56:59AM +1000, Peter Williams wrote:
>
>>> I'm not exactly sure what this means.
>>
>> If you look at the screenshots at sourceforge (which were produced on
>> top of a Mercurial repo) you'll notice that file names in the left
>> most tree have letters in front of them and appear in different
>> foreground colours.  These letters are the same as those returned by
>> Mercurial's status command and, hence, give a Mercurial user an easy
>> to understand snapshot of the status of the files in the playground.
>> The colour coding is (relatively) arbitrary (and chosen by me) and is
>> intended to make it easier to detect the different file statuses.
>>
>> My main problem is that I can't find a git file status command (and
>> there are a lot of them to choose from) that gives a snapshot of the
>> statuses of all files in a directory (including those not tracked or
>> ignored).
>
> Thanks, that helps. You probably just want to use "git status
> --porcelain", which will show you the state of file modification with
> respect to the index and the prior commit, as well as any untracked
> files. See the "porcelain format" section in "git help status".
>
> Note that "git status" will not print files which are not modified. You
> may want to also run "git ls-files" to get the full listing of files,
> including unmodified ones.
>
>> A secondary problem is that, if I could cobble together statuses from
>> various commands, mapping git statuses to the Mercurial ones for
>> display would not be a good solution as they would not necessarily
>> make sense to a git user.  (It's fairly clear to me from my inability
>> to make sense of git's CLI that git users think differently to me, a
>> Mercurial user, and it's unlikely that I can, without help, make a
>> file tree display that makes sense to a git user.)
>
> I'm hoping that "git status --porcelain" will give you a fairly close
> mapping of the basic "what happened to this file" concept, based on what
> I see in the second screenshot you mentioned.
>
> The trickiest thing is the index, which represents an in-between state
> that is not usually exposed by other version control systems. If your
> tool does not make use of the index, then it probably makes sense to
> just consider a path as modified if it has modifications staged in the
> index or in the working tree, which maps to other VCS's idea of
> "modified" (because for them, marking something as to-be-committed and
> commiting it are part of the same step).

Yes, I think your right.  For most of my purposes, I think that it's 
irrelevant whether a change is staged or not and the choices that I 
offer allow the user to do what he thinks is right for a file with 
changes that are staged but uncommitted.  For me to automatically do 
something based on whether the file was staged for a commit would be a 
mistake as I would be reducing the user's options.

However, the distinction might be worth making in the file tree display 
to remind the user what's staged and what's not?

>
>>> For this, you probably want "git diff-files --name-only", which will
>>> show files with differences in the working tree. Keep in mind that git
>>> has an "index" or "staging area", which means that you have three states
>>> of content for a given path:
>>>
>>>    1. the state of the prior commit (i.e., HEAD)
>>>
>>>    2. the state that is marked to be committed when "git commit" is run
>>>       (i.e., the index)
>>>
>>>    3. the state in the working tree
>>
>> This is a prime example of the different mindset of the git user to
>> the hg user.
>
> You don't have to use those features, of course. It's just that
> something like "git status" is going to report on the differences
> between those states, so as a tool writer you need to know they are
> there (and as I said above, you are free to simplify if it fits into the
> mental model of your tool).
>
>>> You can compare the first two with "git diff-index", and the latter two
>>> with "git diff-files". You can also use "git status --porcelain" to get
>>> a machine-readable output that shows how the three states match up, with
>>> one line per file.
>>
>> This is an example of why I'm confused.  There are too many ways to
>> do (similar) things and it's hard to know which to use.
>
> Git is made of little building blocks. The original way to see the
> differences between the index and the working tree was via diff-files.
> But then people build bigger building blocks out of the smaller ones.
> "git status" is really just a shorthand for:
>
>    git diff-index HEAD&&
>    git diff-files&&
>    git ls-files -o
>
> and is in fact implemented using those building blocks (originally as a
> shell script, though these days it is written in C). So you can choose
> either and get the same information. Choosing a higher-level building
> block may save you some work, if the abstraction matches what you want.
> Otherwise, you can compose what you want from the lower levels.
>
> I know it sometimes leads to an overwhelming number of commands, and I'm
> not trying to excuse git's tendency to confuse people. I'm just hoping
> to unconfuse you in this particular situation.

As an aside, I found it easier to delve into git's innards to find out 
how to implement git binary patches than I did finding out how to do 
things from the CLI :-).

>
> In your case, I think "status" is the most convenient level of
> abstraction for you, because you are interesting in looking at
> differences to both the index and HEAD (i.e., the prior commit). But if
> you find as you implement that want more flexibility, you can switch to
> using the lower-level commands yourself.

I'll investigate this approach.  How easy is it to distinguish low level 
commands from high level commands?

>
>> Maybe an example of why I think the feature is useful might help.
>> Say that you start editing a file and then decide that you want to
>> put this change into a patch rather than committing it.  If you were
>> using quilt you would have to do this manually by any of a number or
>> ways such as:
>>
>> $<git diff command>  file>  temp.patch
>> $<git revert command>  file
>> $ quilt new one.patch
>> $ quilt add file
>> $ patch -p1 file<  temp.patch
>> $ rm temp.patch
>>
>> In darning, you just do:
>>
>> $ darn new one.patch
>> $ darn add --absorb file
>
> Sure. We have stgit and topgit, which do similar patch management things
> on top of git. I don't personally user either, though, so I don't have
> much to say on how they compare to darning, or whether it is worth
> looking at their implementations.

And there's MQ on top of hg.  I find the idea of doing "temporary" 
commits (which is what these tools are essentially doing) a little risky 
(e.g. what happens if you do a push with temporary commits in place). 
With MQ, I use its hook system to prevent this happening and I imagine 
git provides something similar.

Of course, these tools have the advantage that it's easier to promote a 
patch to a full blown commit than it is for quilt or darning in its 
current form (I'm thinking about how to do this).  At this stage, 
darning is targeted at the user who has to maintain a set of patches on 
top of a third party source tree without the need to eventually commit 
the changes themselves (i.e. distribution managers).

I've never used stgit or topgit but I have used both quilt and MQ a lot. 
  I find them both quite usable but each with their own set of 
advantages and disadvantages hence my attempt to make a tool as much 
like them as possible but with a smaller set of disadvantages.

>
>> The interface to the SCM to support this is two functions:
>>
>> 1: get_files_with_uncommitted_changes() which called with no
>> arguments returns a list of the paths of all files with uncommitted
>> changes or when given a list of file paths (the more common case)
>> returns the subset of that list which have uncommitted changes; and
>
> "status" will do this for you, modulo the simplification of the concept
> of the index, as we discussed above.
>
>> 2. copy_clean_version_to(filepath, target_path) which makes a copy of
>> the file as recorded in the prior commit and places it at the
>> target_path (usually where darning stores the "original" for
>> reference when creating diffs).
>
> You probably want:
>
>    git cat-file blob HEAD:filepath>target_path
>

I think I might do this in two stages.  First, just do the bit about 
adding files and pushing (as that is the most useful) and leave the file 
tree as a vanilla tree for the time being (as it looks like it may be 
more complicated).


Thanks for your help,
Peter

^ permalink raw reply

* Re: [PATCH] Add MYMETA.yml to perl/.gitignore
From: Jeff King @ 2011-12-01 22:35 UTC (permalink / raw)
  To: Sebastian Morr; +Cc: git
In-Reply-To: <20111201203114.GA12796@thinkpad>

On Thu, Dec 01, 2011 at 09:31:15PM +0100, Sebastian Morr wrote:

> This file is auto-generated in the process of building the Perl
> extension.
>
> [...]
> 
> I just built Git for the first time, issued "git status", and there
> was this untracked file. I guess you could call that an itch. This patch
> fixes that, however, I'm not sure whether this is a relevant issue.

Thanks, I wrote the same patch myself last week. I could swear I sent it
to the list, but it appears that I forgot.

The only thing I would add is the reason this is suddenly coming up now:
generating MYMETA.yml is done only by new-ish versions of
ExtUtils::MakeMaker (it started for me with perl 5.14, which just hit
debian unstable recently). The file just contains extra information
about the environment and arguments to the Makefile-building process,
and can be safely deleted.

-Peff

^ permalink raw reply

* Re: [PATCHv2 0/4] git-p4: small fixes to branches and labels; tests
From: Vitor Antunes @ 2011-12-01 21:59 UTC (permalink / raw)
  To: Pete Wyckoff; +Cc: git, Luke Diamand
In-Reply-To: <CAOpHH-UMdLpCPx1+D2dtQJs+=t1+0U2srKfTwBi-TEF4F7EDyw@mail.gmail.com>

On Dec 1, 2011 4:03 AM, "Pete Wyckoff" <pw@padd.com> wrote:
> I see your point.  P4 labels are the only way that they support
> tagging, apparently.  I'm okay with leaving label support in
> git-p4.  And it will be nice if Luke makes it behave a bit
> better.  But doing heroics to emulate cross-commit tags feels
> like a lot of work, and the wrong direction.

Agreed. Lets keep it simple.

-- 
Vitor Antunes

^ permalink raw reply

* Re: Proposal: create meaningful aliases for git reset's hard/soft/mixed
From: Phil Hord @ 2011-12-01 21:23 UTC (permalink / raw)
  To: Philippe Vaucher; +Cc: Junio C Hamano, git, Christian Couder
In-Reply-To: <CAGK7Mr5nQoubAw11KDj4WKwQnXrfgteKbMj2=AR-HhsGKi52wQ@mail.gmail.com>

On Wed, Nov 23, 2011 at 6:00 PM, Philippe Vaucher
<philippe.vaucher@gmail.com> wrote:
>> In any case, I think your proposal makes it even worse than the current
>> state, and you should aim higher.
>
> Why worse? I'd understand if you said it's doesn't improve it enough
> for it to be worth the change tho.

I think that's what "you should aim higher" means.

> Anyway, my proposal was to get a discussion going, and I'm all for
> aiming higher if there's a way. What do you propose instead? You
> seemed to imply we'd remove --soft and --merge, and make --keep as an
> option for --hard but named differently, something like
> --keep-changes. Maybe I didn't fully understand.

I think there are too many scripts dependent on these switches to
remove them.  But I love the direction you're going in.

Aim higher.

> Mathieu even suggested that it'd have the behavior of --keep by
> default, and that you have to add --force to get today's --hard
> behavior, which sounds like a good idea to me (avoid destructive
> behavior by default).

Think outside the "reset" command.  Like this:

>From the "most popular" comment on http://progit.org/2011/07/11/reset.html:
> I remember them as:
> --soft      -> git uncommit
> --mixed  -> git unadd
> --hard     -> git undo

I don't particular like these names, but conceptually they are helpful.

What other commands can we embellish or create to replace the overload
git-reset functionality?

How about:
  --soft: git checkout -B <commit>
  --mixed: git reset -- <paths>
  --hard:  git checkout --clean

Phil

^ permalink raw reply

* Re: Proposal: create meaningful aliases for git reset's hard/soft/mixed
From: Phil Hord @ 2011-12-01 21:02 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Philippe Vaucher, git, Christian Couder
In-Reply-To: <7vlir6brjw.fsf@alter.siamese.dyndns.org>

On Wed, Nov 23, 2011 at 1:51 PM, Junio C Hamano <gitster@pobox.com> wrote:
> "git reset --hard HEAD" is an unambiguously descriptive good name for the
> option. It is a "hard reset" like power cycling a machinery to discard
> anything in progress and get back to a clean slate. I do not see anything
> confusing with this mode nor its name.

As a git expert-user, I agree.

But, honestly, as a git new-user, I had a lot of trouble with this
command.  It is mysterious and powerful and new users do not
understand it.  Everyone learns "git reset --hard HEAD" as a single
command.  Only much later (if ever) do they learn about the other
git-reset options.  --hard is the only useful option for the new user,
so it seems superfluous.  HEAD is a foreign concept for the new-user
and makes little sense when this command is first memorized.  And at
the early stages of the git learning curve, that's what it is:
memorized.  _The spelling is what counts; the meaning is mysterious._
(For all its flaws, though, at least "git reset --hard HEAD" serves to
introduce the new-user to the concept of HEAD.)

So, as a git new-user, what I wanted was this:
  git clean-checkout [or "git checkout --clean"]

What I found instead was this:
  git reset --hard HEAD

What does this have to do with "checking out my files from the last
commit" or "discarding my local, uncommitted edits"?  To the new-user,
nothing at all.  reset?  Meaningless.  --hard?  Whatever.  HEAD?
Shrug.

In the end it doesn't even do what I wanted.  What I really wanted was this:
  git reset --hard HEAD && git clean -fd

I think the git-reset modes should be relegated to plumbing.  I can
see how 'git reset --mixed' is useful for resetting changes out of the
index, but reset is so mired in all sorts of extra mumbo-jumbo that
this usage becomes a forgotten detail for me.  I didn't even learn
that usage until later, where it makes loads of sense on its own:

     FTH: This means that git reset <paths> is the opposite of git add <paths>.

That is beautiful, clean and useful.  If that's all it did, it would be perfect.

Problems with git-reset--hard:
 * It has no safety nets (except the reflog, another concept foreign
to new-users)
 * It requires extra switches/arguments to be useful
 * Surprisingly (at first), it can move your branch, but it's not
spelled 'branch' or 'commit' or 'move'

That last one is particularly troubling in light of the description of
'git reset --hard':
     Resets the index and working tree. Any changes to tracked
     files in the working tree since <commit> are discarded.

Maybe we should add "and by the way, your currently checked-out branch
is moved to point to <commit>".

</rant>

Phil

^ permalink raw reply

* Re: Workflow Recommendation - Probably your 1000th
From: bradford @ 2011-12-01 20:46 UTC (permalink / raw)
  To: Stephen Bash; +Cc: git
In-Reply-To: <363b3901-eee6-4265-adae-267f4662a1f7@mail>

Thanks, Stephen.   I guess I'm looking for more input on the
advantages and disadvantages of using a QA and production branch vs
just doing everything out of master.

Trying to go through the following:
http://news.ycombinator.com/item?id=1617425
scottchacon.com/2011/08/31/github-flow.html

We have some weeks where we release very frequently and some weeks
where we release only once a week and have to do production fixes in
the meantime.  Sure other people have similar experiences.

On Thu, Dec 1, 2011 at 1:55 PM, Stephen Bash <bash@genarts.com> wrote:
> ----- Original Message -----
>> From: "bradford" <fingermark@gmail.com>
>> To: git@vger.kernel.org
>> Sent: Thursday, December 1, 2011 1:26:10 PM
>> Subject: Workflow Recommendation - Probably your 1000th
>>
>> You guys probably receive a ton of workflow related questions.  I'm
>> trying to convert from svn to git.  In order to complete, I would
>> like to be able to provide a workflow to our team.  We typically go
>> from dev -> qa -> production (Java and Rails projects).  The problem
>> is that sometimes QA can get backed up and we'll need to release
>> something to production while QA is doing their thing.  What is a
>> good workflow?  I would like to not use git-flow, because it's another
>> tool.
>
> Hey wow...  I read that Driessen's workflow post [1] a long time ago, but hadn't run into the git-flow tools until a few days ago.  Guess I was just oblivious...  Anyway, if it's any consolation, my company runs a model very much inspired by Driessen's post without using git-flow itself.
>
> [1] http://nvie.com/posts/a-successful-git-branching-model/
>
>> I've read suggestions to use environment branches (master,
>> staging, production).  I've also read not to do this and just use
>> master, tagging your production releases.  How well would our setup,
>> where things can get backed up, work with the latter?  Are there any
>> alternative suggestions?
>
> In our workflow we flip Driessen's model on its head.  master is the newest code, while we branch off maintenance branches just before each release.  We tag each release so it's easy to identify which versions in the field contain a given bug or fix (multiple minor versions come off a single maintenance branch).  Our QA guys follow the maintenance branches (they're relatively stable).  We recently had to do a hot-fix release which I think would be similar to your "release to production".  Basically we found the last commit on the maintenance branch that was well tested, created a new branch from there, did the hot fix, QA did some real fast testing (sounds like you'd skip this step), and we shipped that.  As always that hot-fix release gets tagged, so in the future we can still reference that particular build (and in this case the branch merged back into the maintenance branch -- we've had other situations where the branch was simply deleted after tagging).
>
> In the grand scheme of things our model isn't that different than Driessen's; we just name the branches differently.  Commits go on the oldest branch that's safe for them, and then everything merges to the newer branches.  Tags provide easy reference for where on a given branch a release came from.
>
> Hope that helps.
>
> Stephen

^ permalink raw reply

* [PATCH] Add MYMETA.yml to perl/.gitignore
From: Sebastian Morr @ 2011-12-01 20:31 UTC (permalink / raw)
  To: git

This file is auto-generated in the process of building the Perl
extension.

Signed-off-by: Sebastian Morr <sebastian@morr.cc>
---

Good day, fine Sirs and Madams,

I just built Git for the first time, issued "git status", and there
was this untracked file. I guess you could call that an itch. This patch
fixes that, however, I'm not sure whether this is a relevant issue.

Didn't know whom to Cc, as the perl directory doesn't seem to get much
attention lately.

Kind regards,
Sebastian

 perl/.gitignore |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/perl/.gitignore b/perl/.gitignore
index 98b2477..9235e73 100644
--- a/perl/.gitignore
+++ b/perl/.gitignore
@@ -1,5 +1,6 @@
 perl.mak
 perl.mak.old
+MYMETA.yml
 blib
 blibdirs
 pm_to_blib
-- 
1.7.8.rc4.dirty

^ permalink raw reply related

* Re: log: option "--follow" not the default for a single file?
From: Jeff King @ 2011-12-01 20:36 UTC (permalink / raw)
  To: Ralf Thielow; +Cc: git
In-Reply-To: <CAN0XMOKrCovkmmdqu2GjsDof0wehfbf5a0EQuPo0q7GQaJ=GRQ@mail.gmail.com>

On Thu, Dec 01, 2011 at 09:28:31PM +0100, Ralf Thielow wrote:

> > It's possible, but it is changing the meaning of "git log foo". With
> > the current code, even if "foo" is currently a file, it will match
> > "foo/bar" in a prior revision. Switching this to "--follow" will not.
> 
> Why does it actually match both things? I think that's
> maybe wrong.

Because that's what the path argument to "git log" is designed to do --
limit revision traversal based on pathspecs.

You can argue that the "--follow" semantics are more meaningful, but it
doesn't change the fact that it is a behavior change. We have to
consider not only backwards compatibility, but also the confusing-ness
of an inconsistent interface where:

  git log foo bar

will treat "foo" as a pathspec, but:

  git log foo

will treat it as a file.

> Also I can't use "git log" with another
> file/folder doesn't exists but in another revision. What actually
> exists is the file and that's imho the only thing that should match.

You can:

  git log -- existed-long-ago

As a syntax shortcut, you can drop the "--". However, there is some
ambiguity with revision arguments, so git allows path arguments without
a "--" only when they exist in the filesystem (_not_ in a particular
revision).

-Peff

^ permalink raw reply

* Re: log: option "--follow" not the default for a single file?
From: Ralf Thielow @ 2011-12-01 20:28 UTC (permalink / raw)
  To: Jeff King; +Cc: git
In-Reply-To: <20111201185230.GB2873@sigill.intra.peff.net>

> It's possible, but it is changing the meaning of "git log foo". With
> the current code, even if "foo" is currently a file, it will match
> "foo/bar" in a prior revision. Switching this to "--follow" will not.

Why does it actually match both things? I think that's
maybe wrong.
The folder was moved or delete so it doesn't exist in my
working directory. Also I can't use "git log" with another
file/folder doesn't exists but in another revision. What actually
exists is the file and that's imho the only thing that should match.

^ permalink raw reply

* Re: Copy branch into master
From: Neal Kreitzinger @ 2011-12-01 19:43 UTC (permalink / raw)
  To: git; +Cc: 'Andrew Eikum', git
In-Reply-To: <CB09450076EA444CA289CDCB995F16A4@bny.us.bosch.com>

On 11/28/2011 1:30 PM, Ron Eggler wrote:
>> Sorry, I have no idea how to use any of the GUI tools. Perhaps your
>> GUI tool has a mailing list where you can ask about merge conflict
>> resolution?
>
> No problem, I actually got it all figured out now, and got my branch
> smoothly merged back into master.
>
That is not what you originally asked for.  What you asked for was:
"Now I would like to copy exactly what I have in that branch back into 
my master to have an exact copy in my master of what got deployed with 
out any changes."  If you did a git-merge then what you did was combine 
master with DVT.  That most likely did not make master equal to DVT.  If 
you run the following git-diff the results will likely show they do not 
match:
$ git-diff --name-only sha1-of-master-after-DVT-merge 
sha1-of-DVT-before-merge-to-master

If you merge branch A into branch B it does not make branch B equal to 
branch A.  It makes branch B a combination of branch B and branch A 
(plus your merge conflict resolutions).  If you truly want to make 
master exactly match DVT then I recommend the following: (I'm assuming 
this is not a superproject containing submodules, and that you are using 
linux.  I am using git 1.7.1.)

(Return master to the state it was in when you asked the question)
(1) git checkout master
(2) git branch BKUP-master-DVT-merge (backup your current post-merge 
master to another branch)
(3) git reset --hard sha1-of-master-before-merge

(Return DVT to the state it was in when you asked the question)
(1) git checkout DVT
(2) git branch BKUP-DVT-B4-merge (backup current DVT if it has new work)
(3) git reset --hard sha1-of-DVT-before-merge

(Make master match DVT exactly)
(1) Use the "vendor branch code drop" method described in the git-rm 
manpage* (see note on permissions below).  (Use the git-archive command 
to create your tarball of DVT for this.)
(2) After "vendor branch code drop" is committed, git-diff --name-status 
master DVT (they should match, ie. no results that matter)
(3) git-tag the resulting commit to make it clear this is the version 
you deployed.

*Note: if you are tracking permissions in git (executable vs. 
non-executable bits) then you will need to accomodate and validate 
permissions in your git-archive and/or ensure permissions are set 
properly before committing your vendor branch code drop.  Otherwise, you 
will have permissions changes that do not exactly match DVT.

I recommend trying the above on a test copy of your repo to verify the 
results are really what you want.
(tar up your repo as root to retain permissions, and untar in test dir)
(repo path = /home/me/MY-REPO)
(1) pwd ( = /home/me)
(2) mkdir test
(3) su root
(4) pwd ( = /home/me)
(5) tar -cvzf MY-REPO.tar.gz MY-REPO/
(6) cd test
(7) mv ../MY-REPO.tar.gz .
(8) tar -xvzf MY-REPO.tar.gz
(test repo path = /home/me/test/MY-REPO)
(9) exit
(10) cd /home/me/test/MY-REPO
(11) try out what I said on test/MY-REPO and then decide if you want to 
do it on the real me/MY-REPO.

If you already have additional commits on master after your master-DVT 
merge then they are backed up in the BKUP-master-DVT-merge branch you 
made earlier.  These commits can then be interactively rebased on your 
remediated master.  (Be aware of any possible unique master-DVT merged 
code the new commits are dependent on.  If so, that code needs to be 
reincorporated instead of being unwittingly lost.)

All the above assumes that others have not already pulled your 
master-DVT merge and based their work on it.  If they have already 
pulled your new master and based work on it then it may be too late for 
this to be practical, or additional steps would be needed for others to 
properly remediate.  I'm assuming you have a known set of developers 
pulling from you.  It is possible for them to interactively rebase their 
new work onto your remediated master and take the merged-master out of 
their history.  (Be aware of any possible unique master-DVT merged code 
the new commits are dependent on.  If so, that code needs to be 
reincorporated instead of being unwittingly lost.)

Hope this helps.

v/r,
neal

^ permalink raw reply

* Re: Status after 'git clone --no-checkout' ?
From: Jeff King @ 2011-12-01 19:00 UTC (permalink / raw)
  To: norbert.nemec; +Cc: git
In-Reply-To: <jb59h0$p3e$1@dough.gmane.org>

On Wed, Nov 30, 2011 at 02:02:22PM +0100, norbert.nemec wrote:

> what exactly is the status after 'git clone --no-checkout'? Is there
> any straightforward way how one could end up in this state starting
> from a regularly checked out repository?

You have a HEAD which points to some actual commit, but no index or
working tree. I don't think there is a particular name for this state.

You can get something similar in an existing repo by deleting all of the
working tree files and removing .git/index.

> 'git checkout' without any further options serves to move from the
> aforementioned special state to a regular checked out state.
> Otherwise it never seems to do anything. Are there any other
> situations where 'git checkout' on its own would have any effect?

By itself, I don't think so. But you can use "git checkout -f" to
discard changes in the index and working tree, setting them back to the
state in HEAD.

At one point, some people used "git checkout" as a no-op, because it
would print the "ahead/behind" information with respect to the upstream.
These days, that information is part of "git status", so I suspect
people use that instead.

-Peff

^ permalink raw reply

* Re: [PATCH/RFC 1/2] pull: pass the --no-ff-only flag through to merge, not fetch
From: Samuel Bronson @ 2011-12-01 18:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vvcq0np35.fsf@alter.siamese.dyndns.org>

On Thu, Dec 1, 2011 at 1:06 PM, Junio C Hamano <gitster@pobox.com> wrote:
> Samuel Bronson <naesten@gmail.com> writes:
>
>> Hmm, yes, I had noticed that it was a tristate (merge.ff clearly is),
>> and I guess --no-ff-only is a pretty ugly flag. I do have to ask,
>> though: why give --ff these new values? Wouldn't it make more sense to
>> reuse the values accepted by merge.ff; namely, 'true' (the implied
>> default), 'false', and 'only'?
>
> The 'true' and 'false' values to merge.ff are carry-over from the days
> when it was a boolean, _not_ a tristate. If we were to make the UI more
> rational by making it clear that this is not a boolean, it is a good time
> for us to aim a bit higher than merely repeating the mistakes we made in
> the past due to historical accident. In other words, we could add a
> synonym for the "default" mode in addition to "--ff=true" (and for the
> "always merge" mode in addition to "--ff=false") that makes it clear that
> the value is _not_ a boolean [*1*]. If we were to go the "--ff=<value>"
> route, we have to add support for other ways to spell boolean 'true'
> (e.g. 'yes', '1', and 'on') anyway, so it is not that much extra work to
> do so, I would think.

Sure, that makes sense. I was just a little worried that you might be
(accidentally) proposing that --ff use a different set of names than
merge.ff for a moment there...

>> Otherwise, this looks like a very nice way to implement what I want: I
>> guess it is probably a mistake that the existing (documented) flags do
>> not behave in this way?
>
> Yeah, right now if you say "merge --ff-only --no-ff", we say these are
> mutually exclusive (which is true), but if you think about the tristate
> nature of the 'ff' option and spell it differently in your head, i.e.
> "merge --ff=only --ff=never", it is reasonable to argue that we should
> apply the usual "last one overrides" rule and behave as if "merge --no-ff"
> were given (for the purpose of "last one overrides", the configured
> defaults can be treated as if they come very early on the command line).
> After all "merge --no-ff --ff" does seem to use the "last one overrides"
> rule.

Yes, I totally agree that it would make more sense that way; I
certainly tried that before I even began to look at any of the code.

> [Footnote]
>
> *1* Perhaps 'allowed' instead of 'normal' (which I wrote out of thin-air;
> I do not have any strong preference on the actual values) may be a better
> choice for such a "this is not a boolean" spelling for the default mode.

^ permalink raw reply

* Re: Workflow Recommendation - Probably your 1000th
From: Stephen Bash @ 2011-12-01 18:55 UTC (permalink / raw)
  To: bradford; +Cc: git
In-Reply-To: <CAEbKVFSXn3we7Btb3fN5DUW7BMub_ZrBeUwLUZrRFTmESoW97A@mail.gmail.com>

----- Original Message -----
> From: "bradford" <fingermark@gmail.com>
> To: git@vger.kernel.org
> Sent: Thursday, December 1, 2011 1:26:10 PM
> Subject: Workflow Recommendation - Probably your 1000th
> 
> You guys probably receive a ton of workflow related questions.  I'm
> trying to convert from svn to git.  In order to complete, I would
> like to be able to provide a workflow to our team.  We typically go 
> from dev -> qa -> production (Java and Rails projects).  The problem 
> is that sometimes QA can get backed up and we'll need to release
> something to production while QA is doing their thing.  What is a
> good workflow?  I would like to not use git-flow, because it's another
> tool.  

Hey wow...  I read that Driessen's workflow post [1] a long time ago, but hadn't run into the git-flow tools until a few days ago.  Guess I was just oblivious...  Anyway, if it's any consolation, my company runs a model very much inspired by Driessen's post without using git-flow itself.

[1] http://nvie.com/posts/a-successful-git-branching-model/

> I've read suggestions to use environment branches (master,
> staging, production).  I've also read not to do this and just use
> master, tagging your production releases.  How well would our setup,
> where things can get backed up, work with the latter?  Are there any
> alternative suggestions?

In our workflow we flip Driessen's model on its head.  master is the newest code, while we branch off maintenance branches just before each release.  We tag each release so it's easy to identify which versions in the field contain a given bug or fix (multiple minor versions come off a single maintenance branch).  Our QA guys follow the maintenance branches (they're relatively stable).  We recently had to do a hot-fix release which I think would be similar to your "release to production".  Basically we found the last commit on the maintenance branch that was well tested, created a new branch from there, did the hot fix, QA did some real fast testing (sounds like you'd skip this step), and we shipped that.  As always that hot-fix release gets tagged, so in the future we can still reference t
 hat particular build (and in this case the branch merged back into the maintenance branch -- we've had other situations where the branch was simply deleted after tagging).

In the grand scheme of things our model isn't that different than Driessen's; we just name the branches differently.  Commits go on the oldest branch that's safe for them, and then everything merges to the newer branches.  Tags provide easy reference for where on a given branch a release came from.

Hope that helps.

Stephen

^ permalink raw reply

* Re: log: option "--follow" not the default for a single file?
From: Jeff King @ 2011-12-01 18:52 UTC (permalink / raw)
  To: Ralf Thielow; +Cc: git
In-Reply-To: <CAN0XMOJGm1frOi7FEke7LfHCSBt2DRn_npkdKe0m3qZ=hQPNHw@mail.gmail.com>

On Wed, Nov 30, 2011 at 07:38:23PM +0100, Ralf Thielow wrote:

> >     pathspec. That may happen to match a single file in the current
> >     revision, but to git it is actually a prefix-limiting pattern, and
> Is it possible to detect the case of a single file in the current
> revisionand use "--follow" by default for exactly that?

It's possible, but it is changing the meaning of "git log foo". With
the current code, even if "foo" is currently a file, it will match
"foo/bar" in a prior revision. Switching this to "--follow" will not.

-Peff

^ permalink raw reply

* Re: git merge strange result
From: Jeff King @ 2011-12-01 18:50 UTC (permalink / raw)
  To: Catalin(ux) M. BOIE; +Cc: git
In-Reply-To: <alpine.LFD.2.02.1112011733070.32682@mail.embedromix.ro>

On Thu, Dec 01, 2011 at 05:36:00PM +0200, Catalin(ux) M. BOIE wrote:

> Below is a script that reproduce what a coleague of mine found.
> Seems that if in a branch we have a commit that is cherry-picked be
> master, than revert that commit in branch and merge branch in master,
> the revert is ignored. Is it normal?

Yes, it's by design. When doing a merge, we look at three points: the
tip of the current branch, the tip of the branch to be merged, and the
point at which history diverged (the "merge base"). We don't look
individually at the commits that happened between the merge base and
each tip.

The non-conflicting case for a 3-way merge is that one side makes some
change, but the other side does nothing. In this case, we include the
change in the merge result. But remember that we are only looking at
endpoints. So what the actual merge code sees is that one side's version
of a file is identical to the merge base's version, and that the other
side's version is now different.

In your case, one side makes the change, but then restores the original
state. So from the perspective of the merge code, no change happened at
all on that side.

-Peff

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox