Git development

Git development
 help / color / mirror / Atom feed

* [PATCH 3/3] Abstract out zlib
From: Marco Costalba @ 2008-01-13 13:24 UTC (permalink / raw)
  To: gitster; +Cc: git, Marco Costalba
In-Reply-To: <1200230678-18188-2-git-send-email-mcostalba@gmail.com>

Finally remove the hardcoded link between
compression helpers and zlib.

Association bewteen a stream and the compression library
is done at stream init time through a set of function
pointers. Library choice is based on compression level
parameter passed to compress_alloc(), this allow us to
stay 100% back compatible with current code.

Patch also adds the the necessary zlib engine plugin,
this turns out to be trivial since we have modeled
everything around zlib.

Signed-off-by: Marco Costalba <mcostalba@gmail.com>
---
 compress.c |   55 ++++++++++++++++++++++++++++++++++++++++++++++---------
 compress.h |    6 ++++++
 2 files changed, 52 insertions(+), 9 deletions(-)

diff --git a/compress.c b/compress.c
index b28c389..09a5df0 100644
--- a/compress.c
+++ b/compress.c
@@ -1,6 +1,41 @@
 #include "cache.h"
 #include "compress.h"
 
+/* Default zlib engine plugin definition
+ *
+ * Because everything is modeled after zlib the
+ * corresponding engine wrappers here are trivial
+ */
+static int zlib_deflateInit(z_stream *s, int l) { return deflateInit(s, l); }
+static int zlib_deflate(z_stream *s, int f) { return deflate(s, f); }
+static int zlib_deflateEnd(z_stream *s) { return deflateEnd(s); }
+static unsigned long zlib_deflateBound(z_stream *s, unsigned long sz) { return deflateBound(s, sz); }
+
+static int zlib_inflateInit(z_stream *s) { return inflateInit(s); }
+static int zlib_inflate(z_stream *s, int f) { return inflate(s, f); }
+static int zlib_inflateEnd(z_stream *s) { return inflateEnd(s); }
+
+
+/* link the stream to the compression library functions */
+static int register_engine(ext_stream *stream, int level)
+{
+	switch (level) {
+	case LZO_COMPRESSION:
+		die ("LZO compression still not implemented");
+		break;
+	default: /* assumed to be zlib */
+		stream->deflateInit_fn  = zlib_deflateInit;
+		stream->deflate_fn      = zlib_deflate;
+		stream->deflateEnd_fn   = zlib_deflateEnd;
+		stream->deflateBound_fn = zlib_deflateBound;
+		stream->inflateInit_fn  = zlib_inflateInit;
+		stream->inflate_fn      = zlib_inflate;
+		stream->inflateEnd_fn   = zlib_inflateEnd;
+		break;
+	};
+	return Z_OK;
+}
+
 /*
  *     Compression helpers
  */
@@ -8,8 +43,9 @@
 unsigned long compress_alloc(ext_stream *stream, int level, unsigned long size)
 {
 	memset(stream, 0, sizeof(*stream));
-	deflateInit(&stream->z, level);
-	return deflateBound(&stream->z, size);
+	register_engine(stream, level);
+	stream->deflateInit_fn(&stream->z, level);
+	return stream->deflateBound_fn(&stream->z, size);
 }
 
 int compress_start(ext_stream *stream,
@@ -28,7 +64,7 @@ int compress_next(ext_stream *stream, int flush)
 	int result;
 
 	do {
-		result = deflate(&stream->z, flush);
+		result = stream->deflate_fn(&stream->z, flush);
 	} while (result == Z_OK);
 
 	return result;
@@ -36,7 +72,7 @@ int compress_next(ext_stream *stream, int flush)
 
 unsigned long compress_free(ext_stream *stream)
 {
-	deflateEnd(&stream->z);
+	stream->deflateEnd_fn(&stream->z);
 	return stream->z.total_out;
 }
 
@@ -68,7 +104,8 @@ unsigned long compress_all(int level, unsigned char *in,
 int decompress_alloc(ext_stream *stream)
 {
 	memset(stream, 0, sizeof(*stream));
-	return inflateInit(&stream->z);
+	register_engine(stream, Z_DEFAULT_COMPRESSION); // FIXME for now zlib assumed
+	return stream->inflateInit_fn(&stream->z);
 }
 
 int decompress_from(ext_stream *stream, unsigned char *in, unsigned long in_size)
@@ -87,24 +124,24 @@ int decompress_into(ext_stream *stream, unsigned char *out, unsigned long out_si
 
 int decompress_next(ext_stream *stream, int flush)
 {
-	return inflate(&stream->z, flush);
+	return stream->inflate_fn(&stream->z, flush);
 }
 
 int decompress_next_from(ext_stream *stream, unsigned char *in, unsigned long in_size, int flush)
 {
 	decompress_from(stream, in, in_size);
-	return inflate(&stream->z, flush);
+	return stream->inflate_fn(&stream->z, flush);
 }
 
 int decompress_next_into(ext_stream *stream, unsigned char *out, unsigned long out_size, int flush)
 {
 	decompress_into(stream, out, out_size);
-	return inflate(&stream->z, flush);
+	return stream->inflate_fn(&stream->z, flush);
 }
 
 unsigned long decompress_free(ext_stream *stream)
 {
-	inflateEnd(&stream->z);
+	stream->inflateEnd_fn(&stream->z);
 	return stream->z.total_out;
 }
 
diff --git a/compress.h b/compress.h
index d1de31f..151234a 100644
--- a/compress.h
+++ b/compress.h
@@ -1,6 +1,12 @@
 #ifndef COMPRESS_H
 #define COMPRESS_H
 
+/* Add here custom compression levels. First 0-9
+ * and -1 are reserved values used by zlib
+ */
+#define LZO_COMPRESSION 99
+
+
 /* Any compress/decompress engine must implement all the
  * below functions that are modeled after the zlib ones.
  */
-- 
1.5.4.rc2.98.g58cd2

^ permalink raw reply related

* [PATCH 2/3] Rename z_stream to ext_stream across the sources
From: Marco Costalba @ 2008-01-13 13:24 UTC (permalink / raw)
  To: gitster; +Cc: git, Marco Costalba
In-Reply-To: <1200230678-18188-1-git-send-email-mcostalba@gmail.com>

Also fix the places where z_stream member access
is open coded.

In the future we could hide direct z_stream handling
behind some helpers, but for now leave it like it is
now to better understand what is going on.

Signed-off-by: Marco Costalba <mcostalba@gmail.com>
---
 builtin-pack-objects.c   |    8 +++---
 builtin-unpack-objects.c |    6 ++--
 http-push.c              |   12 +++++-----
 http-walker.c            |    6 ++--
 index-pack.c             |    6 ++--
 sha1_file.c              |   56 +++++++++++++++++++++++-----------------------
 6 files changed, 47 insertions(+), 47 deletions(-)

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index d2865fe..5165a23 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -300,7 +300,7 @@ static int check_pack_inflate(struct packed_git *p,
 		off_t len,
 		unsigned long expect)
 {
-	z_stream stream;
+	ext_stream stream;
 	unsigned char fakebuf[4096], *in;
 	unsigned int in_size = 0;
 	int st;
@@ -310,12 +310,12 @@ static int check_pack_inflate(struct packed_git *p,
 		decompress_into(&stream, fakebuf, sizeof(fakebuf));
 		in = use_pack(p, w_curs, offset, &in_size);
 		st = decompress_next_from(&stream, in, in_size, Z_FINISH);
-		offset += stream.next_in - in;
+		offset += stream.z.next_in - in;
 	} while (st == Z_OK || st == Z_BUF_ERROR);
 	decompress_free(&stream);
 	return (st == Z_STREAM_END &&
-		stream.total_out == expect &&
-		stream.total_in == len) ? 0 : -1;
+		stream.z.total_out == expect &&
+		stream.z.total_in == len) ? 0 : -1;
 }
 
 static int check_pack_crc(struct packed_git *p, struct pack_window **w_curs,
diff --git a/builtin-unpack-objects.c b/builtin-unpack-objects.c
index c996560..066bf06 100644
--- a/builtin-unpack-objects.c
+++ b/builtin-unpack-objects.c
@@ -61,7 +61,7 @@ static void use(int bytes)
 
 static void *get_data(unsigned long size)
 {
-	z_stream stream;
+	ext_stream stream;
 	unsigned char *buf = xmalloc(size);;
 
 	decompress_alloc(&stream);
@@ -71,8 +71,8 @@ static void *get_data(unsigned long size)
 		/* fill() modifies len, so be sure is evaluated as first */
 		void* tmp = fill(1);
 		int ret = decompress_next_from(&stream, tmp, len, Z_NO_FLUSH);
-		use(len - stream.avail_in);
-		if (stream.total_out == size && ret == Z_STREAM_END)
+		use(len - stream.z.avail_in);
+		if (stream.z.total_out == size && ret == Z_STREAM_END)
 			break;
 		if (ret != Z_OK) {
 			error("decompress returned %d\n", ret);
diff --git a/http-push.c b/http-push.c
index ec0568c..c7ea871 100644
--- a/http-push.c
+++ b/http-push.c
@@ -127,7 +127,7 @@ struct transfer_request
 	long http_code;
 	unsigned char real_sha1[20];
 	SHA_CTX c;
-	z_stream stream;
+	ext_stream stream;
 	int zret;
 	int rename;
 	void *userData;
@@ -209,8 +209,8 @@ static size_t fwrite_sha1_file(void *ptr, size_t eltsize, size_t nmemb,
 		request->zret = decompress_next_into(&request->stream, expn,
 		                                     sizeof(expn), Z_SYNC_FLUSH);
 		SHA1_Update(&request->c, expn,
-			    sizeof(expn) - request->stream.avail_out);
-	} while (request->stream.avail_in && request->zret == Z_OK);
+			    sizeof(expn) - request->stream.z.avail_out);
+	} while (request->stream.z.avail_in && request->zret == Z_OK);
 	data_received++;
 	return size;
 }
@@ -483,7 +483,7 @@ static void start_put(struct transfer_request *request)
 	unsigned long len;
 	int hdrlen;
 	ssize_t size;
-	z_stream stream;
+	ext_stream stream;
 
 	unpacked = read_sha1_file(request->obj->sha1, &type, &len);
 	hdrlen = sprintf(hdr, "%s %lu", typename(type), len) + 1;
@@ -501,8 +501,8 @@ static void start_put(struct transfer_request *request)
 	compress_next(&stream, Z_NO_FLUSH);
 
 	/* Then the data itself.. */
-	stream.next_in = unpacked;
-	stream.avail_in = len;
+	stream.z.next_in = unpacked;
+	stream.z.avail_in = len;
 	compress_next(&stream, Z_FINISH);
 
 	request->buffer.buf.len = compress_free(&stream);
diff --git a/http-walker.c b/http-walker.c
index b1d2a28..1581746 100644
--- a/http-walker.c
+++ b/http-walker.c
@@ -38,7 +38,7 @@ struct object_request
 	long http_code;
 	unsigned char real_sha1[20];
 	SHA_CTX c;
-	z_stream stream;
+	ext_stream stream;
 	int zret;
 	int rename;
 	struct active_request_slot *slot;
@@ -83,8 +83,8 @@ static size_t fwrite_sha1_file(void *ptr, size_t eltsize, size_t nmemb,
 		obj_req->zret = decompress_next_into(&obj_req->stream, expn,
                                              sizeof(expn), Z_SYNC_FLUSH);
 		SHA1_Update(&obj_req->c, expn,
-			    sizeof(expn) - obj_req->stream.avail_out);
-	} while (obj_req->stream.avail_in && obj_req->zret == Z_OK);
+			    sizeof(expn) - obj_req->stream.z.avail_out);
+	} while (obj_req->stream.z.avail_in && obj_req->zret == Z_OK);
 	data_received++;
 	return size;
 }
diff --git a/index-pack.c b/index-pack.c
index 929de39..7881a0b 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -166,7 +166,7 @@ static void bad_object(unsigned long offset, const char *format, ...)
 
 static void *unpack_entry_data(unsigned long offset, unsigned long size)
 {
-	z_stream stream;
+	ext_stream stream;
 	void *buf = xmalloc(size);
 
 	decompress_alloc(&stream);
@@ -176,8 +176,8 @@ static void *unpack_entry_data(unsigned long offset, unsigned long size)
 		/* fill() modifies len, so be sure is evaluated as first */
 		void* tmp = fill(1);
 		int ret = decompress_next_from(&stream, tmp, input_len, Z_NO_FLUSH);
-		use(input_len - stream.avail_in);
-		if (stream.total_out == size && ret == Z_STREAM_END)
+		use(input_len - stream.z.avail_in);
+		if (stream.z.total_out == size && ret == Z_STREAM_END)
 			break;
 		if (ret != Z_OK)
 			bad_object(offset, "decompress returned %d", ret);
diff --git a/sha1_file.c b/sha1_file.c
index 708727a..a978f13 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1052,7 +1052,7 @@ unsigned long unpack_object_header_gently(const unsigned char *buf, unsigned lon
 	return used;
 }
 
-static int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned long mapsize, void *buffer, unsigned long bufsiz)
+static int unpack_sha1_header(ext_stream *stream, unsigned char *map, unsigned long mapsize, void *buffer, unsigned long bufsiz)
 {
 	unsigned long size, used;
 	static const char valid_loose_object_type[8] = {
@@ -1087,19 +1087,19 @@ static int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned lon
 	decompress_from(stream, map, mapsize);
 
 	/* And generate the fake traditional header */
-	stream->total_out = 1 + snprintf(buffer, bufsiz, "%s %lu",
+	stream->z.total_out = 1 + snprintf(buffer, bufsiz, "%s %lu",
 					 typename(type), size);
 	return 0;
 }
 
-static void *unpack_sha1_rest(z_stream *stream, void *buffer, unsigned long size, const unsigned char *sha1)
+static void *unpack_sha1_rest(ext_stream *stream, void *buffer, unsigned long size, const unsigned char *sha1)
 {
 	int bytes = strlen(buffer) + 1;
 	unsigned char *buf = xmalloc(1+size);
 	unsigned long n;
 	int status = Z_OK;
 
-	n = stream->total_out - bytes;
+	n = stream->z.total_out - bytes;
 	if (n > size)
 		n = size;
 	memcpy(buf, (char *) buffer + bytes, n);
@@ -1123,14 +1123,14 @@ static void *unpack_sha1_rest(z_stream *stream, void *buffer, unsigned long size
 			status = decompress_next(stream, Z_FINISH);
 	}
 	buf[size] = 0;
-	if (status == Z_STREAM_END && !stream->avail_in) {
+	if (status == Z_STREAM_END && !stream->z.avail_in) {
 		decompress_free(stream);
 		return buf;
 	}
 
 	if (status < 0)
 		error("corrupt loose object '%s'", sha1_to_hex(sha1));
-	else if (stream->avail_in)
+	else if (stream->z.avail_in)
 		error("garbage at end of loose object '%s'",
 		      sha1_to_hex(sha1));
 	free(buf);
@@ -1191,7 +1191,7 @@ static int parse_sha1_header(const char *hdr, unsigned long *sizep)
 static void *unpack_sha1_file(void *map, unsigned long mapsize, enum object_type *type, unsigned long *size, const unsigned char *sha1)
 {
 	int ret;
-	z_stream stream;
+	ext_stream stream;
 	char hdr[8192];
 
 	ret = unpack_sha1_header(&stream, map, mapsize, hdr, sizeof(hdr));
@@ -1207,7 +1207,7 @@ unsigned long get_size_from_delta(struct packed_git *p,
 {
 	const unsigned char *data;
 	unsigned char delta_head[20], *in;
-	z_stream stream;
+	ext_stream stream;
 	int st;
 	unsigned int in_size = 0;
 
@@ -1217,11 +1217,11 @@ unsigned long get_size_from_delta(struct packed_git *p,
 	do {
 		in = use_pack(p, w_curs, curpos, &in_size);
 		st = decompress_next_from(&stream, in, in_size, Z_FINISH);
-		curpos += stream.next_in - in;
+		curpos += stream.z.next_in - in;
 	} while ((st == Z_OK || st == Z_BUF_ERROR) &&
-		 stream.total_out < sizeof(delta_head));
+		 stream.z.total_out < sizeof(delta_head));
 	decompress_free(&stream);
-	if ((st != Z_STREAM_END) && stream.total_out != sizeof(delta_head))
+	if ((st != Z_STREAM_END) && stream.z.total_out != sizeof(delta_head))
 		die("delta data unpack-initial failed");
 
 	/* Examine the initial part of the delta to figure out
@@ -1416,7 +1416,7 @@ static void *unpack_compressed_entry(struct packed_git *p,
 				    unsigned long size)
 {
 	int st;
-	z_stream stream;
+	ext_stream stream;
 	unsigned char *buffer, *in;
 	unsigned int in_size = 0;
 
@@ -1427,10 +1427,10 @@ static void *unpack_compressed_entry(struct packed_git *p,
 	do {
 		in = use_pack(p, w_curs, curpos, &in_size);
 		st = decompress_next_from(&stream, in, in_size, Z_FINISH);
-		curpos += stream.next_in - in;
+		curpos += stream.z.next_in - in;
 	} while (st == Z_OK || st == Z_BUF_ERROR);
 	decompress_free(&stream);
-	if ((st != Z_STREAM_END) || stream.total_out != size) {
+	if ((st != Z_STREAM_END) || stream.z.total_out != size) {
 		free(buffer);
 		return NULL;
 	}
@@ -1762,7 +1762,7 @@ static int sha1_loose_object_info(const unsigned char *sha1, unsigned long *size
 	int status;
 	unsigned long mapsize, size;
 	void *map;
-	z_stream stream;
+	ext_stream stream;
 	char hdr[32];
 
 	map = map_sha1_file(sha1, &mapsize);
@@ -2033,7 +2033,7 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha
 {
 	int size, ret;
 	unsigned char *compressed;
-	z_stream stream;
+	ext_stream stream;
 	unsigned char sha1[20];
 	char *filename;
 	static char tmpfile[PATH_MAX];
@@ -2084,8 +2084,8 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha
 	compress_next(&stream, Z_NO_FLUSH);
 
 	/* Then the data itself.. */
-	stream.next_in = buf;
-	stream.avail_in = len;
+	stream.z.next_in = buf;
+	stream.z.avail_in = len;
 	ret = compress_next(&stream, Z_FINISH);
 	if (ret != Z_STREAM_END)
 		die("unable to deflate new object %s (%d)", sha1_to_hex(sha1), ret);
@@ -2109,7 +2109,7 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha
 static void *repack_object(const unsigned char *sha1, unsigned long *objsize)
 {
 	size_t size;
-	z_stream stream;
+	ext_stream stream;
 	unsigned char *unpacked;
 	unsigned long len;
 	enum object_type type;
@@ -2135,8 +2135,8 @@ static void *repack_object(const unsigned char *sha1, unsigned long *objsize)
 	compress_next(&stream, Z_NO_FLUSH);
 
 	/* Then the data itself.. */
-	stream.next_in = unpacked;
-	stream.avail_in = len;
+	stream.z.next_in = unpacked;
+	stream.z.avail_in = len;
 	compress_next(&stream, Z_FINISH);
 
 	*objsize = compress_free(&stream);
@@ -2167,7 +2167,7 @@ int write_sha1_from_fd(const unsigned char *sha1, int fd, char *buffer,
 {
 	char tmpfile[PATH_MAX];
 	int local;
-	z_stream stream;
+	ext_stream stream;
 	unsigned char real_sha1[20];
 	unsigned char discard[4096];
 	int ret;
@@ -2193,13 +2193,13 @@ int write_sha1_from_fd(const unsigned char *sha1, int fd, char *buffer,
 			do {
 				ret = decompress_next_into(&stream, discard, sizeof(discard), Z_SYNC_FLUSH);
 				SHA1_Update(&c, discard, sizeof(discard) -
-					    stream.avail_out);
-			} while (stream.avail_in && ret == Z_OK);
-			if (write_buffer(local, buffer, *bufposn - stream.avail_in) < 0)
+					    stream.z.avail_out);
+			} while (stream.z.avail_in && ret == Z_OK);
+			if (write_buffer(local, buffer, *bufposn - stream.z.avail_in) < 0)
 				die("unable to write sha1 file");
-			memmove(buffer, buffer + *bufposn - stream.avail_in,
-				stream.avail_in);
-			*bufposn = stream.avail_in;
+			memmove(buffer, buffer + *bufposn - stream.z.avail_in,
+				stream.z.avail_in);
+			*bufposn = stream.z.avail_in;
 			if (ret != Z_OK)
 				break;
 		}
-- 
1.5.4.rc2.98.g58cd2

^ permalink raw reply related

* [PATCH 1/3] Convert compress helpers to use an extended z_stream
From: Marco Costalba @ 2008-01-13 13:24 UTC (permalink / raw)
  To: gitster; +Cc: git, Marco Costalba

Instead of passing as argument to compress/decompress
helpers the zlib native z_stream create a new extended
one called ext_stream and use this instead.

This will allow us to associate the specified compress
algorithm on a 'per stream' base.

The patch aims to be as less intrusive as possible
so that ext_stream is really a (slightly) extended
z_stream and includes a z_stream member.

Choice to use a z_stream variable instead of a z_stream*
pointer as member is to avoid a xmalloc when a local
ext_stream variable is used.

This patch just introduces ext_stream and modifies the
helpers to cope with that. Next patch will rename z_stream
to ext_stream across the sources. Tough this patch will
not compile without the next I splitted the two for easing
the review: the interesting part is here, the next is
just renaming.

Signed-off-by: Marco Costalba <mcostalba@gmail.com>
---

The patch series apply above my last one sent yesterday.

This patch series is intended more as a RFC then to
be applied.

While the previous series introduced a cleanup, this one
extends the compress/decompress helpers framwork to support
different compress backends.

The compress backends are associated to each z_stream, it means
I can have multiple z_stream working with different backends at
the same time.

The rationale for this low level link instead of setting the chosen
comrpess library, say, at repository level, is that pulling/pushing
packs on the net must continue to work with zlib so to not break
things and also that, as Junio pointed out, today we don't
unpackage and repackage the objects that arrive with a git-pull,
but store the incoming pack as is.

So we can have different packs compressed with different backends
in the same repo. To allow this the compression algorithm info should
be linked at the z_stream level. Note that this is _almost_ trasparent
to the user.


Finally this is the first patch series sent with send-mail, so to try
to overcome the lines wrapping damage of my mailer.

BTW it took more time to setup send-mail and send the patches then to
write them! ;-)


 compress.c |   64 ++++++++++++++++++++++++++++++------------------------------
 compress.h |   52 ++++++++++++++++++++++++++++++++++++++----------
 2 files changed, 73 insertions(+), 43 deletions(-)

diff --git a/compress.c b/compress.c
index f73cf2c..b28c389 100644
--- a/compress.c
+++ b/compress.c
@@ -5,46 +5,46 @@
  *     Compression helpers
  */
 
-unsigned long compress_alloc(z_stream *stream, int level, unsigned long size)
+unsigned long compress_alloc(ext_stream *stream, int level, unsigned long size)
 {
 	memset(stream, 0, sizeof(*stream));
-	deflateInit(stream, level);
-	return deflateBound(stream, size);
+	deflateInit(&stream->z, level);
+	return deflateBound(&stream->z, size);
 }
 
-int compress_start(z_stream *stream,
+int compress_start(ext_stream *stream,
                    unsigned char *in, unsigned long in_size,
                    unsigned char *out, unsigned long out_size)
 {
-	stream->next_out = out;
-	stream->avail_out = out_size;
-	stream->next_in = in;
-	stream->avail_in = in_size;
+	stream->z.next_out = out;
+	stream->z.avail_out = out_size;
+	stream->z.next_in = in;
+	stream->z.avail_in = in_size;
 	return Z_OK;
 }
 
-int compress_next(z_stream *stream, int flush)
+int compress_next(ext_stream *stream, int flush)
 {
 	int result;
 
 	do {
-		result = deflate(stream, flush);
+		result = deflate(&stream->z, flush);
 	} while (result == Z_OK);
 
 	return result;
 }
 
-unsigned long compress_free(z_stream *stream)
+unsigned long compress_free(ext_stream *stream)
 {
-	deflateEnd(stream);
-	return stream->total_out;
+	deflateEnd(&stream->z);
+	return stream->z.total_out;
 }
 
 unsigned long compress_all(int level, unsigned char *in,
                            unsigned long in_size, unsigned char **out)
 {
 	unsigned long out_size;
-	z_stream stream;
+	ext_stream stream;
 
 	out_size = compress_alloc(&stream, level, in_size);
 	*out = xmalloc(out_size);
@@ -65,47 +65,47 @@ unsigned long compress_all(int level, unsigned char *in,
  *     Decompression helpers
  */
 
-int decompress_alloc(z_stream *stream)
+int decompress_alloc(ext_stream *stream)
 {
 	memset(stream, 0, sizeof(*stream));
-	return inflateInit(stream);
+	return inflateInit(&stream->z);
 }
 
-int decompress_from(z_stream *stream, unsigned char *in, unsigned long in_size)
+int decompress_from(ext_stream *stream, unsigned char *in, unsigned long in_size)
 {
-	stream->next_in = in;
-	stream->avail_in = in_size;
+	stream->z.next_in = in;
+	stream->z.avail_in = in_size;
 	return Z_OK;
 }
 
-int decompress_into(z_stream *stream, unsigned char *out, unsigned long out_size)
+int decompress_into(ext_stream *stream, unsigned char *out, unsigned long out_size)
 {
-	stream->next_out = out;
-	stream->avail_out = out_size;
+	stream->z.next_out = out;
+	stream->z.avail_out = out_size;
 	return Z_OK;
 }
 
-int decompress_next(z_stream *stream, int flush)
+int decompress_next(ext_stream *stream, int flush)
 {
-	return inflate(stream, flush);
+	return inflate(&stream->z, flush);
 }
 
-int decompress_next_from(z_stream *stream, unsigned char *in, unsigned long in_size, int flush)
+int decompress_next_from(ext_stream *stream, unsigned char *in, unsigned long in_size, int flush)
 {
 	decompress_from(stream, in, in_size);
-	return inflate(stream, flush);
+	return inflate(&stream->z, flush);
 }
 
-int decompress_next_into(z_stream *stream, unsigned char *out, unsigned long out_size, int flush)
+int decompress_next_into(ext_stream *stream, unsigned char *out, unsigned long out_size, int flush)
 {
 	decompress_into(stream, out, out_size);
-	return inflate(stream, flush);
+	return inflate(&stream->z, flush);
 }
 
-unsigned long decompress_free(z_stream *stream)
+unsigned long decompress_free(ext_stream *stream)
 {
-	inflateEnd(stream);
-	return stream->total_out;
+	inflateEnd(&stream->z);
+	return stream->z.total_out;
 }
 
 unsigned long decompress_all(unsigned char *in, unsigned long in_size,
@@ -113,7 +113,7 @@ unsigned long decompress_all(unsigned char *in, unsigned long in_size,
 {
 /* caller should check for return value != 0 */
 
-	z_stream stream;
+	ext_stream stream;
 	int st;
 
 	if (decompress_alloc(&stream) != Z_OK)
diff --git a/compress.h b/compress.h
index a81d006..d1de31f 100644
--- a/compress.h
+++ b/compress.h
@@ -1,25 +1,55 @@
 #ifndef COMPRESS_H
 #define COMPRESS_H
 
-extern unsigned long compress_alloc(z_stream *stream, int level, unsigned long size);
-extern int compress_start(z_stream *stream, unsigned char *in, unsigned long in_size,
+/* Any compress/decompress engine must implement all the
+ * below functions that are modeled after the zlib ones.
+ */
+typedef int (*deflateInit_fn_t)(z_stream *stream, int level);
+typedef int (*deflate_fn_t)(z_stream *stream, int flush);
+typedef int (*deflateEnd_fn_t)(z_stream *stream);
+typedef unsigned long (*deflateBound_fn_t)(z_stream *stream, unsigned long size);
+
+typedef int (*inflateInit_fn_t)(z_stream *stream);
+typedef int (*inflate_fn_t)(z_stream *stream, int flush);
+typedef int (*inflateEnd_fn_t)(z_stream *stream);
+
+
+/* Extended struct used instead of the zlib native to
+ * call the compress/decompress helpers. It's just a
+ * thin extension of the zlib native one.
+ */
+typedef struct ext_stream_s {
+	z_stream z; /* defined in zlib.h to store stream state */
+
+	/* pointers to low level compress library functions */
+	deflateInit_fn_t  deflateInit_fn;
+	deflate_fn_t      deflate_fn;
+	deflateEnd_fn_t   deflateEnd_fn;
+	deflateBound_fn_t deflateBound_fn;
+	inflateInit_fn_t  inflateInit_fn;
+	inflate_fn_t      inflate_fn;
+	inflateEnd_fn_t   inflateEnd_fn;
+} ext_stream;
+
+extern unsigned long compress_alloc(ext_stream *stream, int level, unsigned long size);
+extern int compress_start(ext_stream *stream, unsigned char *in, unsigned long in_size,
                            unsigned char *out, unsigned long out_size);
-extern int compress_next(z_stream *stream, int flush);
-extern unsigned long compress_free(z_stream *stream);
+extern int compress_next(ext_stream *stream, int flush);
+extern unsigned long compress_free(ext_stream *stream);
 extern unsigned long compress_all(int level, unsigned char *in, unsigned long in_size,
                                   unsigned char **out);
 
 
-extern int decompress_alloc(z_stream *stream);
+extern int decompress_alloc(ext_stream *stream);
 
-extern int decompress_from(z_stream *stream, unsigned char *in, unsigned long in_size);
-extern int decompress_into(z_stream *stream, unsigned char *out, unsigned long out_size);
+extern int decompress_from(ext_stream *stream, unsigned char *in, unsigned long in_size);
+extern int decompress_into(ext_stream *stream, unsigned char *out, unsigned long out_size);
 
-extern int decompress_next(z_stream *stream, int flush);
-extern int decompress_next_from(z_stream *stream, unsigned char *in, unsigned long in_size, int flush);
-extern int decompress_next_into(z_stream *stream, unsigned char *out, unsigned long out_size, int flush);
+extern int decompress_next(ext_stream *stream, int flush);
+extern int decompress_next_from(ext_stream *stream, unsigned char *in, unsigned long in_size, int flush);
+extern int decompress_next_into(ext_stream *stream, unsigned char *out, unsigned long out_size, int flush);
 
-extern unsigned long decompress_free(z_stream *stream);
+extern unsigned long decompress_free(ext_stream *stream);
 
 extern unsigned long decompress_all(unsigned char *in, unsigned long in_size,
                                     unsigned char *out, unsigned long out_size);
-- 
1.5.4.rc2.98.g58cd2

^ permalink raw reply related

* Re: performance problem: "git commit filename"
From: Junio C Hamano @ 2008-01-13 11:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Kristian H?gsberg
In-Reply-To: <7vd4s6qal0.fsf@gitster.siamese.dyndns.org>

Junio C Hamano <gitster@pobox.com> writes:

> The fact that we write out a temporary index and then rebuild
> the real index means CE_UPTODATE flag we populate in the
> temporary index is lost and we still need to lstat(2) while
> building the real index, which is a bit unfortunate.  I suspect
> that we can use the one-way merge to reset the index when
> building the real index after we are done building the temporary
> index, instead of discarding the in-core temporary index and
> re-reading the real index.

This comment is completely bogus.  With your earlier one-way
merge fix, as the way CE_UPTODATE patch was written we preserve
in-core CE_UPTODATE bit across write_index(), the code already
should be taking advantage of an earlier lstat(2).

^ permalink raw reply

* [PATCH] builtin-commit.c: do not lstat(2) partially committed paths twice.
From: Junio C Hamano @ 2008-01-13 10:54 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Kristian Høgsberg
In-Reply-To: <7vd4s6qal0.fsf@gitster.siamese.dyndns.org>

Junio C Hamano <gitster@pobox.com> writes:

>  * In the partial commit codepath, we run one file_exists() and
>    then one add_file_to_index(), each of which has lstat(2) on
>    the same pathname, for the paths to be committed.  This can
>    be reduced to one trivially by changing the calling
>    convention of add_file_to_index()....

And this is the "trivial" lstat(2) reduction.

I do not know if it is worth it, though.  Majority of lstat(2)
must be coming from the paths in the index (and not committed),
not from the paths that are listed on the command line to be
partially committed.

---
 builtin-commit.c |    6 ++++--
 cache.h          |    2 ++
 read-cache.c     |   34 ++++++++++++++++++++++------------
 3 files changed, 28 insertions(+), 14 deletions(-)

diff --git a/builtin-commit.c b/builtin-commit.c
index 6d2ca80..770bd25 100644
--- a/builtin-commit.c
+++ b/builtin-commit.c
@@ -170,9 +170,11 @@ static void add_remove_files(struct path_list *list)
 {
 	int i;
 	for (i = 0; i < list->nr; i++) {
+		struct stat st;
 		struct path_list_item *p = &(list->items[i]);
-		if (file_exists(p->path))
-			add_file_to_cache(p->path, 0);
+
+		if (!lstat(p->path, &st))
+			add_file_to_cache_with_stat(p->path, &st, 0);
 		else
 			remove_file_from_cache(p->path);
 	}
diff --git a/cache.h b/cache.h
index 39331c2..73cc83b 100644
--- a/cache.h
+++ b/cache.h
@@ -174,6 +174,7 @@ extern struct index_state the_index;
 #define remove_cache_entry_at(pos) remove_index_entry_at(&the_index, (pos))
 #define remove_file_from_cache(path) remove_file_from_index(&the_index, (path))
 #define add_file_to_cache(path, verbose) add_file_to_index(&the_index, (path), (verbose))
+#define add_file_to_cache_with_stat(path, st, verbose) add_file_to_index_with_stat(&the_index, (path), (st), (verbose))
 #define refresh_cache(flags) refresh_index(&the_index, (flags), NULL, NULL)
 #define ce_match_stat(ce, st, options) ie_match_stat(&the_index, (ce), (st), (options))
 #define ce_modified(ce, st, options) ie_modified(&the_index, (ce), (st), (options))
@@ -273,6 +274,7 @@ extern struct cache_entry *refresh_cache_entry(struct cache_entry *ce, int reall
 extern int remove_index_entry_at(struct index_state *, int pos);
 extern int remove_file_from_index(struct index_state *, const char *path);
 extern int add_file_to_index(struct index_state *, const char *path, int verbose);
+extern int add_file_to_index_with_stat(struct index_state *, const char *path, struct stat *, int verbose);
 extern struct cache_entry *make_cache_entry(unsigned int mode, const unsigned char *sha1, const char *path, int stage, int refresh);
 extern int ce_same_name(struct cache_entry *a, struct cache_entry *b);
 
diff --git a/read-cache.c b/read-cache.c
index 7db5588..928f49b 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -384,21 +384,22 @@ static int index_name_pos_also_unmerged(struct index_state *istate,
 	return pos;
 }
 
-int add_file_to_index(struct index_state *istate, const char *path, int verbose)
+int add_file_to_index_with_stat(struct index_state *istate,
+				const char *path,
+				struct stat *st,
+				int verbose)
 {
 	int size, namelen, pos;
-	struct stat st;
 	struct cache_entry *ce;
 	unsigned ce_option = CE_MATCH_IGNORE_VALID|CE_MATCH_RACY_IS_DIRTY;
 
-	if (lstat(path, &st))
-		die("%s: unable to stat (%s)", path, strerror(errno));
-
-	if (!S_ISREG(st.st_mode) && !S_ISLNK(st.st_mode) && !S_ISDIR(st.st_mode))
+	if (!S_ISREG(st->st_mode) &&
+	    !S_ISLNK(st->st_mode) &&
+	    !S_ISDIR(st->st_mode))
 		die("%s: can only add regular files, symbolic links or git-directories", path);
 
 	namelen = strlen(path);
-	if (S_ISDIR(st.st_mode)) {
+	if (S_ISDIR(st->st_mode)) {
 		while (namelen && path[namelen-1] == '/')
 			namelen--;
 	}
@@ -406,10 +407,10 @@ int add_file_to_index(struct index_state *istate, const char *path, int verbose)
 	ce = xcalloc(1, size);
 	memcpy(ce->name, path, namelen);
 	ce->ce_flags = htons(namelen);
-	fill_stat_cache_info(ce, &st);
+	fill_stat_cache_info(ce, st);
 
 	if (trust_executable_bit && has_symlinks)
-		ce->ce_mode = create_ce_mode(st.st_mode);
+		ce->ce_mode = create_ce_mode(st->st_mode);
 	else {
 		/* If there is an existing entry, pick the mode bits and type
 		 * from it, otherwise assume unexecutable regular file.
@@ -418,19 +419,19 @@ int add_file_to_index(struct index_state *istate, const char *path, int verbose)
 		int pos = index_name_pos_also_unmerged(istate, path, namelen);
 
 		ent = (0 <= pos) ? istate->cache[pos] : NULL;
-		ce->ce_mode = ce_mode_from_stat(ent, st.st_mode);
+		ce->ce_mode = ce_mode_from_stat(ent, st->st_mode);
 	}
 
 	pos = index_name_pos(istate, ce->name, namelen);
 	if (0 <= pos &&
 	    !ce_stage(istate->cache[pos]) &&
-	    !ie_match_stat(istate, istate->cache[pos], &st, ce_option)) {
+	    !ie_match_stat(istate, istate->cache[pos], st, ce_option)) {
 		/* Nothing changed, really */
 		free(ce);
 		return 0;
 	}
 
-	if (index_path(ce->sha1, path, &st, 1))
+	if (index_path(ce->sha1, path, st, 1))
 		die("unable to index file %s", path);
 	if (add_index_entry(istate, ce, ADD_CACHE_OK_TO_ADD|ADD_CACHE_OK_TO_REPLACE))
 		die("unable to add %s to index",path);
@@ -439,6 +440,15 @@ int add_file_to_index(struct index_state *istate, const char *path, int verbose)
 	return 0;
 }
 
+int add_file_to_index(struct index_state *istate, const char *path, int verbose)
+{
+	struct stat st;
+	if (lstat(path, &st))
+		die("%s: unable to stat (%s)", path, strerror(errno));
+
+	return add_file_to_index_with_stat(istate, path, &st, verbose);
+}
+
 struct cache_entry *make_cache_entry(unsigned int mode,
 		const unsigned char *sha1, const char *path, int stage,
 		int refresh)

^ permalink raw reply related

* [PATCH] builtin-commit.c: remove useless check added by faulty cut and paste
From: Junio C Hamano @ 2008-01-13 10:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Kristian H?gsberg
In-Reply-To: <7vtzliqh3u.fsf@gitster.siamese.dyndns.org>

When I did 2888605c649ccd423232161186d72c0e6c458a48
(builtin-commit: fix partial-commit support), I mindlessly cut
and pasted from builtin-ls-files.c, and included the part that
was meant to exclude redundant path after "ls-files --with-tree"
overlayed the HEAD commit on top of the index.  This logic does
not apply to what git-commit does and should not have been
copied, even though it would not hurt.

Signed-off-by: Junio C Hamano <gitster@pobox.com>
---
 builtin-commit.c |    2 --
 1 files changed, 0 insertions(+), 2 deletions(-)

diff --git a/builtin-commit.c b/builtin-commit.c
index 6d2ca80..265ba6b 100644
--- a/builtin-commit.c
+++ b/builtin-commit.c
@@ -156,8 +156,6 @@ static int list_paths(struct path_list *list, const char *with_tree,

 	for (i = 0; i < active_nr; i++) {
 		struct cache_entry *ce = active_cache[i];
-		if (ce->ce_flags & htons(CE_UPDATE))
-			continue;
 		if (!pathspec_match(pattern, m, ce->name, 0))
 			continue;
 		path_list_insert(ce->name, list);

^ permalink raw reply related

* Re: performance problem: "git commit filename"
From: Junio C Hamano @ 2008-01-13 10:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Kristian H?gsberg
In-Reply-To: <7vtzliqh3u.fsf@gitster.siamese.dyndns.org>

Junio C Hamano <gitster@pobox.com> writes:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
>> HOWEVER. This was just a quick hack, and while it all looks sane, this is 
>> some damn core code. Somebody else should double- and triple-check this.
>
> Double-checked.  The patch looks sane.
>
>> [ That 4x lstat thing bothers me. I think we should add a flag to the 
>>   index saying "we checked this once already, it's clean", so that if we 
>>   do multiple passes over the index, we can still do just a single lstat() 
>>   on just the first pass. But that's a separate issue.
>
> I've thought about this in a different context before, but it
> seemed quite tricky, as some codepaths in more complex commands
> (commit being one of them) tend to use the cache and discard to
> use it for different purpose (like creating temporary index and
> then reading the real index).  Besides, I had an impression that
> we ran out of the bits of ce_flags in the cache entry, although
> we could shorten the maximum path we support from 4k down to 2k
> bytes.  I'll have to think about this a bit more.

Aside from the lstat(2) done for work tree files, there are
quite many lstat(2) calls in refname dwimming codepath.  I am
not currently looking into reducing them.

Some observations.

 * In the partial commit codepath, we run one file_exists() and
   then one add_file_to_index(), each of which has lstat(2) on
   the same pathname, for the paths to be committed.  This can
   be reduced to one trivially by changing the calling
   convention of add_file_to_index().  This is done in order to
   build the temporary index file to be written out as a tree
   for committing (and the index file needs to be written to be
   shown to the pre-commit hook).

 * Then refresh_cache_ent(), which has one lstat(2), is called
   for everybody in the temporary index, in order to refresh the
   temporary index.  Again this cannot be avoided because we
   promise to show the pre-commit hook a refreshed index.

 * Then we run file_exists() and add_file_to_index() to update
   the real index.

 * And refresh_cache_ent() is run for everybody in the real
   index.

 * The final diffstat phase runs lstat(2) from
   reuse_worktree_file() codepath to see if the work tree is up
   to date, and read from the work tree files instead of having
   to explode them from the object store.

The attached is a quick and dirty hack which may or may not
help.  It all looks sane, this also is some core code, and meant
only for discussion and not application.

 * It adds a new ce_flag, CE_UPTODATE, that is meant to mark the
   cache entries that record a regular file blob that is up to
   date in the work tree.

 * I had to reduce the maximum length of allowed pathname from
   4k down to 2k for the above.  Incidentally I noticed that we
   do not check the length of pathname does not exceed what we
   can express with CE_NAMEMASK, which we may want to fix (and
   make it barf if somebody tries to add too long a path)
   independently from this issue.  A low hanging fruit for
   janitors -- hint, hint.

 * fill_stat_cache_info() marks the cache entry it just added
   with CE_UPTODATE.  This has the effect of marking the paths
   we write out of the index and lstat(2) immediately as "no
   need to lstat -- we know it is up-to-date", from quite a lot
   fo callers:

    - git-apply --index
    - git-update-index
    - git-checkout-index
    - git-add (uses add_file_to_index())
    - git-commit (ditto)
    - git-mv (ditto)

 * write_index is changed not to write CE_UPTODATE out to the
   index file, because CE_UPTODATE is meant to be transient only
   in core.  For the same reason, CE_UPDATE is not written to
   prevent an accident from happening.

The fact that we write out a temporary index and then rebuild
the real index means CE_UPTODATE flag we populate in the
temporary index is lost and we still need to lstat(2) while
building the real index, which is a bit unfortunate.  I suspect
that we can use the one-way merge to reset the index when
building the real index after we are done building the temporary
index, instead of discarding the in-core temporary index and
re-reading the real index.

---
 cache.h      |    3 ++-
 diff.c       |    4 ++++
 read-cache.c |   10 ++++++++++
 3 files changed, 16 insertions(+), 1 deletions(-)

diff --git a/cache.h b/cache.h
index 39331c2..9b950fc 100644
--- a/cache.h
+++ b/cache.h
@@ -108,7 +108,8 @@ struct cache_entry {
 	char name[FLEX_ARRAY]; /* more */
 };

-#define CE_NAMEMASK  (0x0fff)
+#define CE_NAMEMASK  (0x07ff)
+#define CE_UPTODATE  (0x0800)
 #define CE_STAGEMASK (0x3000)
 #define CE_UPDATE    (0x4000)
 #define CE_VALID     (0x8000)
diff --git a/diff.c b/diff.c
index b18c140..41847ae 100644
--- a/diff.c
+++ b/diff.c
@@ -1510,6 +1510,10 @@ static int reuse_worktree_file(const char *name, const unsigned char *sha1, int
 	if (pos < 0)
 		return 0;
 	ce = active_cache[pos];
+
+	if (ce->ce_flags & htons(CE_UPTODATE))
+		return 1;
+
 	if ((lstat(name, &st) < 0) ||
 	    !S_ISREG(st.st_mode) || /* careful! */
 	    ce_match_stat(ce, &st, 0) ||
diff --git a/read-cache.c b/read-cache.c
index 7db5588..3a90db1 100644
--- a/read-cache.c
+++ b/read-cache.c
@@ -44,6 +44,9 @@ void fill_stat_cache_info(struct cache_entry *ce, struct stat *st)

 	if (assume_unchanged)
 		ce->ce_flags |= htons(CE_VALID);
+
+	if (S_ISREG(st->st_mode))
+		ce->ce_flags |= htons(CE_UPTODATE);
 }

 static int ce_compare_data(struct cache_entry *ce, struct stat *st)
@@ -794,6 +797,9 @@ static struct cache_entry *refresh_cache_ent(struct index_state *istate,
 	int changed, size;
 	int ignore_valid = options & CE_MATCH_IGNORE_VALID;

+	if (ce->ce_flags & htons(CE_UPTODATE))
+		return ce;
+
 	if (lstat(ce->name, &st) < 0) {
 		if (err)
 			*err = errno;
@@ -1170,13 +1176,17 @@ int write_index(struct index_state *istate, int newfd)

 	for (i = 0; i < entries; i++) {
 		struct cache_entry *ce = cache[i];
+		unsigned short ce_flags;
 		if (!ce->ce_mode)
 			continue;
 		if (istate->timestamp &&
 		    istate->timestamp <= ntohl(ce->ce_mtime.sec))
 			ce_smudge_racily_clean_entry(ce);
+		ce_flags = ce->ce_flags;
+		ce->ce_flags &= htons(~(CE_UPDATE | CE_UPTODATE));
 		if (ce_write(&c, newfd, ce, ce_size(ce)) < 0)
 			return -1;
+		ce->ce_flags = ce_flags;
 	}

 	/* Write extension data here */

^ permalink raw reply related

* [WIP v2] safecrlf: Add mechanism to warn about irreversible crlf conversions
From: Steffen Prohaska @ 2008-01-13  9:05 UTC (permalink / raw)
  To: dpotapov, git; +Cc: torvalds, Steffen Prohaska
In-Reply-To: <20080112191429.GI2963@dpotapov.dyndns.org>

This version gets the naked LF/autocrlf=true case right.
However, different from what Dimitry suggested, the safety check
is run for all cases that are irreversible.  Dimitry suggested to
run it only for the CRLF_GUESS case.  I believe this is not
sufficient: the explicit CFLF_TEXT case should also be checked.
The user explicitly marked the file as text but the conversion is
nonetheless irreversible in the current setting.  This might be
unexpected and we should warn about it.  Paranoid users can even
ask git to fail in this case.  Such users would need to manually
fix the file, e.g. running dos2unix.

I also added basic tests.

A documentation is yet missing.

    Steffen

---- snip snap ---

CRLF conversion bears a slight chance of corrupting data.
autocrlf=true will convert CRLF to LF during commit and LF to
CRLF during checkout.  A file that containes a mixture of LF and
CRLF before the commit cannot be recreated by git.  For text
files this does not really matter because we do not care about
the line endings anyway; but for binary files that are
accidentally classified as text the conversion can result in
corrupted data.

If you recognize such corruption during commit you can easily fix
it by setting the conversion type explicitly in .gitattributes.
Right after committing you still have the original file in your
work tree and this file is not yet corrupted.

However, in mixed Windows/Unix environments text files quite
easily can end up containing a mixture of CRLF and LF line
endings and git should handle such situations gracefully.  For
example a user could copy a CRLF file from Windows to Unix and
mix it with an existing LF file there.  The result would contain
both types of line endings.

Unfortunately, the desired effect of cleaning up text files
with mixed lineendings and undesired effect of corrupting binary
files can not be distinguished.  In both cases CRLF are removed
in an irreversible way.  For text files this is the right thing
to do, while for binary file its corrupting data.

In a sane environment committing and checking out the same file
should not modify the origin file in the work tree.  For
autocrlf=input the original file must not contain CRLF.  For
autocrlf=true the original file must not contain LF without
preceding CR.  Otherwise the conversion is irreversible.  Note,
git might be able to recreate the original file with different
autocrlf settings, but in the current environment checking out
will yield a file that differs from the file before the commit.

This patch adds a mechanism that can either warn the user about
an irreversible conversion or can even refuse to convert.  The
mechanism is controlled by the variable core.safecrlf, with the
following values
 - false: disable safecrlf mechanism
 - warn: warn about irreversible conversions
 - true: refuse irreversible conversions

The default is to warn.

The concept of a safety check was originally proposed in a similar
way by Linus Torvalds.  Thanks to Dimitry Potapov for insisting
on getting the naked LF/autocrlf=true case right.

Signed-off-by: Steffen Prohaska <prohaska@zib.de>
---
 cache.h         |    8 ++++++++
 config.c        |    9 +++++++++
 convert.c       |   28 +++++++++++++++++++++++++---
 environment.c   |    1 +
 t/t0020-crlf.sh |   45 +++++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 88 insertions(+), 3 deletions(-)

diff --git a/cache.h b/cache.h
index 39331c2..4e03e3d 100644
--- a/cache.h
+++ b/cache.h
@@ -330,6 +330,14 @@ extern size_t packed_git_limit;
 extern size_t delta_base_cache_limit;
 extern int auto_crlf;

+enum safe_crlf {
+	SAFE_CRLF_FALSE = 0,
+	SAFE_CRLF_FAIL = 1,
+	SAFE_CRLF_WARN = 2,
+};
+
+extern enum safe_crlf safe_crlf;
+
 #define GIT_REPO_VERSION 0
 extern int repository_format_version;
 extern int check_repository_format(void);
diff --git a/config.c b/config.c
index 857deb6..0a46046 100644
--- a/config.c
+++ b/config.c
@@ -407,6 +407,15 @@ int git_default_config(const char *var, const char *value)
 		return 0;
 	}

+	if (!strcmp(var, "core.safecrlf")) {
+		if (value && !strcasecmp(value, "warn")) {
+			safe_crlf = SAFE_CRLF_WARN;
+			return 0;
+		}
+		safe_crlf = git_config_bool(var, value);
+		return 0;
+	}
+
 	if (!strcmp(var, "user.name")) {
 		strlcpy(git_default_name, value, sizeof(git_default_name));
 		return 0;
diff --git a/convert.c b/convert.c
index 4df7559..c9678ee 100644
--- a/convert.c
+++ b/convert.c
@@ -90,9 +90,6 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
 		return 0;

 	gather_stats(src, len, &stats);
-	/* No CR? Nothing to convert, regardless. */
-	if (!stats.cr)
-		return 0;

 	if (action == CRLF_GUESS) {
 		/*
@@ -110,6 +107,20 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
 			return 0;
 	}

+	if (safe_crlf && auto_crlf > 0 && action != CRLF_INPUT) {
+		/* CRLFs would be added by checkout: check if we have "naked" LFs */
+		if (stats.lf != stats.crlf) {
+			if (safe_crlf == SAFE_CRLF_WARN)
+				warning("Checkout will replace LFs with CRLF in %s", path);
+			else
+				die("Checkout would replace LFs with CRLF in %s", path);
+		}
+	}
+
+	/* Optimization: No CR? Nothing to convert, regardless. */
+	if (!stats.cr)
+		return 0;
+
 	/* only grow if not in place */
 	if (strbuf_avail(buf) + buf->len < len)
 		strbuf_grow(buf, len - buf->len);
@@ -132,6 +143,17 @@ static int crlf_to_git(const char *path, const char *src, size_t len,
 				*dst++ = c;
 		} while (--len);
 	}
+
+	if (safe_crlf && (action == CRLF_INPUT || auto_crlf <= 0)) {
+		/* CRLFs would not be restored by checkout: check if we removed CRLFs */
+		if (buf->len != dst - buf->buf) {
+			if (safe_crlf == SAFE_CRLF_WARN)
+				warning("Stripped CRLF from %s.", path);
+			else
+				die("Refusing to strip CRLF from %s.", path);
+		}
+	}
+
 	strbuf_setlen(buf, dst - buf->buf);
 	return 1;
 }
diff --git a/environment.c b/environment.c
index 18a1c4e..e351e99 100644
--- a/environment.c
+++ b/environment.c
@@ -35,6 +35,7 @@ int pager_use_color = 1;
 char *editor_program;
 char *excludes_file;
 int auto_crlf = 0;	/* 1: both ways, -1: only when adding git objects */
+enum safe_crlf safe_crlf = SAFE_CRLF_WARN;
 unsigned whitespace_rule_cfg = WS_DEFAULT_RULE;

 /* This is set by setup_git_dir_gently() and/or git_default_config() */
diff --git a/t/t0020-crlf.sh b/t/t0020-crlf.sh
index 89baebd..e2e0f7b 100755
--- a/t/t0020-crlf.sh
+++ b/t/t0020-crlf.sh
@@ -8,6 +8,10 @@ q_to_nul () {
 	tr Q '\000'
 }

+q_to_cr () {
+	tr Q '\015'
+}
+
 append_cr () {
 	sed -e 's/$/Q/' | tr Q '\015'
 }
@@ -42,6 +46,47 @@ test_expect_success setup '
 	echo happy.
 '

+test_expect_failure 'safecrlf: autocrlf=input, all CRLF' '
+
+	git repo-config core.autocrlf input &&
+	git repo-config core.safecrlf true &&
+
+	for w in I am all CRLF; do echo $w; done | append_cr >allcrlf &&
+	git add allcrlf
+'
+
+test_expect_failure 'safecrlf: autocrlf=input, mixed LF/CRLF' '
+
+	git repo-config core.autocrlf input &&
+	git repo-config core.safecrlf true &&
+
+	for w in Oh here is CRLFQ in text; do echo $w; done | q_to_cr >mixed &&
+	git add mixed
+'
+
+test_expect_failure 'safecrlf: autocrlf=true, all LF' '
+
+	git repo-config core.autocrlf true &&
+	git repo-config core.safecrlf true &&
+
+	for w in I am all LF; do echo $w; done >alllf &&
+	git add alllf
+'
+
+test_expect_failure 'safecrlf: autocrlf=true mixed LF/CRLF' '
+
+	git repo-config core.autocrlf true &&
+	git repo-config core.safecrlf true &&
+
+	for w in Oh here is CRLFQ in text; do echo $w; done | q_to_cr >mixed &&
+	git add mixed
+'
+
+test_expect_success 'switch off autocrlf, safecrlf' '
+	git repo-config core.autocrlf false &&
+	git repo-config core.safecrlf false
+'
+
 test_expect_success 'update with autocrlf=input' '

 	rm -f tmp one dir/two three &&
-- 
1.5.4.rc2.60.g46ee

^ permalink raw reply related

* Re: performance problem: "git commit filename"
From: Junio C Hamano @ 2008-01-13  8:14 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Linus Torvalds, Git Mailing List, Kristian H?gsberg
In-Reply-To: <alpine.LNX.1.00.0801130028460.13593@iabervon.org>

Daniel Barkalow <barkalow@iabervon.org> writes:

> On Sat, 12 Jan 2008, Linus Torvalds wrote:
>
>> It makes builtin-commit.c use the same logic that "git read-tree -i -m" 
>> does (which is what the old shell script did), and it seems to pass the 
>> test-suite, and it looks pretty obvious.
>
> The only issue I know about with using unpack_trees in C as a replacement 
> for read-tree in shell is that unpack_trees leaves "deletion" index 
> entries in memory which are not written to disk,...

I do not think you have to worry about that one.  That "to be
deleted" was a Linus invention and he surely remembers it.

write_index() function of course knows about skipping them (they
are marked as !ce->ce_mode).  I think the patch is safe.

^ permalink raw reply

* Re: performance problem: "git commit filename"
From: Junio C Hamano @ 2008-01-13  8:12 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List, Kristian H?gsberg
In-Reply-To: <alpine.LFD.1.00.0801121949180.2806@woody.linux-foundation.org>

Linus Torvalds <torvalds@linux-foundation.org> writes:

> HOWEVER. This was just a quick hack, and while it all looks sane, this is 
> some damn core code. Somebody else should double- and triple-check this.

Double-checked.  The patch looks sane.

> [ That 4x lstat thing bothers me. I think we should add a flag to the 
>   index saying "we checked this once already, it's clean", so that if we 
>   do multiple passes over the index, we can still do just a single lstat() 
>   on just the first pass. But that's a separate issue.

I've thought about this in a different context before, but it
seemed quite tricky, as some codepaths in more complex commands
(commit being one of them) tend to use the cache and discard to
use it for different purpose (like creating temporary index and
then reading the real index).  Besides, I had an impression that
we ran out of the bits of ce_flags in the cache entry, although
we could shorten the maximum path we support from 4k down to 2k
bytes.  I'll have to think about this a bit more.

^ permalink raw reply

* How to handle confilicts when merging submodules in the process of rebase
From: Ping Yin @ 2008-01-13  6:54 UTC (permalink / raw)
  To: Git Mailing List

In such a case, the blob sha1 for the conflicting submodule in working
tree and 3 stages of index may be different from each other.
So git add will not work if I want to use stage 1 or stage 2 as the
merging result. Then how?

-- 
Ping Yin

^ permalink raw reply

* Re: [PATCH 3/5] git-submodule: New subcommand 'summary' (3) - limit summary size
From: Ping Yin @ 2008-01-13  6:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7v4pdiua4k.fsf@gitster.siamese.dyndns.org>

On Jan 13, 2008 3:17 AM, Junio C Hamano <gitster@pobox.com> wrote:
> "Ping Yin" <pkufranky@gmail.com> writes:
>
> > On Jan 12, 2008 4:36 PM, Junio C Hamano <gitster@pobox.com> wrote:
> >> Ping Yin <pkufranky@gmail.com> writes:
> >>
> >> > @@ -265,6 +267,10 @@ set_name_rev () {
> >> >  #
> >> >  modules_summary()
> >> >  {
> >> > +     summary_limit=${summary_limit:-1000000}
> >>
> >> Why a million?
> > Because i think a million is big enough. I'd better define a constant
> > for unlimited number.
>
> I think that is a wrong approach to begin with.  You are
> assuming that you will always limit and by using improbably
> large limit to pretend it is unlimited.  Why not making the
> summary list generator truely capable of produce an unlimited
> list?
>
I used a pseudo unlimited number just to make code shorter.
After considering it again, i find that the code is still brief  by using
a real unlimited number. So i'll correct it.

> I also think using 100 or so as a sane default, allowing the
> user to override to say "I do not want any limitation", is a
> much better default.
>
Reasonable



-- 
Ping Yin

^ permalink raw reply

* Re: [PATCH 2/5] git-submodule: New subcommand 'summary' (2) - hard work
From: Ping Yin @ 2008-01-13  6:28 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vve5ysv72.fsf@gitster.siamese.dyndns.org>

On Jan 13, 2008 3:25 AM, Junio C Hamano <gitster@pobox.com> wrote:
> "Ping Yin" <pkufranky@gmail.com> writes:
>
> >> +                       echo "* $name $sha1_src...$sha1_dst:"
> >
> > If it's a type change (head submodule but index blob, or the the
> > reverse), $sha1_dst or $sha1_src will be the sha1 of the blob. It's
> > inapprociate to be shown as if it's a commit in the submodule. May
> > 00000000 should be shown instead of the blob sha1?
>
> I do not think that adds much value.  When A or B is a
> non-commit, you know that A...B notation does not apply, and
> because it is probably a rare situation you would want to make
> it even more clearer to the reader by using a different
> notation.  Like
>
>     echo "* $name have changed from submodule $sha1_src to blob $sha1_dst!!".
>
> perhaps in red bold letter in larger font ;-)
>
OK, i will think over.



-- 
Ping Yin

^ permalink raw reply

* Re: performance problem: "git commit filename"
From: Daniel Barkalow @ 2008-01-13  5:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List, Kristian H?gsberg
In-Reply-To: <alpine.LFD.1.00.0801121949180.2806@woody.linux-foundation.org>

On Sat, 12 Jan 2008, Linus Torvalds wrote:

> It makes builtin-commit.c use the same logic that "git read-tree -i -m" 
> does (which is what the old shell script did), and it seems to pass the 
> test-suite, and it looks pretty obvious.

The only issue I know about with using unpack_trees in C as a replacement 
for read-tree in shell is that unpack_trees leaves "deletion" index 
entries in memory which are not written to disk, but may surprise some 
code (these are used to allow -u to remove the files from the working 
tree). So you may want to make sure that you don't get any weird results 
out of a commit of particular files that involves not committing some 
newly-added files:

$ git add new-file
$ (edit old-file)
$ git commit old-file

This may cause the unpack_trees to leave a misleading entry for new-file 
that the code doesn't expect. I've got a patch to make it saner as part of 
my builtin-checkout series, but I can't say for sure that that change 
won't either confuse something else or have performance problems without a 
bunch of analysis I haven't done recently.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply

* Re: Digging through old vendor code
From: Linus Torvalds @ 2008-01-13  4:41 UTC (permalink / raw)
  To: Jon Smirl; +Cc: Git Mailing List
In-Reply-To: <9e4733910801122009k5658488bx69c04a5cbf7d832a@mail.gmail.com>



On Sat, 12 Jan 2008, Jon Smirl wrote:
>
> I have a file that a vendor has modified. It's a serial driver so I
> know which directory the original file came from. Is there a way to
> ask git to search through all of the past versions of all of the files
> in this directory and give me the top two or three choices as to what
> file the vendor originally copied before staring to edit? This is the
> same problem as picking the best diff base.

Heh. Maybe you could just use the rename logic?

Example of what *might* work:

 - go to the directory in the git tree you think the file came from

 - delete all the files that are *potential* sources (eg "rm *.c")

 - add the new file to that directory

 - commit the end result

 - ask git to find the best rename possibility, using a low rename 
   detection score, something like:

	git show -M1%

(and no, I don't guarantee that that "-M1%" is the right syntax, and in 
general you might well want to play with this concept a bit..)

No guarantees, but it just might work..

		Linus

^ permalink raw reply

* Digging through old vendor code
From: Jon Smirl @ 2008-01-13  4:09 UTC (permalink / raw)
  To: Git Mailing List

I have a file that a vendor has modified. It's a serial driver so I
know which directory the original file came from. Is there a way to
ask git to search through all of the past versions of all of the files
in this directory and give me the top two or three choices as to what
file the vendor originally copied before staring to edit? This is the
same problem as picking the best diff base.

-- 
Jon Smirl
jonsmirl@gmail.com

^ permalink raw reply

* Re: performance problem: "git commit filename"
From: Linus Torvalds @ 2008-01-13  4:04 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List; +Cc: Kristian H?gsberg
In-Reply-To: <alpine.LFD.1.00.0801121735020.2806@woody.linux-foundation.org>

On Sat, 12 Jan 2008, Linus Torvalds wrote:
> 
> HOWEVER. When that logic was converted from that shell-script into a 
> builtin-commit.c, that conversion was not done correctly. The old "git 
> read-tree -i -m" was not translated as a "unpack_trees()" call, but as 
> this in prepare_index():
> 
> 	discard_cache()
> 	..
> 	tree = parse_tree_indirect(head_sha1);
> 	..
> 	read_tree(tree, 0, NULL)
> 
> which is very wrong, because it replaces the old index entirely, and 
> doesn't do that stat information merging.

This patch may or may not fix it.

It makes builtin-commit.c use the same logic that "git read-tree -i -m" 
does (which is what the old shell script did), and it seems to pass the 
test-suite, and it looks pretty obvious.

It also brings down the number of open/mmap/munmap/close calls to where it 
should be, although it still does *way* too many "lstat()" operations (ie 
it does 4*lstat for each file in the index - one more than the 
non-filename one does).

With that fixed, performance is also roughly where it should be (ie the 
17-18s for the cold-cache case), because it no longer needs to rehash all 
the files!

HOWEVER. This was just a quick hack, and while it all looks sane, this is 
some damn core code. Somebody else should double- and triple-check this.

[ That 4x lstat thing bothers me. I think we should add a flag to the 
  index saying "we checked this once already, it's clean", so that if we 
  do multiple passes over the index, we can still do just a single lstat() 
  on just the first pass. But that's a separate issue.

  On Linux, a cached lstat() is almost free. Well, at least compared to 
  all the crap operating systems out there. And obviously, if you do 
  multiple lstat's per file, all but the first one *will* be cached.

  However, "almost free" still isn't zero, and with the kernel having 23k 
  files in it, doing almost a hundred thousand lstat's is still something 
  that only takes about half a second or so for me. We _really_ should do 
  only ~23k or so of them, and the cached cache should take on the order 
  of 0.15s, rather than half a second!

  So this is worth optimizing. With bigger repositories, it's going to be 
  more noticeable, and with other operating systems, all those lstat()'s 
  will cost much _much_ more. Of course, any IO overhead will be much 
  bigger, so this is mostly a cached-case issue, but cached-case is still 
  important.. ]

Anyway, consider this being conditionally signed-off-by: me, assuming 
a few other people spend a bit of time double-checking all my logic.

Please?

			Linus

---
 builtin-commit.c |   37 ++++++++++++++++++++++++++++---------
 1 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/builtin-commit.c b/builtin-commit.c
index 73f1e35..cc5134e 100644
--- a/builtin-commit.c
+++ b/builtin-commit.c
@@ -21,6 +21,7 @@
 #include "utf8.h"
 #include "parse-options.h"
 #include "path-list.h"
+#include "unpack-trees.h"

 static const char * const builtin_commit_usage[] = {
 	"git-commit [options] [--] <filepattern>...",
@@ -177,10 +178,34 @@ static void add_remove_files(struct path_list *list)
 	}
 }

+static void create_base_index(void)
+{
+	struct tree *tree;
+	struct unpack_trees_options opts;
+	struct tree_desc t;
+
+	if (initial_commit) {
+		discard_cache();
+		return;
+	}
+
+	memset(&opts, 0, sizeof(opts));
+	opts.head_idx = 1;
+	opts.index_only = 1;
+	opts.merge = 1;
+	
+	opts.fn = oneway_merge;
+	tree = parse_tree_indirect(head_sha1);
+	if (!tree)
+		die("failed to unpack HEAD tree object");
+	parse_tree(tree);
+	init_tree_desc(&t, tree->buffer, tree->size);
+	unpack_trees(1, &t, &opts);
+}
+
 static char *prepare_index(int argc, const char **argv, const char *prefix)
 {
 	int fd;
-	struct tree *tree;
 	struct path_list partial;
 	const char **pathspec = NULL;

@@ -278,14 +303,8 @@ static char *prepare_index(int argc, const char **argv, const char *prefix)

 	fd = hold_lock_file_for_update(&false_lock,
 				       git_path("next-index-%d", getpid()), 1);
-	discard_cache();
-	if (!initial_commit) {
-		tree = parse_tree_indirect(head_sha1);
-		if (!tree)
-			die("failed to unpack HEAD tree object");
-		if (read_tree(tree, 0, NULL))
-			die("failed to read HEAD tree object");
-	}
+
+	create_base_index();
 	add_remove_files(&partial);
 	refresh_cache(REFRESH_QUIET);

^ permalink raw reply related

* Re: performance problem: "git commit filename"
From: Linus Torvalds @ 2008-01-13  1:46 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List; +Cc: Kristian H?gsberg
In-Reply-To: <alpine.LFD.1.00.0801121426510.2806@woody.linux-foundation.org>

On Sat, 12 Jan 2008, Linus Torvalds wrote:
> 
> I thought we had fixed this long long ago, but if we did, it has 
> re-surfaced.

It's new, and yes, it seems to be due to the new builtin-commit.c.

I think I know what is going on.

In the old git-commit.sh, this case used to be handled with

	TMP_INDEX="$GIT_DIR/tmp-index$$"

	GIT_INDEX_FILE="$THIS_INDEX" \
	git read-tree --index-output="$TMP_INDEX" -i -m HEAD

which is a one-way merge of the *old* index and HEAD, taking the index 
information from the old index, but the actual file information from HEAD 
(to then later be updated by the named files).

This logic is implemented by builtin-read-tree.c with

	struct unpack_trees_options opts;
	..
	opts.fn = oneway_merge;
	..
	unpack_trees(nr_trees, t, &opts);

where all the magic is done by that "oneway_merge()" function being called 
for each entry by unpack_trees(). This does everything right, and the 
result is that any index entry that was up-to-date in the old index and 
unchanged in the base tree will be up-to-date in the new index too

HOWEVER. When that logic was converted from that shell-script into a 
builtin-commit.c, that conversion was not done correctly. The old "git 
read-tree -i -m" was not translated as a "unpack_trees()" call, but as 
this in prepare_index():

	discard_cache()
	..
	tree = parse_tree_indirect(head_sha1);
	..
	read_tree(tree, 0, NULL)

which is very wrong, because it replaces the old index entirely, and 
doesn't do that stat information merging.

As a result, the index that is created by read-tree is totally bogus in 
the stat cache, and yes, everything will have to be re-computed.

Kristian?

			Linus

^ permalink raw reply

* Re: Adding Git to Better SCM Initiative : Comparison
From: Jakub Narebski @ 2008-01-13  0:44 UTC (permalink / raw)
  To: Shlomi Fish
  Cc: git, Eyvind Bernhardsen, David Kastrup, Florian Weimer,
	Chris Shoemaker
In-Reply-To: <200801071057.27710.shlomif@iglu.org.il>

On Mon, 7 Jan 2008, Shlomi Fish wrote:
> 
> I'm CCing all the correspondents, because I'm banned from the vger.kernel.org 
> mail. This has been an obstacle for me in several legitimate occassions and 
> this one is the latest. I'm still CCing it, so the people in the mailing list 
> will receive the replies.
> 
> On Monday 10 December 2007, Jakub Narebski wrote:
> > I have noticed that your SCM comparison at "Better SCM Initiative"
> > website
> >   http://better-scm.berlios.de/comparison/comparison.html
> > misses one of the Git, version control system which is used to manage
> > Linux kernel, and one of the main open source (distributed) version
> > control systems (among Mercurial, Bazaar-NG, Monotone and Darcs).
> >
> 
> Indeed git is absent. That's because no one until you has volunteered to send 
> a patch that adds it to the comparison. Another requirement is for someone to 
> volunteer to become a "champion" for the version control system and maintain 
> it into the future. So who is going to be the champion?

I can be git champion for "Better SCM Initiative" comparison... although
I'd rather somebody else was it.

[...] 
> > Below there is (slightly doctored) patch to the sources for the site.
> >
> 
> Despite the fact that I the comparison was recently patched to add Bazaar and 
> fix some grammatical problems, the patch still applies cleanly. However, I 
> saw that some people commented on it here. Can you send me a new patch 
> integrating all this commentary?

I'll try to send revised patch soon. Integrating commentary is a bit
harder that it could be because some responses were sent _only_ to
git mailing list, so I'd have to browse through git mailing list
archives.

BTW. some of the questions / comments were caused by the fact that the
features listed in Better SCM Initiative: Comparison are a bit ambiguous.

What does for example "Atomic Commit" mean? Does it mean that if we
interrupt commit in the middle we would always get full commit or none,
and not some f**d-up intermediate state? Hos CVS can have atomic commits
then?

What does "Renames Support" mean? Does it mean that when browsing history
we [can] show file / directory renames? Does it mean that log of file or
directory history [can] follow renames? Does it mean that line-wise file
history [can] follow renames? Renames support in merges is as TODO, so
I don't think that this one matters in this question. Because the answer,
especially in the case of git which is a bit different in that it does
rename detection and not rename tracking (using inodes / file-ids),
depends on that...

-- 
Jakub Narebski
Poland

^ permalink raw reply

* Re: [PATCH] Teach remote machinery about remotes.default config variable
From: Junio C Hamano @ 2008-01-12 22:48 UTC (permalink / raw)
  To: Mark Levedahl; +Cc: git
In-Reply-To: <47893E1A.5020702@gmail.com>

Mark Levedahl <mlevedahl@gmail.com> writes:

> Basically, I think an important (but not complete) test of the design
> is that
>
>    git clone -o frotz git://frotz.foo.bar/myproject.git
>    cd myproject
>    git submodule init
>    git submodule update
>
> work, with origin = frotz throughout the submodules, and with the
> whole project correctly checked out even if the entire project was
> rehosted onto a different server.

I like that.  This is a very good argument, especially because
it clarifies very well that the issue is not about "'submodule
init' misbehaves" but "fetch/pull/merge does not play well with
clone -o".

The only remaining (minor) doubt I have (not in the sense that
"I object to it!", but in the sense that "I wish there could be
a better alternative, but I do not think of one offhand") is
polluting the core.* namespace with this configuration variable.

Looking at Documentation/config.txt, I realize that we already
have made a mistake of allowing core.gitproxy, but other than
that single mistake, everything in core.* is still about things
that apply to the use of git even when the repository does not
talk with any other repository.  If we deprecate and rename away
that one mistake, we can again make core.* to mean things that
are _really_ core, but using core.origin for "the default remote
is not called 'origin' but 'frotz' here" is a step backwards
from that ideal.

But that's a minor naming issue.

^ permalink raw reply

* performance problem: "git commit filename"
From: Linus Torvalds @ 2008-01-12 22:46 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List; +Cc: Kristian Høgsberg

I thought we had fixed this long long ago, but if we did, it has 
re-surfaced.

Using an explicit filename with "git commit" is _extremely_ slow. Lookie 
here:

	[torvalds@woody linux]$ time git commit fs/exec.c
	no changes added to commit (use "git add" and/or "git commit -a")

	real    0m1.671s
	user    0m1.200s
	sys     0m0.328s

that's closer to two seconds on a fast machine, with the whole tree 
cached!

And for the uncached case, it's just unbearably slow: two and a half 
*minutes*.

In contrast, without the filename, it's much faster:

	[torvalds@woody linux]$ time git commit
	no changes added to commit (use "git add" and/or "git commit -a")

	real    0m0.387s
	user    0m0.220s
	sys     0m0.168s

with the cold-cache case now being "just" 18s (which is still long, but 
we're talking eight times faster, and certainly not unbearable!)

Doing an "strace -c" on the thing shows why. In the filename case, we 
have:

	% time     seconds  usecs/call     calls    errors syscall
	------ ----------- ----------- --------- --------- ----------------
	 32.69    0.000868           0     92299        37 lstat
	 17.40    0.000462           0     29958      3993 open
	 15.78    0.000419           0      5522           getdents
	 15.56    0.000413           0     23165           mmap
	 11.37    0.000302           0     23118           munmap
	  5.76    0.000153           0     25966         2 close
	  1.43    0.000038           0      2845           fstat
	...

and in the non-filename case we have

	% time     seconds  usecs/call     calls    errors syscall
	------ ----------- ----------- --------- --------- ----------------
	 53.67    0.000600           0     69227        31 lstat
	 23.35    0.000261           0      5522           getdents
	 11.09    0.000124           2        55           munmap
	  4.20    0.000047           0       285           write
	  3.31    0.000037           0      5537      2638 open
	  2.33    0.000026           0      2899         1 close
	  2.06    0.000023           0      2844           fstat
	...

notice how the expensive case has a lot of successful open/mmap/munmap 
calls: it is *literally* ignoring the valid entries in the old index 
entirely, and re-hashing every single file in the tree! No wonder it is 
slow!

Just counting "lstat()" calls, it's worth noticing that the non-filename 
case seems to do three lstat's for each index entry (and yes, that's two
too many), but the named file case has upped that to *four* lstats per 
entry, and then added the one open/mmap/munmap/close on top of that!

I'm pretty sure we didn't use to do things this badly. And if this is a 
regression like I think it is, it should be fixed before a real 1.5.4 
release.

I'll try to see if I can see what's up, but I thought I'd better let 
others know too, in case I don't have time. I *suspect* (but have nothing 
what-so-ever to back that up) that this happened as part of making commit 
a builtin.

			Linus

^ permalink raw reply

* Re: [PATCH] Teach remote machinery about remotes.default config variable
From: Mark Levedahl @ 2008-01-12 22:29 UTC (permalink / raw)
  To: Johannes Schindelin; +Cc: Junio C Hamano, git
In-Reply-To: <alpine.LSU.1.00.0801122123430.8333@wbgn129.biozentrum.uni-wuerzburg.de>

Johannes Schindelin wrote:
> Why is your patch then not about git-submodule?
>
> And I still fail to see -- even for submodules -- how you begin to tackle 
> that lookup problem.
>
> Ciao,
> Dscho
>
>   
Because git-submodule is a wrapper around git-fetch and git-clone and 
git-remote, and those lacked the mechanism to honor the fact that when I 
said

       git clone -o  frotz frontz.foo.bar/foo.git

I *defined* the upstream's nickname as "frotz", not "origin", and origin 
is *not* defined so don't try to use it. As sub-modules are git 
projects, fixing this in a sub-module necessarily fixes it in any git 
project.

Mark

^ permalink raw reply

* Re: [PATCH] Teach remote machinery about remotes.default config variable
From: Mark Levedahl @ 2008-01-12 22:24 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git
In-Reply-To: <7vbq7qssd7.fsf@gitster.siamese.dyndns.org>

Junio C Hamano wrote:
> If it is truly only about "submodule update" then the change
> seems too intrusive, especially "remotes.default" variable that
> affects the way how fetch and merge works in situations that do
> not involve submodules.
> If it is not limited to "submodule update" but equally valid fix
> to non-submodule situations, the changes to the other parts may
> very well be justifiable, but that would mean your "Yes" is a
> lie and instead should be "No, but these situations are helped
> by these changes because...".
>
>   
First, I resent the patch series last night, it now uses core.origin to 
avoid touching remotes.* namespace.

The changes *do* fix a nit when on a non-tracking branch. With this, 
fetch / merge / pull will now honor that the user said (via git clone -o 
frotz) "my upstream is nicknamed frotz" and not try to use origin when 
origin was never defined.

So, while fixing this minor aggravation wasn't my motivation, I view 
this as a nice side-benefit :^).

The driving issues:
1) I deal with too many servers for "origin" to be a useful nick name, 
and we have an agreed set of nickname / server pairings across my project.
2) Therefore, we always do git clone -o frotz  frotz.foo.bar/path_to_git.
3) Because of 2, for top-level, "origin" is not defined, tracking 
branches set up via git branch --track point to the correct remote, and 
we basically understand branch names as <nickname>/branch. In other 
words, we *are* aware of what server we are using.
4) git-submodule update breaks the above:
- a) it invokes git clone frotz.foo.bar/path_to_git thus defining 
"origin" as the nickname for frotz.foo.bar.
  b) it invokes bare git-fetch on a detached head, so the upstream *has* 
to be origin.

> If your top-level repository needs to access a specific server
> "frotz.foo.bar" for updates, then you would have bootstrapped
> the whole thing with:
>
> 	$ git clone git://frotz.foo.bar/toplevel.git
>
> and in that particular instance of the repository, the source
> repository on frotz.foo.bar would have been known as 'origin',
> right? 
Nope, we did it with git clone -o frotz git://frotz.foo.bar/toplevel.git
We *never* define origin, frozt.foo.bar is *always* frotz.

>  I would not object if you also gave another nickname
> 'frotz' to the same repository for consistency across
> developers.
>   
good. We are making (some) progress. :^)
> If that is the case, I am wondering why your subprojects are not
> pointing at the corresponding repository on that same
> 'frotz.foo.bar' machine as 'origin'.  I suspect the reason is
> that .gitmodules do not say 'frotz.foo.bar' but name some other
> machine.
>   
Actually,
1) We don't use origin because we avoid having to wonder "Is 
frotz.foo.bar named "origin" or "frotz" on this client, and thus how do 
I get data from frotz?
2) I submitted the change allowing submodules to be recorded into 
.gitmodules with a relative url (e.g., ./path_from_parent_to_submodule) 
rather than an absolute, so we record the relative path only.
3) Thus, git submodule has set up the submodules to point at the parent 
project's default remote. However, in the parent the server is nicknamed 
"frotz", but now in the submodule the server is nicknamed "origin"  Oops.

With my patches, parent and submodule both refer to frotz.foo.bar as frotz.

> And in-tree .gitmodules can name only one URL, as it is project
> global and shared by everybody.  There is no escaping it.
> At least as things were designed, "git submodule init" takes URL
> recorded in .gitmodules as a hint, but this is for the user to
> override in .git/config in the top-level.  Maybe the UI to allow
> this overriding is not easy enough to use, and your submodules
> ended up pointing at wrong (from the machine's point of view)
> URL as 'origin'.  And perhaps that is the root cause of this
> issue?
>
>   
Again, the relative-url patch was to address this so that a project that 
is mirrored to another server remains valid on the new server without 
modifying the .gitmodules in-tree. (Yes, I know you *can* modify 
information in a given clones .git/config, but I'm trying to avoid such 
manual per clone/checkout modifications where it can reasonably be done.).

Basically, I think an important (but not complete) test of the design is 
that

    git clone -o frotz git://frotz.foo.bar/myproject.git
    cd myproject
    git submodule init
    git submodule update

work, with origin = frotz throughout the submodules, and with the whole 
project correctly checked out even if the entire project was rehosted 
onto a different server. With relative urls and my latest patch series 
last night, this all works, and of course upstream can still be "origin" 
if that is what is desired.

While our overall project exists on many servers, mirroring is an 
incorrect term. Rather, only certain branches of  various parts exist 
everywhere, many other branches are specific to a given server, so we 
really name branches using servername/branchname. It is this aspect of 
the project that causes us to be aware of the server in use, and thus 
makes use of "origin" as a generic upstream not useful.

> I am looking at the discussion on the list archive when we
> discussed the initial design of .gitmodules:
>
>     http://thread.gmane.org/gmane.comp.version-control.git/47466/focus=47502
>     http://thread.gmane.org/gmane.comp.version-control.git/47466/focus=47548
>     http://thread.gmane.org/gmane.comp.version-control.git/47466/focus=47621
>
> I do not think we are there yet, and suspect that the current
> "git submodule init" does not give the user a chance to say "the
> URL recorded in the in-tree .gitmodules corresponds to this URL
> in this repository for administrative or network connectivity or
> whatever reasons".
>   
> Maybe that is the real issue that we should be tackling.  I
> dunno.
>
> Although I _think_ being able to use nickname other than
> hardcoded 'origin' for fetch/merge is a good change, if my above
> suspicion is correct, that change alone would not make the life
> easier to people who _use_ submodules, as the need for them to
> set up extra nicknames (like 'frotz') and configure the
> submodule repositories to use that specific nickname instead of
> 'origin' would not change.
>
>   

git-submodule right now supports two different layouts (urls relative to 
the parent, and absolute urls such that each sub-module is on an 
independent server). The management approaches to these are going to be 
different.

I also suspect there are two basic use cases here: accumulation of a 
number of independently managed projects vs. splitting a single major 
project into a number of smaller pieces to allow some decoupling, but 
still managing the set as a composite whole.

There may be some direct correlation of use-case and submodule layout, 
don't know. My project uses relative-urls, and I am managing a large 
project that has been split into a number of components. So, my 
suggestions are focused entirely upon this design and use-case, and I 
don't expect I am addressing the others at all. (As usual, this requires 
someone who needs the other model(s) to step up and drive).

For *my* uses (relative urls, single logical project):

1) There are times when the parent's branch.<name>.remote should be 
flowed down to all subprojects for git submodule update, of course this 
would require that the remote be defined for all.
2) Thus, there needs to be a way to define a new remote globally for the 
project, and have it be correctly interpreted by each submodule (e.g., a 
repeat of the relative-url dereferencing now done by submodule init, but 
applied later to all submodules to define a new remote). Yes, this could 
be accomplished by going into each submodule independently and issuing 
appropriate commands, but administration would be much easier given a 
top-level command that could recurse and "do the right thing" per 
sub-project.

I *suspect* that origin is a much more useful concept for the alternate 
construct (absolute urls, loose alliance of separately managed 
projects), but as I said that is not my problem so please ask folks who 
have that model to define what works for them.
> For communication purposes, I would agree with Dscho that the
> name 'origin' that names different things for different people
> is wrong and using specific name 'frotz' would solve
> communication issues.  But when using the repository and doing
> actual work, wouldn't it be _much_ better if you can
> consistently go to a repository on a random machine and always
> can say 'origin' to mean the other repository this repository
> usually gets new objects from (and sends its new objects to)?
>
>
>   
(Acutally, I thought I was the one arguing that using origin when it 
means different things to different folks is not good. That's the root 
of my problems. :^) )

Anyway, I have not found any use of "origin" on my project really 
useful. We have to be and *are* aware of the server/branchname in use, 
not just the branch. Partly this is because different subgroups have 
different natural gathering points (we tend to exchange data via ad hoc 
"mob" branches on whatever server is most accessible to the particular 
group), and partly because some information simply cannot be allowed on 
some servers, but basically the more accessible a server is, the less 
information that server can have. I believe "origin" is really useful 
only when it has just one meaning, or when all values are effectively 
identical (e.g., you have several mirrors for load balancing, etc, but 
all are identical modulo mirroring delays).

OTOH, a reasonable change to the semantics of "origin" might be to have:
1) core.origin name the remote that is the "normal" upstream.
2) Reserve and allow use of the name "origin" to mean $core.origin, 
e.g., in shell scripts replace all references to remote "origin" with 
$(git config core.origin). Of course, if core.origin = origin, then no 
user visible change occurs.

In this way, git would  not record the same remote's branches in two 
ways (as origin/master and as frotz/master), but rather dereference 
origin -> frotz and then get frotz/master. Dunno, no matter how you 
slice it, having more than one way to refer to the same remote is going 
to be confusing, and that's why we don't use origin.

Mark

^ permalink raw reply

* Re: [PATCH] Teach remote machinery about remotes.default config variable
From: Junio C Hamano @ 2008-01-12 20:26 UTC (permalink / raw)
  To: Mark Levedahl; +Cc: git
In-Reply-To: <47891658.3090604@gmail.com>

Mark Levedahl <mlevedahl@gmail.com> writes:

> Junio C Hamano wrote:
>> Ahh.
>>
>> Does that suggest the new configuration thing is only about the
>> "submodule update" command, not "remotes.default" that affects
>> how the non-submodule merge and fetch works?
>>
>>
> Yes - this patch set was inspired by the single question of "how do I
> avoid needing to define origin as opposed to a server-specific
> nickname now that I am using sub-modules?"

If it is truly only about "submodule update" then the change
seems too intrusive, especially "remotes.default" variable that
affects the way how fetch and merge works in situations that do
not involve submodules.

If it is not limited to "submodule update" but equally valid fix
to non-submodule situations, the changes to the other parts may
very well be justifiable, but that would mean your "Yes" is a
lie and instead should be "No, but these situations are helped
by these changes because...".

In any case, let's step back a bit.

Earlier you said in a response to Dscho that your servers are
named consistently across repositories.  servername.foo.bar has
nickname servername everywhere.

If your top-level repository needs to access a specific server
"frotz.foo.bar" for updates, then you would have bootstrapped
the whole thing with:

	$ git clone git://frotz.foo.bar/toplevel.git

and in that particular instance of the repository, the source
repository on frotz.foo.bar would have been known as 'origin',
right?  I would not object if you also gave another nickname
'frotz' to the same repository for consistency across
developers.

If that is the case, I am wondering why your subprojects are not
pointing at the corresponding repository on that same
'frotz.foo.bar' machine as 'origin'.  I suspect the reason is
that .gitmodules do not say 'frotz.foo.bar' but name some other
machine.

And in-tree .gitmodules can name only one URL, as it is project
global and shared by everybody.  There is no escaping it.

At least as things were designed, "git submodule init" takes URL
recorded in .gitmodules as a hint, but this is for the user to
override in .git/config in the top-level.  Maybe the UI to allow
this overriding is not easy enough to use, and your submodules
ended up pointing at wrong (from the machine's point of view)
URL as 'origin'.  And perhaps that is the root cause of this
issue?

I am looking at the discussion on the list archive when we
discussed the initial design of .gitmodules:

    http://thread.gmane.org/gmane.comp.version-control.git/47466/focus=47502
    http://thread.gmane.org/gmane.comp.version-control.git/47466/focus=47548
    http://thread.gmane.org/gmane.comp.version-control.git/47466/focus=47621

I do not think we are there yet, and suspect that the current
"git submodule init" does not give the user a chance to say "the
URL recorded in the in-tree .gitmodules corresponds to this URL
in this repository for administrative or network connectivity or
whatever reasons".

Maybe that is the real issue that we should be tackling.  I
dunno.

Although I _think_ being able to use nickname other than
hardcoded 'origin' for fetch/merge is a good change, if my above
suspicion is correct, that change alone would not make the life
easier to people who _use_ submodules, as the need for them to
set up extra nicknames (like 'frotz') and configure the
submodule repositories to use that specific nickname instead of
'origin' would not change.

For communication purposes, I would agree with Dscho that the
name 'origin' that names different things for different people
is wrong and using specific name 'frotz' would solve
communication issues.  But when using the repository and doing
actual work, wouldn't it be _much_ better if you can
consistently go to a repository on a random machine and always
can say 'origin' to mean the other repository this repository
usually gets new objects from (and sends its new objects to)?

^ permalink raw reply

* Re: [PATCH] Teach remote machinery about remotes.default config variable
From: Johannes Schindelin @ 2008-01-12 20:24 UTC (permalink / raw)
  To: Mark Levedahl; +Cc: Junio C Hamano, git
In-Reply-To: <47891658.3090604@gmail.com>

Hi,

On Sat, 12 Jan 2008, Mark Levedahl wrote:

> Junio C Hamano wrote:
> > Ahh.
> > 
> > Does that suggest the new configuration thing is only about the 
> > "submodule update" command, not "remotes.default" that affects how the 
> > non-submodule merge and fetch works?
>
> Yes - this patch set was inspired by the single question of "how do I 
> avoid needing to define origin as opposed to a server-specific nickname 
> now that I am using sub-modules?"

Why is your patch then not about git-submodule?

And I still fail to see -- even for submodules -- how you begin to tackle 
that lookup problem.

Ciao,
Dscho

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox