[PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size
@ 2007-04-04 20:40 Dana How
  2007-04-04 22:04 ` Nicolas Pitre
  0 siblings, 1 reply; 12+ messages in thread
From: Dana How @ 2007-04-04 20:40 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, danahow

[-- Attachment #1: Type: text/plain, Size: 2071 bytes --]

The motivations are to better support portable media,
older filesystems,  and larger repositories without
awkward enormous packfiles.

When --pack-limit[=N] is specified and --stdout is not,
all bytes in the resulting packfile(s) appear at offsets
less than N (which defaults to 1<<31).  The default
guarantees mmap(2) on 32b systems never sees signed off_t's.
The object stream may be broken into multiple packfiles
as a result,  each properly and conventionally built.

When --stdout is also specified,  all objects in the
resulting packfile(s) _start_ at offsets less than N.
All the packfiles appear concatenated on stdout,
and each has its object count set to 0.  The behavior
without --stdout cannot be duplicated here since
lseek(2) is not generally possible on stdout.  The end
of each pack in the concatenated whole can be detected by
asking,  just before reading each object,  if the next
20 bytes match the SHA-1 of what came before.
A future change might insert a null-byte EOF marker
(i.e. type=OBJ_NONE/length=0) before each pack's final SHA-1
or before only the final pack's SHA-1.

When --blob-limit=N is specified,  blobs whose uncompressed
size is greater than or equal to N are omitted from the pack(s).
If --pack-limit is specified, --blob-limit is not, and
--stdout is not,  then --blob-limit defaults to 1/4
of the --pack-limit.

To enable this,  csum-file.c now has the ability to rollback
a checksummed write (see sha1mark/sha1undo).

Signed-off-by: Dana L. How <danahow@gmail.com>
---
 builtin-pack-objects.c   |  300 ++++++++++++++++++++++++++++++++++++++--------
 builtin-unpack-objects.c |    2 +-
 csum-file.c              |   41 +++++++
 csum-file.h              |    6 +
 git-repack.sh            |   12 ++-
 http-fetch.c             |    2 +-
 http-push.c              |    2 +-
 index-pack.c             |    2 +-
 sha1_file.c              |    2 +-
 9 files changed, 307 insertions(+), 62 deletions(-)

[Full patch attached since gmail insists on wordwrap and eating ws]

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

[-- Attachment #2: 0001-git-repack-pack-objects-accept-pack-blob-limit.patch.txt --]
[-- Type: text/plain, Size: 22247 bytes --]

From f9da90fb81a8ed3336829c824baf3fd5c08cf74b Mon Sep 17 00:00:00 2001
From: Dana L. How <danahow@gmail.com>
Date: Wed, 4 Apr 2007 13:15:45 -0700
Subject: [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size

The motivations are to better support portable media,
older filesystems,  and larger repositories without
awkward enormous packfiles.

When --pack-limit[=N] is specified and --stdout is not,
all bytes in the resulting packfile(s) appear at offsets
less than N (which defaults to 1<<31).  The default
guarantees mmap(2) on 32b systems never sees signed off_t's.
The object stream may be broken into multiple packfiles
as a result,  each properly and conventionally built.

When --stdout is also specified,  all objects in the
resulting packfile(s) _start_ at offsets less than N.
All the packfiles appear concatenated on stdout,
and each has its object count set to 0.  The behavior
without --stdout cannot be duplicated here since
lseek(2) is not generally possible on stdout.  The end
of each pack in the concatenated whole can be detected by
asking,  just before reading each object,  if the next
20 bytes match the SHA-1 of what came before.
A future change might insert a null-byte EOF marker
(i.e. type=OBJ_NONE/length=0) before each pack's final SHA-1
or before only the final pack's SHA-1.

When --blob-limit=N is specified,  blobs whose uncompressed
size is greater than or equal to N are omitted from the pack(s).
If --pack-limit is specified, --blob-limit is not, and
--stdout is not,  then --blob-limit defaults to 1/4
of the --pack-limit.

To enable this,  csum-file.c now has the ability to rollback
a checksummed write (see sha1mark/sha1undo).

Signed-off-by: Dana L. How <danahow@gmail.com>
---
 builtin-pack-objects.c   |  300 ++++++++++++++++++++++++++++++++++++++--------
 builtin-unpack-objects.c |    2 +-
 csum-file.c              |   41 +++++++
 csum-file.h              |    6 +
 git-repack.sh            |   12 ++-
 http-fetch.c             |    2 +-
 http-push.c              |    2 +-
 index-pack.c             |    2 +-
 sha1_file.c              |    2 +-
 9 files changed, 307 insertions(+), 62 deletions(-)

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index b5f9648..0a11113 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -14,8 +14,8 @@
 #include "list-objects.h"
 
 static const char pack_usage[] = "\
-git-pack-objects [{ -q | --progress | --all-progress }] \n\
-	[--local] [--incremental] [--window=N] [--depth=N] \n\
+git-pack-objects [{ -q | --progress | --all-progress }] [--pack-limit[=N]]\n\
+	[--blob-limit=N] [--local] [--incremental] [--window=N] [--depth=N]\n\
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty] \n\
 	[--revs [--unpacked | --all]*] [--reflog] [--stdout | base-name] \n\
 	[<ref-list | <object-list]";
@@ -29,6 +29,7 @@ struct object_entry {
 	unsigned int depth;	/* delta depth */
 	unsigned int delta_limit;	/* base adjustment for in-pack delta */
 	unsigned int hash;	/* name hint hash */
+	char no_write;		/* flag: written to previous pack OR too big */
 	enum object_type type;
 	enum object_type in_pack_type;	/* could be delta */
 	unsigned long delta_size;	/* delta data size (uncompressed) */
@@ -68,7 +69,10 @@ static int allow_ofs_delta;
 
 static struct object_entry **sorted_by_sha, **sorted_by_type;
 static struct object_entry *objects;
-static uint32_t nr_objects, nr_alloc, nr_result;
+static struct object_entry **written_list;
+static uint32_t nr_objects, nr_alloc, nr_result, nr_written, nr_skipped = 0;
+static uint32_t offset_limit = 0;
+static uint32_t blob_limit = 0;
 static const char *base_name;
 static unsigned char pack_file_sha1[20];
 static int progress = 1;
@@ -415,13 +419,17 @@ static off_t write_object(struct sha1file *f,
 	}
 
 	if (!to_reuse) {
+		int usable_delta =	!entry->delta ? 0 :
+					!offset_limit ? 1 :
+					entry->delta->no_write ? 0 :
+					entry->delta->offset ? 1 : 0;
 		buf = read_sha1_file(entry->sha1, &type, &size);
 		if (!buf)
 			die("unable to read %s", sha1_to_hex(entry->sha1));
 		if (size != entry->size)
 			die("object %s size inconsistency (%lu vs %lu)",
 			    sha1_to_hex(entry->sha1), size, entry->size);
-		if (entry->delta) {
+		if (usable_delta) {
 			buf = delta_against(buf, size, entry);
 			size = entry->delta_size;
 			obj_type = (allow_ofs_delta && entry->delta->offset) ?
@@ -503,62 +511,212 @@ static off_t write_one(struct sha1file *f,
 			       struct object_entry *e,
 			       off_t offset)
 {
-	if (e->offset || e->preferred_base)
+	struct sha1posn posn;
+	uint32_t save_written = 0, save_written_delta = 0;
+	uint32_t save_reused = 0, save_reused_delta = 0;
+	if (e->offset || e->preferred_base || e->no_write)
 		/* offset starts from header size and cannot be zero
 		 * if it is written already.
 		 */
 		return offset;
 	/* if we are deltified, write out its base object first. */
-	if (e->delta)
+	if (e->delta) {
 		offset = write_one(f, e->delta, offset);
+		if (!offset)
+			return offset;
+	}
+	if (offset_limit) {
+		if ( !pack_to_stdout ) {
+			/* save state before write for possible later seekback */
+			save_written = written, save_written_delta = written_delta;
+			save_reused = reused, save_reused_delta = reused_delta;
+			sha1mark(f, &posn);
+		} else if ( (unsigned long)offset >= (unsigned long)offset_limit )
+			/*
+			 * This ensures that no object's offset in the pack
+			 * exceeds or is equal to the offset_limit.  It is
+			 * a looser way of limiting packsize w/o seeking and
+			 * is used in --stdout mode.
+			 */
+			return 0;
+	}
 	e->offset = offset;
-	return offset + write_object(f, e);
+	offset += write_object(f, e);
+	/*
+	 * This ensures that the packfile size never exceeds or matches
+	 * the offset_limit if supplied.  The "20" is for the final SHA1.
+	 * This limit isn't used with --stdout since it requires seeking.
+	 */
+	if (offset_limit && !pack_to_stdout &&
+	    (unsigned long)offset >= (unsigned long)(offset_limit - 20)) {
+		written = save_written, written_delta = save_written_delta;
+		reused = save_reused, reused_delta = save_reused_delta;
+		sha1undo(f, &posn, offset, e->offset);
+		e->offset = 0;
+		return 0;
+	}
+	*written_list++ = e;
+	return offset;
+}
+
+/*
+ * Move this, the version in fast-import.c,
+ * and index_pack.c:readjust_pack_header_and_sha1 into sha1_file.c ?
+ */
+static void fixup_header_footer(int pack_fd, unsigned char *pack_file_sha1,
+				char *pack_name, uint32_t object_count)
+{
+	static const int buf_sz = 128 * 1024;
+	SHA_CTX c;
+	struct pack_header hdr;
+	char *buf;
+
+	if (lseek(pack_fd, 0, SEEK_SET) != 0)
+		die("Failed seeking to start: %s", strerror(errno));
+	if (read_in_full(pack_fd, &hdr, sizeof(hdr)) != sizeof(hdr))
+		die("Unable to reread header of %s", pack_name);
+	if (lseek(pack_fd, 0, SEEK_SET) != 0)
+		die("Failed seeking to start: %s", strerror(errno));
+	hdr.hdr_entries = htonl(object_count);
+	write_or_die(pack_fd, &hdr, sizeof(hdr));
+
+	SHA1_Init(&c);
+	SHA1_Update(&c, &hdr, sizeof(hdr));
+
+	buf = xmalloc(buf_sz);
+	for (;;) {
+		size_t n = xread(pack_fd, buf, buf_sz);
+		if (!n)
+			break;
+		if (n < 0)
+			die("Failed to checksum %s", pack_name);
+		SHA1_Update(&c, buf, n);
+	}
+	free(buf);
+
+	SHA1_Final(pack_file_sha1, &c);
+	write_or_die(pack_fd, pack_file_sha1, 20);
+	close(pack_fd);
 }
 
+typedef int (*entry_sort_t)(const struct object_entry *, const struct object_entry *);
+
+static entry_sort_t current_sort;
+
+/* forward declarations for write_pack_file */
+/* (probably should move sorting stuff up here) */
+static int sort_comparator(const void *_a, const void *_b);
+static int sha1_sort(const struct object_entry *a, const struct object_entry *b);
+static void write_index_file(void);
+
 static void write_pack_file(void)
 {
-	uint32_t i;
+	uint32_t i, j;
 	struct sha1file *f;
 	off_t offset;
 	struct pack_header hdr;
 	unsigned last_percent = 999;
-	int do_progress = progress;
+	int do_progress = progress >> !base_name;
+	char oldname[PATH_MAX];
+	int pack_fd;
+	struct object_entry **list;
+	SHA_CTX ctx;
+	uint32_t nr_actual = nr_result - nr_skipped;
 
-	if (!base_name) {
-		f = sha1fd(1, "<stdout>");
-		do_progress >>= 1;
-	}
-	else
-		f = sha1create("%s-%s.%s", base_name,
-			       sha1_to_hex(object_list_sha1), "pack");
 	if (do_progress)
-		fprintf(stderr, "Writing %u objects.\n", nr_result);
-
-	hdr.hdr_signature = htonl(PACK_SIGNATURE);
-	hdr.hdr_version = htonl(PACK_VERSION);
-	hdr.hdr_entries = htonl(nr_result);
-	sha1write(f, &hdr, sizeof(hdr));
-	offset = sizeof(hdr);
-	if (!nr_result)
-		goto done;
-	for (i = 0; i < nr_objects; i++) {
-		offset = write_one(f, objects + i, offset);
-		if (do_progress) {
-			unsigned percent = written * 100 / nr_result;
-			if (progress_update || percent != last_percent) {
-				fprintf(stderr, "%4u%% (%u/%u) done\r",
-					percent, written, nr_result);
-				progress_update = 0;
-				last_percent = percent;
+		fprintf(stderr, "Writing %u objects.\n", nr_actual);
+	written_list = list = xmalloc(nr_objects * sizeof(struct object_entry *));
+
+	for (i = 0; i < nr_objects;) {
+		if (!base_name) {
+			f = sha1fd(pack_fd = 1, "<stdout>");
+		}
+		else {
+			int len = snprintf(oldname, sizeof oldname, "%s-XXXXXX", base_name);
+			if (len >= PATH_MAX)
+				die("excessive pathname length for initial packfile name");
+			pack_fd = mkstemp(oldname);
+			if (pack_fd < 0)
+				die("can't create %s: %s", oldname, strerror(errno));
+			f = sha1fd(pack_fd, oldname);
+		}
+
+		hdr.hdr_signature = htonl(PACK_SIGNATURE);
+		hdr.hdr_version = htonl(PACK_VERSION);
+		hdr.hdr_entries = htonl(!base_name && offset_limit ? 0 : nr_actual);
+		sha1write(f, &hdr, sizeof(hdr));
+		offset = sizeof(hdr);
+		for (; i < nr_objects; i++) {
+			off_t offset_one = write_one(f, objects + i, offset);
+			if (!offset_one)
+				break;
+			offset = offset_one;
+			if (do_progress) {
+				unsigned percent = written * 100 / nr_actual;
+				if (progress_update || percent != last_percent) {
+					fprintf(stderr, "%4u%% (%u/%u) done\r",
+						percent, written, nr_actual);
+					progress_update = 0;
+					last_percent = percent;
+				}
 			}
 		}
+		nr_written = written_list - list;
+		written_list = list;
+
+		/*
+		 * Write terminator record here if desired:
+		 * type=OBJ_NONE, len=0;  this is a zero byte.
+		 */
+
+		/*
+		 * Did we write the wrong # entries in the header?
+		 * If so, rewrite it like in fast-import (gackk).
+		 */
+		if ( !base_name || nr_written == nr_actual ) {
+			sha1close(f, pack_file_sha1, 1);
+		} else {
+			sha1close(f, pack_file_sha1, -1);
+			fixup_header_footer(pack_fd, pack_file_sha1, oldname, nr_written);
+		}
+
+		/*
+		 * compute object_list_sha1 of sorted sha's we just wrote out;
+		 * we also mark these objects as written
+		 */
+		current_sort = sha1_sort;
+		qsort(list, nr_written, sizeof(struct object_entry *), sort_comparator);
+		SHA1_Init(&ctx);
+		for (j = 0; j < nr_written; j++) {
+			struct object_entry *entry = *list++;
+			entry->no_write = 1;
+			SHA1_Update(&ctx, entry->sha1, 20);
+		}
+		SHA1_Final(object_list_sha1, &ctx);
+		list = written_list;
+		/*
+		 * now we can rename the pack correctly and write the index file
+		 */
+		if (base_name) {
+			char newname[PATH_MAX];
+			int len = snprintf(newname, sizeof newname, "%s-%s.%s",
+						base_name, sha1_to_hex(object_list_sha1), "pack");
+			if (len >= PATH_MAX)
+				die("excessive pathname length for final packfile name");
+			if (rename(oldname, newname) < 0)
+				die("could not rename the pack file");
+		}
+		if (!pack_to_stdout) {
+			write_index_file();
+			puts(sha1_to_hex(object_list_sha1));
+		}
 	}
-	if (do_progress)
+
+	free(written_list);
+	if (nr_actual && do_progress)
 		fputc('\n', stderr);
- done:
-	if (written != nr_result)
-		die("wrote %u objects while expecting %u", written, nr_result);
-	sha1close(f, pack_file_sha1, 1);
+	if (written != nr_actual)
+		die("wrote %u objects while expecting %u", written, nr_actual);
 }
 
 static void write_index_file(void)
@@ -566,8 +724,8 @@ static void write_index_file(void)
 	uint32_t i;
 	struct sha1file *f = sha1create("%s-%s.%s", base_name,
 					sha1_to_hex(object_list_sha1), "idx");
-	struct object_entry **list = sorted_by_sha;
-	struct object_entry **last = list + nr_result;
+	struct object_entry **list = written_list;
+	struct object_entry **last = list + nr_written;
 	uint32_t array[256];
 
 	/*
@@ -583,7 +741,7 @@ static void write_index_file(void)
 				break;
 			next++;
 		}
-		array[i] = htonl(next - sorted_by_sha);
+		array[i] = htonl(next - written_list);
 		list = next;
 	}
 	sha1write(f, array, 256 * 4);
@@ -591,8 +749,8 @@ static void write_index_file(void)
 	/*
 	 * Write the actual SHA1 entries..
 	 */
-	list = sorted_by_sha;
-	for (i = 0; i < nr_result; i++) {
+	list = written_list;
+	for (i = 0; i < nr_written; i++) {
 		struct object_entry *entry = *list++;
 		uint32_t offset = htonl(entry->offset);
 		sha1write(f, &offset, 4);
@@ -1014,7 +1172,7 @@ static void check_object(struct object_entry *entry)
 				ofs = c & 127;
 				while (c & 128) {
 					ofs += 1;
-					if (!ofs || ofs & ~(~0UL >> 7))
+					if (!ofs || ofs & ~(~0UL >> 1))
 						die("delta base offset overflow in pack for %s",
 						    sha1_to_hex(entry->sha1));
 					c = buf[used_0++];
@@ -1058,6 +1216,7 @@ static void check_object(struct object_entry *entry)
 	}
 
 	entry->type = sha1_object_info(entry->sha1, &entry->size);
+	nr_skipped += entry->no_write = blob_limit && (unsigned long)entry->size >= blob_limit;
 	if (entry->type < 0)
 		die("unable to get type of object %s",
 		    sha1_to_hex(entry->sha1));
@@ -1103,10 +1262,6 @@ static void get_object_details(void)
 	}
 }
 
-typedef int (*entry_sort_t)(const struct object_entry *, const struct object_entry *);
-
-static entry_sort_t current_sort;
-
 static int sort_comparator(const void *_a, const void *_b)
 {
 	struct object_entry *a = *(struct object_entry **)_a;
@@ -1197,6 +1352,12 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 	if (trg_entry->type != src_entry->type)
 		return -1;
 
+	/* Don't try deltas involving excessively large objects when
+	 * pack size is limited
+	 */
+	if (trg_entry->no_write || src_entry->no_write)
+		return -1;
+
 	/* We do not compute delta to *create* objects we are not
 	 * going to pack.
 	 */
@@ -1538,6 +1699,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	const char **rp_av;
 	int rp_ac_alloc = 64;
 	int rp_ac;
+	int added = 0;
 
 	rp_av = xcalloc(rp_ac_alloc, sizeof(*rp_av));
 
@@ -1566,6 +1728,24 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			incremental = 1;
 			continue;
 		}
+		if (!strcmp("--pack-limit", arg)) {
+			offset_limit = 1UL << 31;
+			continue;
+		}
+		if (!prefixcmp(arg, "--pack-limit=")) {
+			char *end;
+			offset_limit = strtoul(arg+13, &end, 0);
+			if (!arg[13] || *end)
+				usage(pack_usage);
+			continue;
+		}
+		if (!prefixcmp(arg, "--blob-limit=")) {
+			char *end;
+			blob_limit = strtoul(arg+13, &end, 0);
+			if (!arg[13] || *end)
+				usage(pack_usage);
+			continue;
+		}
 		if (!prefixcmp(arg, "--window=")) {
 			char *end;
 			window = strtoul(arg+9, &end, 0);
@@ -1629,6 +1809,24 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		}
 		usage(pack_usage);
 	}
+	if ( offset_limit && !blob_limit && !pack_to_stdout ) {
+		/* need to limit blob size when creating bounded packs on disk */
+		blob_limit = offset_limit >> 2;
+		added |= 2;
+	}
+	if ( offset_limit && !no_reuse_delta ) {
+		/* didn't audit this case yet */
+		no_reuse_delta = 1;
+		added |= 1;
+	}
+	if ( added ) {
+		fprintf(stderr, "Added to command line:");
+		if ( added & 1 )
+			fprintf(stderr, " --no-reuse-delta");
+		if ( added & 2 )
+			fprintf(stderr, " --blob-limit=%u", blob_limit);
+		fprintf(stderr, "\n");
+	}
 
 	/* Traditionally "pack-objects [options] base extra" failed;
 	 * we would however want to take refs parameter that would
@@ -1695,10 +1893,6 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			progress_update = 0;
 		}
 		write_pack_file();
-		if (!pack_to_stdout) {
-			write_index_file();
-			puts(sha1_to_hex(object_list_sha1));
-		}
 	}
 	if (progress)
 		fprintf(stderr, "Total %u (delta %u), reused %u (delta %u)\n",
diff --git a/builtin-unpack-objects.c b/builtin-unpack-objects.c
index 3956c56..84a6c5c 100644
--- a/builtin-unpack-objects.c
+++ b/builtin-unpack-objects.c
@@ -209,7 +209,7 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size,
 		base_offset = c & 127;
 		while (c & 128) {
 			base_offset += 1;
-			if (!base_offset || base_offset & ~(~0UL >> 7))
+			if (!base_offset || base_offset & ~(~0UL >> 1))
 				die("offset value overflow for delta base object");
 			pack = fill(1);
 			c = *pack;
diff --git a/csum-file.c b/csum-file.c
index b7174c6..e2bef75 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -35,7 +35,10 @@ int sha1close(struct sha1file *f, unsigned char *result, int update)
 	if (offset) {
 		SHA1_Update(&f->ctx, f->buffer, offset);
 		sha1flush(f, offset);
+		f->offset = 0;
 	}
+	if (update < 0)
+		return 0;	/* only want to flush (no checksum write, no close) */
 	SHA1_Final(f->buffer, &f->ctx);
 	if (result)
 		hashcpy(result, f->buffer);
@@ -69,6 +72,44 @@ int sha1write(struct sha1file *f, void *buf, unsigned int count)
 	return 0;
 }
 
+/*
+ * Save current position/state in output file
+ * (needs to be fast -- no system calls!)
+ */
+void sha1mark(struct sha1file *f, struct sha1posn *p)
+{
+	p->offset = f->offset;
+	p->ctx = f->ctx;	/* larger than I'd like */
+}
+
+/*
+ * Restore previous position/state in output file
+ * (can be slow)
+ */
+void sha1undo(struct sha1file *f, struct sha1posn *p, long new, long old)
+{
+	if (new - old == (long)f->offset - (long)p->offset) {
+		f->ctx = p->ctx;
+		f->offset = p->offset;
+		return;
+	}
+	if (lseek(f->fd, (off_t)old - (off_t)p->offset, SEEK_SET) == (off_t)-1)
+		die("sha1 file '%s' undo seekback error (%s)", f->name, strerror(errno));
+	f->ctx = p->ctx;
+	while (p->offset) {
+		int ret = xread(f->fd, f->buffer, p->offset);
+		if (!ret)
+			die("sha1 file '%s' undo readback error. No data", f->name);
+		if (ret < 0)
+			die("sha1 file '%s' undo readback error (%s)", f->name, strerror(errno));
+		SHA1_Update(&f->ctx, f->buffer, ret);
+		p->offset -= ret;
+	}
+	if (ftruncate(f->fd, (off_t)old))
+		die("sha1 file '%s' undo truncate error (%s)", f->name, strerror(errno));
+	f->offset = 0;
+}
+
 struct sha1file *sha1create(const char *fmt, ...)
 {
 	struct sha1file *f;
diff --git a/csum-file.h b/csum-file.h
index 3ad1a99..780df17 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -9,11 +9,17 @@ struct sha1file {
 	char name[PATH_MAX];
 	unsigned char buffer[8192];
 };
+struct sha1posn {
+	unsigned int offset;
+	SHA_CTX ctx;
+};
 
 extern struct sha1file *sha1fd(int fd, const char *name);
 extern struct sha1file *sha1create(const char *fmt, ...) __attribute__((format (printf, 1, 2)));
 extern int sha1close(struct sha1file *, unsigned char *, int);
 extern int sha1write(struct sha1file *, void *, unsigned int);
 extern int sha1write_compressed(struct sha1file *, void *, unsigned int);
+extern void sha1mark(struct sha1file *, struct sha1posn *);
+extern void sha1undo(struct sha1file *, struct sha1posn *, long, long);
 
 #endif
diff --git a/git-repack.sh b/git-repack.sh
index ddfa8b4..0299ff1 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -18,6 +18,9 @@ do
 	-q)	quiet=-q ;;
 	-f)	no_reuse_delta=--no-reuse-delta ;;
 	-l)	local=--local ;;
+	--pack-limit) extra="$extra $1" ;;
+	--pack-limit=*) extra="$extra $1" ;;
+	--blob-limit=*) extra="$extra $1" ;;
 	--window=*) extra="$extra $1" ;;
 	--depth=*) extra="$extra $1" ;;
 	*)	usage ;;
@@ -62,11 +65,12 @@ case ",$all_into_one," in
 esac
 
 args="$args $local $quiet $no_reuse_delta$extra"
-name=$(git-pack-objects --non-empty --all --reflog $args </dev/null "$PACKTMP") ||
+names=$(git-pack-objects --non-empty --all --reflog $args </dev/null "$PACKTMP") ||
 	exit 1
-if [ -z "$name" ]; then
+if [ -z "$names" ]; then
 	echo Nothing new to pack.
-else
+fi
+for name in $names ; do
 	chmod a-w "$PACKTMP-$name.pack"
 	chmod a-w "$PACKTMP-$name.idx"
 	if test "$quiet" != '-q'; then
@@ -92,7 +96,7 @@ else
 		exit 1
 	}
 	rm -f "$PACKDIR/old-pack-$name.pack" "$PACKDIR/old-pack-$name.idx"
-fi
+done
 
 if test "$remove_redundant" = t
 then
diff --git a/http-fetch.c b/http-fetch.c
index 557b403..09baedc 100644
--- a/http-fetch.c
+++ b/http-fetch.c
@@ -198,7 +198,7 @@ static void start_object_request(struct object_request *obj_req)
 		SHA1_Init(&obj_req->c);
 		if (prev_posn>0) {
 			prev_posn = 0;
-			lseek(obj_req->local, SEEK_SET, 0);
+			lseek(obj_req->local, 0, SEEK_SET);
 			ftruncate(obj_req->local, 0);
 		}
 	}
diff --git a/http-push.c b/http-push.c
index 724720c..e3f7675 100644
--- a/http-push.c
+++ b/http-push.c
@@ -312,7 +312,7 @@ static void start_fetch_loose(struct transfer_request *request)
 		SHA1_Init(&request->c);
 		if (prev_posn>0) {
 			prev_posn = 0;
-			lseek(request->local_fileno, SEEK_SET, 0);
+			lseek(request->local_fileno, 0, SEEK_SET);
 			ftruncate(request->local_fileno, 0);
 		}
 	}
diff --git a/index-pack.c b/index-pack.c
index 6284fe3..d35c4c4 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -249,7 +249,7 @@ static void *unpack_raw_entry(struct object_entry *obj, union delta_base *delta_
 		base_offset = c & 127;
 		while (c & 128) {
 			base_offset += 1;
-			if (!base_offset || base_offset & ~(~0UL >> 7))
+			if (!base_offset || base_offset & ~(~0UL >> 1))
 				bad_object(obj->offset, "offset value overflow for delta base object");
 			p = fill(1);
 			c = *p;
diff --git a/sha1_file.c b/sha1_file.c
index 9c26038..7082d3e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1149,7 +1149,7 @@ static off_t get_delta_base(struct packed_git *p,
 		base_offset = c & 127;
 		while (c & 128) {
 			base_offset += 1;
-			if (!base_offset || base_offset & ~(~0UL >> 7))
+			if (!base_offset || base_offset & ~(~0UL >> 1))
 				die("offset value overflow for delta base object");
 			c = base_info[used++];
 			base_offset = (base_offset << 7) + (c & 127);
-- 
1.5.1.rc2.18.g9c88-dirty


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size
  2007-04-04 20:40 [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size Dana How
@ 2007-04-04 22:04 ` Nicolas Pitre
  2007-04-04 22:55   ` Dana How
  0 siblings, 1 reply; 12+ messages in thread
From: Nicolas Pitre @ 2007-04-04 22:04 UTC (permalink / raw)
  To: Dana How; +Cc: Junio C Hamano, Git Mailing List

On Wed, 4 Apr 2007, Dana How wrote:

> The motivations are to better support portable media,
> older filesystems,  and larger repositories without
> awkward enormous packfiles.

I wouldn't qualify "enormous" pack files as "awkward".

It will always be more efficient to have only one pack to deal with 
(when possible of course).

> When --pack-limit[=N] is specified and --stdout is not,
> all bytes in the resulting packfile(s) appear at offsets
> less than N (which defaults to 1<<31).  The default
> guarantees mmap(2) on 32b systems never sees signed off_t's.
> The object stream may be broken into multiple packfiles
> as a result,  each properly and conventionally built.
> 

This sounds fine.  *However* how do you ensure that the second pack (or 
subsequent packs) is self contained with regards to delta base objects 
when it is _not_ meant to be a thin pack?

> When --stdout is also specified,  all objects in the
> resulting packfile(s) _start_ at offsets less than N.
> All the packfiles appear concatenated on stdout,
> and each has its object count set to 0.  The behavior
> without --stdout cannot be duplicated here since
> lseek(2) is not generally possible on stdout.

Please scrap that.  There is simply no point making --pack-limit and 
--stdout work together.  If the amount of data to send over the GIT 
protocol exceeds 4G (or whatever) it is the receiving end's business to 
split it up _if_ it wants/has to.  The alternative is just too ugly.

> When --blob-limit=N is specified,  blobs whose uncompressed
> size is greater than or equal to N are omitted from the pack(s).
> If --pack-limit is specified, --blob-limit is not, and
> --stdout is not,  then --blob-limit defaults to 1/4
> of the --pack-limit.

Is this really useful?

If you have a pack size limit and a blob cannot make it even in a pack 
of its 
own then you're screwed anyway.  It is much better to simply fail the 
operation than leaving some blobs behind.  IOW I don't see the 
usefulness of this feature.


Nicolas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size
  2007-04-04 22:04 ` Nicolas Pitre
@ 2007-04-04 22:55   ` Dana How
  2007-04-05  3:17     ` Nicolas Pitre
  2007-04-05  6:54     ` Shawn O. Pearce
  0 siblings, 2 replies; 12+ messages in thread
From: Dana How @ 2007-04-04 22:55 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Junio C Hamano, Git Mailing List, danahow

On 4/4/07, Nicolas Pitre <nico@cam.org> wrote:
> On Wed, 4 Apr 2007, Dana How wrote:
> > The motivations are to better support portable media,
> > older filesystems,  and larger repositories without
> > awkward enormous packfiles.
>
> I wouldn't qualify "enormous" pack files as "awkward".
>
> It will always be more efficient to have only one pack to deal with
> (when possible of course).
Yes.  "(when possible of course)" refers to the remaining motivations
I didn't explicitly mention: the 32b offset limit in .idx files,
and keeping the mmap code working on a 32b system.
I realize there are better solutions in the pipeline,
but I'd like to address this now (for my own use) and hopefully
also create something useful for 4GB-limited filesystems,
USB sticks, etc.

> > When --pack-limit[=N] is specified and --stdout is not,
> > all bytes in the resulting packfile(s) appear at offsets
> > less than N (which defaults to 1<<31).  The default
> > guarantees mmap(2) on 32b systems never sees signed off_t's.
> > The object stream may be broken into multiple packfiles
> > as a result,  each properly and conventionally built.
> >
>
> This sounds fine.  *However* how do you ensure that the second pack (or
> subsequent packs) is self contained with regards to delta base objects
> when it is _not_ meant to be a thin pack?
Good question.  Search for "int usable_delta" in the patch.
With --pack-limit (offset_limit in C), you can use a delta if the base
is in the same pack and already written out.  The first condition
addresses your concern, and the second handles the case
where the base object gets pushed to the next pack.
These restrictions should be loosened for --thin-pack
but I didn't do that yet.
Also, --pack-limit turns on --no-reuse-delta.
This is not necessary, but not doing it would have meant
hacking up even more conditions which I didn't want to do
on a newbie submission.

> > When --stdout is also specified,  all objects in the
> > resulting packfile(s) _start_ at offsets less than N.
> > All the packfiles appear concatenated on stdout,
> > and each has its object count set to 0.  The behavior
> > without --stdout cannot be duplicated here since
> > lseek(2) is not generally possible on stdout.
>
> Please scrap that.  There is simply no point making --pack-limit and
> --stdout work together.  If the amount of data to send over the GIT
> protocol exceeds 4G (or whatever) it is the receiving end's business to
> split it up _if_ it wants/has to.  The alternative is just too ugly.
I have a similar but much weaker reaction, but Linus specifically asked for
this combination to work.  So I made it work as well as possible
given no seeking.

> > When --blob-limit=N is specified,  blobs whose uncompressed
> > size is greater than or equal to N are omitted from the pack(s).
> > If --pack-limit is specified, --blob-limit is not, and
> > --stdout is not,  then --blob-limit defaults to 1/4
> > of the --pack-limit.
> Is this really useful?
>
> If you have a pack size limit and a blob cannot make it even in a pack
> of its
> own then you're screwed anyway.  It is much better to simply fail the
> operation than leaving some blobs behind.  IOW I don't see the
> usefulness of this feature.
I agree if --stdout is specified.  This is why --pack-limit && --stdout
DON'T turn on --blob-limit if not specified.

However, if I'm building packs inside a non-(web-)published
repository, I find this useful. First of all, if there's some blob bigger
than the --pack-limit I must drop it anyway -- it's not clear to me that
the mmap window code works on 32b systems
with >2GB-sized objects in packs.  An "all-or-nothing" limitation
wouldn't be helpful to me.
But blobs even close to the packfile limit don't seem all that useful
to pack either (this of course is a weaker argument).
In the sample (p4) checkout I'm testing on [i.e. no history],
I have 56K+ objects consuming ~55GB uncompressed;
there are 9 blobs over 500MB each uncompressed.
I'm guessing packing them is not a performance advantage,
and I certainly wouldn't want frequently-used objects to be
stuck between them.  [ I guess my repo stats are going to
be a bit strange ;-) ]

Packing plays two roles: archive storage (long life) and
transmission (possibly short life).
These seem to pull the packing code in different directions.

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size
  2007-04-04 22:55   ` Dana How
@ 2007-04-05  3:17     ` Nicolas Pitre
  2007-04-05  7:15       ` Shawn O. Pearce
  2007-04-05  6:54     ` Shawn O. Pearce
  1 sibling, 1 reply; 12+ messages in thread
From: Nicolas Pitre @ 2007-04-05  3:17 UTC (permalink / raw)
  To: Dana How; +Cc: Junio C Hamano, Git Mailing List

On Wed, 4 Apr 2007, Dana How wrote:

> On 4/4/07, Nicolas Pitre <nico@cam.org> wrote:
> > On Wed, 4 Apr 2007, Dana How wrote:
> > > The motivations are to better support portable media,
> > > older filesystems,  and larger repositories without
> > > awkward enormous packfiles.
> > 
> > I wouldn't qualify "enormous" pack files as "awkward".
> > 
> > It will always be more efficient to have only one pack to deal with
> > (when possible of course).
> Yes.  "(when possible of course)" refers to the remaining motivations
> I didn't explicitly mention: the 32b offset limit in .idx files,
> and keeping the mmap code working on a 32b system.
> I realize there are better solutions in the pipeline,
> but I'd like to address this now (for my own use) and hopefully
> also create something useful for 4GB-limited filesystems,
> USB sticks, etc.

I think this is a valid feature to have, no problem there.

> > > When --pack-limit[=N] is specified and --stdout is not,
> > > all bytes in the resulting packfile(s) appear at offsets
> > > less than N (which defaults to 1<<31).  The default
> > > guarantees mmap(2) on 32b systems never sees signed off_t's.
> > > The object stream may be broken into multiple packfiles
> > > as a result,  each properly and conventionally built.
> > >
> > 
> > This sounds fine.  *However* how do you ensure that the second pack (or
> > subsequent packs) is self contained with regards to delta base objects
> > when it is _not_ meant to be a thin pack?
> Good question.  Search for "int usable_delta" in the patch.
> With --pack-limit (offset_limit in C), you can use a delta if the base
> is in the same pack and already written out.  The first condition
> addresses your concern, and the second handles the case
> where the base object gets pushed to the next pack.
> These restrictions should be loosened for --thin-pack
> but I didn't do that yet.

OK.

> Also, --pack-limit turns on --no-reuse-delta.
> This is not necessary, but not doing it would have meant
> hacking up even more conditions which I didn't want to do
> on a newbie submission.

Thing is delta reusing is one of the most important feature for good 
performances so it has to work.

But let's take a moment to talk about your "newbie submission".  This 
patch is _way_ too large and covers too many things at once.  You'll 
have to split it in many smaller logical units pretty please.  For 
example, if you have to borrow code from fast-import then make a patch 
that perform that code movement _only_.  Then you can have another patch 
that move code around within builtin-pack-objects.c without changing any 
functionality just to prepare the way for the next patch which would 
concentrate on the new feature only.  Then make sure you have the 
addition of the new capability separate from the patch that let other 
part of GIT use it.  Etc.

That would be much easier for us to comment on smaller patches, and 
especially easier for you to rework one of the smaller patches than the 
big one if need be.

I looked at your patch and there are things I like and other things I 
don't like at all.  Because it is a single large patch I may only NAK 
the whole of it.

> > > When --stdout is also specified,  all objects in the
> > 
> > Please scrap that.  There is simply no point making --pack-limit and
> > --stdout work together.  If the amount of data to send over the GIT
> > protocol exceeds 4G (or whatever) it is the receiving end's business to
> > split it up _if_ it wants/has to.  The alternative is just too ugly.
> I have a similar but much weaker reaction, but Linus specifically asked for
> this combination to work.

Linus is a total imcompetent who knows nothing about programming or good 
taste.  So never ever listen to what he says.  He is wrong, always.

And in this case he's more wrong than usual.

;-)

> So I made it work as well as possible
> given no seeking.

The fact is that there is no point in imposing split packs on the 
receiving end if it can accept a single pack just fine.  Pack layout is 
and must remain a local policy, not something that the remote end felt 
was good for the peer.  This is true for the treshold value to decide 
whether fetched packs are kept as is or unpacked as loose objects.  This 
is the same issue for pack size limit.

And as you discovered yourself, it is quite messy to implement in 
pack-objects when the output is a stream because you don't know in 
advance how many objects will be sent so you have to invent special 
markers to indicate the end of pack.  This in turn doesn't let 
index-pack know how much to expect and provide some progress report at 
all.  The appropriate location for the splitting of packs in a fetch 
context is really within index-pack not in pack-objects.  And it is 
probably so much easier to do in index-pack anyway.

> > > When --blob-limit=N is specified,  blobs whose uncompressed
> > > size is greater than or equal to N are omitted from the pack(s).
> > > If --pack-limit is specified, --blob-limit is not, and
> > > --stdout is not,  then --blob-limit defaults to 1/4
> > > of the --pack-limit.
> > Is this really useful?
> > 
> > If you have a pack size limit and a blob cannot make it even in a 
> > pack of its own then you're screwed anyway.  It is much better to 
> > simply fail the operation than leaving some blobs behind.  IOW I 
> > don't see the usefulness of this feature.
> I agree if --stdout is specified.  This is why --pack-limit && --stdout
> DON'T turn on --blob-limit if not specified.

Let's forget about --stdout.  I hope I convinced you that it doesn't 
make sense.

> However, if I'm building packs inside a non-(web-)published
> repository, I find this useful. First of all, if there's some blob bigger
> than the --pack-limit I must drop it anyway -- it's not clear to me that
> the mmap window code works on 32b systems
> with >2GB-sized objects in packs.  An "all-or-nothing" limitation
> wouldn't be helpful to me.

But do you realize that if you drop even a single object you are 
screwed?  The fetching of a repo that is missing objects is a corrupt 
fetch.  You cannot just sent a bunch of objects and leave a couple 
behind in the hope that the peer will never need them.  For one the peer 
would never be able to perform a successful git-fsck.

If you care about your local usage only then this feature is bogus as 
well.  The blob _has_ to exist as a loose object before ever being 
packed.  If you have blobs larger than 2GB or 4GB and your filesystem is 
unable to cope with that then you're screwed already.  You won't be able 
to check them out, etc.

> But blobs even close to the packfile limit don't seem all that useful
> to pack either (this of course is a weaker argument).

In the extreme case where you have a blob that is near the pack size 
limit then you'll end up with one pack per blob. No problem there.  If 
you're lucky then you might have 10 big blobs which size is near the 
pack size limit, but because they delta well against each other you 
might be able to stuff them all in a single pack.

And if a blob is larger than the pack limit then either you should 
increase your pack size limit, or if the pack limit is already near the 
filesystem capacity for a single file then the blob cannot exist on that 
filesystem in the first place.

So you still have to convince me this is a useful feature.

> Packing plays two roles: archive storage (long life) and
> transmission (possibly short life).

Packing is also a _huge_ performance boost.  Don't underestimate that.

Nicolas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size
  2007-04-05  3:17     ` Nicolas Pitre
@ 2007-04-05  7:15       ` Shawn O. Pearce
  2007-04-05 15:52         ` Nicolas Pitre
  0 siblings, 1 reply; 12+ messages in thread
From: Shawn O. Pearce @ 2007-04-05  7:15 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Dana How, Junio C Hamano, Git Mailing List

Nicolas Pitre <nico@cam.org> wrote:
> But let's take a moment to talk about your "newbie submission".  This 
> patch is _way_ too large and covers too many things at once.  You'll 
> have to split it in many smaller logical units pretty please.

Yes, absolutely!  If I was doing this series I'd probably have this
broken into at least 5, maybe even 8 patches.  There's a lot of
stuff here, and some of it was completely unrelated to each other.
Like a fix for lseek arguments?  Uhh, that's so not what this patch
was about, but is still a good fix.  Nice catch.  Lets get it in,
but properly please.

I also noticed a lot of "/* and here we could do... */" markers.
I think these belong more in the commit message than in the code
itself, as the code gets really cluttered after a while with the
todo markers that nobody has done yet.  Odds are, if you don't do
it in this series, it won't get done for quite a while, if ever.
Mention it in your commit message, so others are aware of it,
and leave it at that.

> On Wed, 4 Apr 2007, Dana How wrote:
> > I have a similar but much weaker reaction, but Linus specifically asked for
> > this combination to work.
> 
> Linus is a total imcompetent who knows nothing about programming or good 
> taste.  So never ever listen to what he says.  He is wrong, always.

Now now, I wouldn't go that far...

> And in this case he's more wrong than usual.

But yes, in this case I certainly I agree with you Nico.  ;-)

> The appropriate location for the splitting of packs in a fetch 
> context is really within index-pack not in pack-objects.  And it is 
> probably so much easier to do in index-pack anyway.

Or in a push context.  I have started down the road of doing the
pack splitting in index-pack, but never was able to finish it.
It is totally doable in index-pack, but it turned out to be more
work than I had time for when I started it.

I still have the topic laying in my repository, and it is pushed
to my fastimport.git fork on repo.or.cz, if someone wants to take
a look at it.  The patch is far from complete, hence no patch to
mailing list.

> > But blobs even close to the packfile limit don't seem all that useful
> > to pack either (this of course is a weaker argument).
> 
> So you still have to convince me this is a useful feature.

Me too.  Nico's right.  And lets just assume we are in a bad case
where we need to put one (and only one) such blob into each packfile.
The packfile on disk will be an additional 32 bytes larger than if
it were to stay a loose file, and that's assuming the loose file was
created using the newer-style loose file format.  This is unlikely
to cause a large disk space overhead.

Yea, we to have an extra SHA-1 we have to compute when we created
the file, but then we also have another SHA-1 to verify that the
zlib stream is not corrupt, *without* unpacking the zlib stream.
That in and of itself could be useful for an fsck, we could do a
faster rule that says "if there's only a couple of blobs in this
packfile, and the thing is *big*, just do a check of the packfile
SHA-1 and assume the blobs are OK".  So there actually might be
value to that extra SHA-1 after all.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size
  2007-04-05  7:15       ` Shawn O. Pearce
@ 2007-04-05 15:52         ` Nicolas Pitre
  0 siblings, 0 replies; 12+ messages in thread
From: Nicolas Pitre @ 2007-04-05 15:52 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Dana How, Junio C Hamano, Git Mailing List

On Thu, 5 Apr 2007, Shawn O. Pearce wrote:
> Nicolas Pitre <nico@cam.org> wrote:
> > On Wed, 4 Apr 2007, Dana How wrote:
> > > I have a similar but much weaker reaction, but Linus specifically asked for
> > > this combination to work.
> > 
> > Linus is a total imcompetent who knows nothing about programming or good 
> > taste.  So never ever listen to what he says.  He is wrong, always.
> 
> Now now, I wouldn't go that far...

There was a well meant smiley along with my diatribe of course.


Nicolas

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size
  2007-04-04 22:55   ` Dana How
  2007-04-05  3:17     ` Nicolas Pitre
@ 2007-04-05  6:54     ` Shawn O. Pearce
  2007-04-05 15:34       ` Linus Torvalds
  1 sibling, 1 reply; 12+ messages in thread
From: Shawn O. Pearce @ 2007-04-05  6:54 UTC (permalink / raw)
  To: Dana How; +Cc: Nicolas Pitre, Junio C Hamano, Git Mailing List

Dana How <danahow@gmail.com> wrote:
> it's not clear to me that
> the mmap window code works on 32b systems
> with >2GB-sized objects in packs.

Hmmph.  Depends on the system.

For glibc we do try to set _FILE_OFFSET_BITS to 64, so that
even on 32 bit glibcs we pick up a 64 bit off_t.  But really old
glibcs won't have any 64 bit off_t support no matter what we do.
And some systems don't know what _FILE_OFFSET_BITS is, so they
give us whatever size off_t they want.

Forget mmap.  In a 32 bit off_t case there is absolutely no way
that open_packed_git_1 can verify the packfile signature (last
20 bytes) against the index if the packfile is larger than 2 GiB.
In this case we will have an off_t that is negative, subtract 20
from it, and its still probably negative.  I doubt SEEK_SET will
like a negative offset.  This of course assumes that the earlier
fstat call actually succeeded on such a file a large file with
such a small off_t.

If we get through that open_packed_git_1 and actually verify the
signature, we either have some random sequence in the middle of the
packfile that matches the signature in the index (sort of unlikely),
or our off_t is actually large enough to handle the window position
tests that the use_pack and in_window functions perform.  In this
latter case we shouldn't have any problems with the mmap code.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size
  2007-04-05  6:54     ` Shawn O. Pearce
@ 2007-04-05 15:34       ` Linus Torvalds
  2007-04-05 15:53         ` Shawn O. Pearce
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2007-04-05 15:34 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Dana How, Nicolas Pitre, Junio C Hamano, Git Mailing List

On Thu, 5 Apr 2007, Shawn O. Pearce wrote:
> 
> For glibc we do try to set _FILE_OFFSET_BITS to 64

I repeat: that's _broken_.

It's in no way portable. It's a glibc horror. It should not be used.

It was a quick hack, but the real way to do it is to use "loff_t" and 
"llseek".

But there simply isn't any way to do mmap() or pread() portably outside 
the 32-bit area. So there are good reasons why we should just limit 
pack-files to 32-bits on 32-bit architectures.

So I think that Dana's approach is just fundamentally correct. Yeah, we 
should probably have a 64-bit index as a *possibility*, but it simply 
isn't a replacement for "keep packs under 2GB in size".

		Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size
  2007-04-05 15:34       ` Linus Torvalds
@ 2007-04-05 15:53         ` Shawn O. Pearce
  2007-04-05 16:21           ` Linus Torvalds
  0 siblings, 1 reply; 12+ messages in thread
From: Shawn O. Pearce @ 2007-04-05 15:53 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dana How, Nicolas Pitre, Junio C Hamano, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Thu, 5 Apr 2007, Shawn O. Pearce wrote:
> > 
> > For glibc we do try to set _FILE_OFFSET_BITS to 64
> 
> I repeat: that's _broken_.
> 
> It's in no way portable. It's a glibc horror. It should not be used.
> 
> It was a quick hack, but the real way to do it is to use "loff_t" and 
> "llseek".

Sure, OK, but that libc function doesn't exist on Mac OS X:

man llseek:

  This function is Linux-specific, and should not be used in programs
  intended to be portable.

So we'd need our own horror to wrap llseek as an lseek fake-alike
anyway.  That's what that glibc horror does, and we didn't have to
write that code.  :-)

> But there simply isn't any way to do mmap() or pread() portably outside 
> the 32-bit area. So there are good reasons why we should just limit 
> pack-files to 32-bits on 32-bit architectures.

Not unless your off_t is 64 bits, no.  If it is 64 bits then you
should be able to do a pread or mmap starting past the first 2 GiB.
You just might not be able to ask for a mmap that exceeds 2 GiB in
size, as your size_t may not be that large.  E.g., Darwin/Mac OS X.
Hence the sliding window mmap.

> So I think that Dana's approach is just fundamentally correct. Yeah, we 
> should probably have a 64-bit index as a *possibility*, but it simply 
> isn't a replacement for "keep packs under 2GB in size".

I'm not disagreeing.  Some filesystems (FAT on a USB stick, Dana's
example) just don't want files larger than 2 GiB.  So keeping them
small has a number of advantages.  Plus they are easier to burn on
DVD: 2 packs per DVD.  ;-)

I was simply trying to point out that the mmap code isn't broken
if the off_t is able to handle a file of that size; and if it can't
then other things are broken, like a simple lseek.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size
  2007-04-05 15:53         ` Shawn O. Pearce
@ 2007-04-05 16:21           ` Linus Torvalds
  2007-04-05 17:14             ` Shawn O. Pearce
  0 siblings, 1 reply; 12+ messages in thread
From: Linus Torvalds @ 2007-04-05 16:21 UTC (permalink / raw)
  To: Shawn O. Pearce; +Cc: Dana How, Nicolas Pitre, Junio C Hamano, Git Mailing List

On Thu, 5 Apr 2007, Shawn O. Pearce wrote:
> 
> Sure, OK, but that libc function doesn't exist on Mac OS X:

My bad. It's *not* linux-specific like the OSX man-page apparently says, 
it's very traditional. But the right name is "lseek64()" (and offt64_t for 
the size).

Of course, OSX didn't have some of the backwards-compatibility issues with 
decades ago, so they just made off_t 64-bit by default. Maybe they don't 
even bother to do the trivial portability things to support programs that 
try to be portable..

Anyway, we should use open64(), lseek64() and friends to be as portable as 
possible.. And if some system doesn't have them, just use the normal ops, 
and pray that they are already 64-bit safe..

		Linus

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size
  2007-04-05 16:21           ` Linus Torvalds
@ 2007-04-05 17:14             ` Shawn O. Pearce
  2007-04-05 21:17               ` Florian Weimer
  0 siblings, 1 reply; 12+ messages in thread
From: Shawn O. Pearce @ 2007-04-05 17:14 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Dana How, Nicolas Pitre, Junio C Hamano, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> 
> On Thu, 5 Apr 2007, Shawn O. Pearce wrote:
> > 
> > Sure, OK, but that libc function doesn't exist on Mac OS X:
> 
> My bad. It's *not* linux-specific like the OSX man-page apparently says, 
> it's very traditional. But the right name is "lseek64()" (and offt64_t for 
> the size).

Sorry for the confusion, that manpage snippet came from a Gentoo/x86
system.  But I digress, you are right, the right interfaces to be
using here is lseek64 and open64.

Now those also don't eixst on OSX, because as you pointed out, they
have no legacy to deal with and are just using sizeof off_t == 8.

So we'd probably have to do something like:

	#ifndef _LFS_LARGEFILE
	#define open64 open
	#define lseek64 lseek
	#endif

and then start using the open64/lseek64 variants instead.  Or do
the reverse #define's.  ;-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size
  2007-04-05 17:14             ` Shawn O. Pearce
@ 2007-04-05 21:17               ` Florian Weimer
  0 siblings, 0 replies; 12+ messages in thread
From: Florian Weimer @ 2007-04-05 21:17 UTC (permalink / raw)
  To: git

* Shawn O. Pearce:

> So we'd probably have to do something like:
>
> 	#ifndef _LFS_LARGEFILE
> 	#define open64 open
> 	#define lseek64 lseek
> 	#endif
>
> and then start using the open64/lseek64 variants instead.  Or do
> the reverse #define's.  ;-)

This is actually what "#define _FILE_OFFSET_BITS 64" does.  It's
usually not a bad idea per se, but you must not use off_t in library
header files if you do this.

lseek64 and friends are under -D_LARGEFILE64_SOURCE, it seems.  Pitty
we couldn't get rid of this mess when switching to libc6. 8-(

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2007-04-05 21:18 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-04 20:40 [PATCH] git-{repack,pack-objects} accept --{pack,blob}-limit to control pack size Dana How
2007-04-04 22:04 ` Nicolas Pitre
2007-04-04 22:55   ` Dana How
2007-04-05  3:17     ` Nicolas Pitre
2007-04-05  7:15       ` Shawn O. Pearce
2007-04-05 15:52         ` Nicolas Pitre
2007-04-05  6:54     ` Shawn O. Pearce
2007-04-05 15:34       ` Linus Torvalds
2007-04-05 15:53         ` Shawn O. Pearce
2007-04-05 16:21           ` Linus Torvalds
2007-04-05 17:14             ` Shawn O. Pearce
2007-04-05 21:17               ` Florian Weimer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).