Re: [RFC] Packing large repositories

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Dana How" <danahow@gmail.com>
To: "Linus Torvalds" <torvalds@linux-foundation.org>
Cc: git@vger.kernel.org, danahow@gmail.com
Subject: Re: [RFC] Packing large repositories
Date: Mon, 2 Apr 2007 14:19:29 -0700	[thread overview]
Message-ID: <56b7f5510704021419s4f8635abs8544df2f1065a5d4@mail.gmail.com> (raw)
In-Reply-To: <Pine.LNX.4.64.0703280943450.6730@woody.linux-foundation.org>

[-- Attachment #1: Type: text/plain, Size: 6615 bytes --]

On 3/28/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > I just started experimenting with using git ...
> > Part of a checkout is about 55GB;
> > after an initial commit and packing I have a 20GB+ packfile.
> > Of course this is unusable, ... . I conclude that
> > for such large projects,  git-repack/git-pack-objects would need
> > new options to control maximum packfile size.
>
> Either that, or update the index file format. I think that your approach
> of having a size limiter is actually the *better* one, though.
>
> > [ I don't think this affects git-{fetch,receive,send}-pack
> > since apparently only the pack is transferred and it only uses
> > the variable-length size and delta base offset encodings
> > (of course the accumulation of the 7 bit chunks in a 32b
> > variable would need to be corrected, but at least the data
> > format doesn't change).]
>
> Well, it does affect fetching, in that "git index-pack" obviously would
> also need to be taught how to split the resulting indexed packs up into
> multiple smaller ones from one large incoming one. But that shouldn't be
> fundamentally hard either, apart from the inconvenience of having to
> rewrite the object count in the pack headers..
>
> To avoid that issue, it may be that it's actually better to split things
> up at pack-generation time *even* for the case of --stdout, exactly so
> that "git index-pack" wouldn't have to split things up (we potentially
> know a lot more about object sizes up-front at pack-generation time than
> we know at re-indexing).

The attached patch adds a --pack-limit[=N] option
to git-repack/git-pack-objects.  N defaults to 1<<31,
and the result with --pack-limit is that no packfile
can be equal to or larger than N.  A --blob-limit=N
option is also added (see below).

My original plan was simply to ensure that no
object started at a file offset not representable in 31 bits.
However,  I became concerned about the arithmetic involved
when mmap'ing a pack,  so I decided to make sure *all*
bytes lived at offsets representable in 31 bits.

Consequently after an object is written out,
the new offset is checked.  If the limit has been exceeded,
the write is rolled back (see sha1mark/sha1undo).
This is awkward and inefficient,  but yields packs closer
to the limit and happens too infrequently to be of much impact.

However, there are really two modes when packing:
packing to disk, and packing to stdout.
Since you can't rollback a write on stdout,  the initial
file-offset-limit technique is used when --stdout is specified.
[Note:  I did not *test* the --pack-limit && --stdout combination.]

To fully guarantee that a pack file doesn't exceed a certain size,
objects above that size must not be packed into it.
But I think this makes sense -- I don't see a lot of advantage
to packing a 100MB+ object into a pack,  except for fetch/send
which is a serial stream without index anyway.  Thus
this patch automatically excludes any object whose uncompressed
size is 1/4 or more of the packfile size limit when --stdout
is not specified.  This behavior can be altered with an explicit
--blob-limit=N option.

Two interesting twists presented themselves.
First,  the packfile contains the number of objects in
the header at the beginning,  and this header is included
in the final SHA1.  But I don't know the final count until
the limit is reached.  Consequently the header must be
rewritten and the entire file rescanned to make the correct
checksum.  This already happens in two other places in git.

Secondly,  when using --pack-limit with --stdout,  the header
can't be rewritten.  Instead the object count in the header
is left at 0 to flag that it's wrong.  The end of an individual
pack inside a multi-pack stream COULD be detected by checking,
after each object,  if the next 20 bytes are equal to the SHA1
of what's come before.  I've made no additional effort beyond
this minimal solution because it's not clear that splitting
a pack up at the transmitter is better than at the receiver.
An alternative method is to add,  before the final SHA1,  a last
object of type OBJ_NONE and length 0 (thus a single zero byte).
This would function as an EOF marker.  I've indicated where this
would go in write_pack_file but didn't put it in since the current
code doesn't tolerate a 0 object count in the header anyway (yet?).
[Note: I have *not* started in on teaching git-index-pack etc.
 how to read such concatenated split packs since (a) I'd like
 to see which way people will prefer and (b) I don't plan on
 using the feature anyway and I'm wondering if I'm alone
 in that reaction.]

Some code has been added but
very few function relationships have been changed,
with the exception that write_pack_file now calls write_index_file
directly since write_pack_file decides when to split packs
and thus must call write_index_file before moving on to the next pack.

In response to my original post,  I've seen some emails about
changing the pack file/index file format.  This is exactly what I
*didn't* want to do,  since (1) it would delay a feature I'd like to
use now,  (2) the current format is better than people seem to realize,
and (3) it would create yet another flag in the config file
to help phase in a new feature over a year or two.

If,  however,  there are other pent-up reasons for changing the format
which might make it happen sometime soon,  I can see some small tweaks
that could be useful.

* [For stdout/serial access:] Tolerate "0" for object count in a .pack
file;  it would mean look for the pack end by either matching a SHA1 or
looking for an OBJ_NONE/0 record,  all as explained above.
(The point is to avoid any need to rescan a file to rebuild checksums.)

* [For disk/random access:] Don't change the current .pack/.idx files,
but do add a third file type which would be a "super index" with a format
similar to .idx.  It would map sorted SHA1s to (pack#,offset) pairs,
either in one table of triples or two parallel tables, one of SHA1s and
the other of pairs.  It probably would only be used if mentioned in
objects/info/packs (and it would be automatically ignored if older than
objects/info/packs?).  It could be searched by taking advantage of the
uniform SHA1 distribution recently discussed.  There would be at most
one such file in a repository;  perhaps the .idx files from which it was
generated could be removed.  For safety the "super index" could contain
a small table of all the SHA1s for the packs it indexes.

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

cat GIT-VERSION-FILE
GIT_VERSION = 1.5.1.rc2.18.g9c88-dirty

[-- Attachment #2: large.patch --]
[-- Type: application/octet-stream, Size: 20042 bytes --]

diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index b5f9648..0a11113 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -14,8 +14,8 @@
 #include "list-objects.h"
 
 static const char pack_usage[] = "\
-git-pack-objects [{ -q | --progress | --all-progress }] \n\
-	[--local] [--incremental] [--window=N] [--depth=N] \n\
+git-pack-objects [{ -q | --progress | --all-progress }] [--pack-limit[=N]]\n\
+	[--blob-limit=N] [--local] [--incremental] [--window=N] [--depth=N]\n\
 	[--no-reuse-delta] [--delta-base-offset] [--non-empty] \n\
 	[--revs [--unpacked | --all]*] [--reflog] [--stdout | base-name] \n\
 	[<ref-list | <object-list]";
@@ -29,6 +29,7 @@ struct object_entry {
 	unsigned int depth;	/* delta depth */
 	unsigned int delta_limit;	/* base adjustment for in-pack delta */
 	unsigned int hash;	/* name hint hash */
+	char no_write;		/* flag: written to previous pack OR too big */
 	enum object_type type;
 	enum object_type in_pack_type;	/* could be delta */
 	unsigned long delta_size;	/* delta data size (uncompressed) */
@@ -68,7 +69,10 @@ static int allow_ofs_delta;
 
 static struct object_entry **sorted_by_sha, **sorted_by_type;
 static struct object_entry *objects;
-static uint32_t nr_objects, nr_alloc, nr_result;
+static struct object_entry **written_list;
+static uint32_t nr_objects, nr_alloc, nr_result, nr_written, nr_skipped = 0;
+static uint32_t offset_limit = 0;
+static uint32_t blob_limit = 0;
 static const char *base_name;
 static unsigned char pack_file_sha1[20];
 static int progress = 1;
@@ -415,13 +419,17 @@ static off_t write_object(struct sha1file *f,
 	}
 
 	if (!to_reuse) {
+		int usable_delta =	!entry->delta ? 0 :
+					!offset_limit ? 1 :
+					entry->delta->no_write ? 0 :
+					entry->delta->offset ? 1 : 0;
 		buf = read_sha1_file(entry->sha1, &type, &size);
 		if (!buf)
 			die("unable to read %s", sha1_to_hex(entry->sha1));
 		if (size != entry->size)
 			die("object %s size inconsistency (%lu vs %lu)",
 			    sha1_to_hex(entry->sha1), size, entry->size);
-		if (entry->delta) {
+		if (usable_delta) {
 			buf = delta_against(buf, size, entry);
 			size = entry->delta_size;
 			obj_type = (allow_ofs_delta && entry->delta->offset) ?
@@ -503,62 +511,212 @@ static off_t write_one(struct sha1file *f,
 			       struct object_entry *e,
 			       off_t offset)
 {
-	if (e->offset || e->preferred_base)
+	struct sha1posn posn;
+	uint32_t save_written = 0, save_written_delta = 0;
+	uint32_t save_reused = 0, save_reused_delta = 0;
+	if (e->offset || e->preferred_base || e->no_write)
 		/* offset starts from header size and cannot be zero
 		 * if it is written already.
 		 */
 		return offset;
 	/* if we are deltified, write out its base object first. */
-	if (e->delta)
+	if (e->delta) {
 		offset = write_one(f, e->delta, offset);
+		if (!offset)
+			return offset;
+	}
+	if (offset_limit) {
+		if ( !pack_to_stdout ) {
+			/* save state before write for possible later seekback */
+			save_written = written, save_written_delta = written_delta;
+			save_reused = reused, save_reused_delta = reused_delta;
+			sha1mark(f, &posn);
+		} else if ( (unsigned long)offset >= (unsigned long)offset_limit )
+			/*
+			 * This ensures that no object's offset in the pack
+			 * exceeds or is equal to the offset_limit.  It is
+			 * a looser way of limiting packsize w/o seeking and
+			 * is used in --stdout mode.
+			 */
+			return 0;
+	}
 	e->offset = offset;
-	return offset + write_object(f, e);
+	offset += write_object(f, e);
+	/*
+	 * This ensures that the packfile size never exceeds or matches
+	 * the offset_limit if supplied.  The "20" is for the final SHA1.
+	 * This limit isn't used with --stdout since it requires seeking.
+	 */
+	if (offset_limit && !pack_to_stdout &&
+	    (unsigned long)offset >= (unsigned long)(offset_limit - 20)) {
+		written = save_written, written_delta = save_written_delta;
+		reused = save_reused, reused_delta = save_reused_delta;
+		sha1undo(f, &posn, offset, e->offset);
+		e->offset = 0;
+		return 0;
+	}
+	*written_list++ = e;
+	return offset;
+}
+
+/*
+ * Move this, the version in fast-import.c,
+ * and index_pack.c:readjust_pack_header_and_sha1 into sha1_file.c ?
+ */
+static void fixup_header_footer(int pack_fd, unsigned char *pack_file_sha1,
+				char *pack_name, uint32_t object_count)
+{
+	static const int buf_sz = 128 * 1024;
+	SHA_CTX c;
+	struct pack_header hdr;
+	char *buf;
+
+	if (lseek(pack_fd, 0, SEEK_SET) != 0)
+		die("Failed seeking to start: %s", strerror(errno));
+	if (read_in_full(pack_fd, &hdr, sizeof(hdr)) != sizeof(hdr))
+		die("Unable to reread header of %s", pack_name);
+	if (lseek(pack_fd, 0, SEEK_SET) != 0)
+		die("Failed seeking to start: %s", strerror(errno));
+	hdr.hdr_entries = htonl(object_count);
+	write_or_die(pack_fd, &hdr, sizeof(hdr));
+
+	SHA1_Init(&c);
+	SHA1_Update(&c, &hdr, sizeof(hdr));
+
+	buf = xmalloc(buf_sz);
+	for (;;) {
+		size_t n = xread(pack_fd, buf, buf_sz);
+		if (!n)
+			break;
+		if (n < 0)
+			die("Failed to checksum %s", pack_name);
+		SHA1_Update(&c, buf, n);
+	}
+	free(buf);
+
+	SHA1_Final(pack_file_sha1, &c);
+	write_or_die(pack_fd, pack_file_sha1, 20);
+	close(pack_fd);
 }
 
+typedef int (*entry_sort_t)(const struct object_entry *, const struct object_entry *);
+
+static entry_sort_t current_sort;
+
+/* forward declarations for write_pack_file */
+/* (probably should move sorting stuff up here) */
+static int sort_comparator(const void *_a, const void *_b);
+static int sha1_sort(const struct object_entry *a, const struct object_entry *b);
+static void write_index_file(void);
+
 static void write_pack_file(void)
 {
-	uint32_t i;
+	uint32_t i, j;
 	struct sha1file *f;
 	off_t offset;
 	struct pack_header hdr;
 	unsigned last_percent = 999;
-	int do_progress = progress;
+	int do_progress = progress >> !base_name;
+	char oldname[PATH_MAX];
+	int pack_fd;
+	struct object_entry **list;
+	SHA_CTX ctx;
+	uint32_t nr_actual = nr_result - nr_skipped;
 
-	if (!base_name) {
-		f = sha1fd(1, "<stdout>");
-		do_progress >>= 1;
-	}
-	else
-		f = sha1create("%s-%s.%s", base_name,
-			       sha1_to_hex(object_list_sha1), "pack");
 	if (do_progress)
-		fprintf(stderr, "Writing %u objects.\n", nr_result);
-
-	hdr.hdr_signature = htonl(PACK_SIGNATURE);
-	hdr.hdr_version = htonl(PACK_VERSION);
-	hdr.hdr_entries = htonl(nr_result);
-	sha1write(f, &hdr, sizeof(hdr));
-	offset = sizeof(hdr);
-	if (!nr_result)
-		goto done;
-	for (i = 0; i < nr_objects; i++) {
-		offset = write_one(f, objects + i, offset);
-		if (do_progress) {
-			unsigned percent = written * 100 / nr_result;
-			if (progress_update || percent != last_percent) {
-				fprintf(stderr, "%4u%% (%u/%u) done\r",
-					percent, written, nr_result);
-				progress_update = 0;
-				last_percent = percent;
+		fprintf(stderr, "Writing %u objects.\n", nr_actual);
+	written_list = list = xmalloc(nr_objects * sizeof(struct object_entry *));
+
+	for (i = 0; i < nr_objects;) {
+		if (!base_name) {
+			f = sha1fd(pack_fd = 1, "<stdout>");
+		}
+		else {
+			int len = snprintf(oldname, sizeof oldname, "%s-XXXXXX", base_name);
+			if (len >= PATH_MAX)
+				die("excessive pathname length for initial packfile name");
+			pack_fd = mkstemp(oldname);
+			if (pack_fd < 0)
+				die("can't create %s: %s", oldname, strerror(errno));
+			f = sha1fd(pack_fd, oldname);
+		}
+
+		hdr.hdr_signature = htonl(PACK_SIGNATURE);
+		hdr.hdr_version = htonl(PACK_VERSION);
+		hdr.hdr_entries = htonl(!base_name && offset_limit ? 0 : nr_actual);
+		sha1write(f, &hdr, sizeof(hdr));
+		offset = sizeof(hdr);
+		for (; i < nr_objects; i++) {
+			off_t offset_one = write_one(f, objects + i, offset);
+			if (!offset_one)
+				break;
+			offset = offset_one;
+			if (do_progress) {
+				unsigned percent = written * 100 / nr_actual;
+				if (progress_update || percent != last_percent) {
+					fprintf(stderr, "%4u%% (%u/%u) done\r",
+						percent, written, nr_actual);
+					progress_update = 0;
+					last_percent = percent;
+				}
 			}
 		}
+		nr_written = written_list - list;
+		written_list = list;
+
+		/*
+		 * Write terminator record here if desired:
+		 * type=OBJ_NONE, len=0;  this is a zero byte.
+		 */
+
+		/*
+		 * Did we write the wrong # entries in the header?
+		 * If so, rewrite it like in fast-import (gackk).
+		 */
+		if ( !base_name || nr_written == nr_actual ) {
+			sha1close(f, pack_file_sha1, 1);
+		} else {
+			sha1close(f, pack_file_sha1, -1);
+			fixup_header_footer(pack_fd, pack_file_sha1, oldname, nr_written);
+		}
+
+		/*
+		 * compute object_list_sha1 of sorted sha's we just wrote out;
+		 * we also mark these objects as written
+		 */
+		current_sort = sha1_sort;
+		qsort(list, nr_written, sizeof(struct object_entry *), sort_comparator);
+		SHA1_Init(&ctx);
+		for (j = 0; j < nr_written; j++) {
+			struct object_entry *entry = *list++;
+			entry->no_write = 1;
+			SHA1_Update(&ctx, entry->sha1, 20);
+		}
+		SHA1_Final(object_list_sha1, &ctx);
+		list = written_list;
+		/*
+		 * now we can rename the pack correctly and write the index file
+		 */
+		if (base_name) {
+			char newname[PATH_MAX];
+			int len = snprintf(newname, sizeof newname, "%s-%s.%s",
+						base_name, sha1_to_hex(object_list_sha1), "pack");
+			if (len >= PATH_MAX)
+				die("excessive pathname length for final packfile name");
+			if (rename(oldname, newname) < 0)
+				die("could not rename the pack file");
+		}
+		if (!pack_to_stdout) {
+			write_index_file();
+			puts(sha1_to_hex(object_list_sha1));
+		}
 	}
-	if (do_progress)
+
+	free(written_list);
+	if (nr_actual && do_progress)
 		fputc('\n', stderr);
- done:
-	if (written != nr_result)
-		die("wrote %u objects while expecting %u", written, nr_result);
-	sha1close(f, pack_file_sha1, 1);
+	if (written != nr_actual)
+		die("wrote %u objects while expecting %u", written, nr_actual);
 }
 
 static void write_index_file(void)
@@ -566,8 +724,8 @@ static void write_index_file(void)
 	uint32_t i;
 	struct sha1file *f = sha1create("%s-%s.%s", base_name,
 					sha1_to_hex(object_list_sha1), "idx");
-	struct object_entry **list = sorted_by_sha;
-	struct object_entry **last = list + nr_result;
+	struct object_entry **list = written_list;
+	struct object_entry **last = list + nr_written;
 	uint32_t array[256];
 
 	/*
@@ -583,7 +741,7 @@ static void write_index_file(void)
 				break;
 			next++;
 		}
-		array[i] = htonl(next - sorted_by_sha);
+		array[i] = htonl(next - written_list);
 		list = next;
 	}
 	sha1write(f, array, 256 * 4);
@@ -591,8 +749,8 @@ static void write_index_file(void)
 	/*
 	 * Write the actual SHA1 entries..
 	 */
-	list = sorted_by_sha;
-	for (i = 0; i < nr_result; i++) {
+	list = written_list;
+	for (i = 0; i < nr_written; i++) {
 		struct object_entry *entry = *list++;
 		uint32_t offset = htonl(entry->offset);
 		sha1write(f, &offset, 4);
@@ -1014,7 +1172,7 @@ static void check_object(struct object_entry *entry)
 				ofs = c & 127;
 				while (c & 128) {
 					ofs += 1;
-					if (!ofs || ofs & ~(~0UL >> 7))
+					if (!ofs || ofs & ~(~0UL >> 1))
 						die("delta base offset overflow in pack for %s",
 						    sha1_to_hex(entry->sha1));
 					c = buf[used_0++];
@@ -1058,6 +1216,7 @@ static void check_object(struct object_entry *entry)
 	}
 
 	entry->type = sha1_object_info(entry->sha1, &entry->size);
+	nr_skipped += entry->no_write = blob_limit && (unsigned long)entry->size >= blob_limit;
 	if (entry->type < 0)
 		die("unable to get type of object %s",
 		    sha1_to_hex(entry->sha1));
@@ -1103,10 +1262,6 @@ static void get_object_details(void)
 	}
 }
 
-typedef int (*entry_sort_t)(const struct object_entry *, const struct object_entry *);
-
-static entry_sort_t current_sort;
-
 static int sort_comparator(const void *_a, const void *_b)
 {
 	struct object_entry *a = *(struct object_entry **)_a;
@@ -1197,6 +1352,12 @@ static int try_delta(struct unpacked *trg, struct unpacked *src,
 	if (trg_entry->type != src_entry->type)
 		return -1;
 
+	/* Don't try deltas involving excessively large objects when
+	 * pack size is limited
+	 */
+	if (trg_entry->no_write || src_entry->no_write)
+		return -1;
+
 	/* We do not compute delta to *create* objects we are not
 	 * going to pack.
 	 */
@@ -1538,6 +1699,7 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 	const char **rp_av;
 	int rp_ac_alloc = 64;
 	int rp_ac;
+	int added = 0;
 
 	rp_av = xcalloc(rp_ac_alloc, sizeof(*rp_av));
 
@@ -1566,6 +1728,24 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			incremental = 1;
 			continue;
 		}
+		if (!strcmp("--pack-limit", arg)) {
+			offset_limit = 1UL << 31;
+			continue;
+		}
+		if (!prefixcmp(arg, "--pack-limit=")) {
+			char *end;
+			offset_limit = strtoul(arg+13, &end, 0);
+			if (!arg[13] || *end)
+				usage(pack_usage);
+			continue;
+		}
+		if (!prefixcmp(arg, "--blob-limit=")) {
+			char *end;
+			blob_limit = strtoul(arg+13, &end, 0);
+			if (!arg[13] || *end)
+				usage(pack_usage);
+			continue;
+		}
 		if (!prefixcmp(arg, "--window=")) {
 			char *end;
 			window = strtoul(arg+9, &end, 0);
@@ -1629,6 +1809,24 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 		}
 		usage(pack_usage);
 	}
+	if ( offset_limit && !blob_limit && !pack_to_stdout ) {
+		/* need to limit blob size when creating bounded packs on disk */
+		blob_limit = offset_limit >> 2;
+		added |= 2;
+	}
+	if ( offset_limit && !no_reuse_delta ) {
+		/* didn't audit this case yet */
+		no_reuse_delta = 1;
+		added |= 1;
+	}
+	if ( added ) {
+		fprintf(stderr, "Added to command line:");
+		if ( added & 1 )
+			fprintf(stderr, " --no-reuse-delta");
+		if ( added & 2 )
+			fprintf(stderr, " --blob-limit=%u", blob_limit);
+		fprintf(stderr, "\n");
+	}
 
 	/* Traditionally "pack-objects [options] base extra" failed;
 	 * we would however want to take refs parameter that would
@@ -1695,10 +1893,6 @@ int cmd_pack_objects(int argc, const char **argv, const char *prefix)
 			progress_update = 0;
 		}
 		write_pack_file();
-		if (!pack_to_stdout) {
-			write_index_file();
-			puts(sha1_to_hex(object_list_sha1));
-		}
 	}
 	if (progress)
 		fprintf(stderr, "Total %u (delta %u), reused %u (delta %u)\n",
diff --git a/builtin-unpack-objects.c b/builtin-unpack-objects.c
index 3956c56..84a6c5c 100644
--- a/builtin-unpack-objects.c
+++ b/builtin-unpack-objects.c
@@ -209,7 +209,7 @@ static void unpack_delta_entry(enum object_type type, unsigned long delta_size,
 		base_offset = c & 127;
 		while (c & 128) {
 			base_offset += 1;
-			if (!base_offset || base_offset & ~(~0UL >> 7))
+			if (!base_offset || base_offset & ~(~0UL >> 1))
 				die("offset value overflow for delta base object");
 			pack = fill(1);
 			c = *pack;
diff --git a/csum-file.c b/csum-file.c
index b7174c6..e2bef75 100644
--- a/csum-file.c
+++ b/csum-file.c
@@ -35,7 +35,10 @@ int sha1close(struct sha1file *f, unsigned char *result, int update)
 	if (offset) {
 		SHA1_Update(&f->ctx, f->buffer, offset);
 		sha1flush(f, offset);
+		f->offset = 0;
 	}
+	if (update < 0)
+		return 0;	/* only want to flush (no checksum write, no close) */
 	SHA1_Final(f->buffer, &f->ctx);
 	if (result)
 		hashcpy(result, f->buffer);
@@ -69,6 +72,44 @@ int sha1write(struct sha1file *f, void *buf, unsigned int count)
 	return 0;
 }
 
+/*
+ * Save current position/state in output file
+ * (needs to be fast -- no system calls!)
+ */
+void sha1mark(struct sha1file *f, struct sha1posn *p)
+{
+	p->offset = f->offset;
+	p->ctx = f->ctx;	/* larger than I'd like */
+}
+
+/*
+ * Restore previous position/state in output file
+ * (can be slow)
+ */
+void sha1undo(struct sha1file *f, struct sha1posn *p, long new, long old)
+{
+	if (new - old == (long)f->offset - (long)p->offset) {
+		f->ctx = p->ctx;
+		f->offset = p->offset;
+		return;
+	}
+	if (lseek(f->fd, (off_t)old - (off_t)p->offset, SEEK_SET) == (off_t)-1)
+		die("sha1 file '%s' undo seekback error (%s)", f->name, strerror(errno));
+	f->ctx = p->ctx;
+	while (p->offset) {
+		int ret = xread(f->fd, f->buffer, p->offset);
+		if (!ret)
+			die("sha1 file '%s' undo readback error. No data", f->name);
+		if (ret < 0)
+			die("sha1 file '%s' undo readback error (%s)", f->name, strerror(errno));
+		SHA1_Update(&f->ctx, f->buffer, ret);
+		p->offset -= ret;
+	}
+	if (ftruncate(f->fd, (off_t)old))
+		die("sha1 file '%s' undo truncate error (%s)", f->name, strerror(errno));
+	f->offset = 0;
+}
+
 struct sha1file *sha1create(const char *fmt, ...)
 {
 	struct sha1file *f;
diff --git a/csum-file.h b/csum-file.h
index 3ad1a99..780df17 100644
--- a/csum-file.h
+++ b/csum-file.h
@@ -9,11 +9,17 @@ struct sha1file {
 	char name[PATH_MAX];
 	unsigned char buffer[8192];
 };
+struct sha1posn {
+	unsigned int offset;
+	SHA_CTX ctx;
+};
 
 extern struct sha1file *sha1fd(int fd, const char *name);
 extern struct sha1file *sha1create(const char *fmt, ...) __attribute__((format (printf, 1, 2)));
 extern int sha1close(struct sha1file *, unsigned char *, int);
 extern int sha1write(struct sha1file *, void *, unsigned int);
 extern int sha1write_compressed(struct sha1file *, void *, unsigned int);
+extern void sha1mark(struct sha1file *, struct sha1posn *);
+extern void sha1undo(struct sha1file *, struct sha1posn *, long, long);
 
 #endif
diff --git a/git-repack.sh b/git-repack.sh
index ddfa8b4..0299ff1 100755
--- a/git-repack.sh
+++ b/git-repack.sh
@@ -18,6 +18,9 @@ do
 	-q)	quiet=-q ;;
 	-f)	no_reuse_delta=--no-reuse-delta ;;
 	-l)	local=--local ;;
+	--pack-limit) extra="$extra $1" ;;
+	--pack-limit=*) extra="$extra $1" ;;
+	--blob-limit=*) extra="$extra $1" ;;
 	--window=*) extra="$extra $1" ;;
 	--depth=*) extra="$extra $1" ;;
 	*)	usage ;;
@@ -62,11 +65,12 @@ case ",$all_into_one," in
 esac
 
 args="$args $local $quiet $no_reuse_delta$extra"
-name=$(git-pack-objects --non-empty --all --reflog $args </dev/null "$PACKTMP") ||
+names=$(git-pack-objects --non-empty --all --reflog $args </dev/null "$PACKTMP") ||
 	exit 1
-if [ -z "$name" ]; then
+if [ -z "$names" ]; then
 	echo Nothing new to pack.
-else
+fi
+for name in $names ; do
 	chmod a-w "$PACKTMP-$name.pack"
 	chmod a-w "$PACKTMP-$name.idx"
 	if test "$quiet" != '-q'; then
@@ -92,7 +96,7 @@ else
 		exit 1
 	}
 	rm -f "$PACKDIR/old-pack-$name.pack" "$PACKDIR/old-pack-$name.idx"
-fi
+done
 
 if test "$remove_redundant" = t
 then
diff --git a/http-fetch.c b/http-fetch.c
index 557b403..09baedc 100644
--- a/http-fetch.c
+++ b/http-fetch.c
@@ -198,7 +198,7 @@ static void start_object_request(struct object_request *obj_req)
 		SHA1_Init(&obj_req->c);
 		if (prev_posn>0) {
 			prev_posn = 0;
-			lseek(obj_req->local, SEEK_SET, 0);
+			lseek(obj_req->local, 0, SEEK_SET);
 			ftruncate(obj_req->local, 0);
 		}
 	}
diff --git a/http-push.c b/http-push.c
index 724720c..e3f7675 100644
--- a/http-push.c
+++ b/http-push.c
@@ -312,7 +312,7 @@ static void start_fetch_loose(struct transfer_request *request)
 		SHA1_Init(&request->c);
 		if (prev_posn>0) {
 			prev_posn = 0;
-			lseek(request->local_fileno, SEEK_SET, 0);
+			lseek(request->local_fileno, 0, SEEK_SET);
 			ftruncate(request->local_fileno, 0);
 		}
 	}
diff --git a/index-pack.c b/index-pack.c
index 6284fe3..d35c4c4 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -249,7 +249,7 @@ static void *unpack_raw_entry(struct object_entry *obj, union delta_base *delta_
 		base_offset = c & 127;
 		while (c & 128) {
 			base_offset += 1;
-			if (!base_offset || base_offset & ~(~0UL >> 7))
+			if (!base_offset || base_offset & ~(~0UL >> 1))
 				bad_object(obj->offset, "offset value overflow for delta base object");
 			p = fill(1);
 			c = *p;
diff --git a/sha1_file.c b/sha1_file.c
index 9c26038..7082d3e 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1149,7 +1149,7 @@ static off_t get_delta_base(struct packed_git *p,
 		base_offset = c & 127;
 		while (c & 128) {
 			base_offset += 1;
-			if (!base_offset || base_offset & ~(~0UL >> 7))
+			if (!base_offset || base_offset & ~(~0UL >> 1))
 				die("offset value overflow for delta base object");
 			c = base_info[used++];
 			base_offset = (base_offset << 7) + (c & 127);

next prev parent reply	other threads:[~2007-04-02 21:20 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-03-28  7:05 [RFC] Packing large repositories Dana How
2007-03-28 16:53 ` Linus Torvalds
2007-03-30  6:23   ` Shawn O. Pearce
2007-03-30 13:01     ` Nicolas Pitre
2007-03-31 11:04       ` Geert Bosch
2007-03-31 18:36         ` Linus Torvalds
2007-03-31 19:02           ` Nicolas Pitre
2007-03-31 20:54           ` Junio C Hamano
2007-03-31 21:20           ` Linus Torvalds
2007-03-31 21:56             ` Linus Torvalds
2007-04-02  6:22           ` Geert Bosch
2007-04-03  5:39             ` Shawn O. Pearce
2007-03-31 18:51         ` Nicolas Pitre
2007-04-02 21:19   ` Dana How [this message]
2007-04-02  1:39 ` Sam Vilain

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:b5f9648 dfblob:0a11113 dfblob:3956c56 dfblob:84a6c5c
dfblob:b7174c6 dfblob:e2bef75 dfblob:3ad1a99 dfblob:780df17
dfblob:ddfa8b4 dfblob:0299ff1 dfblob:557b403 dfblob:09baedc
dfblob:724720c dfblob:e3f7675 dfblob:6284fe3 dfblob:d35c4c4
dfblob:9c26038 dfblob:7082d3e )
 OR (
bs:"Re: [RFC] Packing large repositories" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56b7f5510704021419s4f8635abs8544df2f1065a5d4@mail.gmail.com \
    --to=danahow@gmail.com \
    --cc=git@vger.kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).