git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: CAREFUL! No more delta object support!
  2005-06-28  1:14 CAREFUL! No more delta object support! Linus Torvalds
@ 2005-06-27 23:58 ` Christopher Li
  2005-06-28  3:30   ` Linus Torvalds
  2005-06-28  2:01 ` CAREFUL! No more delta object support! Junio C Hamano
  2005-06-28  8:49 ` [PATCH] Adjust fsck-cache to packed GIT and alternate object pool Junio C Hamano
  2 siblings, 1 reply; 38+ messages in thread
From: Christopher Li @ 2005-06-27 23:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

On Mon, Jun 27, 2005 at 06:14:40PM -0700, Linus Torvalds wrote:
> 
> The reason? The new git understands packed files natively, which ends up 
> being a much bigger win in many many ways.

Interesting. I take a look at your change, it still support delta object
inside the pack file right? For a second I am wondering you drop the delta
feature completely.

Chris

^ permalink raw reply	[flat|nested] 38+ messages in thread

* CAREFUL! No more delta object support!
@ 2005-06-28  1:14 Linus Torvalds
  2005-06-27 23:58 ` Christopher Li
                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-06-28  1:14 UTC (permalink / raw)
  To: Git Mailing List


Some people may have noticed already (hopefully not the hard way) that the 
current git code doesn't support delta objects lying around in the object 
directory any more.

In other words, if you have delta objects, you need to un-deltify your 
repository _before_ you upgrade your git binaries, or they won't be able 
to read your objects any more.

The reason? The new git understands packed files natively, which ends up 
being a much bigger win in many many ways.

You should be very careful about using packed files (since they are a very 
recent addition), but what you can do to try them out is to do so in a 
separate repository.

Starting to use a packed repository is very simple indeed, and here's what 
you need to do for git, for example:

In your regular "git" directory (once you have ypdated your git to a 
recent version, in particular you need to have the "csum-file: fix missing 
buf pointer update" commit), do:

	git-rev-list --objects HEAD | git-pack-objects --window=50 --depth=50 out

which will say something like "Packing 3741 objects" and result in two new 
files a few seconds later:

	torvalds@ppc970:~/git> ls -lh out*
	-rw-r--r--  1 torvalds torvalds  89K Jun 27 17:59 out.idx
	-rw-r--r--  1 torvalds torvalds 1.3M Jun 27 17:59 out.pack

now, don't do anythign with those files, but instead go and create a 
directory somewhere else:

	cd ~
	mkdir packed-git-trial
	cd packed-git-trial
	git-init-db

you have now obviously created a totally empty repository. Now, let's 
populate that empty repository with _just_ the pack files:

	mkdir .git/objects/pack
	mv ~/git/out.* .git/objects/pack

and then, move over your tags, in particularly the HEAD pointer, with 
something like

	cat ~/git/.git/HEAD > .git/HEAD

and voila, you're done. Try "gitk", for example. Or "git log".

Now, what's even cooler is how you can just start using this packed tree: 
feel free to do a test-commit or something, and notice how git starts 
populating the empty .git/objects/xx/ subdirectories with new objects. But 
it still relies on the pack-file for the old history.

Now, there's still a misfeature there, which is that when you create a new
object, it doesn't check whether that object already exists in the
pack-file, so you'll end up with a few recent objects that you really
don't need (notably tree objects), and we'll fix that eventually. But
notice how you started with a 17MB .git/objects/ directory in your
original tree, and you now have just a 1.3MB pack-file and a 90kB index
file that replaces all that?

There are some other issues too, like the fact that "git-fsck-cache"  
doesn't know about the pack-files yet, so it will complain about missing
objects etc. Also, please note that the pack-file _only_ packs the commits
and the things reachable from them: things like tags (and your references
in your .git/refs directory) need to be copied over separately.

So this is all very rough, still, but the basics do actually seem to work
(ie anything that doesn't look directly at the object files - which is
pretty much all of it except for fsck and the direct-filesystem-access 
things like "rsync" and "git-local-pull").

Maybe you might not want to switch over yet, and as mentioned, rsync then
ends up not being a good way to sync (nor git-local-pull), but the
"git-http/ssh-pull" family should hopefully just work.

I've used a packed kernel tree too, so this has gotten _some_ testing even 
on really quite big git trees. 

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28  1:14 CAREFUL! No more delta object support! Linus Torvalds
  2005-06-27 23:58 ` Christopher Li
@ 2005-06-28  2:01 ` Junio C Hamano
  2005-06-28  2:03   ` [PATCH] Skip writing out sha1 files for objects in packed git Junio C Hamano
  2005-06-28  2:13   ` CAREFUL! No more delta object support! Linus Torvalds
  2005-06-28  8:49 ` [PATCH] Adjust fsck-cache to packed GIT and alternate object pool Junio C Hamano
  2 siblings, 2 replies; 38+ messages in thread
From: Junio C Hamano @ 2005-06-28  2:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> Now, there's still a misfeature there, which is that when you create a new
LT> object, it doesn't check whether that object already exists in the
LT> pack-file, so you'll end up with a few recent objects that you really
LT> don't need (notably tree objects), and we'll fix that eventually.

Patch will be sent separately.

LT> ... Also, please note that the pack-file _only_ packs the commits
LT> and the things reachable from them ...

Shouldn't feeding "git-rev-list --object" output plus
handcrafted list of objects in 2.6.11 tree object to
git-pack-objects just work???

LT> Maybe you might not want to switch over yet, and as mentioned, rsync then
LT> ends up not being a good way to sync (nor git-local-pull), but the
LT> "git-http/ssh-pull" family should hopefully just work.

No.  The pull protocol Dan did expects to throw compressed
representation around on the wire (which is valid if you assume
uncompressed transfer) and does not use read-sha1-file --
write-sha1-file pair, so all three do not work.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH] Skip writing out sha1 files for objects in packed git.
  2005-06-28  2:01 ` CAREFUL! No more delta object support! Junio C Hamano
@ 2005-06-28  2:03   ` Junio C Hamano
  2005-06-28  2:43     ` Linus Torvalds
  2005-06-28  2:13   ` CAREFUL! No more delta object support! Linus Torvalds
  1 sibling, 1 reply; 38+ messages in thread
From: Junio C Hamano @ 2005-06-28  2:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Now, there's still a misfeature there, which is that when you
create a new object, it doesn't check whether that object
already exists in the pack-file, so you'll end up with a few
recent objects that you really don't need (notably tree
objects), and this patch fixes it.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 apply.c          |    2 +-
 cache.h          |    2 +-
 commit-tree.c    |    2 +-
 convert-cache.c  |    6 +++---
 mktag.c          |    2 +-
 sha1_file.c      |   44 ++++++++++++++++++++++++++++++--------------
 unpack-objects.c |    4 ++--
 update-cache.c   |    2 +-
 write-tree.c     |    2 +-
 9 files changed, 41 insertions(+), 25 deletions(-)

f4f76b275cdabc038bcb4f3c7ca0d443638df88d
diff --git a/apply.c b/apply.c
--- a/apply.c
+++ b/apply.c
@@ -1221,7 +1221,7 @@ static void add_index_file(const char *p
 	if (lstat(path, &st) < 0)
 		die("unable to stat newly created file %s", path);
 	fill_stat_cache_info(ce, &st);
-	if (write_sha1_file(buf, size, "blob", ce->sha1) < 0)
+	if (write_sha1_file(buf, size, "blob", ce->sha1, 0) < 0)
 		die("unable to create backing store for newly created file %s", path);
 	if (add_cache_entry(ce, ADD_CACHE_OK_TO_ADD) < 0)
 		die("unable to add cache entry for %s", path);
diff --git a/cache.h b/cache.h
--- a/cache.h
+++ b/cache.h
@@ -165,7 +165,7 @@ extern int parse_sha1_header(char *hdr, 
 extern int sha1_object_info(const unsigned char *, char *, unsigned long *);
 extern void * unpack_sha1_file(void *map, unsigned long mapsize, char *type, unsigned long *size);
 extern void * read_sha1_file(const unsigned char *sha1, char *type, unsigned long *size);
-extern int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *return_sha1);
+extern int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *return_sha1, int do_expand);
 
 extern int check_sha1_signature(const unsigned char *sha1, void *buf, unsigned long size, const char *type);
 
diff --git a/commit-tree.c b/commit-tree.c
--- a/commit-tree.c
+++ b/commit-tree.c
@@ -191,7 +191,7 @@ int main(int argc, char **argv)
 	while (fgets(comment, sizeof(comment), stdin) != NULL)
 		add_buffer(&buffer, &size, "%s", comment);
 
-	write_sha1_file(buffer, size, "commit", commit_sha1);
+	write_sha1_file(buffer, size, "commit", commit_sha1, 0);
 	printf("%s\n", sha1_to_hex(commit_sha1));
 	return 0;
 }
diff --git a/convert-cache.c b/convert-cache.c
--- a/convert-cache.c
+++ b/convert-cache.c
@@ -111,7 +111,7 @@ static int write_subdirectory(void *buff
 		buffer += len;
 	}
 
-	write_sha1_file(new, newlen, "tree", result_sha1);
+	write_sha1_file(new, newlen, "tree", result_sha1, 0);
 	free(new);
 	return used;
 }
@@ -251,7 +251,7 @@ static void convert_date(void *buffer, u
 	memcpy(new + newlen, buffer, size);
 	newlen += size;
 
-	write_sha1_file(new, newlen, "commit", result_sha1);
+	write_sha1_file(new, newlen, "commit", result_sha1, 0);
 	free(new);	
 }
 
@@ -286,7 +286,7 @@ static struct entry * convert_entry(unsi
 	memcpy(buffer, data, size);
 	
 	if (!strcmp(type, "blob")) {
-		write_sha1_file(buffer, size, "blob", entry->new_sha1);
+		write_sha1_file(buffer, size, "blob", entry->new_sha1, 0);
 	} else if (!strcmp(type, "tree"))
 		convert_tree(buffer, size, entry->new_sha1);
 	else if (!strcmp(type, "commit"))
diff --git a/mktag.c b/mktag.c
--- a/mktag.c
+++ b/mktag.c
@@ -123,7 +123,7 @@ int main(int argc, char **argv)
 	if (verify_tag(buffer, size) < 0)
 		die("invalid tag signature file");
 
-	if (write_sha1_file(buffer, size, "tag", result_sha1) < 0)
+	if (write_sha1_file(buffer, size, "tag", result_sha1, 0) < 0)
 		die("unable to write tag file");
 	printf("%s\n", sha1_to_hex(result_sha1));
 	return 0;
diff --git a/sha1_file.c b/sha1_file.c
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -891,31 +891,47 @@ void *read_object_with_reference(const u
 	}
 }
 
-int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned char *returnsha1)
+static char *write_sha1_file_prepare(void *buf,
+				     unsigned long len,
+				     const char *type,
+				     unsigned char *sha1,
+				     unsigned char *hdr,
+				     int *hdrlen)
 {
-	int size;
-	unsigned char *compressed;
-	z_stream stream;
-	unsigned char sha1[20];
 	SHA_CTX c;
-	char *filename;
-	static char tmpfile[PATH_MAX];
-	unsigned char hdr[50];
-	int fd, hdrlen, ret;
 
 	/* Generate the header */
-	hdrlen = sprintf((char *)hdr, "%s %lu", type, len)+1;
+	*hdrlen = sprintf((char *)hdr, "%s %lu", type, len)+1;
 
 	/* Sha1.. */
 	SHA1_Init(&c);
-	SHA1_Update(&c, hdr, hdrlen);
+	SHA1_Update(&c, hdr, *hdrlen);
 	SHA1_Update(&c, buf, len);
 	SHA1_Final(sha1, &c);
 
+	return sha1_file_name(sha1);
+}
+
+int write_sha1_file(void *buf, unsigned long len, const char *type,
+		    unsigned char *returnsha1, int do_expand)
+{
+	int size;
+	unsigned char *compressed;
+	z_stream stream;
+	unsigned char sha1[20];
+	char *filename;
+	static char tmpfile[PATH_MAX];
+	unsigned char hdr[50];
+	int fd, hdrlen, ret;
+
+	/* Normally if we have it in the pack then we do not bother writing
+	 * it out into .git/objects/??/?{38} file.
+	 */
+	filename = write_sha1_file_prepare(buf, len, type, sha1, hdr, &hdrlen);
 	if (returnsha1)
 		memcpy(returnsha1, sha1, 20);
-
-	filename = sha1_file_name(sha1);
+	if (!do_expand && has_sha1_file(sha1))
+		return 0;
 	fd = open(filename, O_RDONLY);
 	if (fd >= 0) {
 		/*
@@ -1082,7 +1098,7 @@ int index_fd(unsigned char *sha1, int fd
 	if ((int)(long)buf == -1)
 		return -1;
 
-	ret = write_sha1_file(buf, size, "blob", sha1);
+	ret = write_sha1_file(buf, size, "blob", sha1, 0);
 	if (size)
 		munmap(buf, size);
 	return ret;
diff --git a/unpack-objects.c b/unpack-objects.c
--- a/unpack-objects.c
+++ b/unpack-objects.c
@@ -126,7 +126,7 @@ static int unpack_non_delta_entry(struct
 	case 'B': type_s = "blob"; break;
 	default: goto err_finish;
 	}
-	if (write_sha1_file(buffer, size, type_s, sha1) < 0)
+	if (write_sha1_file(buffer, size, type_s, sha1, 1) < 0)
 		die("failed to write %s (%s)",
 		    sha1_to_hex(entry->sha1), type_s);
 	printf("%s %s\n", sha1_to_hex(sha1), type_s);
@@ -223,7 +223,7 @@ static int unpack_delta_entry(struct pac
 		die("failed to apply delta");
 	free(delta_data);
 
-	if (write_sha1_file(result, result_size, type, sha1) < 0)
+	if (write_sha1_file(result, result_size, type, sha1, 1) < 0)
 		die("failed to write %s (%s)",
 		    sha1_to_hex(entry->sha1), type);
 	free(result);
diff --git a/update-cache.c b/update-cache.c
--- a/update-cache.c
+++ b/update-cache.c
@@ -77,7 +77,7 @@ static int add_file_to_cache(char *path)
 			free(target);
 			return -1;
 		}
-		if (write_sha1_file(target, st.st_size, "blob", ce->sha1))
+		if (write_sha1_file(target, st.st_size, "blob", ce->sha1, 0))
 			return -1;
 		free(target);
 		break;
diff --git a/write-tree.c b/write-tree.c
--- a/write-tree.c
+++ b/write-tree.c
@@ -76,7 +76,7 @@ static int write_tree(struct cache_entry
 		nr++;
 	}
 
-	write_sha1_file(buffer, offset, "tree", returnsha1);
+	write_sha1_file(buffer, offset, "tree", returnsha1, 0);
 	free(buffer);
 	return nr;
 }
------------

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28  2:01 ` CAREFUL! No more delta object support! Junio C Hamano
  2005-06-28  2:03   ` [PATCH] Skip writing out sha1 files for objects in packed git Junio C Hamano
@ 2005-06-28  2:13   ` Linus Torvalds
  2005-06-28  2:32     ` Junio C Hamano
                       ` (2 more replies)
  1 sibling, 3 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-06-28  2:13 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git



On Mon, 27 Jun 2005, Junio C Hamano wrote:
> 
> LT> ... Also, please note that the pack-file _only_ packs the commits
> LT> and the things reachable from them ...
> 
> Shouldn't feeding "git-rev-list --object" output plus
> handcrafted list of objects in 2.6.11 tree object to
> git-pack-objects just work???

You could do that. And yes, we can add support for "tag" objects too 
(which the packing doesn't do at all right now. So this is not a 
"fundamental" problem, it's just a practical one right now.

> > [..  git-ssh-pull hopefully working ..]
>
> No.  The pull protocol Dan did expects to throw compressed
> representation around on the wire (which is valid if you assume
> uncompressed transfer) and does not use read-sha1-file --
> write-sha1-file pair, so all three do not work.

Fair enough. I'd prefer for the pull/push to push object packs around 
anyway, so there's some more work there..

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28  2:13   ` CAREFUL! No more delta object support! Linus Torvalds
@ 2005-06-28  2:32     ` Junio C Hamano
  2005-06-28  2:37       ` [PATCH] Adjust to git-init-db creating $GIT_OBJECT_DIRECTORY/pack Junio C Hamano
  2005-06-28  2:48       ` CAREFUL! No more delta object support! Linus Torvalds
  2005-06-28  5:09     ` Daniel Barkalow
  2005-06-29 18:59     ` Linus Torvalds
  2 siblings, 2 replies; 38+ messages in thread
From: Junio C Hamano @ 2005-06-28  2:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> Fair enough. I'd prefer for the pull/push to push object packs around 
LT> anyway, so there's some more work there..

Yes, I'd prefer that too.

By the way, you broke t/t0000 with the last commit.  Now an
empty GIT_OBJECT_DIRECTORY has 257 subdirectories.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH] Adjust to git-init-db creating $GIT_OBJECT_DIRECTORY/pack
  2005-06-28  2:32     ` Junio C Hamano
@ 2005-06-28  2:37       ` Junio C Hamano
  2005-06-28  2:48       ` CAREFUL! No more delta object support! Linus Torvalds
  1 sibling, 0 replies; 38+ messages in thread
From: Junio C Hamano @ 2005-06-28  2:37 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

Some tests expected the directory not to exist by default.
Updated git-init-db prepares it properly so adjust tests to
match that behaviour.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 t/t0000-basic.sh       |    6 +++---
 t/t5300-pack-object.sh |    1 -
 2 files changed, 3 insertions(+), 4 deletions(-)

de500ab0379e4db18d1511cbe91ace106eee7830
diff --git a/t/t0000-basic.sh b/t/t0000-basic.sh
--- a/t/t0000-basic.sh
+++ b/t/t0000-basic.sh
@@ -28,11 +28,11 @@ test_expect_success \
     '.git/objects should be empty after git-init-db in an empty repo.' \
     'cmp -s /dev/null should-be-empty' 
 
-# also it should have 256 subdirectories.  257 is counting "objects"
+# also it should have 257 subdirectories.  258 is counting "objects"
 find .git/objects -type d -print >full-of-directories
 test_expect_success \
-    '.git/objects should have 256 subdirectories.' \
-    'test $(wc -l < full-of-directories) = 257'
+    '.git/objects should have 257 subdirectories.' \
+    'test $(wc -l < full-of-directories) = 258'
 
 ################################################################
 # Basics of the basics
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -99,7 +99,6 @@ test_expect_success \
     'GIT_OBJECT_DIRECTORY=.git2/objects &&
      export GIT_OBJECT_DIRECTORY &&
      git-init-db &&
-     mkdir .git2/objects/pack &&
      cp test-1.pack test-1.idx .git2/objects/pack && {
 	 git-diff-tree --root -p $commit &&
 	 while read object
------------

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Skip writing out sha1 files for objects in packed git.
  2005-06-28  2:03   ` [PATCH] Skip writing out sha1 files for objects in packed git Junio C Hamano
@ 2005-06-28  2:43     ` Linus Torvalds
  2005-06-28  3:33       ` Junio C Hamano
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-06-28  2:43 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git



On Mon, 27 Jun 2005, Junio C Hamano wrote:
>
> Now, there's still a misfeature there, which is that when you
> create a new object, it doesn't check whether that object
> already exists in the pack-file, so you'll end up with a few
> recent objects that you really don't need (notably tree
> objects), and this patch fixes it.
> 
> Signed-off-by: Junio C Hamano <junkio@cox.net>

Actually, I don't think that "do_expand" flag should exist.

If we want to expand a packed file and really write the objects to the 
.git/objects directories, we should just not have that packed file in the 
.git/objects/pack directory.

And if we have a pack-file in .git/objects/ that already has the object, 
that may not be the _same_ pack-file that we're expanding at all, so if 
that pack file already has the object, then not writing it out is actually 
the right thing to do.

That will also simplify your patch a bit. I'll fix it up.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28  2:32     ` Junio C Hamano
  2005-06-28  2:37       ` [PATCH] Adjust to git-init-db creating $GIT_OBJECT_DIRECTORY/pack Junio C Hamano
@ 2005-06-28  2:48       ` Linus Torvalds
  1 sibling, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-06-28  2:48 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git



On Mon, 27 Jun 2005, Junio C Hamano wrote:
> 
> By the way, you broke t/t0000 with the last commit.  Now an
> empty GIT_OBJECT_DIRECTORY has 257 subdirectories.

Yup, I noticed that. Fix pushed out (along with another one that was 
failing because it wanted to create the "pack" directory itself, and was 
unhappy when it already existed).

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-27 23:58 ` Christopher Li
@ 2005-06-28  3:30   ` Linus Torvalds
  2005-06-28  9:40     ` Junio C Hamano
  2005-06-28 10:38     ` Christopher Li
  0 siblings, 2 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-06-28  3:30 UTC (permalink / raw)
  To: Christopher Li; +Cc: Git Mailing List



On Mon, 27 Jun 2005, Christopher Li wrote:
> On Mon, Jun 27, 2005 at 06:14:40PM -0700, Linus Torvalds wrote:
> > 
> > The reason? The new git understands packed files natively, which ends up 
> > being a much bigger win in many many ways.
> 
> Interesting. I take a look at your change, it still support delta object
> inside the pack file right? For a second I am wondering you drop the delta
> feature completely.

Deltas do exist inside pack-files, yes. They just don't exist as 
independent objects any more, so you can never get into the situation that 
you find a delta but you don't find the delta it points to.

Because in the pack-files, there are only deltas _within_ a pack-file. You 
can't have a delta that points to outside the pack.

This means that pack-files with few objects will inevitably be larger than
they could otherwise be (ie you can never have a pack file that _only_
contains deltas to the outside world), but it's just incredibly reassuring 
to me that a pack-file is always self-sufficient. 

So when/if we start using pack-files for doing "git pull" etc, the 
pack-file won't actually help pack things for small updates: small updates 
will probably contain the whole changed file, unless the update has 
several changes to the same file (which is not unusual, of course), in 
which case it will only contain one version and then deltas from that.

But the savings get increasingly bigger the more history we have. That's
also why the packed git archive is about 1/14th of the size of the fully
unpacked disk usage of the git project, but a packed kernel archive "only"  
achieves a packing rate of 1/5th of the fully unpacked kernel archive. The
git archive is all history, while the kernel archive just "appears", and
2/3 of the files have only one single version and thus don't delta-
compress at all.

(Another reason is probably that the kernel has bigger files, which means
that it thus has relatively less loss in filesystem block padding).

But not having any outside deltas not only makes me feel safer, it also
means that you can fully validate a pack archive consistency without even
knowing what project it is from - you can check the SHA1 results of every
file in the pack against the index of the pack, and check that the SHA1's
of the pack files themselves are valid. Again, this is just a data
_consistency_ check, of course - it means that you can validate that it
downloaded fine, and that you don't have disk corruption, but it doesn't
mean that the data isn't evil and nasty and buggy ;)

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Skip writing out sha1 files for objects in packed git.
  2005-06-28  2:43     ` Linus Torvalds
@ 2005-06-28  3:33       ` Junio C Hamano
  2005-06-28 15:45         ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Junio C Hamano @ 2005-06-28  3:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> If we want to expand a packed file and really write the objects to the 
LT> .git/objects directories, we should just not have that packed file in the 
LT> .git/objects/pack directory.

What I was aiming for was this:

 (1) Introduce an interface to sha1_file.c that lets you say
     "use this file as one of the packs, although it is not
     under .git/objects/pack";

 (2) Introduce another interface to sha1_file.c that lets you
     enumerate the index entries for a given pack file.

 (3) Remove the unpacking logic from unpack-object.c; instead
     call the above interfaces to register the pack and
     enumerate entries, and call read_sha1_file() followed by
     write_sha1_file() with do_expand repeatedly.

However, the infrastructure (1) and (2) may end up being a
special case only to support unpack-object (and removing the
code duplication for unpacking), in which case what you suggest
would make more sense.

LT> And if we have a pack-file in .git/objects/ that already has
LT> the object, that may not be the _same_ pack-file that we're
LT> expanding at all, so if that pack file already has the
LT> object, then not writing it out is actually the right thing
LT> to do.

This I have to think about a bit.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28  2:13   ` CAREFUL! No more delta object support! Linus Torvalds
  2005-06-28  2:32     ` Junio C Hamano
@ 2005-06-28  5:09     ` Daniel Barkalow
  2005-06-28 15:49       ` Linus Torvalds
  2005-06-29 18:59     ` Linus Torvalds
  2 siblings, 1 reply; 38+ messages in thread
From: Daniel Barkalow @ 2005-06-28  5:09 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git

On Mon, 27 Jun 2005, Linus Torvalds wrote:

> > > [..  git-ssh-pull hopefully working ..]
> >
> > No.  The pull protocol Dan did expects to throw compressed
> > representation around on the wire (which is valid if you assume
> > uncompressed transfer) and does not use read-sha1-file --
> > write-sha1-file pair, so all three do not work.
> 
> Fair enough. I'd prefer for the pull/push to push object packs around 
> anyway, so there's some more work there..

It shouldn't be hard to add; the main issue is determining when
transfering a pack file is a good idea, because it probably doesn't make
sense to transfer a pack file just because the source side has an object
that the target side wants in that pack. (If you pull from someone who
packed up the whole history of everything, which you already have, into a
file with one new commit, you'd be sad to get the huge thing; you really
want a little custom (or just limited) pack file.)

The ideal thing is probably to pick up some tricks from Mercurial in
figuring out what needs to be transferred, and have the source side write
a pack file directly to the connection, which the target side would then
save directly. I never worked out exactly what those tricks were, though.

The next trick would be to put something in place of cleverly-chosen
objects to specify what pack file they're in, so that the HTTP client
could find things from a packed repository. (Or we could just have an
option to unpack post-transfer.)

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH] Adjust fsck-cache to packed GIT and alternate object pool.
  2005-06-28  1:14 CAREFUL! No more delta object support! Linus Torvalds
  2005-06-27 23:58 ` Christopher Li
  2005-06-28  2:01 ` CAREFUL! No more delta object support! Junio C Hamano
@ 2005-06-28  8:49 ` Junio C Hamano
  2005-06-28 21:56   ` [PATCH] Expose packed_git and alt_odb Junio C Hamano
  2005-06-28 21:58   ` [PATCH 3/3] Update fsck-cache (take 2) Junio C Hamano
  2 siblings, 2 replies; 38+ messages in thread
From: Junio C Hamano @ 2005-06-28  8:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> There are some other issues too, like the fact that "git-fsck-cache"  
LT> doesn't know about the pack-files yet, so it will complain about missing
LT> objects etc.

And here is a patch to fix it.  It is interesting to know that
the same problem existed for a long time in a different form and
nobody has complained: GIT_ALTERNATE_OBJECT_DIRECTORIES.

Maybe the alternate object pool mechanism is not so widely used
and probably not very useful for everyday use.  I donno.

------------
The fsck-cache complains if objects referred to by files in
.git/refs/ or objects stored in files under .git/objects/??/ are
not found as stand-alone SHA1 files (i.e. found in alternate
object pools GIT_ALTERNATE_OBJECT_DIRECTORIES or packed archives
stored under .git/objects/pack).

Although this is a good semantics to maintain consistency of a
single .git/objects directory as a self contained set of
objects, it sometimes is useful to consider it is OK as long as
these "outside" objects are available.

This commit introduces a new flag, --standalone, to
git-fsck-cache.  When it is not specified, connectivity checks
and .git/refs pointer checks are taught that it is OK when
expected objects do not exist under .git/objects/?? hierarchy
but are available from an packed archive or in an alternate
object pool.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 fsck-cache.c |   20 ++++++++++++++++----
 1 files changed, 16 insertions(+), 4 deletions(-)

ea4255429bb0b4b760ba2fe327f5806d8d24d8a6
diff --git a/fsck-cache.c b/fsck-cache.c
--- a/fsck-cache.c
+++ b/fsck-cache.c
@@ -12,6 +12,7 @@
 static int show_root = 0;
 static int show_tags = 0;
 static int show_unreachable = 0;
+static int standalone = 0;
 static int keep_cache_objects = 0; 
 static unsigned char head_sha1[20];
 
@@ -25,13 +26,17 @@ static void check_connectivity(void)
 		struct object_list *refs;
 
 		if (!obj->parsed) {
-			printf("missing %s %s\n",
-			       obj->type, sha1_to_hex(obj->sha1));
+			if (!standalone && has_sha1_file(obj->sha1))
+				; /* it is in pack */
+			else
+				printf("missing %s %s\n",
+				       obj->type, sha1_to_hex(obj->sha1));
 			continue;
 		}
 
 		for (refs = obj->refs; refs; refs = refs->next) {
-			if (refs->item->parsed)
+			if (refs->item->parsed ||
+			    (!standalone && has_sha1_file(refs->item->sha1)))
 				continue;
 			printf("broken link from %7s %s\n",
 			       obj->type, sha1_to_hex(obj->sha1));
@@ -315,8 +320,11 @@ static int read_sha1_reference(const cha
 		return -1;
 
 	obj = lookup_object(sha1);
-	if (!obj)
+	if (!obj) {
+		if (!standalone && has_sha1_file(sha1))
+			return 0; /* it is in pack */
 		return error("%s: invalid sha1 pointer %.40s", path, hexname);
+	}
 
 	obj->used = 1;
 	mark_reachable(obj, REACHABLE);
@@ -390,6 +398,10 @@ int main(int argc, char **argv)
 			keep_cache_objects = 1;
 			continue;
 		}
+		if (!strcmp(arg, "--standalone")) {
+			standalone = 1;
+			continue;
+		}
 		if (*arg == '-')
 			usage("git-fsck-cache [--tags] [[--unreachable] [--cache] <head-sha1>*]");
 	}
------------

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28  3:30   ` Linus Torvalds
@ 2005-06-28  9:40     ` Junio C Hamano
  2005-06-28 11:06       ` Christopher Li
  2005-06-28 14:46       ` Jan Harkes
  2005-06-28 10:38     ` Christopher Li
  1 sibling, 2 replies; 38+ messages in thread
From: Junio C Hamano @ 2005-06-28  9:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> But the savings get increasingly bigger the more history we have. That's
LT> also why the packed git archive is about 1/14th of the size of the fully
LT> unpacked disk usage of the git project,...

GIT archive may be an odd-ball because the project itself is so
small, but a fair comparison should include the disk usage of
256 fan-out directories.  Counting them, empty .git/objects/
with the 1.4MB packed archive and 90KB index file ends up being
somewhere around 2.4MB on my machine, compared with 17MB for the
traditional one.

Still a good space reduction.  Good job!

I am now dreaming if we someday would enhance the mechanism with
append-only updates to the *.pack files with complete rewrite of
the *.idx files, and get rid of files under .git/objects totally.

This would make things reasonably friendly to rsync.  The kernel
pack has around 60M pack with 1.1M index, so everyday use would
involve incremental updates to the pack [*1*] and full download
of the index file.

[Footnote]

*1* Presumably many objects are deltified against older objects
which is suboptimal.  Most likely the newer objects are accessed
far more often and they are what we would want to keep in full
not as delta.  So even with this scheme we would want to have
weekly repacking.  Interestingly enough, pack-objects gets the
objects via usual read_sha1_file() interface so it can produce a
new pack from an existing pack.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28  3:30   ` Linus Torvalds
  2005-06-28  9:40     ` Junio C Hamano
@ 2005-06-28 10:38     ` Christopher Li
  2005-06-28 16:45       ` Linus Torvalds
  1 sibling, 1 reply; 38+ messages in thread
From: Christopher Li @ 2005-06-28 10:38 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Git Mailing List

That is all nice improvement to address the space usage issue.

Should people just run repacking once a while or is it automaticly
add new object to the pack file?

Chris


On Mon, Jun 27, 2005 at 08:30:22PM -0700, Linus Torvalds wrote:
> 
> Deltas do exist inside pack-files, yes. They just don't exist as 
> independent objects any more, so you can never get into the situation that 
> you find a delta but you don't find the delta it points to.
> 
> Because in the pack-files, there are only deltas _within_ a pack-file. You 
> can't have a delta that points to outside the pack.
> 
> This means that pack-files with few objects will inevitably be larger than
> they could otherwise be (ie you can never have a pack file that _only_
> contains deltas to the outside world), but it's just incredibly reassuring 
> to me that a pack-file is always self-sufficient. 
> 
> So when/if we start using pack-files for doing "git pull" etc, the 
> pack-file won't actually help pack things for small updates: small updates 
> will probably contain the whole changed file, unless the update has 
> several changes to the same file (which is not unusual, of course), in 
> which case it will only contain one version and then deltas from that.
> 
> But the savings get increasingly bigger the more history we have. That's
> also why the packed git archive is about 1/14th of the size of the fully
> unpacked disk usage of the git project, but a packed kernel archive "only"  
> achieves a packing rate of 1/5th of the fully unpacked kernel archive. The
> git archive is all history, while the kernel archive just "appears", and
> 2/3 of the files have only one single version and thus don't delta-
> compress at all.
> 
> (Another reason is probably that the kernel has bigger files, which means
> that it thus has relatively less loss in filesystem block padding).
> 
> But not having any outside deltas not only makes me feel safer, it also
> means that you can fully validate a pack archive consistency without even
> knowing what project it is from - you can check the SHA1 results of every
> file in the pack against the index of the pack, and check that the SHA1's
> of the pack files themselves are valid. Again, this is just a data
> _consistency_ check, of course - it means that you can validate that it
> downloaded fine, and that you don't have disk corruption, but it doesn't
> mean that the data isn't evil and nasty and buggy ;)
> 
> 			Linus
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28  9:40     ` Junio C Hamano
@ 2005-06-28 11:06       ` Christopher Li
  2005-06-28 14:52         ` Petr Baudis
  2005-06-28 14:46       ` Jan Harkes
  1 sibling, 1 reply; 38+ messages in thread
From: Christopher Li @ 2005-06-28 11:06 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

On Tue, Jun 28, 2005 at 02:40:56AM -0700, Junio C Hamano wrote:
> >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:
> Still a good space reduction.  Good job!
> 
> I am now dreaming if we someday would enhance the mechanism with
> append-only updates to the *.pack files with complete rewrite of
> the *.idx files, and get rid of files under .git/objects totally.

No offense my friend, this has been done. It's name is mercurial.

> This would make things reasonably friendly to rsync.  The kernel
> pack has around 60M pack with 1.1M index, so everyday use would
> involve incremental updates to the pack [*1*] and full download
> of the index file.

It still have other open issue. Now it would be harder to not sync
all the heads. If I just want the clean Linus-2.6 tree, I have to
dig it out from the pack file which mixing with other heads. 

You could host different projects with it's own pack file. That
will lost the space saving on co-hosting projects.

So I am not convince rsync is the way to go in long run. You need
to have your own network syncing method.

> 
> [Footnote]
> 
> *1* Presumably many objects are deltified against older objects
> which is suboptimal.  Most likely the newer objects are accessed
> far more often and they are what we would want to keep in full
> not as delta.  So even with this scheme we would want to have
> weekly repacking.  Interestingly enough, pack-objects gets the
> objects via usual read_sha1_file() interface so it can produce a
> new pack from an existing pack.

It sounds like you are suggesting backward delta. Keeping the
latest node in full and using delta to access the old one. It should
work but it will lose the append only property.

Chris

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28  9:40     ` Junio C Hamano
  2005-06-28 11:06       ` Christopher Li
@ 2005-06-28 14:46       ` Jan Harkes
  1 sibling, 0 replies; 38+ messages in thread
From: Jan Harkes @ 2005-06-28 14:46 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, git

On Tue, Jun 28, 2005 at 02:40:56AM -0700, Junio C Hamano wrote:
> I am now dreaming if we someday would enhance the mechanism with
> append-only updates to the *.pack files with complete rewrite of
> the *.idx files, and get rid of files under .git/objects totally.

Stop dreaming, please.

The current separate objects setup might not be space efficient, but it
has many other advantages.

- Objects are only written only once, and from then on are only read.
  This works well on filesystems that provide session semantics, as
  opposed to unix semantics. And the resulting objects are perfectly
  cacheable since they are only invalidated if someone ever decides to
  pack the repository.

- The hierarchy and the way the objects directories are updated works
  very well in combination with AFS style directory acls. What surprised
  me was that subdirectories in refs/heads work perfectly with all the
  core git tools, branchnames simply become 'user/branch'.

- Objects that differ in content have different naming, as a result
  multiple developers can safely commit into a shared repository without
  requiring locks. This is also why it is safe to pull from another
  repository without clobbering your own history. Imagine if you
  appended some local changes to a packed archive and the next rsync
  wipes your local commits.

I've been trying to keep an up to date document on how (and why) I use
git on Coda. It started pretty much the identical to jgarzik's HOWTO.
But it ended up a lot more complicated, to a point where I needed my own
scripts for just about every action. Until I discovered that the
alternate objects pool would work well in my environment.

    http://www.coda.cs.cmu.edu/git.html

Jan

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28 11:06       ` Christopher Li
@ 2005-06-28 14:52         ` Petr Baudis
  2005-06-28 16:35           ` Benjamin LaHaise
  0 siblings, 1 reply; 38+ messages in thread
From: Petr Baudis @ 2005-06-28 14:52 UTC (permalink / raw)
  To: Christopher Li; +Cc: Junio C Hamano, Linus Torvalds, git

Dear diary, on Tue, Jun 28, 2005 at 01:06:25PM CEST, I got a letter
where Christopher Li <git@chrisli.org> told me that...
> On Tue, Jun 28, 2005 at 02:40:56AM -0700, Junio C Hamano wrote:
> > >>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:
> > Still a good space reduction.  Good job!
> > 
> > I am now dreaming if we someday would enhance the mechanism with
> > append-only updates to the *.pack files with complete rewrite of
> > the *.idx files, and get rid of files under .git/objects totally.
> 
> No offense my friend, this has been done. It's name is mercurial.
> 
> > This would make things reasonably friendly to rsync.  The kernel
> > pack has around 60M pack with 1.1M index, so everyday use would
> > involve incremental updates to the pack [*1*] and full download
> > of the index file.
> 
> It still have other open issue. Now it would be harder to not sync
> all the heads. If I just want the clean Linus-2.6 tree, I have to
> dig it out from the pack file which mixing with other heads. 
> 
> You could host different projects with it's own pack file. That
> will lost the space saving on co-hosting projects.
> 
> So I am not convince rsync is the way to go in long run. You need
> to have your own network syncing method.

I think the git-*-pull tools are actually just fine. You will only need
to have some server-side CGI gadget to frontend the file, but we need
that anyway to make the pull reasonably effective.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
<Espy> be careful, some twit might quote you out of context..

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: [PATCH] Skip writing out sha1 files for objects in packed git.
  2005-06-28  3:33       ` Junio C Hamano
@ 2005-06-28 15:45         ` Linus Torvalds
  0 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-06-28 15:45 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: git



On Mon, 27 Jun 2005, Junio C Hamano wrote:
> 
> LT> And if we have a pack-file in .git/objects/ that already has
> LT> the object, that may not be the _same_ pack-file that we're
> LT> expanding at all, so if that pack file already has the
> LT> object, then not writing it out is actually the right thing
> LT> to do.
> 
> This I have to think about a bit.

The most trivial example is doing a "git pull" of a small pack-file 
update.

We probably don't want to leave it around as a pack-file (we'll re-pack 
everything at some later date, but we also don't want to expand the stuff 
we already have in our _real_ pack-file).

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28  5:09     ` Daniel Barkalow
@ 2005-06-28 15:49       ` Linus Torvalds
  2005-06-28 16:21         ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-06-28 15:49 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Junio C Hamano, git



On Tue, 28 Jun 2005, Daniel Barkalow wrote:
> 
> It shouldn't be hard to add; the main issue is determining when
> transfering a pack file is a good idea, because it probably doesn't make
> sense to transfer a pack file just because the source side has an object
> that the target side wants in that pack.

Oh, you'd never just transfer the whole big pack-file at all: you'd just 
create a new one. And creatign a new one is just a matter of finding the 
common parent, and then doing

	git-rev-list --objects common..HEAD | git-pack-file .git/tmp-pack

and then you send the result to the other side..

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28 15:49       ` Linus Torvalds
@ 2005-06-28 16:21         ` Linus Torvalds
  2005-06-28 17:04           ` Daniel Barkalow
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-06-28 16:21 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Junio C Hamano, git



On Tue, 28 Jun 2005, Linus Torvalds wrote:
> 
> Oh, you'd never just transfer the whole big pack-file at all: you'd just 
> create a new one. And creatign a new one is just a matter of finding the 
> common parent, and then doing
> 
> 	git-rev-list --objects common..HEAD | git-pack-file .git/tmp-pack
> 
> and then you send the result to the other side..

To clarify: this also works with objects that are already in another
pack-file (now that Junio fixed the "get size of a deltified packed
entry"), so you can have any number of unpacked objects in your objects
directory, _and_ a pack-file (or several), and you can generate a new
temporary pack-file just for sending somewhere else that contains
arbistrary parts of that (ie a mix of objects that are in your "main"
packfiles and objects that are unpacked).

You don't have to use "git-rev-list" to generate the objects, btw,
git-pack-file takes an arbitrary list of object ID's (plus a "packing
hint" in the form of a filename that is not required, but that can help
the packing heuristics, and that git-rev-list does provide).

I'll also fix up git-pack-file to be able to pack tag objects (and the
unpacking to understand them), so that any valid object can be packed. 
Right now it only handles the objects that git-rev-list knows about.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28 14:52         ` Petr Baudis
@ 2005-06-28 16:35           ` Benjamin LaHaise
  2005-06-28 20:30             ` Petr Baudis
  0 siblings, 1 reply; 38+ messages in thread
From: Benjamin LaHaise @ 2005-06-28 16:35 UTC (permalink / raw)
  To: Petr Baudis; +Cc: Christopher Li, Junio C Hamano, Linus Torvalds, git

On Tue, Jun 28, 2005 at 04:52:56PM +0200, Petr Baudis wrote:
> I think the git-*-pull tools are actually just fine. You will only need
> to have some server-side CGI gadget to frontend the file, but we need
> that anyway to make the pull reasonably effective.

Not really -- the use of rsync for the objects fails horribly on slow 
links when the project scales in the number of commits.  The rsync 
protocol has to transfer the names of each file and some information 
about it, and that information isn't delta compressed.  This is where 
kernel.org is falling over, as well as what makes the kernel tree very 
painful to use over a dialup modem link.

		-ben
-- 
"Time is what keeps everything from happening all at once." -- John Wheeler

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28 10:38     ` Christopher Li
@ 2005-06-28 16:45       ` Linus Torvalds
  2005-06-29  0:49         ` [PATCH] Emit base objects of a delta chain when the delta is output Junio C Hamano
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-06-28 16:45 UTC (permalink / raw)
  To: Christopher Li; +Cc: Git Mailing List



On Tue, 28 Jun 2005, Christopher Li wrote:
>
> That is all nice improvement to address the space usage issue.
> 
> Should people just run repacking once a while or is it automaticly
> add new object to the pack file?

While adding a new object to a pack file is _possible_ (you add it to the
end of the pack-file, and re-generate the index file), I would strongly
suggest against it for several reasons:

 - It's a lot more complex and expensive than just writing a new file.  
   Much better to make the pack generation be an off-line thing, and make 
   new object creation really cheap.

 - it has serious locking issues, and if something goes wrong you are just 
   horribly screwed. This implies, for example, that to be safe you really 
   have to use fsync() etc at every point (and be careful about writing 
   the index), making the update even _more_ expensive. Over NFS you need 
   to be extremely careful to make sure that everybody got the right lock, 
   yadda yadda.

   Packing things off-line just means that _all_ of these problems go 
   away.

 - There are operations that want to remove objects (I do that all the 
   time: I do something stupid, and decide to undo it, or I just do a 
   "git-update-cache" and notice that I need to do more work so I edit it 
   some more and actually never commit the first version)

   If _adding_ to the file had some serious correctness issues, _removing_ 
   an object from a file is even worse. MUCH worse. Now you don't just 
   have to lock against other people creating new objects, now you have to 
   lock against updates (or totally re-write the whole big file and do an 
   atomic "rename").

 - it can actually generate worse packing. The current "offline" method 
   means that we can pack any version of a file against any other version 
   of a file, and we do. We pick the closest version we can find, and we 
   try to always pack against the bigger one (deletes are smaller deltas, 
   and the biggest one tends to be the latest version, so this not only
   means that the delta is denser, it also means that the latest version -
   which is likely to be the biggest and most often used - tends to be
   non-delta).

   In contrast, updating the pack file means that you always write the 
   latest version as a delta, which means that you're doing things 
   _exactly_ the wrong way around both for performance and size.

 - Finally: packing allows us to do optimize for locality. In particular, 
   I write out the pack file in "recency" order, ie the top-most objects 
   go first, and in particular, the "commit" objects go at the very top of 
   the file. Why? Because it means that the commit objects (which are 
   heavily used for the history generation by pretty much anything, since 
   "git-rev-list" will access them) are packed together, and in the right 
   order.

   Again, you can't do that if you do on-line updates as opposed to 
   offline packing.

So the usage pattern I envision is to pack stuff maybe once a month
(depending on how much changes, of course), because then you really do get
the best of both worlds: the simplicity of individual objects for recent
work and the optimal packing and ordering that you can really work on for
the longer range case. And your project never grows very big.

Btw, I'm not claiming that my current pack format is "optimal" of course.  
For example, while I write all objects in recency order, right now that
means that if a recent object has been written as a delta that depends on
an older one, I actually write the delta first (correct) but I won't write
the older object until its recency ordering (wrong).

That kind of thing is trivial to fix (eventually), but it's an example of
where ordering matters (ie if it's the other way around: the delta is the
older object, it's probably better to leave it at the end of the file,
since it's probably not going to be accessed much, making the effective
packing at the head more efficicient). It's also an example of the kinds
of things we can do exactly because we're doing the packing off-line.

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28 16:21         ` Linus Torvalds
@ 2005-06-28 17:04           ` Daniel Barkalow
  2005-06-28 17:36             ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Daniel Barkalow @ 2005-06-28 17:04 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git

On Tue, 28 Jun 2005, Linus Torvalds wrote:

> I'll also fix up git-pack-file to be able to pack tag objects (and the
> unpacking to understand them), so that any valid object can be packed. 
> Right now it only handles the objects that git-rev-list knows about.

Actually, the ideal thing would be to move the packing code into an object
file that git-ssh-push can include; that way it can write directly to the
socket instead of going through disk, and it can also go from getting the
remote end's list of common ancestors to having a pack to send without
needing to exec a script.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28 17:04           ` Daniel Barkalow
@ 2005-06-28 17:36             ` Linus Torvalds
  2005-06-28 18:17               ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-06-28 17:36 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Junio C Hamano, git



On Tue, 28 Jun 2005, Daniel Barkalow wrote:
> 
> Actually, the ideal thing would be to move the packing code into an object
> file that git-ssh-push can include; that way it can write directly to the
> socket instead of going through disk

It doesn't work very easily that way because the index file (which
contains the object list and the offsets into the pack file) cannot be
created until after the pack file has been created (and we don't want to
evaluate that one in memory, since it can be quite big).

Now, what we could do is to stream out the pack file first to stdout, and
write the index file afterwards. But since we don't know how big the pack
file will be when we start packing, and the pack-file can contain
basically arbitrary patterns, that requires that the receiver actually 
parse the pack-file as it comes in.

The format of the pack-file is a fairly trivial data stream of

 - rinse and repeat for each object:

     - one character of type of file (C, T, B, G, D for "commit", "tree", 
       "blob", "tag" or "delta" respectively)

     - four bytes of network-order unpacked data length

     - [ if delta: 20 bytes of delta object ID ]

     - zlib-packed data (length unknown, except we know how much we want 
       it to unpack to)

 - Finally at the end: 20 bytes of SHA1 of the pack-file contents (up to 
   the SHA1)

so it's actually possible to pick up the objects as they come off the 
stream, since the SHA1 name is defined by the contents and you don't need 
the index file unless you want to look things up.

So the receiver side could try this algorithm:

 - unpack each object in memory on the receiving side

	If the unpack failed, it must have been the SHA1 at the end, so 
	verify it!

 - if it's a delta object and you haven't seen the object it's a delta 
   against, keep it in memory.

 - if it's a non-delta object, just write it to the object store, and try 
   to resolve any delta objects you have pending that this new object 
   satisfies. That in turn creates other objects that may have more deltas 
   they satisfy etc.

which looks quite doable. The delta objects are small, so keeping them in 
memory shouldn't be a problem (especially since we _tend_ to write deltas 
after the object they depend on).

I can certainly add an option to git-pack-file that disables writing of
the index file, and just writes the pack-file to stdout. I'm not sure I
want to write the "parse incoming pack-file" thing, but git-unpack-objects
comes _reasonably_ close (but right now it seeks around using the index
file to resolve deltas, instead of keeping them in memory and resolving
them when possible). But I can make the infrastructure ready for it.

Sounds like a plan.

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28 17:36             ` Linus Torvalds
@ 2005-06-28 18:17               ` Linus Torvalds
  2005-06-28 19:49                 ` Matthias Urlichs
                                   ` (2 more replies)
  0 siblings, 3 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-06-28 18:17 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Junio C Hamano, git



On Tue, 28 Jun 2005, Linus Torvalds wrote:
> 
> I can certainly add an option to git-pack-file that disables writing of
> the index file, and just writes the pack-file to stdout.

Done.

>						 I'm not sure I
> want to write the "parse incoming pack-file" thing, but git-unpack-objects
> comes _reasonably_ close (but right now it seeks around using the index
> file to resolve deltas, instead of keeping them in memory and resolving
> them when possible).

I'm still thinking about this one. I think I'll just do it.

One problem here is that since we don't know how big the incoming
pack-file will be, in a streaming input environment the receiver needs to
either make the pack-file reception be the last thing it sees, or it will
have to live with the fact that "git-unpack-objects" will read some more
than it needs before it notices that it got it all...

We can handle the latter either by padding (make the rule be that
git-unpack-file will always read in chunks of 4kB max, and pad the output
with 4kB of zero bytes or something, and then you can execute
git-unpack-objects and continue reading stdin afterwards, removing any
zeroes that git-unpack-file didn't eat), or by having git-unpack-objects 
flush anything after the final SHA1 to _its_ stdout, so that you can get 
the following data/commands in the stream from the unpack-file thing. 
Ugly, in any case.

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28 18:17               ` Linus Torvalds
@ 2005-06-28 19:49                 ` Matthias Urlichs
  2005-06-28 20:18                   ` Matthias Urlichs
  2005-06-28 20:01                 ` Daniel Barkalow
  2005-06-29  3:53                 ` Linus Torvalds
  2 siblings, 1 reply; 38+ messages in thread
From: Matthias Urlichs @ 2005-06-28 19:49 UTC (permalink / raw)
  To: git

Hi, Linus Torvalds wrote:

> Ugly, in any case.

Why not chunk the thing?

In other words, the stream shouldn't be

	"here's a big-ass packfile of unknown size"

but an arbitrary number of

	"here's a N-byte sized chunk of the current pack file"
snippets, followed by a
	"here's the SHA1 of the whole thing"
packet.

-- 
Matthias Urlichs   |   {M:U} IT Design @ m-u-it.de   |  smurf@smurf.noris.de
Disclaimer: The quote was selected randomly. Really. | http://smurf.noris.de
 - -
Be like a duck -- keep calm and unruffled on the surface but paddle like the
devil under water.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28 18:17               ` Linus Torvalds
  2005-06-28 19:49                 ` Matthias Urlichs
@ 2005-06-28 20:01                 ` Daniel Barkalow
  2005-06-29  3:53                 ` Linus Torvalds
  2 siblings, 0 replies; 38+ messages in thread
From: Daniel Barkalow @ 2005-06-28 20:01 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, git

On Tue, 28 Jun 2005, Linus Torvalds wrote:

> On Tue, 28 Jun 2005, Linus Torvalds wrote:
> > 
> > I can certainly add an option to git-pack-file that disables writing of
> > the index file, and just writes the pack-file to stdout.
> 
> Done.

What I actually meant was that it would be useful for git-ssh-push to be
able to pack stuff as a function call rather than execing an external
program, because just sticking git-ssh-push at the end of a pipeline
doesn't work if you don't remember what the remote side has.

> >						 I'm not sure I
> > want to write the "parse incoming pack-file" thing, but git-unpack-objects
> > comes _reasonably_ close (but right now it seeks around using the index
> > file to resolve deltas, instead of keeping them in memory and resolving
> > them when possible).
> 
> I'm still thinking about this one. I think I'll just do it.

One possibility would be to put a special type tag (like '\0') before the
hash, so that the format is more deterministic.

> One problem here is that since we don't know how big the incoming
> pack-file will be, in a streaming input environment the receiver needs to
> either make the pack-file reception be the last thing it sees, or it will
> have to live with the fact that "git-unpack-objects" will read some more
> than it needs before it notices that it got it all...

In a completely streaming environment, yes; but the receiving side is the
one sending commands, so you don't run into the next thing unless you're
overlapping requests. Failing that, we can just keep a 4k buffer of stuff
we've already read around; we don't have to worry about reading into
something we won't want to read at all.

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28 19:49                 ` Matthias Urlichs
@ 2005-06-28 20:18                   ` Matthias Urlichs
  0 siblings, 0 replies; 38+ messages in thread
From: Matthias Urlichs @ 2005-06-28 20:18 UTC (permalink / raw)
  To: git

I wrote:

> Linus Torvalds wrote:
> 
>> Ugly, in any case.
> 
> Why not chunk the thing?

Having the number of files sent first would work too, I'd think.

I'm wary of trying to interpret something non-decompressible as a sha1
chunk, however -- the set of random bytes that, to zlib, look like a
sufficiently valid zip header that it wants to read more than 20 of them
before punting is certainly not zero.

-- 
Matthias Urlichs   |   {M:U} IT Design @ m-u-it.de   |  smurf@smurf.noris.de
Disclaimer: The quote was selected randomly. Really. | http://smurf.noris.de
 - -
I was sure the old fellow would never make it
to the other side of the curb when I struck him.

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28 16:35           ` Benjamin LaHaise
@ 2005-06-28 20:30             ` Petr Baudis
  0 siblings, 0 replies; 38+ messages in thread
From: Petr Baudis @ 2005-06-28 20:30 UTC (permalink / raw)
  To: Benjamin LaHaise; +Cc: Christopher Li, Junio C Hamano, Linus Torvalds, git

Dear diary, on Tue, Jun 28, 2005 at 06:35:51PM CEST, I got a letter
where Benjamin LaHaise <bcrl@kvack.org> told me that...
> On Tue, Jun 28, 2005 at 04:52:56PM +0200, Petr Baudis wrote:
> > I think the git-*-pull tools are actually just fine. You will only need
> > to have some server-side CGI gadget to frontend the file, but we need
> > that anyway to make the pull reasonably effective.
> 
> Not really -- the use of rsync for the objects fails horribly on slow 
> links when the project scales in the number of commits.  The rsync 
> protocol has to transfer the names of each file and some information 
> about it, and that information isn't delta compressed.  This is where 
> kernel.org is falling over, as well as what makes the kernel tree very 
> painful to use over a dialup modem link.

Yes. But isn't that what I'm after all saying too? git-*-pull tools
shouldn't have that problem since they have much less overhead and only
pull stuff you need.

-- 
				Petr "Pasky" Baudis
Stuff: http://pasky.or.cz/
<Espy> be careful, some twit might quote you out of context..

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH] Expose packed_git and alt_odb.
  2005-06-28  8:49 ` [PATCH] Adjust fsck-cache to packed GIT and alternate object pool Junio C Hamano
@ 2005-06-28 21:56   ` Junio C Hamano
  2005-06-28 21:58   ` [PATCH 3/3] Update fsck-cache (take 2) Junio C Hamano
  1 sibling, 0 replies; 38+ messages in thread
From: Junio C Hamano @ 2005-06-28 21:56 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

The commands git-fsck-cache and probably git-*-pull needs to
have a way to enumerate objects contained in packed GIT archives
and alternate object pools.  This commit exposes the data
structure used to keep track of them from sha1_file.c, and adds
a couple of accessor interface functions for use by the enhanced
git-fsck-cache command.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 cache.h     |   19 +++++++++++++++++++
 sha1_file.c |   43 ++++++++++++++++++++++++-------------------
 2 files changed, 43 insertions(+), 19 deletions(-)

da37711700d11f8c7f44fcb6819c724978c840b7
diff --git a/cache.h b/cache.h
--- a/cache.h
+++ b/cache.h
@@ -233,4 +233,23 @@ struct checkout {
 
 extern int checkout_entry(struct cache_entry *ce, struct checkout *state);
 
+extern struct alternate_object_database {
+	char *base;
+	char *name;
+} *alt_odb;
+extern void prepare_alt_odb(void);
+
+extern struct packed_git {
+	struct packed_git *next;
+	unsigned long index_size;
+	unsigned long pack_size;
+	unsigned int *index_base;
+	void *pack_base;
+	unsigned int pack_last_used;
+	char pack_name[0]; /* something like ".git/objects/pack/xxxxx.pack" */
+} *packed_git;
+extern void prepare_packed_git(void);
+extern int num_packed_objects(const struct packed_git *p);
+extern int nth_packed_object_sha1(const struct packed_git *, int, unsigned char*);
+
 #endif /* CACHE_H */
diff --git a/sha1_file.c b/sha1_file.c
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -184,10 +184,7 @@ char *sha1_file_name(const unsigned char
 	return base;
 }
 
-static struct alternate_object_database {
-	char *base;
-	char *name;
-} *alt_odb;
+struct alternate_object_database *alt_odb;
 
 /*
  * Prepare alternate object database registry.
@@ -205,13 +202,15 @@ static struct alternate_object_database 
  * pointed by base fields of the array elements with one xmalloc();
  * the string pool immediately follows the array.
  */
-static void prepare_alt_odb(void)
+void prepare_alt_odb(void)
 {
 	int pass, totlen, i;
 	const char *cp, *last;
 	char *op = NULL;
 	const char *alt = gitenv(ALTERNATE_DB_ENVIRONMENT) ? : "";
 
+	if (alt_odb)
+		return;
 	/* The first pass counts how large an area to allocate to
 	 * hold the entire alt_odb structure, including array of
 	 * structs and path buffers for them.  The second pass fills
@@ -258,8 +257,7 @@ static char *find_sha1_file(const unsign
 
 	if (!stat(name, st))
 		return name;
-	if (!alt_odb)
-		prepare_alt_odb();
+	prepare_alt_odb();
 	for (i = 0; (name = alt_odb[i].name) != NULL; i++) {
 		fill_sha1_path(name, sha1);
 		if (!stat(alt_odb[i].base, st))
@@ -271,15 +269,7 @@ static char *find_sha1_file(const unsign
 #define PACK_MAX_SZ (1<<26)
 static int pack_used_ctr;
 static unsigned long pack_mapped;
-static struct packed_git {
-	struct packed_git *next;
-	unsigned long index_size;
-	unsigned long pack_size;
-	unsigned int *index_base;
-	void *pack_base;
-	unsigned int pack_last_used;
-	char pack_name[0]; /* something like ".git/objects/pack/xxxxx.pack" */
-} *packed_git;
+struct packed_git *packed_git;
 
 struct pack_entry {
 	unsigned int offset;
@@ -430,7 +420,7 @@ static void prepare_packed_git_one(char 
 	}
 }
 
-static void prepare_packed_git(void)
+void prepare_packed_git(void)
 {
 	int i;
 	static int run_once = 0;
@@ -439,8 +429,7 @@ static void prepare_packed_git(void)
 		return;
 
 	prepare_packed_git_one(get_object_directory());
-	if (!alt_odb)
-		prepare_alt_odb();
+	prepare_alt_odb();
 	for (i = 0; alt_odb[i].base != NULL; i++) {
 		alt_odb[i].name[0] = 0;
 		prepare_packed_git_one(alt_odb[i].base);
@@ -750,6 +739,22 @@ static void *unpack_entry(struct pack_en
 	return unpack_non_delta_entry(pack+5, size, left);
 }
 
+int num_packed_objects(const struct packed_git *p)
+{
+	/* See check_packed_git_idx and pack-objects.c */
+	return (p->index_size - 20 - 20 - 4*256) / 24;
+}
+
+int nth_packed_object_sha1(const struct packed_git *p, int n,
+			   unsigned char* sha1)
+{
+	void *index = p->index_base + 256;
+	if (n < 0 || num_packed_objects(p) <= n)
+		return -1;
+	memcpy(sha1, (index + 24 * n + 4), 20);
+	return 0;
+}
+
 static int find_pack_entry_1(const unsigned char *sha1,
 			     struct pack_entry *e, struct packed_git *p)
 {
------------

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH 3/3] Update fsck-cache (take 2)
  2005-06-28  8:49 ` [PATCH] Adjust fsck-cache to packed GIT and alternate object pool Junio C Hamano
  2005-06-28 21:56   ` [PATCH] Expose packed_git and alt_odb Junio C Hamano
@ 2005-06-28 21:58   ` Junio C Hamano
  1 sibling, 0 replies; 38+ messages in thread
From: Junio C Hamano @ 2005-06-28 21:58 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

The fsck-cache complains if objects referred to by files in
.git/refs/ or objects stored in files under .git/objects/??/ are
not found as stand-alone SHA1 files (i.e. found in alternate
object pools GIT_ALTERNATE_OBJECT_DIRECTORIES or packed archives
stored under .git/objects/pack).

Although this is a good semantics to maintain consistency of a
single .git/objects directory as a self contained set of
objects, it sometimes is useful to consider it is OK as long as
these "outside" objects are available.

This commit introduces a new flag, --standalone, to
git-fsck-cache.  When it is not specified, connectivity checks
and .git/refs pointer checks are taught that it is OK when
expected objects do not exist under .git/objects/?? hierarchy
but are available from an packed archive or in an alternate
object pool.

Another new flag, --full, makes git-fsck-cache to check not only
the current GIT_OBJECT_DIRECTORY but also objects found in
alternate object pools and packed GIT archives.a

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

*** This completes "the other half" the fsck updates I did last
*** night was missing.  Please discard that one and use this
*** instead.

 Documentation/git-fsck-cache.txt |   18 +++++++++-
 fsck-cache.c                     |   71 ++++++++++++++++++++++++++++++++------
 2 files changed, 76 insertions(+), 13 deletions(-)

5cae1fa43bfeae6722d916aa764fa75d9ce1839a
diff --git a/Documentation/git-fsck-cache.txt b/Documentation/git-fsck-cache.txt
--- a/Documentation/git-fsck-cache.txt
+++ b/Documentation/git-fsck-cache.txt
@@ -9,7 +9,7 @@ git-fsck-cache - Verifies the connectivi
 
 SYNOPSIS
 --------
-'git-fsck-cache' [--tags] [--root] [--unreachable] [--cache] [<object>*]
+'git-fsck-cache' [--tags] [--root] [--unreachable] [--cache] [--standalone | --full] [<object>*]
 
 DESCRIPTION
 -----------
@@ -37,6 +37,22 @@ OPTIONS
 	Consider any object recorded in the cache also as a head node for
 	an unreachability trace.
 
+--standalone::
+	Limit checks to the contents of GIT_OBJECT_DIRECTORY
+	(.git/objects), making sure that it is consistent and
+	complete without referring to objects found in alternate
+	object pools listed in GIT_ALTERNATE_OBJECT_DIRECTORIES,
+	nor packed GIT archives found in .git/objects/pack;
+	cannot be used with --full.
+
+--full::
+	Check not just objects in GIT_OBJECT_DIRECTORY
+	(.git/objects), but also the ones found in alternate
+	object pools listed in GIT_ALTERNATE_OBJECT_DIRECTORIES,
+	and in packed GIT archives found in .git/objects/pack
+	and corresponding pack subdirectories in alternate
+	object pools; cannot be used with --standalone.
+
 It tests SHA1 and general object sanity, and it does full tracking of
 the resulting reachability and everything else. It prints out any
 corruption it finds (missing or bad objects), and if you use the
diff --git a/fsck-cache.c b/fsck-cache.c
--- a/fsck-cache.c
+++ b/fsck-cache.c
@@ -12,6 +12,8 @@
 static int show_root = 0;
 static int show_tags = 0;
 static int show_unreachable = 0;
+static int standalone = 0;
+static int check_full = 0;
 static int keep_cache_objects = 0; 
 static unsigned char head_sha1[20];
 
@@ -25,13 +27,17 @@ static void check_connectivity(void)
 		struct object_list *refs;
 
 		if (!obj->parsed) {
-			printf("missing %s %s\n",
-			       obj->type, sha1_to_hex(obj->sha1));
+			if (!standalone && has_sha1_file(obj->sha1))
+				; /* it is in pack */
+			else
+				printf("missing %s %s\n",
+				       obj->type, sha1_to_hex(obj->sha1));
 			continue;
 		}
 
 		for (refs = obj->refs; refs; refs = refs->next) {
-			if (refs->item->parsed)
+			if (refs->item->parsed ||
+			    (!standalone && has_sha1_file(refs->item->sha1)))
 				continue;
 			printf("broken link from %7s %s\n",
 			       obj->type, sha1_to_hex(obj->sha1));
@@ -315,8 +321,11 @@ static int read_sha1_reference(const cha
 		return -1;
 
 	obj = lookup_object(sha1);
-	if (!obj)
+	if (!obj) {
+		if (!standalone && has_sha1_file(sha1))
+			return 0; /* it is in pack */
 		return error("%s: invalid sha1 pointer %.40s", path, hexname);
+	}
 
 	obj->used = 1;
 	mark_reachable(obj, REACHABLE);
@@ -366,10 +375,20 @@ static void get_default_heads(void)
 		die("No default references");
 }
 
+static void fsck_object_dir(const char *path)
+{
+	int i;
+	for (i = 0; i < 256; i++) {
+		static char dir[4096];
+		sprintf(dir, "%s/%02x", path, i);
+		fsck_dir(i, dir);
+	}
+	fsck_sha1_list();
+}
+
 int main(int argc, char **argv)
 {
 	int i, heads;
-	char *sha1_dir;
 
 	for (i = 1; i < argc; i++) {
 		const char *arg = argv[i];
@@ -390,17 +409,45 @@ int main(int argc, char **argv)
 			keep_cache_objects = 1;
 			continue;
 		}
+		if (!strcmp(arg, "--standalone")) {
+			standalone = 1;
+			continue;
+		}
+		if (!strcmp(arg, "--full")) {
+			check_full = 1;
+			continue;
+		}
 		if (*arg == '-')
-			usage("git-fsck-cache [--tags] [[--unreachable] [--cache] <head-sha1>*]");
+			usage("git-fsck-cache [--tags] [[--unreachable] [--cache] [--standalone | --full] <head-sha1>*]");
 	}
 
-	sha1_dir = get_object_directory();
-	for (i = 0; i < 256; i++) {
-		static char dir[4096];
-		sprintf(dir, "%s/%02x", sha1_dir, i);
-		fsck_dir(i, dir);
+	if (standalone && check_full)
+		die("Only one of --standalone or --full can be used.");
+	if (standalone)
+		unsetenv("GIT_ALTERNATE_OBJECT_DIRECTORIES");
+
+	fsck_object_dir(get_object_directory());
+	if (check_full) {
+		int j;
+		struct packed_git *p;
+		prepare_alt_odb();
+		for (j = 0; alt_odb[j].base; j++) {
+			alt_odb[j].name[-1] = 0; /* was slash */
+			fsck_object_dir(alt_odb[j].base);
+			alt_odb[j].name[-1] = '/';
+		}
+		prepare_packed_git();
+		for (p = packed_git; p; p = p->next) {
+			int num = num_packed_objects(p);
+			for (i = 0; i < num; i++) {
+				unsigned char sha1[20];
+				nth_packed_object_sha1(p, i, sha1);
+				if (fsck_sha1(sha1) < 0)
+					fprintf(stderr, "bad sha1 entry '%s'\n", sha1_to_hex(sha1));
+
+			}
+		}
 	}
-	fsck_sha1_list();
 
 	heads = 0;
 	for (i = 1; i < argc; i++) {
------------

^ permalink raw reply	[flat|nested] 38+ messages in thread

* [PATCH] Emit base objects of a delta chain when the delta is output.
  2005-06-28 16:45       ` Linus Torvalds
@ 2005-06-29  0:49         ` Junio C Hamano
  0 siblings, 0 replies; 38+ messages in thread
From: Junio C Hamano @ 2005-06-29  0:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: git

>>>>> "LT" == Linus Torvalds <torvalds@osdl.org> writes:

LT> While adding a new object to a pack file is _possible_ (you add it to the
LT> end of the pack-file, and re-generate the index file), I would strongly
LT> suggest against it for several reasons:

OK, people have convinced me not to dream on ;-).

LT> Btw, I'm not claiming that my current pack format is "optimal" of course.  
LT> For example, while I write all objects in recency order, right now that
LT> means that if a recent object has been written as a delta that depends on
LT> an older one, I actually write the delta first (correct) but I won't write
LT> the older object until its recency ordering (wrong).

I agree.  

How does this one look?  Lightly tested by packing, unpacking
without -n and fsck'ing, not unpacking but placing it under
.git/objects/pack and running fsck with --full, all using the
current GIT repo.

------------
Deltas are useless by themselves and when you use them you need
to get to their base objects.  A base object should inherit
recency from the most recent deltified object that is based on
it and that is what this patch teaches git-pack-objects.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---
cd /opt/packrat/playpen/public/in-place/git/git.junio/
jit-diff
# - master: Use enhanced diff_delta() in the similarity estimator.
# + (working tree)
diff --git a/pack-objects.c b/pack-objects.c
--- a/pack-objects.c
+++ b/pack-objects.c
@@ -118,6 +118,23 @@ static unsigned long write_object(struct
 	return hdrlen + datalen;
 }
 
+static unsigned long write_one(struct sha1file *f,
+			       struct object_entry *e,
+			       unsigned long offset)
+{
+	if (e->offset)
+		/* offset starts from header size and cannot be zero
+		 * if it is written already.
+		 */
+		return offset;
+	e->offset = offset;
+	offset += write_object(f, e);
+	/* if we are delitified, write out its base object. */
+	if (e->delta)
+		offset = write_one(f, e->delta, offset);
+	return offset;
+}
+
 static void write_pack_file(void)
 {
 	int i;
@@ -135,11 +152,9 @@ static void write_pack_file(void)
 	hdr.hdr_entries = htonl(nr_objects);
 	sha1write(f, &hdr, sizeof(hdr));
 	offset = sizeof(hdr);
-	for (i = 0; i < nr_objects; i++) {
-		struct object_entry *entry = objects + i;
-		entry->offset = offset;
-		offset += write_object(f, entry);
-	}
+	for (i = 0; i < nr_objects; i++)
+		offset = write_one(f, objects + i, offset);
+
 	sha1close(f, pack_file_sha1, 1);
 	mb = offset >> 20;
 	offset &= 0xfffff;

Compilation finished at Tue Jun 28 17:43:31

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28 18:17               ` Linus Torvalds
  2005-06-28 19:49                 ` Matthias Urlichs
  2005-06-28 20:01                 ` Daniel Barkalow
@ 2005-06-29  3:53                 ` Linus Torvalds
  2 siblings, 0 replies; 38+ messages in thread
From: Linus Torvalds @ 2005-06-29  3:53 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Junio C Hamano, git



On Tue, 28 Jun 2005, Linus Torvalds wrote:
> 
> >						 I'm not sure I
> > want to write the "parse incoming pack-file" thing, but git-unpack-objects
> > comes _reasonably_ close (but right now it seeks around using the index
> > file to resolve deltas, instead of keeping them in memory and resolving
> > them when possible).
> 
> I'm still thinking about this one. I think I'll just do it.

Ok, done. I had to basically rewrite that unpacking logic, but the end 
result is actually slightly smaller and cleaner, and it can now unpack 
from a stream. That stream reading logic that uncompresses directly from 
the stream buffer might be considered a bit too subtle (and somebody 
should really double-check it), but hey, it works for me.

In fact, I just did this:

	#
	# Create empty git archive "~/unpack"	
	#
	mkdir ~/unpack
	cd ~/unpack
	git-init-db

	#
	# Copy the git archive there over a pipe
	#
	cd ~/git
	git-rev-list --objects HEAD | git-pack-objects --depth=50 --window=50 --stdout | (cd ~/unpack ; git-unpack-objects)

	#
	# Go to new archive, set up the head, and fsck to verify
	#
	cd ~/unpack
	cat ~/git/.git/HEAD > .git/HEAD 
	git-fsck-cache --unreachable

Now, the above is a silly example, since I _could_ just have moved the
pack file into .git/objects/pack, but that was not the point of this whole
thing. The point was to do what a "git-ssh-push" would basically boil down
to.

I'd like somebody who knows zlib intimately to take a look at how I do the 
streaming input thing (in particular, the "use(len - stream.avail_in);" 
part in the inflate loop in the "get_data()" function).

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-28  2:13   ` CAREFUL! No more delta object support! Linus Torvalds
  2005-06-28  2:32     ` Junio C Hamano
  2005-06-28  5:09     ` Daniel Barkalow
@ 2005-06-29 18:59     ` Linus Torvalds
  2005-06-29 21:05       ` Daniel Barkalow
  2 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-06-29 18:59 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Git Mailing List, Daniel Barkalow



On Mon, 27 Jun 2005, Linus Torvalds wrote:
> 
> On Mon, 27 Jun 2005, Junio C Hamano wrote:
> > 
> > Shouldn't feeding "git-rev-list --object" output plus
> > handcrafted list of objects in 2.6.11 tree object to
> > git-pack-objects just work???
> 
> You could do that. And yes, we can add support for "tag" objects too 
> (which the packing doesn't do at all right now. So this is not a 
> "fundamental" problem, it's just a practical one right now.

Ok, I've added the logic to "git-rev-list --object" to handle arbitrary 
object dependencies.

So you can do things like this, if you want to:

	git-rev-list --object HEAD ^v2.6.11-tree

which basically generates the complete list of every object reachable from 
HEAD, but not reachable from the v2.6.11 tree. It also understands about 
tags, so if you do

	git-rev-list --object v2.6.12 ^v2.6.11-tree

the end result will have the "v2.6.12" tag in it (along with all the
objects reachable from it, but not reachable from v2.6.11-tree).

What does this mean? It means that you can do a "push" from repository "a" 
to repository "b" by doing

 - in "b", do

	refs_in_b=($(find .git/refs -type f | xargs cat))


 - in "a" do

	refs_in_a=($(find .git/refs -type f | xargs cat))

 - then, in "a", do

	git-rev-list "${refs_in_a[@]}" --not "${refs_in_b[@]}" |
		git-pack-objects --stdout > push.pack

   to generate the objects pack in "push.pack"

 - then, in "b", do

	git-unpack-objects < push.pack

and you now have moved over _all_ the objects that were referenced in "a",
but not in "b". Including tags etc. So after that last stage, when you've
unpacked the objects, the only thing left to do is to make the refs in "b"  
point to the new references from "a" (which basically boils down to a
"cp", except it would be good to verify that the refs in "b" still have
the same values as they did before we did the object push).

Daniel (or anybody else), interested? Please?

Of course, you can do this one branch at a time, too, if you want to, but
the above was meant as an example of how you can actually do all the
branches in one single pack-file, which is a lot more efficient (if you do
it one branch at a time, you'll quite possible end up transferring objects
that are reachable in other branches multiple times, while the "all in one
go" thing will pack each object just once).

Now, have I actually _tested_ the above? Hell no. But all the heavy 
lifting should now be done for doing an efficient "git push" that pushes 
all branches in one go (or one at a time, it's your choice on how you end 
up using git-rev-list).

		Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-29 18:59     ` Linus Torvalds
@ 2005-06-29 21:05       ` Daniel Barkalow
  2005-06-29 21:38         ` Linus Torvalds
  0 siblings, 1 reply; 38+ messages in thread
From: Daniel Barkalow @ 2005-06-29 21:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

On Wed, 29 Jun 2005, Linus Torvalds wrote:

> and you now have moved over _all_ the objects that were referenced in "a",
> but not in "b". Including tags etc. So after that last stage, when you've
> unpacked the objects, the only thing left to do is to make the refs in "b"  
> point to the new references from "a" (which basically boils down to a
> "cp", except it would be good to verify that the refs in "b" still have
> the same values as they did before we did the object push).
> 
> Daniel (or anybody else), interested? Please?

I'll probably get to this over the weekend.

> Of course, you can do this one branch at a time, too, if you want to, but
> the above was meant as an example of how you can actually do all the
> branches in one single pack-file, which is a lot more efficient (if you do
> it one branch at a time, you'll quite possible end up transferring objects
> that are reachable in other branches multiple times, while the "all in one
> go" thing will pack each object just once).

It should transfer each only once if you recalculate "refs_in_b" after
each push, right? Or is the marking for "--objects ^commit" still not
tight wrt object and tree files? I think branch-at-a-time is preferable
for the case where the source doesn't want to send quite everything, and
the target doesn't necessarily want everything named the same.

> Now, have I actually _tested_ the above? Hell no. But all the heavy 
> lifting should now be done for doing an efficient "git push" that pushes 
> all branches in one go (or one at a time, it's your choice on how you end 
> up using git-rev-list).

The one thing I can think of is whether things will blow up if the target
repository has heads that aren't in the source, at which point the source
has no clue what to exclude. I.e.:

parent -- new-b
  \
   new-a

If I've moved the head on b forward to new-b, and a wants to push new-a
(as a new branch, perhaps), refs_in_b has only new-b, refs_in_a has parent
and new-a, and git-rev-list in a can't see that b has parent (and
everything upwards of that). You probably just don't want to do this, but
I bet that some people will (e.g. projects that synchronize through a
shared-owner repository).

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-29 21:05       ` Daniel Barkalow
@ 2005-06-29 21:38         ` Linus Torvalds
  2005-06-29 22:24           ` Daniel Barkalow
  0 siblings, 1 reply; 38+ messages in thread
From: Linus Torvalds @ 2005-06-29 21:38 UTC (permalink / raw)
  To: Daniel Barkalow; +Cc: Junio C Hamano, Git Mailing List



On Wed, 29 Jun 2005, Daniel Barkalow wrote:
> 
> > Of course, you can do this one branch at a time, too, if you want to, but
> > the above was meant as an example of how you can actually do all the
> > branches in one single pack-file, which is a lot more efficient (if you do
> > it one branch at a time, you'll quite possible end up transferring objects
> > that are reachable in other branches multiple times, while the "all in one
> > go" thing will pack each object just once).
> 
> It should transfer each only once if you recalculate "refs_in_b" after
> each push, right?

Yes, you can do it that way too. It will possibly not pack as well due to
giving you fewer opportunities for deltas, but that's likely not a huge 
issue.

> The one thing I can think of is whether things will blow up if the target
> repository has heads that aren't in the source

Right. I think that's a "feature" of pushing: you cannot push to an 
archive that has state that you don't know about. Ie you can only push to 
something that is a proper subset of what you are (on a per-branch basis, 
of course - not necessarily on a "global" stage - so you could push just 
_one_ branch, even if another branch was ahead of where you are).

			Linus

^ permalink raw reply	[flat|nested] 38+ messages in thread

* Re: CAREFUL! No more delta object support!
  2005-06-29 21:38         ` Linus Torvalds
@ 2005-06-29 22:24           ` Daniel Barkalow
  0 siblings, 0 replies; 38+ messages in thread
From: Daniel Barkalow @ 2005-06-29 22:24 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

On Wed, 29 Jun 2005, Linus Torvalds wrote:

> On Wed, 29 Jun 2005, Daniel Barkalow wrote:
> > The one thing I can think of is whether things will blow up if the target
> > repository has heads that aren't in the source
> 
> Right. I think that's a "feature" of pushing: you cannot push to an 
> archive that has state that you don't know about. Ie you can only push to 
> something that is a proper subset of what you are (on a per-branch basis, 
> of course - not necessarily on a "global" stage - so you could push just 
> _one_ branch, even if another branch was ahead of where you are).

The issue is really distinguishing the "other" branches I don't care about
from the one that I do care about. With -w, I almost certainly care about
the ref I'm writing, but that doesn't help for refs that are new (new
branches or tags), for which I care about some other thing. Also, the
failure is a bit hard to detect, I think, in that I could find I do
recognize some ancient thing that's barely useful for exclusion, and miss
something that should exclude almost everything but it's been updated. In
any case, when things go wrong we simply send stuff the recipient already
has, so it's not the end of the world. (And there's probably some clever
way of dealing with it)

	-Daniel
*This .sig left intentionally blank*

^ permalink raw reply	[flat|nested] 38+ messages in thread

end of thread, other threads:[~2005-06-29 22:19 UTC | newest]

Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-06-28  1:14 CAREFUL! No more delta object support! Linus Torvalds
2005-06-27 23:58 ` Christopher Li
2005-06-28  3:30   ` Linus Torvalds
2005-06-28  9:40     ` Junio C Hamano
2005-06-28 11:06       ` Christopher Li
2005-06-28 14:52         ` Petr Baudis
2005-06-28 16:35           ` Benjamin LaHaise
2005-06-28 20:30             ` Petr Baudis
2005-06-28 14:46       ` Jan Harkes
2005-06-28 10:38     ` Christopher Li
2005-06-28 16:45       ` Linus Torvalds
2005-06-29  0:49         ` [PATCH] Emit base objects of a delta chain when the delta is output Junio C Hamano
2005-06-28  2:01 ` CAREFUL! No more delta object support! Junio C Hamano
2005-06-28  2:03   ` [PATCH] Skip writing out sha1 files for objects in packed git Junio C Hamano
2005-06-28  2:43     ` Linus Torvalds
2005-06-28  3:33       ` Junio C Hamano
2005-06-28 15:45         ` Linus Torvalds
2005-06-28  2:13   ` CAREFUL! No more delta object support! Linus Torvalds
2005-06-28  2:32     ` Junio C Hamano
2005-06-28  2:37       ` [PATCH] Adjust to git-init-db creating $GIT_OBJECT_DIRECTORY/pack Junio C Hamano
2005-06-28  2:48       ` CAREFUL! No more delta object support! Linus Torvalds
2005-06-28  5:09     ` Daniel Barkalow
2005-06-28 15:49       ` Linus Torvalds
2005-06-28 16:21         ` Linus Torvalds
2005-06-28 17:04           ` Daniel Barkalow
2005-06-28 17:36             ` Linus Torvalds
2005-06-28 18:17               ` Linus Torvalds
2005-06-28 19:49                 ` Matthias Urlichs
2005-06-28 20:18                   ` Matthias Urlichs
2005-06-28 20:01                 ` Daniel Barkalow
2005-06-29  3:53                 ` Linus Torvalds
2005-06-29 18:59     ` Linus Torvalds
2005-06-29 21:05       ` Daniel Barkalow
2005-06-29 21:38         ` Linus Torvalds
2005-06-29 22:24           ` Daniel Barkalow
2005-06-28  8:49 ` [PATCH] Adjust fsck-cache to packed GIT and alternate object pool Junio C Hamano
2005-06-28 21:56   ` [PATCH] Expose packed_git and alt_odb Junio C Hamano
2005-06-28 21:58   ` [PATCH 3/3] Update fsck-cache (take 2) Junio C Hamano

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).