git-index-pack really does suck..

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* git-index-pack really does suck..
@ 2007-04-03 15:15 Linus Torvalds
       [not found] ` <Pi ne.LNX.4.64.0704031413200.6730@woody.linux-foundation.org>
                   ` (4 more replies)
  0 siblings, 5 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 15:15 UTC (permalink / raw)
  To: Junio C Hamano, Nicolas Pitre, Git Mailing List

Junio, Nico,
 I think we need to do something about it.

CLee was complaining about git-index-pack on #irc with the partial KDE 
repo, and while I don't have the KDE repo, I decided to investigate a bit.

Even with just the kernel repo (with a single 170MB pack-file), I can do

	git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack

and it uses 52s of CPU-time, and on my 4GB machine it actually started 
doing IO and swapping, because git-index-pack grew to 4.8GB in size. So 
while I initially thought I'd want a bigger test-case to see the problem, 
I sure as heck don't.

The 52s of CPU time exploded into almost three minutes of actual 
real-time:

	47.33user 5.79system 2:41.65elapsed 32%CPU
	2117major+1245763minor

And that's on a good system with a powerful CPU, "enough memory" for any 
reasonable development, and good disks! Very much ungood-plus-plus.

I haven't looked into exactly why yet, but I bet it's just that we keep 
every single object expanded in memory. We do need to keep the objects 
around, so that we can resolve delta's, but we can certainly do it other 
ways. 

Two suggestion for other ways:

 - simple one: don't keep unexploded objects around, just keep the deltas, 
   and spend tons of CPU-time just re-expanding them if required.

   We *should* be able to do it with just keeping the original 170MB 
   pack-file in memory, not expanding it to 3.8GB! 

   Still, even this will be painful once you have a big pack-file, and the 
   CPU waste is nasty (although a delta-base cache like we do in 
   sha1_file.c would probably fix it 99% - at that point it's getting 
   less simple, and the "best" solution below looks more palatable)

 - best one: when writing out the pack-file, we incrementally keep a 
   "struct packed_git" around, and update the index for it dynamically, 
   and totally get rid of all objects that we've written out, because we 
   can re-create them.

   This means that we should have _zero_ memory footprint except for the 
   one object that we're working on right then and there, and any 
   unresolved deltas where we've not seen the base at all (and the latter 
   generally shouldn't happen any more with most pack-files)

The "best one" wouldn't seem to be *that* painful, but as mentioned, I 
haven't even started looking at the code yet, I thought I'd try to rope 
Nico into looking at this first ;)

		Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

[parent not found: <Pi ne.LNX.4.64.0704031413200.6730@woody.linux-foundation.org>]

[parent not found: <alpine.LFD.0.98. 0704031836350.28181@xanadu.home>]

[parent not found: <db 69205d0704031227q1009eabfhdd82aa3636f25bb6@mail.gmail.com>]

[parent not found: <Pine.LNX.4.64.07 04031304420.6730@woody.linux-foundation.org>]

[parent not found: <Pine.LNX.4.64.0704031322490.67 30@woody.linux-foundation.org>]

* Re: git-index-pack really does suck..
  2007-04-03 15:15 git-index-pack really does suck Linus Torvalds
       [not found] ` <Pi ne.LNX.4.64.0704031413200.6730@woody.linux-foundation.org>
       [not found] ` <db 69205d0704031227q1009eabfhdd82aa3636f25bb6@mail.gmail.com>
@ 2007-04-03 16:21 ` Linus Torvalds
  2007-04-03 16:40   ` Nicolas Pitre
  2007-04-03 16:33 ` Nicolas Pitre
  2007-04-03 19:27 ` Chris Lee
  4 siblings, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 16:21 UTC (permalink / raw)
  To: Junio C Hamano, Nicolas Pitre, Git Mailing List



On Tue, 3 Apr 2007, Linus Torvalds wrote:
> 
> and it uses 52s of CPU-time, and on my 4GB machine it actually started 
> doing IO and swapping, because git-index-pack grew to 4.8GB in size.

Ahh. False alarm.

The problem is actually largely a really stupid memory leak in the SHA1 
collision checking (which wouldn't trigger on a normal pull, but obviously 
does trigger for every single object when testing!)

This trivial patch fixes most of it. git-index-pack still uses too much 
memory, but it does a *lot* better.

Junio, please get this into 1.5.1 (I *think* the SHA1 checking is new, but 
if it exists in 1.5.0 too, it obviously needs the same fix).

It still grows, but it grew to just 287M in size now for the 170M kernel 
object:

	41.59user 1.39system 0:43.64elapsed
	0major+73552minor

which is quite a lot better.

Duh.

		Linus

---
 index-pack.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/index-pack.c b/index-pack.c
index 6284fe3..3c768fb 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -358,6 +358,7 @@ static void sha1_object(const void *data, unsigned long size,
 		if (size != has_size || type != has_type ||
 		    memcmp(data, has_data, size) != 0)
 			die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1));
+		free(has_data);
 	}
 }
 

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 16:21 ` Linus Torvalds
@ 2007-04-03 16:40   ` Nicolas Pitre
  0 siblings, 0 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 16:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

On Tue, 3 Apr 2007, Linus Torvalds wrote:

> 
> 
> On Tue, 3 Apr 2007, Linus Torvalds wrote:
> > 
> > and it uses 52s of CPU-time, and on my 4GB machine it actually started 
> > doing IO and swapping, because git-index-pack grew to 4.8GB in size.
> 
> Ahh. False alarm.
> 
> The problem is actually largely a really stupid memory leak in the SHA1 
> collision checking (which wouldn't trigger on a normal pull, but obviously 
> does trigger for every single object when testing!)

Damn!

Please don't report those things when I'm out for lunch so I could have 
a chance to fix my own stupidities myself in time!  :-)

> This trivial patch fixes most of it. git-index-pack still uses too much 
> memory, but it does a *lot* better.
> 
> Junio, please get this into 1.5.1 (I *think* the SHA1 checking is new, but 
> if it exists in 1.5.0 too, it obviously needs the same fix).

No, it is new to 1.5.1.

> It still grows, but it grew to just 287M in size now for the 170M kernel 
> object:
> 
> 	41.59user 1.39system 0:43.64elapsed
> 	0major+73552minor
> 
> which is quite a lot better.
> 
> Duh.

Indeed.

Acked-by: Nicolas Pitre <nico@cam.org>

>  index-pack.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/index-pack.c b/index-pack.c
> index 6284fe3..3c768fb 100644
> --- a/index-pack.c
> +++ b/index-pack.c
> @@ -358,6 +358,7 @@ static void sha1_object(const void *data, unsigned long size,
>  		if (size != has_size || type != has_type ||
>  		    memcmp(data, has_data, size) != 0)
>  			die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1));
> +		free(has_data);
>  	}
>  }
>  
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


Nicolas

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 15:15 git-index-pack really does suck Linus Torvalds
                   ` (2 preceding siblings ...)
  2007-04-03 16:21 ` Linus Torvalds
@ 2007-04-03 16:33 ` Nicolas Pitre
  2007-04-03 19:27 ` Chris Lee
  4 siblings, 0 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 16:33 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List

On Tue, 3 Apr 2007, Linus Torvalds wrote:

> 
> Junio, Nico,
>  I think we need to do something about it.

Sure.  Mea culpa.

> CLee was complaining about git-index-pack on #irc with the partial KDE 
> repo, and while I don't have the KDE repo, I decided to investigate a bit.
> 
> Even with just the kernel repo (with a single 170MB pack-file), I can do
> 
> 	git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack
> 
> and it uses 52s of CPU-time, and on my 4GB machine it actually started 
> doing IO and swapping, because git-index-pack grew to 4.8GB in size.

Right.

> Two suggestion for other ways:
> 
>  - simple one: don't keep unexploded objects around, just keep the deltas, 
>    and spend tons of CPU-time just re-expanding them if required.
> 
>    We *should* be able to do it with just keeping the original 170MB 
>    pack-file in memory, not expanding it to 3.8GB! 
> 
>    Still, even this will be painful once you have a big pack-file, and the 
>    CPU waste is nasty (although a delta-base cache like we do in 
>    sha1_file.c would probably fix it 99% - at that point it's getting 
>    less simple, and the "best" solution below looks more palatable)
> 
>  - best one: when writing out the pack-file, we incrementally keep a 
>    "struct packed_git" around, and update the index for it dynamically, 
>    and totally get rid of all objects that we've written out, because we 
>    can re-create them.
> 
>    This means that we should have _zero_ memory footprint except for the 
>    one object that we're working on right then and there, and any 
>    unresolved deltas where we've not seen the base at all (and the latter 
>    generally shouldn't happen any more with most pack-files)

Even better:

  - Fix my own stupid mistake with a _single_ line of code:

diff --git a/index-pack.c b/index-pack.c
index 6284fe3..3c768fb 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -358,6 +358,7 @@ static void sha1_object(const void *data, unsigned long size,
 		if (size != has_size || type != has_type ||
 		    memcmp(data, has_data, size) != 0)
 			die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1));
+		free(has_data);
 	}
 }

The thing is, that code path is executed _only_ when index-pack is 
encountering an object already in the repository in order to protect 
against possible SHA1 collision attacks.  See commit 8685da42561d log 
for the full story.

Normally this should not happen in normal usage scenarios because the 
objects you fetch are those that you don't already have.  But if you 
manually run index-pack inside an existing repository then you'll 
already have _all_ those objects already explaining the high CPU usage.

But this is no excuse for not freeing the data though.


Nicolas

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 15:15 git-index-pack really does suck Linus Torvalds
                   ` (3 preceding siblings ...)
  2007-04-03 16:33 ` Nicolas Pitre
@ 2007-04-03 19:27 ` Chris Lee
  2007-04-03 19:49   ` Nicolas Pitre
  2007-04-03 20:18   ` Linus Torvalds
  4 siblings, 2 replies; 58+ messages in thread
From: Chris Lee @ 2007-04-03 19:27 UTC (permalink / raw)
  To: Linus Torvalds, git

On 4/3/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> Junio, Nico,
>  I think we need to do something about it.
>
> CLee was complaining about git-index-pack on #irc with the partial KDE
> repo, and while I don't have the KDE repo, I decided to investigate a bit.
>
> Even with just the kernel repo (with a single 170MB pack-file), I can do
>
>         git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack
>
> and it uses 52s of CPU-time, and on my 4GB machine it actually started
> doing IO and swapping, because git-index-pack grew to 4.8GB in size. So
> while I initially thought I'd want a bigger test-case to see the problem,
> I sure as heck don't.

There's another issue here.

I'm running git-index-pack as part of a workflow like so:

$ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects
$ grep 'blob' /tmp/all-objects > /tmp/blob-objects
$ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects
--delta-base-offset --all-progress --stdout > blob.pack
$ git-index-pack -v blob.pack

Now, when I run 'git-index-pack' on blob.pack in the current
directory, memory usage is pretty horrific (even with the applied
patch to not leak all everything). Shawn tells me that index-pack
should only be decompressing the object twice - once from the repo and
once from blob.pack - iff I call git-index-pack with --stdin, which I
am not.

If I move the blob.pack into /tmp, and run git-index-pack on it there,
it completes much faster and the memory usage never exceeds 200MB.
(Inside the repo, it takes up over 3G of RES according to top.)

By "much faster", I mean: the entire pack was indexed and completed in
17:42.40, whereas I cancelled the inside-the-repo index because after
56 minutes it was only at 46%.

(And, as far as getting this huge repo published - I have it burned to
a DVD, and I'm going to drop it in the mail to hpa today on my lunch
break, which I should be taking soon.)

-clee

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 19:27 ` Chris Lee
@ 2007-04-03 19:49   ` Nicolas Pitre
  2007-04-03 19:54     ` Chris Lee
  2007-04-03 20:18   ` Linus Torvalds
  1 sibling, 1 reply; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 19:49 UTC (permalink / raw)
  To: Chris Lee; +Cc: Linus Torvalds, git

On Tue, 3 Apr 2007, Chris Lee wrote:

> There's another issue here.
> 
> I'm running git-index-pack as part of a workflow like so:
> 
> $ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects
> $ grep 'blob' /tmp/all-objects > /tmp/blob-objects
> $ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects
> --delta-base-offset --all-progress --stdout > blob.pack
> $ git-index-pack -v blob.pack

Instead of using --stdout with git-pack-object, you should provide it 
with a suitable base name for the resulting pack and the index will be 
created automatically along side the pack for you.  No need to use 
index-pack for that.

> Now, when I run 'git-index-pack' on blob.pack in the current
> directory, memory usage is pretty horrific (even with the applied
> patch to not leak all everything). Shawn tells me that index-pack
> should only be decompressing the object twice - once from the repo and
> once from blob.pack - iff I call git-index-pack with --stdin, which I
> am not.
> 
> If I move the blob.pack into /tmp, and run git-index-pack on it there,
> it completes much faster and the memory usage never exceeds 200MB.
> (Inside the repo, it takes up over 3G of RES according to top.)

The 3G should definitely be fixed with the added free().

The CPU usage is explained by the fact that you're running index-pack on 
objects that are all already found in your repo so the collision check 
is triggered.  This is more or like the same issue as if you tried to 
run unpack-objects on the same pack where none of your objects will 
actually be unpacked.

Nicolas

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 19:49   ` Nicolas Pitre
@ 2007-04-03 19:54     ` Chris Lee
  0 siblings, 0 replies; 58+ messages in thread
From: Chris Lee @ 2007-04-03 19:54 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, git

On 4/3/07, Nicolas Pitre <nico@cam.org> wrote:
> On Tue, 3 Apr 2007, Chris Lee wrote:
>
> > There's another issue here.
> >
> > I'm running git-index-pack as part of a workflow like so:
> >
> > $ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects
> > $ grep 'blob' /tmp/all-objects > /tmp/blob-objects
> > $ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects
> > --delta-base-offset --all-progress --stdout > blob.pack
> > $ git-index-pack -v blob.pack
>
> Instead of using --stdout with git-pack-object, you should provide it
> with a suitable base name for the resulting pack and the index will be
> created automatically along side the pack for you.  No need to use
> index-pack for that.

Right. But then I wouldn't have discovered how much git-index-pack sucks. :)

> > Now, when I run 'git-index-pack' on blob.pack in the current
> > directory, memory usage is pretty horrific (even with the applied
> > patch to not leak all everything). Shawn tells me that index-pack
> > should only be decompressing the object twice - once from the repo and
> > once from blob.pack - iff I call git-index-pack with --stdin, which I
> > am not.
> >
> > If I move the blob.pack into /tmp, and run git-index-pack on it there,
> > it completes much faster and the memory usage never exceeds 200MB.
> > (Inside the repo, it takes up over 3G of RES according to top.)
>
> The 3G should definitely be fixed with the added free().

Not really. This packfile is 2.6GB in size, and apparently it gets mmap'd.

(Yesterday, my machine ran out of memory trying to do index-pack when
the memleak still existed; I have 4G of RAM and, normally, 4G of swap,
but I upped it to 32G of swap and it still ran out of memory.)

> The CPU usage is explained by the fact that you're running index-pack on
> objects that are all already found in your repo so the collision check
> is triggered.  This is more or like the same issue as if you tried to
> run unpack-objects on the same pack where none of your objects will
> actually be unpacked.

Right, and if I was using --stdin, I would expect that. But I'm not.
And, according to Shawn anyway, the current behaviour is not what was
intended.

-clee

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 19:27 ` Chris Lee
  2007-04-03 19:49   ` Nicolas Pitre
@ 2007-04-03 20:18   ` Linus Torvalds
  2007-04-03 20:32     ` Nicolas Pitre
                       ` (2 more replies)
  1 sibling, 3 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 20:18 UTC (permalink / raw)
  To: Chris Lee, Junio C Hamano; +Cc: Git Mailing List



On Tue, 3 Apr 2007, Chris Lee wrote:
> 
> There's another issue here.
> 
> I'm running git-index-pack as part of a workflow like so:
> 
> $ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects
> $ grep 'blob' /tmp/all-objects > /tmp/blob-objects
> $ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects
> --delta-base-offset --all-progress --stdout > blob.pack
> $ git-index-pack -v blob.pack
> 
> Now, when I run 'git-index-pack' on blob.pack in the current
> directory, memory usage is pretty horrific (even with the applied
> patch to not leak all everything). Shawn tells me that index-pack
> should only be decompressing the object twice - once from the repo and
> once from blob.pack - iff I call git-index-pack with --stdin, which I
> am not.
> 
> If I move the blob.pack into /tmp, and run git-index-pack on it there,
> it completes much faster and the memory usage never exceeds 200MB.
> (Inside the repo, it takes up over 3G of RES according to top.)

Yeah. What happens is that inside the repo, because we do all the 
duplicate object checks (verifying that there are no evil hash collisions) 
even after fixing the memory leak, we end up keeping *track* of all those 
objects.

And with a large repository, it's quite the expensive operation.

That whole "verify no SHA1 hash collision" code is really pretty damn 
paranoid. Maybe we shouldn't have it enabled by default.

So how about this updated patch? We could certainly make "git pull" imply 
"--paranoid" if we want to, but even that is likely pretty unnecessary. 
It's not like anybody has ever shown a SHA1 collision, and if the *local* 
repository is corrupt (and has an object with the wrong SHA1 - that's what 
the testsuite checks for), then it's probably good to get the valid object 
from the remote..

This includes the previous one-liner, but also adds the "--paranoid" flag 
and fixes up the Documentation and tests to match.

Junio, your choice, but regardless which one you choose:

	Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Thanks,

		Linus

---
 Documentation/git-index-pack.txt |    3 +--
 index-pack.c                     |    6 +++++-
 t/t5300-pack-object.sh           |    4 ++--
 3 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
index 2229ee8..7d8d33b 100644
--- a/Documentation/git-index-pack.txt
+++ b/Documentation/git-index-pack.txt
@@ -8,8 +8,7 @@ git-index-pack - Build pack index file for an existing packed archive
 
 SYNOPSIS
 --------
-'git-index-pack' [-v] [-o <index-file>] <pack-file>
-'git-index-pack' --stdin [--fix-thin] [--keep] [-v] [-o <index-file>] [<pack-file>]
+'git-index-pack' [--stdin [--fix-thin] [--keep]] [-v] [--paranoid] [-o <index-file>] <pack-file>
 
 
 DESCRIPTION
diff --git a/index-pack.c b/index-pack.c
index 6284fe3..8a4c27a 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -45,6 +45,7 @@ static int nr_resolved_deltas;
 
 static int from_stdin;
 static int verbose;
+static int paranoid;
 
 static volatile sig_atomic_t progress_update;
 
@@ -348,7 +349,7 @@ static void sha1_object(const void *data, unsigned long size,
 			enum object_type type, unsigned char *sha1)
 {
 	hash_sha1_file(data, size, typename(type), sha1);
-	if (has_sha1_file(sha1)) {
+	if (paranoid && has_sha1_file(sha1)) {
 		void *has_data;
 		enum object_type has_type;
 		unsigned long has_size;
@@ -358,6 +359,7 @@ static void sha1_object(const void *data, unsigned long size,
 		if (size != has_size || type != has_type ||
 		    memcmp(data, has_data, size) != 0)
 			die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1));
+		free(has_data);
 	}
 }
 
@@ -839,6 +841,8 @@ int main(int argc, char **argv)
 		if (*arg == '-') {
 			if (!strcmp(arg, "--stdin")) {
 				from_stdin = 1;
+			} else if (!strcmp(arg, "--paranoid")) {
+				paranoid = 1;
 			} else if (!strcmp(arg, "--fix-thin")) {
 				fix_thin_pack = 1;
 			} else if (!strcmp(arg, "--keep")) {
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 35e036a..407c71e 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -262,7 +262,7 @@ test_expect_success \
 		.git/objects/c8/2de19312b6c3695c0c18f70709a6c535682a67'
 
 test_expect_failure \
-    'make sure index-pack detects the SHA1 collision' \
-    'git-index-pack -o bad.idx test-3.pack'
+    'make sure index-pack detects the SHA1 collision when paranoid' \
+    'git-index-pack --paranoid -o bad.idx test-3.pack'
 
 test_done

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 20:18   ` Linus Torvalds
@ 2007-04-03 20:32     ` Nicolas Pitre
  2007-04-03 20:40       ` Junio C Hamano
  2007-04-03 20:56       ` Linus Torvalds
  2007-04-03 20:33     ` Linus Torvalds
  2007-04-03 20:34     ` Junio C Hamano
  2 siblings, 2 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 20:32 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Chris Lee, Junio C Hamano, Git Mailing List

On Tue, 3 Apr 2007, Linus Torvalds wrote:

> 
> 
> On Tue, 3 Apr 2007, Chris Lee wrote:
> > 
> > There's another issue here.
> > 
> > I'm running git-index-pack as part of a workflow like so:
> > 
> > $ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects
> > $ grep 'blob' /tmp/all-objects > /tmp/blob-objects
> > $ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects
> > --delta-base-offset --all-progress --stdout > blob.pack
> > $ git-index-pack -v blob.pack
> > 
> > Now, when I run 'git-index-pack' on blob.pack in the current
> > directory, memory usage is pretty horrific (even with the applied
> > patch to not leak all everything). Shawn tells me that index-pack
> > should only be decompressing the object twice - once from the repo and
> > once from blob.pack - iff I call git-index-pack with --stdin, which I
> > am not.
> > 
> > If I move the blob.pack into /tmp, and run git-index-pack on it there,
> > it completes much faster and the memory usage never exceeds 200MB.
> > (Inside the repo, it takes up over 3G of RES according to top.)
> 
> Yeah. What happens is that inside the repo, because we do all the 
> duplicate object checks (verifying that there are no evil hash collisions) 
> even after fixing the memory leak, we end up keeping *track* of all those 
> objects.

What do you mean?

> And with a large repository, it's quite the expensive operation.
> 
> That whole "verify no SHA1 hash collision" code is really pretty damn 
> paranoid. Maybe we shouldn't have it enabled by default.

Maybe we shouldn't run index-pack on packs for which we _already_ have 
an index for which is the most likely reason for the collision check to 
trigger in the first place.

This is in the same category as trying to run unpack-objects on a pack 
within a repository and wondering why it doesn't work.

> So how about this updated patch? We could certainly make "git pull" imply 
> "--paranoid" if we want to, but even that is likely pretty unnecessary. 

I'm of the opinion that this patch is unnecessary.  It only helps in 
bogus workflows to start with, and it makes the default behavior unsafe 
(unsafe from a paranoid pov, but still).  And in the _normal_ workflow 
it should never trigger.

So I wouldn't merge it.


Nicolas

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 20:32     ` Nicolas Pitre
@ 2007-04-03 20:40       ` Junio C Hamano
  2007-04-03 21:00         ` Linus Torvalds
  2007-04-03 20:56       ` Linus Torvalds
  1 sibling, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2007-04-03 20:40 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Chris Lee, Git Mailing List

[sorry, sent a message without finishing]

Nicolas Pitre <nico@cam.org> writes:

> Maybe we shouldn't run index-pack on packs for which we _already_ have 
> an index for which is the most likely reason for the collision check to 
> trigger in the first place.
>
> This is in the same category as trying to run unpack-objects on a pack 
> within a repository and wondering why it doesn't work.
> ...
> I'm of the opinion that this patch is unnecessary.  It only helps in 
> bogus workflows to start with, and it makes the default behavior unsafe 
> (unsafe from a paranoid pov, but still).  And in the _normal_ workflow 
> it should never trigger.

Hmmmm.  You may have a point.

So maybe we should retitle this thread from "git-index-pack
really does suck.." to "I used git-index-pack in a stupid way"?
 

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 20:40       ` Junio C Hamano
@ 2007-04-03 21:00         ` Linus Torvalds
  2007-04-03 21:28           ` Nicolas Pitre
  2007-04-03 22:49           ` Chris Lee
  0 siblings, 2 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 21:00 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, Chris Lee, Git Mailing List

On Tue, 3 Apr 2007, Junio C Hamano wrote:
> 
> So maybe we should retitle this thread from "git-index-pack
> really does suck.." to "I used git-index-pack in a stupid way"?

See my separate timing numbers, although I bet that Chris can give even 
better ones..

Chris, try applying my patch, and then inside the KDE repo you have, do

	git index-pack --paranoid --stdin --fix-thin new.pack < ~/git/.git/objects/pack/pack-*.pack

(ie index the objects of the *git* repository, not the KDE one). That 
should approximate doing a fair-sized "git pull" - getting new objects. Do 
it with and without --paranoid, and time it.

I bet that what I see as a 7% slowdown will be much bigger for you, just 
because the negative lookups will be all that much more expensive when you 
have tons of objects.

			Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:00         ` Linus Torvalds
@ 2007-04-03 21:28           ` Nicolas Pitre
  2007-04-03 22:49           ` Chris Lee
  1 sibling, 0 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 21:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Chris Lee, Git Mailing List

On Tue, 3 Apr 2007, Linus Torvalds wrote:

> See my separate timing numbers, although I bet that Chris can give even 
> better ones..
> 
> Chris, try applying my patch, and then inside the KDE repo you have, do
> 
> 	git index-pack --paranoid --stdin --fix-thin new.pack < ~/git/.git/objects/pack/pack-*.pack
> 
> (ie index the objects of the *git* repository, not the KDE one). That 
> should approximate doing a fair-sized "git pull" - getting new objects. Do 
> it with and without --paranoid, and time it.

Like I said this is bogus since the index-pack is throttled by the 
network making this overhead a non issue in real life.

And like I said there should _not_ be such a memory usage difference 
which is probably showing potential problems *elsewhere*.

> I bet that what I see as a 7% slowdown will be much bigger for you, just 
> because the negative lookups will be all that much more expensive when you 
> have tons of objects.

And I bet your newton-raphson lookup idea would shine and bring that 
overhead down considerably in that case.


Nicolas

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:00         ` Linus Torvalds
  2007-04-03 21:28           ` Nicolas Pitre
@ 2007-04-03 22:49           ` Chris Lee
  2007-04-03 23:12             ` Linus Torvalds
  1 sibling, 1 reply; 58+ messages in thread
From: Chris Lee @ 2007-04-03 22:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Nicolas Pitre, Git Mailing List

On 4/3/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 3 Apr 2007, Junio C Hamano wrote:
> > So maybe we should retitle this thread from "git-index-pack
> > really does suck.." to "I used git-index-pack in a stupid way"?

Well, I never claimed to be a genius. :)

> See my separate timing numbers, although I bet that Chris can give even
> better ones..
>
> Chris, try applying my patch, and then inside the KDE repo you have, do
>
>         git index-pack --paranoid --stdin --fix-thin new.pack < ~/git/.git/objects/pack/pack-*.pack
>
> (ie index the objects of the *git* repository, not the KDE one). That
> should approximate doing a fair-sized "git pull" - getting new objects. Do
> it with and without --paranoid, and time it.

% time git-index-pack --paranoid --stdin --fix-thin paranoid.pack <
/usr/local/src/git/.git/objects/pack/*pack
pack	bf8ba7897da9c84d1981ecdc92c0b1979506a4b9
git-index-pack --paranoid --stdin --fix-thin paranoid.pack <   5.28s
user 0.24s system 98% cpu 5.592 total

% time git-index-pack --stdin --fix-thin trusting.pack <
/usr/local/src/git/.git/objects/pack/*pack
pack	bf8ba7897da9c84d1981ecdc92c0b1979506a4b9
git-index-pack --stdin --fix-thin trusting.pack <   5.07s user 0.12s
system 99% cpu 5.202 total

So, in my case, at least... not really much of a difference, which is puzzling.

> I bet that what I see as a 7% slowdown will be much bigger for you, just
> because the negative lookups will be all that much more expensive when you
> have tons of objects.

I applied exactly the patch you sent, and it applied perfectly cleanly
- no failures.

I also mailed out the DVD with the repo on it to hpa today, so
hopefully by tomorrow he'll get it. (He's not even two cities over,
and I suspect I could have just driven it to his place, but that might
have been a little awkward since I've never met him.)

Anyway, so, hopefully once he gets it he can put it up somewhere that
you guys can grab it. For reference, the KDE repo is pretty big, but a
"real" conversion of the repo would be bigger; the one that I've been
playing with only has the KDE svn trunk, and only the first 409k
revisions - there are, as of right now, over 650k revisions in KDE's
svn repo. So, realistically speaking, a fully-converted KDE git repo
would probably take up at least 6GB, packed, if not more. Subproject
support would probably be *really* helpful to mitigate that.

-clee

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 22:49           ` Chris Lee
@ 2007-04-03 23:12             ` Linus Torvalds
  0 siblings, 0 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 23:12 UTC (permalink / raw)
  To: Chris Lee; +Cc: Junio C Hamano, Nicolas Pitre, Git Mailing List

On Tue, 3 Apr 2007, Chris Lee wrote:
>
> git-index-pack --paranoid --stdin --fix-thin paranoid.pack < 
> 5.28s user 0.24s system 98% cpu 5.592 total
> 
> git-index-pack --stdin --fix-thin trusting.pack < 
> 5.07s user 0.12s system 99% cpu 5.202 total

Ok, that's not a big enough of a difference to care.

> So, in my case, at least... not really much of a difference, which is
> puzzling.

It's entirely possible that the object lookup is good enough to not be a 
problem even for huge packs, and it really only gets to be a problem when 
you actually unpack all the objects.

In that case, the only real case to worry about is indeed the "alternates" 
case (or if people actually use a shared git object directory, but I don't 
think anybody really does - alternates just work well enough, and shared 
object directories are painful enough that I doubt anybody *really* uses 
it).

> I also mailed out the DVD with the repo on it to hpa today, so
> hopefully by tomorrow he'll get it. (He's not even two cities over,
> and I suspect I could have just driven it to his place, but that might
> have been a little awkward since I've never met him.)

Heh. Ok, good. I'll torrent it or something when it's up.

> Anyway, so, hopefully once he gets it he can put it up somewhere that
> you guys can grab it. For reference, the KDE repo is pretty big, but a
> "real" conversion of the repo would be bigger; the one that I've been
> playing with only has the KDE svn trunk, and only the first 409k
> revisions - there are, as of right now, over 650k revisions in KDE's
> svn repo. So, realistically speaking, a fully-converted KDE git repo
> would probably take up at least 6GB, packed, if not more. Subproject
> support would probably be *really* helpful to mitigate that.

Sure. I think subproject support is likely the big "missing feature" of 
git right now. The rest is "details", even if they can be big and involved 
details.

But even at only 409k revisions, it's still going to be an order of 
magnitude bigger than what the kernel is, exactly *because* it's such a 
disaster from a maintenance setup standpoint, and it's going to be a 
useful real-world test-case. So whether that is a "good" git archive or 
not, it's going to be useful.

Long ago we used to be able to look at the historic Linux archive as an 
example of a "big" archive, but it's not actually all that much bigger 
than the normal Linux archive any more, and we've pretty much fixed the 
problems we used to have.

[ The historical pack-file is actually smaller, but that's because it was 
  done with a much deeper delta-chain to make it small: the historical 
  archive still has more objects in it than the current active git kernel 
  tree - but it's only in the 20% range, not "20 *times* bigger" ]

The Eclipse tree was useful (and I think we already improved performance 
for you thanks to working with it - I don't know how much faster the 
delta-base cache made things for you, but I'd assume it was at *least* by 
the factor-of-2.5 that we saw on Eclipse), but the KDE is bigger *and* 
deeper (the eclipse tree is 1.7GB, and 136k revisions in the main branch, 
so the KDE tree is more than twice the revisions).

		Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 20:32     ` Nicolas Pitre
  2007-04-03 20:40       ` Junio C Hamano
@ 2007-04-03 20:56       ` Linus Torvalds
  2007-04-03 21:03         ` Shawn O. Pearce
  2007-04-03 21:21         ` Nicolas Pitre
  1 sibling, 2 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 20:56 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Chris Lee, Junio C Hamano, Git Mailing List

On Tue, 3 Apr 2007, Nicolas Pitre wrote:
> > 
> > Yeah. What happens is that inside the repo, because we do all the 
> > duplicate object checks (verifying that there are no evil hash collisions) 
> > even after fixing the memory leak, we end up keeping *track* of all those 
> > objects.
> 
> What do you mean?

Look at what we have to do to look up a SHA1 object.. We create all the 
lookup infrastructure, we don't *just* read the object. The delta base 
cache is the most obvious one. 

> I'm of the opinion that this patch is unnecessary.  It only helps in 
> bogus workflows to start with, and it makes the default behavior unsafe 
> (unsafe from a paranoid pov, but still).  And in the _normal_ workflow 
> it should never trigger.

Actually, even in the normal workflow it will do all the extra unnecessary 
work, if only because the lookup costs of *not* finding the entry.

Lookie here:

 - git index-pack of the *git* pack-file in the v2.6/linux directory (zero 
   overlap of objects)

   With --paranoid:

	2.75user 0.37system 0:03.13elapsed 99%CPU
	0major+5583minor pagefaults

   Without --paranoid:

	2.55user 0.12system 0:02.68elapsed 99%CPU
	0major+2957minor pagefaults

See? That's the *normal* workflow. Zero objects found. 7% CPU overhead 
from just the unnecessary work, and almost twice as much memory used. Just 
from the index file lookup etc for a decent-sized project.

Now, in the KDE situation, the *unnecessary* lookups will be about ten 
times more expensive, both on memory and CPU, just because the repository 
is about 20x the size. Even with no actual hits.

		Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 20:56       ` Linus Torvalds
@ 2007-04-03 21:03         ` Shawn O. Pearce
  2007-04-03 21:13           ` Linus Torvalds
  2007-04-03 21:21         ` Nicolas Pitre
  1 sibling, 1 reply; 58+ messages in thread
From: Shawn O. Pearce @ 2007-04-03 21:03 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Chris Lee, Junio C Hamano, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Actually, even in the normal workflow it will do all the extra unnecessary 
> work, if only because the lookup costs of *not* finding the entry.
> 
> Lookie here:
> 
>  - git index-pack of the *git* pack-file in the v2.6/linux directory (zero 
>    overlap of objects)
> 
>    With --paranoid:
> 
> 	2.75user 0.37system 0:03.13elapsed 99%CPU
> 	0major+5583minor pagefaults
> 
>    Without --paranoid:
> 
> 	2.55user 0.12system 0:02.68elapsed 99%CPU
> 	0major+2957minor pagefaults
> 
> See? That's the *normal* workflow. Zero objects found. 7% CPU overhead 
> from just the unnecessary work, and almost twice as much memory used. Just 
> from the index file lookup etc for a decent-sized project.

OK, but what about that case with unpack-objects?  Didn't we there
do all this work to also check for the object already existing?
During update-index, write-tree and commit-tree don't we also do
a lot of work (per object anyway) to check for a non-existing object?

So even with --paranoid (aka what we have now) index-pack still
should be faster than unpack-objects for any sizeable transfer,
and is just as "safe".

If its the missing-object lookup that is expensive, maybe we should
try to optimize that.  We do it enough already in other parts of
the code...

-- 
Shawn.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:03         ` Shawn O. Pearce
@ 2007-04-03 21:13           ` Linus Torvalds
  2007-04-03 21:17             ` Shawn O. Pearce
  0 siblings, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 21:13 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Nicolas Pitre, Chris Lee, Junio C Hamano, Git Mailing List



On Tue, 3 Apr 2007, Shawn O. Pearce wrote:
> 
> So even with --paranoid (aka what we have now) index-pack still
> should be faster than unpack-objects for any sizeable transfer,
> and is just as "safe".

I agree 100% that index-pack is much much MUCH better than unpack-objects 
ever was, so ..

> If its the missing-object lookup that is expensive, maybe we should
> try to optimize that.  We do it enough already in other parts of
> the code...

Well, for all other cases it's really the "object found" case that is 
worth optimizing for, so I think optimizing for "no object" is actually 
wrong, unless it also speeds up (or at least doesn't make it worse) the 
"real" normal case.

		Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:13           ` Linus Torvalds
@ 2007-04-03 21:17             ` Shawn O. Pearce
  2007-04-03 21:26               ` Linus Torvalds
  2007-04-03 21:34               ` git-index-pack really does suck Nicolas Pitre
  0 siblings, 2 replies; 58+ messages in thread
From: Shawn O. Pearce @ 2007-04-03 21:17 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Chris Lee, Junio C Hamano, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 3 Apr 2007, Shawn O. Pearce wrote:
> > If its the missing-object lookup that is expensive, maybe we should
> > try to optimize that.  We do it enough already in other parts of
> > the code...
> 
> Well, for all other cases it's really the "object found" case that is 
> worth optimizing for, so I think optimizing for "no object" is actually 
> wrong, unless it also speeds up (or at least doesn't make it worse) the 
> "real" normal case.

Right.  But maybe we shouldn't be scanning for packfiles every
time we don't find a loose object.  Especially if the caller is in
a context where we actually *expect* to not find said object like
half of the time... say in git-add/update-index.  ;-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:17             ` Shawn O. Pearce
@ 2007-04-03 21:26               ` Linus Torvalds
  2007-04-03 21:28                 ` Linus Torvalds
  2007-04-03 21:34               ` git-index-pack really does suck Nicolas Pitre
  1 sibling, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 21:26 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Nicolas Pitre, Chris Lee, Junio C Hamano, Git Mailing List



On Tue, 3 Apr 2007, Shawn O. Pearce wrote:
> 
> Right.  But maybe we shouldn't be scanning for packfiles every
> time we don't find a loose object.  Especially if the caller is in
> a context where we actually *expect* to not find said object like
> half of the time... say in git-add/update-index.  ;-)

Yes, we could definitely skip the re-lookup if we had a "don't really 
care, I can recreate the object myself" flag (ie anybody who is going to 
write that object)

			Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:26               ` Linus Torvalds
@ 2007-04-03 21:28                 ` Linus Torvalds
  2007-04-03 22:31                   ` Junio C Hamano
                                     ` (2 more replies)
  0 siblings, 3 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 21:28 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Nicolas Pitre, Chris Lee, Junio C Hamano, Git Mailing List



On Tue, 3 Apr 2007, Linus Torvalds wrote:
> 
> Yes, we could definitely skip the re-lookup if we had a "don't really 
> care, I can recreate the object myself" flag (ie anybody who is going to 
> write that object)

Side note: with "alternates" files, you might well *always* have the 
objects. If you do

	git clone -l -s ...

to create various branches, and then pull between them, you'll actually 
end up in the situation that you'll always find the objects and get back 
to the really expensive case..

		Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:28                 ` Linus Torvalds
@ 2007-04-03 22:31                   ` Junio C Hamano
  2007-04-03 22:38                     ` Shawn O. Pearce
  2007-04-05 10:22                   ` [PATCH 1/2] git-fetch--tool pick-rref Junio C Hamano
  2007-04-05 10:22                   ` [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate Junio C Hamano
  2 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2007-04-03 22:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Shawn O. Pearce, Nicolas Pitre, Chris Lee, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Side note: with "alternates" files, you might well *always* have the 
> objects. If you do
>
> 	git clone -l -s ...
>
> to create various branches, and then pull between them, you'll actually 
> end up in the situation that you'll always find the objects and get back 
> to the really expensive case..

Ah, that's true.  If you "git clone -l -s A B", create new
objects in A and pull from B, the transfer would not exclude
new objects as they are not visible from B's refs.

In that scenario, the keep-pack behaviour is already worse than
the unpack-objects behaviour.  The former creates a packfile
that duplicates objects that are in A while the latter, although
expensive, ends up doing nothing.

I wonder if we can have a backdoor to avoid any object transfer
in such a case to begin with...

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 22:31                   ` Junio C Hamano
@ 2007-04-03 22:38                     ` Shawn O. Pearce
  2007-04-03 22:41                       ` Junio C Hamano
  0 siblings, 1 reply; 58+ messages in thread
From: Shawn O. Pearce @ 2007-04-03 22:38 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Nicolas Pitre, Chris Lee, Git Mailing List

Junio C Hamano <junkio@cox.net> wrote:
> Ah, that's true.  If you "git clone -l -s A B", create new
> objects in A and pull from B, the transfer would not exclude
> new objects as they are not visible from B's refs.
> 
> In that scenario, the keep-pack behaviour is already worse than
> the unpack-objects behaviour.  The former creates a packfile
> that duplicates objects that are in A while the latter, although
> expensive, ends up doing nothing.
> 
> I wonder if we can have a backdoor to avoid any object transfer
> in such a case to begin with...

Yea, symlink to the corresponding refs directory of the alternate
ODB.  Then the refs will be visible.  ;-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 22:38                     ` Shawn O. Pearce
@ 2007-04-03 22:41                       ` Junio C Hamano
  0 siblings, 0 replies; 58+ messages in thread
From: Junio C Hamano @ 2007-04-03 22:41 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Linus Torvalds, Nicolas Pitre, Chris Lee, Git Mailing List

"Shawn O. Pearce" <spearce@spearce.org> writes:

>> I wonder if we can have a backdoor to avoid any object transfer
>> in such a case to begin with...
>
> Yea, symlink to the corresponding refs directory of the alternate
> ODB.  Then the refs will be visible.  ;-)

I was thinking about going the other way.  When git-fetch is
asked to fetch over a local transport, it could check if the
source is one of its alternate object stores.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* [PATCH 1/2] git-fetch--tool pick-rref
  2007-04-03 21:28                 ` Linus Torvalds
  2007-04-03 22:31                   ` Junio C Hamano
@ 2007-04-05 10:22                   ` Junio C Hamano
  2007-04-05 10:22                   ` [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate Junio C Hamano
  2 siblings, 0 replies; 58+ messages in thread
From: Junio C Hamano @ 2007-04-05 10:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Shawn O. Pearce, Nicolas Pitre, Chris Lee, Git Mailing List

This script helper takes list of fully qualified refnames and
results from ls-remote and grabs only the lines for the named
refs from the latter.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

  Linus Torvalds <torvalds@linux-foundation.org> writes:
  > On Tue, 3 Apr 2007, Linus Torvalds wrote:
  >> 
  >> Yes, we could definitely skip the re-lookup if we had a "don't really 
  >> care, I can recreate the object myself" flag (ie anybody who is going to 
  >> write that object)
  >
  > Side note: with "alternates" files, you might well *always* have the 
  > objects. If you do
  >
  > 	git clone -l -s ...
  >
  > to create various branches, and then pull between them, you'll actually 
  > end up in the situation that you'll always find the objects and get back 
  > to the really expensive case..

  Ah, that's true.  If you "git clone -l -s A B", create new
  objects in A and pull from B, the transfer would not exclude
  new objects as they are not visible from B's refs.

  In that scenario, the keep-pack behaviour is already worse than
  the unpack-objects behaviour.  The former creates a packfile
  that duplicates objects that are in A while the latter, although
  expensive, ends up doing nothing.

  I wonder if we can have a backdoor to avoid any object transfer
  in such a case to begin with...

  ... and this two patch series does exactly that.

 builtin-fetch--tool.c |   84 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 84 insertions(+), 0 deletions(-)

diff --git a/builtin-fetch--tool.c b/builtin-fetch--tool.c
index e9d6764..be341c1 100644
--- a/builtin-fetch--tool.c
+++ b/builtin-fetch--tool.c
@@ -436,10 +436,87 @@ static int expand_refs_wildcard(const char *ls_remote_result, int numrefs,
 	return 0;
 }
 
+static int pick_rref(int sha1_only, const char *rref, const char *ls_remote_result)
+{
+	int err = 0;
+	int lrr_count = lrr_count, i, pass;
+	const char *cp;
+	struct lrr {
+		const char *line;
+		const char *name;
+		int namelen;
+		int shown;
+	} *lrr_list = lrr_list;
+
+	for (pass = 0; pass < 2; pass++) {
+		/* pass 0 counts and allocates, pass 1 fills... */
+		cp = ls_remote_result;
+		i = 0;
+		while (1) {
+			const char *np;
+			while (*cp && isspace(*cp))
+				cp++;
+			if (!*cp)
+				break;
+			np = strchr(cp, '\n');
+			if (!np)
+				np = cp + strlen(cp);
+			if (pass) {
+				lrr_list[i].line = cp;
+				lrr_list[i].name = cp + 41;
+				lrr_list[i].namelen = np - (cp + 41);
+			}
+			i++;
+			cp = np;
+		}
+		if (!pass) {
+			lrr_count = i;
+			lrr_list = xcalloc(lrr_count, sizeof(*lrr_list));
+		}
+	}
+
+	while (1) {
+		const char *next;
+		int rreflen;
+		int i;
+
+		while (*rref && isspace(*rref))
+			rref++;
+		if (!*rref)
+			break;
+		next = strchr(rref, '\n');
+		if (!next)
+			next = rref + strlen(rref);
+		rreflen = next - rref;
+
+		for (i = 0; i < lrr_count; i++) {
+			struct lrr *lrr = &(lrr_list[i]);
+
+			if (rreflen == lrr->namelen &&
+			    !memcmp(lrr->name, rref, rreflen)) {
+				if (!lrr->shown)
+					printf("%.*s\n",
+					       sha1_only ? 40 : lrr->namelen + 41,
+					       lrr->line);
+				lrr->shown = 1;
+				break;
+			}
+		}
+		if (lrr_count <= i) {
+			error("pick-rref: %.*s not found", rreflen, rref);
+			err = 1;
+		}
+		rref = next;
+	}
+	free(lrr_list);
+	return err;
+}
+
 int cmd_fetch__tool(int argc, const char **argv, const char *prefix)
 {
 	int verbose = 0;
 	int force = 0;
+	int sopt = 0;
 
 	while (1 < argc) {
 		const char *arg = argv[1];
@@ -447,6 +524,8 @@ int cmd_fetch__tool(int argc, const char **argv, const char *prefix)
 			verbose = 1;
 		else if (!strcmp("-f", arg))
 			force = 1;
+		else if (!strcmp("-s", arg))
+			sopt = 1;
 		else
 			break;
 		argc--;
@@ -491,6 +570,11 @@ int cmd_fetch__tool(int argc, const char **argv, const char *prefix)
 			reflist = get_stdin();
 		return parse_reflist(reflist);
 	}
+	if (!strcmp("pick-rref", argv[1])) {
+		if (argc != 4)
+			return error("pick-rref takes 2 args");
+		return pick_rref(sopt, argv[2], argv[3]);
+	}
 	if (!strcmp("expand-refs-wildcard", argv[1])) {
 		const char *reflist;
 		if (argc < 4)
-- 
1.5.1.45.g1ddb

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate
  2007-04-03 21:28                 ` Linus Torvalds
  2007-04-03 22:31                   ` Junio C Hamano
  2007-04-05 10:22                   ` [PATCH 1/2] git-fetch--tool pick-rref Junio C Hamano
@ 2007-04-05 10:22                   ` Junio C Hamano
  2007-04-05 16:15                     ` Shawn O. Pearce
  2 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2007-04-05 10:22 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Shawn O. Pearce, Nicolas Pitre, Chris Lee, Git Mailing List

When we are fetching from a repository that is on a local
filesystem, first check if we have all the objects that we are
going to fetch available locally, by not just checking the tips
of what we are fetching, but with a full reachability analysis
to our existing refs.  In such a case, we do not have to run
git-fetch-pack which would send many needless objects.  This is
especially true when the other repository is an alternate of the
current repository (e.g. perhaps the repository was created by
running "git clone -l -s" from there).

The useless objects transferred used to be discarded when they
were expanded by git-unpack-objects called from git-fetch-pack,
but recent git-fetch-pack prefers to keep the data it receives
from the other end without exploding them into loose objects,
resulting in a pack full of duplicated data when fetching from
your own alternate.

This also uses fetch--tool pick-rref on dumb transport side to
remove a shell loop to do the same.

Signed-off-by: Junio C Hamano <junkio@cox.net>
---

 * Strictly speaking, there is no need to even check if $remote
   is a local directory for this to operate properly, as
   rev-list would barf and die as soon as it finds something
   unavailable, while limiting the traversal to stop immediately
   after it hits what are known to be reachable locally.  On the
   other hand, if we really want to limit this to the case to a
   repository with an alternate to "clone -l -s" origin, we
   could add 'test -f "$GIT_OBJECT_DIRECTORY/info/alternates"',
   but I chose not to.

 git-fetch.sh |   41 ++++++++++++++++++++++++++++-------------
 1 files changed, 28 insertions(+), 13 deletions(-)

diff --git a/git-fetch.sh b/git-fetch.sh
index fd70696..5dc3063 100755
--- a/git-fetch.sh
+++ b/git-fetch.sh
@@ -173,9 +173,32 @@ fetch_all_at_once () {
 	    git-bundle unbundle "$remote" $rref ||
 	    echo failed "$remote"
 	else
-	  git-fetch-pack --thin $exec $keep $shallow_depth $no_progress \
-		"$remote" $rref ||
-	  echo failed "$remote"
+		if	test -d "$remote" &&
+
+			# The remote might be our alternate.  With
+			# this optimization we will bypass fetch-pack
+			# altogether, which means we cannot be doing
+			# the shallow stuff at all.
+			test ! -f "$GIT_DIR/shallow" &&
+			test -z "$shallow_depth" &&
+
+			# See if all of what we are going to fetch are
+			# connected to our repository's tips, in which
+			# case we do not have to do any fetch.
+			theirs=$(git-fetch--tool -s pick-rref \
+					"$rref" "$ls_remote_result") &&
+
+			# This will barf when $theirs reach an object that
+			# we do not have in our repository.  Otherwise,
+			# we already have everything the fetch would bring in.
+			git-rev-list --objects $theirs --not --all 2>/dev/null
+		then
+			git-fetch--tool pick-rref "$rref" "$ls_remote_result"
+		else
+			git-fetch-pack --thin $exec $keep $shallow_depth \
+				$no_progress "$remote" $rref ||
+			echo failed "$remote"
+		fi
 	fi
       ) |
       (
@@ -235,16 +258,8 @@ fetch_per_ref () {
 	  fi

 	  # Find $remote_name from ls-remote output.
-	  head=$(
-		IFS='	'
-		echo "$ls_remote_result" |
-		while read sha1 name
-		do
-			test "z$name" = "z$remote_name" || continue
-			echo "$sha1"
-			break
-		done
-	  )
+	  head=$(git-fetch--tool -s pick-rref \
+			"$remote_name" "$ls_remote_result")
 	  expr "z$head" : "z$_x40\$" >/dev/null ||
 		die "No such ref $remote_name at $remote"
 	  echo >&2 "Fetching $remote_name from $remote using $proto"
-- 
1.5.1.45.g1ddb

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate
  2007-04-05 10:22                   ` [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate Junio C Hamano
@ 2007-04-05 16:15                     ` Shawn O. Pearce
  2007-04-05 21:37                       ` Junio C Hamano
  0 siblings, 1 reply; 58+ messages in thread
From: Shawn O. Pearce @ 2007-04-05 16:15 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Nicolas Pitre, Chris Lee, Git Mailing List

Junio C Hamano <junkio@cox.net> wrote:
> +			# This will barf when $theirs reach an object that
> +			# we do not have in our repository.  Otherwise,
> +			# we already have everything the fetch would bring in.
> +			git-rev-list --objects $theirs --not --all 2>/dev/null

OK, I must be missing something here.

That rev-list is going to print out the SHA-1s for the objects we
would have copied, but didn't, isn't it?  So fetch--tool native-store
is going to get a whole lot of SHA-1s it doesn't want to see, right?

Otherwise this is a nice trick.  It doesn't assure us that after the
fetch those objects are still in the alternate.  Meaning someone
could run prune in the alternate between the rev-list and the
native-store, and whack these objects.  Given how small of a window
it is, and the improvements this brings to alternates, I say its
worth that small downside.  Just don't prune while fetching.  ;-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate
  2007-04-05 16:15                     ` Shawn O. Pearce
@ 2007-04-05 21:37                       ` Junio C Hamano
  0 siblings, 0 replies; 58+ messages in thread
From: Junio C Hamano @ 2007-04-05 21:37 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Linus Torvalds, Nicolas Pitre, Chris Lee, Git Mailing List

"Shawn O. Pearce" <spearce@spearce.org> writes:

> Junio C Hamano <junkio@cox.net> wrote:
>> +			# This will barf when $theirs reach an object that
>> +			# we do not have in our repository.  Otherwise,
>> +			# we already have everything the fetch would bring in.
>> +			git-rev-list --objects $theirs --not --all 2>/dev/null
>
> OK, I must be missing something here.
>
> That rev-list is going to print out the SHA-1s for the objects we
> would have copied, but didn't, isn't it?  So fetch--tool native-store
> is going to get a whole lot of SHA-1s it doesn't want to see, right?

True.  We should send the standard output also to /dev/null.

> Otherwise this is a nice trick.  It doesn't assure us that after the
> fetch those objects are still in the alternate.  Meaning someone
> could run prune in the alternate between the rev-list and the
> native-store, and whack these objects.  Given how small of a window
> it is, and the improvements this brings to alternates, I say its
> worth that small downside.  Just don't prune while fetching.  ;-)

That is "don't prune after making an alternate that depends on
you" in general.  Without this patch, and without the keep
(i.e. when the fetch is very small and the transfarred pack is
given to unpack-objects) you would have depended on the
alternates for those objects anyway.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:17             ` Shawn O. Pearce
  2007-04-03 21:26               ` Linus Torvalds
@ 2007-04-03 21:34               ` Nicolas Pitre
  2007-04-03 21:37                 ` Shawn O. Pearce
  2007-04-03 22:40                 ` Dana How
  1 sibling, 2 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 21:34 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Linus Torvalds, Chris Lee, Junio C Hamano, Git Mailing List

On Tue, 3 Apr 2007, Shawn O. Pearce wrote:

> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > On Tue, 3 Apr 2007, Shawn O. Pearce wrote:
> > > If its the missing-object lookup that is expensive, maybe we should
> > > try to optimize that.  We do it enough already in other parts of
> > > the code...
> > 
> > Well, for all other cases it's really the "object found" case that is 
> > worth optimizing for, so I think optimizing for "no object" is actually 
> > wrong, unless it also speeds up (or at least doesn't make it worse) the 
> > "real" normal case.
> 
> Right.  But maybe we shouldn't be scanning for packfiles every
> time we don't find a loose object.  Especially if the caller is in
> a context where we actually *expect* to not find said object like
> half of the time... say in git-add/update-index.  ;-)

First, I truly believe we should have a 64-bit pack index and fewer 
larger packs than many small packs.

Which leaves us with the actual pack index lookup.  At that point the 
cost of finding an existing object and finding that a given object 
doesn't exist is about the same thing, isn't it?

Optimizing that lookup is going to benefit both cases.


Nicolas

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:34               ` git-index-pack really does suck Nicolas Pitre
@ 2007-04-03 21:37                 ` Shawn O. Pearce
  2007-04-03 21:44                   ` Junio C Hamano
  2007-04-03 22:40                 ` Dana How
  1 sibling, 1 reply; 58+ messages in thread
From: Shawn O. Pearce @ 2007-04-03 21:37 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Chris Lee, Junio C Hamano, Git Mailing List

Nicolas Pitre <nico@cam.org> wrote:
> First, I truly believe we should have a 64-bit pack index and fewer 
> larger packs than many small packs.

I'll buy that.  ;-)

> Which leaves us with the actual pack index lookup.  At that point the 
> cost of finding an existing object and finding that a given object 
> doesn't exist is about the same thing, isn't it?

Here's the rub:  in the missing object case we didn't find it
in the pack index, but it could be loose.  That's one failed
syscall per object if the object isn't loose.  If the object
isn't loose, it could be that it was *just* removed by a
running prune-packed, and the packfile that it was moved
to was created after we scanned for packfiles, so time to
rescan...

-- 
Shawn.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:37                 ` Shawn O. Pearce
@ 2007-04-03 21:44                   ` Junio C Hamano
  2007-04-03 21:53                     ` Shawn O. Pearce
  0 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2007-04-03 21:44 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Nicolas Pitre, Linus Torvalds, Chris Lee, Git Mailing List

"Shawn O. Pearce" <spearce@spearce.org> writes:

> Nicolas Pitre <nico@cam.org> wrote:
>> First, I truly believe we should have a 64-bit pack index and fewer 
>> larger packs than many small packs.
>
> I'll buy that.  ;-)
>  
>> Which leaves us with the actual pack index lookup.  At that point the 
>> cost of finding an existing object and finding that a given object 
>> doesn't exist is about the same thing, isn't it?
>
> Here's the rub:  in the missing object case we didn't find it
> in the pack index, but it could be loose.  That's one failed
> syscall per object if the object isn't loose.  If the object
> isn't loose, it could be that it was *just* removed by a
> running prune-packed, and the packfile that it was moved
> to was created after we scanned for packfiles, so time to
> rescan...

If that is the only reason we have these reprepare_packed_git()
sprinkled all over in sha1_file.c (by 637cdd9d), perhaps we
should rethink that.  Is there a cheap way to trigger these
rescanning only when a prune-packed is in progress, I wonder...

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:44                   ` Junio C Hamano
@ 2007-04-03 21:53                     ` Shawn O. Pearce
  2007-04-03 22:10                       ` Jeff King
  0 siblings, 1 reply; 58+ messages in thread
From: Shawn O. Pearce @ 2007-04-03 21:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, Linus Torvalds, Chris Lee, Git Mailing List

Junio C Hamano <junkio@cox.net> wrote:
> "Shawn O. Pearce" <spearce@spearce.org> writes:
> > Here's the rub:  in the missing object case we didn't find it
> > in the pack index, but it could be loose.  That's one failed
> > syscall per object if the object isn't loose.  If the object
> > isn't loose, it could be that it was *just* removed by a
> > running prune-packed, and the packfile that it was moved
> > to was created after we scanned for packfiles, so time to
> > rescan...
> 
> If that is the only reason we have these reprepare_packed_git()
> sprinkled all over in sha1_file.c (by 637cdd9d), perhaps we
> should rethink that.  Is there a cheap way to trigger these
> rescanning only when a prune-packed is in progress, I wonder...

Yea, it is the only reason.  So... if we could have some
magic to trigger that, it would be good.  I just don't
know what magic that would be.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:53                     ` Shawn O. Pearce
@ 2007-04-03 22:10                       ` Jeff King
  0 siblings, 0 replies; 58+ messages in thread
From: Jeff King @ 2007-04-03 22:10 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Junio C Hamano, Nicolas Pitre, Linus Torvalds, Chris Lee,
	Git Mailing List

On Tue, Apr 03, 2007 at 05:53:42PM -0400, Shawn O. Pearce wrote:

> > If that is the only reason we have these reprepare_packed_git()
> > sprinkled all over in sha1_file.c (by 637cdd9d), perhaps we
> > should rethink that.  Is there a cheap way to trigger these
> > rescanning only when a prune-packed is in progress, I wonder...
> 
> Yea, it is the only reason.  So... if we could have some
> magic to trigger that, it would be good.  I just don't
> know what magic that would be.

It doesn't have to be in progress; it might have started and completed
between pack-scanning and object lookup. Something like:

  1. start git-repack -a -d
  2. start git-rev-list, scan for packs
  3. repack moves finished pack into place, starts git-prune-packed
  4. git-prune-pack completes
  5. git-rev-list looks up object

So you would need to have some sort of incremented counter that says
"there's a new pack available."

Perhaps instead of rescanning unconditionally, it should simply stat()
the pack directory and look for a change? That should be much cheaper.

-Jeff

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:34               ` git-index-pack really does suck Nicolas Pitre
  2007-04-03 21:37                 ` Shawn O. Pearce
@ 2007-04-03 22:40                 ` Dana How
  2007-04-03 22:52                   ` Linus Torvalds
  2007-04-03 23:00                   ` Nicolas Pitre
  1 sibling, 2 replies; 58+ messages in thread
From: Dana How @ 2007-04-03 22:40 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Shawn O. Pearce, Linus Torvalds, Chris Lee, Junio C Hamano,
	Git Mailing List

On 4/3/07, Nicolas Pitre <nico@cam.org> wrote:
> On Tue, 3 Apr 2007, Shawn O. Pearce wrote:
> > Right.  But maybe we shouldn't be scanning for packfiles every
> > time we don't find a loose object.  Especially if the caller is in
> > a context where we actually *expect* to not find said object like
> > half of the time... say in git-add/update-index.  ;-)
>
> First, I truly believe we should have a 64-bit pack index and fewer
> larger packs than many small packs.
>
> Which leaves us with the actual pack index lookup.  At that point the
> cost of finding an existing object and finding that a given object
> doesn't exist is about the same thing, isn't it?
>
> Optimizing that lookup is going to benefit both cases.

Do you get what you want if you move to fewer larger INDEX files
but not pack files -- in the extreme, one large index file?
A "super index" could be built from multiple .idx files.
This would be a new file (format) unencumbered by the past,
so it could be tried out more quickly.
Just like objects are pruned when packed,
.idx files could be pruned when the super index is built.

Perhaps a number of (<2GB) packfiles and a large index
file could work out.

Larger and larger pack files make me nervous.
They are expensive to manipulate,
and >2GB requires a file format change.

Thanks,
-- 
Dana L. How  danahow@gmail.com  +1 650 804 5991 cell

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 22:40                 ` Dana How
@ 2007-04-03 22:52                   ` Linus Torvalds
  2007-04-03 22:31                     ` David Lang
  2007-04-03 23:00                   ` Nicolas Pitre
  1 sibling, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 22:52 UTC (permalink / raw)
  To: Dana How
  Cc: Nicolas Pitre, Shawn O. Pearce, Chris Lee, Junio C Hamano,
	Git Mailing List

On Tue, 3 Apr 2007, Dana How wrote:
> 
> Larger and larger pack files make me nervous.
> They are expensive to manipulate,
> and >2GB requires a file format change.

It sometimes also requires a new filesystem. There are a lot of 
filesystems that can handle more than 4GB total, but not necessarily in a 
single file.

The only really useful such filesystem is probably FAT, which is still 
quite useful for things like USB memory sticks. But that is probably 
already worth supporting.

So I think we want to support 64-bit (or at least something like 40+ bit) 
pack-files, but yes, I think that even if/when we support it, we still 
want to support the "multiple smaller pack-files" schenario exactly 
because for some uses it's much *better* to have ten 2GB packfiles rather 
than one 20GB pack-file.

			Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 22:52                   ` Linus Torvalds
@ 2007-04-03 22:31                     ` David Lang
  0 siblings, 0 replies; 58+ messages in thread
From: David Lang @ 2007-04-03 22:31 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Dana How, Nicolas Pitre, Shawn O. Pearce, Chris Lee,
	Junio C Hamano, Git Mailing List

On Tue, 3 Apr 2007, Linus Torvalds wrote:

> On Tue, 3 Apr 2007, Dana How wrote:
>>
>> Larger and larger pack files make me nervous.
>> They are expensive to manipulate,
>> and >2GB requires a file format change.
>
> It sometimes also requires a new filesystem. There are a lot of
> filesystems that can handle more than 4GB total, but not necessarily in a
> single file.
>
> The only really useful such filesystem is probably FAT, which is still
> quite useful for things like USB memory sticks. But that is probably
> already worth supporting.
>
> So I think we want to support 64-bit (or at least something like 40+ bit)
> pack-files, but yes, I think that even if/when we support it, we still
> want to support the "multiple smaller pack-files" schenario exactly
> because for some uses it's much *better* to have ten 2GB packfiles rather
> than one 20GB pack-file.

however, for historical archives you may end up with wanting to do a 2GB 
packfile of 'recent' stuff, and a 14GB packfile of 'ancient' stuff (with the 
large one built with all space-saving options turned up all the way, no matter 
how much time it takes to build the pack)

David Lang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 22:40                 ` Dana How
  2007-04-03 22:52                   ` Linus Torvalds
@ 2007-04-03 23:00                   ` Nicolas Pitre
  1 sibling, 0 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 23:00 UTC (permalink / raw)
  To: Dana How
  Cc: Shawn O. Pearce, Linus Torvalds, Chris Lee, Junio C Hamano,
	Git Mailing List

On Tue, 3 Apr 2007, Dana How wrote:

> Do you get what you want if you move to fewer larger INDEX files
> but not pack files -- in the extreme, one large index file?

No that doesn't solve the clone/fetch of large amount of data cleanly.

> A "super index" could be built from multiple .idx files.
> This would be a new file (format) unencumbered by the past,
> so it could be tried out more quickly.
> Just like objects are pruned when packed,
> .idx files could be pruned when the super index is built.

This is an idea to consider independently of any other issues.

> Perhaps a number of (<2GB) packfiles and a large index
> file could work out.
> 
> Larger and larger pack files make me nervous.
> They are expensive to manipulate,
> and >2GB requires a file format change.

No.  Larger pack files are not more expensive than the same set of 
objects spread into multiple packs.  In fact I'd say it's quite the 
opposite.  And larger pack files do not require a pack format change -- 
it's just the index that has to change and the index is a local matter.


Nicolas

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 20:56       ` Linus Torvalds
  2007-04-03 21:03         ` Shawn O. Pearce
@ 2007-04-03 21:21         ` Nicolas Pitre
  1 sibling, 0 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 21:21 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Chris Lee, Junio C Hamano, Git Mailing List

On Tue, 3 Apr 2007, Linus Torvalds wrote:

> 
> 
> On Tue, 3 Apr 2007, Nicolas Pitre wrote:
> > > 
> > > Yeah. What happens is that inside the repo, because we do all the 
> > > duplicate object checks (verifying that there are no evil hash collisions) 
> > > even after fixing the memory leak, we end up keeping *track* of all those 
> > > objects.
> > 
> > What do you mean?
> 
> Look at what we have to do to look up a SHA1 object.. We create all the 
> lookup infrastructure, we don't *just* read the object. The delta base 
> cache is the most obvious one. 

It is caped to 16MB, so we're far from the 200+ MB count.

> > I'm of the opinion that this patch is unnecessary.  It only helps in 
> > bogus workflows to start with, and it makes the default behavior unsafe 
> > (unsafe from a paranoid pov, but still).  And in the _normal_ workflow 
> > it should never trigger.
> 
> Actually, even in the normal workflow it will do all the extra unnecessary 
> work, if only because the lookup costs of *not* finding the entry.
> 
> Lookie here:
> 
>  - git index-pack of the *git* pack-file in the v2.6/linux directory (zero 
>    overlap of objects)
> 
>    With --paranoid:
> 
> 	2.75user 0.37system 0:03.13elapsed 99%CPU
> 	0major+5583minor pagefaults
> 
>    Without --paranoid:
> 
> 	2.55user 0.12system 0:02.68elapsed 99%CPU
> 	0major+2957minor pagefaults
> 
> See? That's the *normal* workflow. Zero objects found. 7% CPU overhead 
> from just the unnecessary work, and almost twice as much memory used. Just 
> from the index file lookup etc for a decent-sized project.

7% overhead over 2 second and a half of CPU which, _normally_, happens 
when cloning the whole thing over a network connection which, if you're 
lucky and have a 6mbps cable connection, will still be spread over 5 
minutes of real time.  And that is assuming that you're cloning a big 
project inside itself which wouldn't work anyway.  Otherwise a big clone 
wound run index-pack in an empty repo where the lookup of exinsting 
object is zero.  Remains git-fetch which should concern itself with much 
smaller packs pushing this overhead in the noise.

> Now, in the KDE situation, the *unnecessary* lookups will be about ten 
> times more expensive, both on memory and CPU, just because the repository 
> is about 20x the size. Even with no actual hits.

So?  When would you really perform such an operation in a meaningful 
way?

The memory usage worries me.  I still cannot explain nor justify it.  
But the CPU overhead is certainly not of any concern in _normal_ usage 
scenarios, is it?

If anything that might be a good test case for the newton-raphson pack 
lookup idea.

Nicolas

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 20:18   ` Linus Torvalds
  2007-04-03 20:32     ` Nicolas Pitre
@ 2007-04-03 20:33     ` Linus Torvalds
  2007-04-03 21:05       ` Nicolas Pitre
  2007-04-03 20:34     ` Junio C Hamano
  2 siblings, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 20:33 UTC (permalink / raw)
  To: Chris Lee, Nicolas Pitre, Junio C Hamano; +Cc: Git Mailing List

On Tue, 3 Apr 2007, Linus Torvalds wrote:
> 
> So how about this updated patch? We could certainly make "git pull" imply 
> "--paranoid" if we want to, but even that is likely pretty unnecessary. 
> It's not like anybody has ever shown a SHA1 collision, and if the *local* 
> repository is corrupt (and has an object with the wrong SHA1 - that's what 
> the testsuite checks for), then it's probably good to get the valid object 
> from the remote..

Some trivial timings for indexing just the kernel pack..

Without --paranoid:

	24.61user 2.16system 0:27.04elapsed 99%CPU
	0major+14120minor pagefaults

With --paranoid:

	42.74user 3.04system 0:46.36elapsed 98%CPU
	0major+72768minor pagefaults

so it's a noticeable CPU issue, but it's even more noticeable in memory 
usage (55MB vs 284MB - pagefaults give a good way to look at how much 
memory really got allocated for the process).

All that extra memory is just for SHA1 commit ID information. 

Now, clearly the usage scenario here is a big odd (ie the case where we 
have all the objects already), so in that sense this is very much a 
worst-case situation, and you simply shouldn't *do* something like this, 
but at the same time, I'm just not convinced a very theoretical SHA1 
collision check is worth it. 

Btw, even if we don't have any of the objects, if you have tons and tons 
of objects and do a "git pull", just the *lookup* of the nonexistent 
objects will be expensive: first we won't find it in any pack, then we'll 
look at the loose objects, and then we'll look int he pack *again* due to 
the race avoidance. So looking up nonexistent objects is actually pretty 
expensive.

In fact, "--paranoid" takes one second more for me even totally outside of 
a git repository, just because we waste so much time trying to look up 
non-existent object files ;)

			Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 20:33     ` Linus Torvalds
@ 2007-04-03 21:05       ` Nicolas Pitre
  2007-04-03 21:11         ` Shawn O. Pearce
  2007-04-03 21:24         ` Linus Torvalds
  0 siblings, 2 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 21:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Chris Lee, Junio C Hamano, Git Mailing List

On Tue, 3 Apr 2007, Linus Torvalds wrote:

> All that extra memory is just for SHA1 commit ID information. 

I don't see where that might be.  The only thing that the paranoia check 
triggers is:

	foo = read_sha1_file(blah);
	memcmp(foo with bar);
	free(foo);

So where is that commit ID information actually stored when using 
read_sha1_file()?

> Btw, even if we don't have any of the objects, if you have tons and tons 
> of objects and do a "git pull", just the *lookup* of the nonexistent 
> objects will be expensive: first we won't find it in any pack, then we'll 
> look at the loose objects, and then we'll look int he pack *again* due to 
> the race avoidance. So looking up nonexistent objects is actually pretty 
> expensive.

Not if you consider that it is performed _while_ receiving (and waiting 
for) the pack data over the net in the normal case.


Nicolas

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:05       ` Nicolas Pitre
@ 2007-04-03 21:11         ` Shawn O. Pearce
  2007-04-03 21:24         ` Linus Torvalds
  1 sibling, 0 replies; 58+ messages in thread
From: Shawn O. Pearce @ 2007-04-03 21:11 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Chris Lee, Junio C Hamano, Git Mailing List

Nicolas Pitre <nico@cam.org> wrote:
> On Tue, 3 Apr 2007, Linus Torvalds wrote:
> 
> > All that extra memory is just for SHA1 commit ID information. 
> 
> I don't see where that might be.  The only thing that the paranoia check 
> triggers is:
> 
> 	foo = read_sha1_file(blah);
> 	memcmp(foo with bar);
> 	free(foo);
> 
> So where is that commit ID information actually stored when using 
> read_sha1_file()?

That might just be the mmap code in sha1_file.c.  We are mmaping
the existing packfile, to fetch the object.  That counts as virtual
memory.  ;-)

-- 
Shawn.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:05       ` Nicolas Pitre
  2007-04-03 21:11         ` Shawn O. Pearce
@ 2007-04-03 21:24         ` Linus Torvalds
       [not found]           ` <alpine.LF D.0.98.0704031735470.28181@xanadu.home>
  2007-04-03 21:42           ` Nicolas Pitre
  1 sibling, 2 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 21:24 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Chris Lee, Junio C Hamano, Git Mailing List

On Tue, 3 Apr 2007, Nicolas Pitre wrote:
>
> I don't see where that might be.  The only thing that the paranoia check 
> triggers is:
> 
> 	foo = read_sha1_file(blah);
> 	memcmp(foo with bar);
> 	free(foo);
> 
> So where is that commit ID information actually stored when using 
> read_sha1_file()?

I've got the numbers: it uses much more memory when doing even failing 
lookups, ie:

	With --paranoid: 5583 minor pagefaults (21MB)
	Without --paranoid: 2957 minor pagefaults (11MB)

(remember, this was *just* the git pack, not the kernel pack)

It could be as simple as just the index file itself. That's 11MB for the 
kernel. Imagine if the index file was 20 times bigger - 200MB of memory 
paged in with bad access patterns just for unnecessary lookups.

Running valgrind shows no leak at all without --paranoid. With --paranoid, 
there's some really trivial stuff (the "packed_git" structure etc, so I 
think it's really just the index itself).

> Not if you consider that it is performed _while_ receiving (and waiting 
> for) the pack data over the net in the normal case.

..which is why I think it makes sense for "pull" to be paranoid. I just 
don't think it makes sense to be paranoid *all* the time, since it's 
clearly expensive.

		Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

[parent not found: <alpine.LF D.0.98.0704031735470.28181@xanadu.home>]

* Re: git-index-pack really does suck..
  2007-04-03 21:24         ` Linus Torvalds
       [not found]           ` <alpine.LF D.0.98.0704031735470.28181@xanadu.home>
@ 2007-04-03 21:42           ` Nicolas Pitre
  2007-04-03 22:07             ` Junio C Hamano
  2007-04-03 22:14             ` Linus Torvalds
  1 sibling, 2 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 21:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Chris Lee, Junio C Hamano, Git Mailing List

On Tue, 3 Apr 2007, Linus Torvalds wrote:

> 
> 
> On Tue, 3 Apr 2007, Nicolas Pitre wrote:
> >
> > I don't see where that might be.  The only thing that the paranoia check 
> > triggers is:
> > 
> > 	foo = read_sha1_file(blah);
> > 	memcmp(foo with bar);
> > 	free(foo);
> > 
> > So where is that commit ID information actually stored when using 
> > read_sha1_file()?
> 
> I've got the numbers: it uses much more memory when doing even failing 
> lookups, ie:
> 
> 	With --paranoid: 5583 minor pagefaults (21MB)
> 	Without --paranoid: 2957 minor pagefaults (11MB)
> 
> (remember, this was *just* the git pack, not the kernel pack)
> 
> It could be as simple as just the index file itself. That's 11MB for the 
> kernel. Imagine if the index file was 20 times bigger - 200MB of memory 
> paged in with bad access patterns just for unnecessary lookups.

Again that presumes you have to page in the whole index, which should 
not happen when pulling (much smaller packs) and with a better lookup 
algorithm.

> > Not if you consider that it is performed _while_ receiving (and waiting 
> > for) the pack data over the net in the normal case.
> 
> ..which is why I think it makes sense for "pull" to be paranoid. I just 
> don't think it makes sense to be paranoid *all* the time, since it's 
> clearly expensive.

Make it conditionnal on --stdin then.  This covers all cases where we 
really want the secure thing to happen, and the --stdin case already 
perform the atomic rename-and-move thing when the pack is fully indexed.


Nicolas

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:42           ` Nicolas Pitre
@ 2007-04-03 22:07             ` Junio C Hamano
  2007-04-03 22:11               ` Shawn O. Pearce
  2007-04-03 22:34               ` Nicolas Pitre
  2007-04-03 22:14             ` Linus Torvalds
  1 sibling, 2 replies; 58+ messages in thread
From: Junio C Hamano @ 2007-04-03 22:07 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Chris Lee, Git Mailing List

Nicolas Pitre <nico@cam.org> writes:

> Make it conditionnal on --stdin then.  This covers all cases where we 
> really want the secure thing to happen, and the --stdin case already 
> perform the atomic rename-and-move thing when the pack is fully indexed.

Repacking objects in a repository uses pack-objects without
using index-pack, as you suggested Chris.  Is there a sane usage
of index-pack that does not use --stdin?  I do not think of any.

If there isn't, the "conditional on --stdin" suggestion means we
unconditionally do the secure thing for all the sane usage, and
go unsecure for an insane usage that we do not really care about.

If so, it seems to me that it would be the simplest not to touch
the code at all, except that missing free().

Am I missing something?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 22:07             ` Junio C Hamano
@ 2007-04-03 22:11               ` Shawn O. Pearce
  2007-04-03 22:34               ` Nicolas Pitre
  1 sibling, 0 replies; 58+ messages in thread
From: Shawn O. Pearce @ 2007-04-03 22:11 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Nicolas Pitre, Linus Torvalds, Chris Lee, Git Mailing List

Junio C Hamano <junkio@cox.net> wrote:
> Nicolas Pitre <nico@cam.org> writes:
> 
> > Make it conditionnal on --stdin then.  This covers all cases where we 
> > really want the secure thing to happen, and the --stdin case already 
> > perform the atomic rename-and-move thing when the pack is fully indexed.
> 
> Repacking objects in a repository uses pack-objects without
> using index-pack, as you suggested Chris.  Is there a sane usage
> of index-pack that does not use --stdin?  I do not think of any.
> 
> If there isn't, the "conditional on --stdin" suggestion means we
> unconditionally do the secure thing for all the sane usage, and
> go unsecure for an insane usage that we do not really care about.
> 
> If so, it seems to me that it would be the simplest not to touch
> the code at all, except that missing free().
> 
> Am I missing something?

Nope. I agree with you completely.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 22:07             ` Junio C Hamano
  2007-04-03 22:11               ` Shawn O. Pearce
@ 2007-04-03 22:34               ` Nicolas Pitre
  1 sibling, 0 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 22:34 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Chris Lee, Git Mailing List

On Tue, 3 Apr 2007, Junio C Hamano wrote:

> Nicolas Pitre <nico@cam.org> writes:
> 
> > Make it conditionnal on --stdin then.  This covers all cases where we 
> > really want the secure thing to happen, and the --stdin case already 
> > perform the atomic rename-and-move thing when the pack is fully indexed.
> 
> Repacking objects in a repository uses pack-objects without
> using index-pack, as you suggested Chris.  Is there a sane usage
> of index-pack that does not use --stdin?  I do not think of any.
> 
> If there isn't, the "conditional on --stdin" suggestion means we
> unconditionally do the secure thing for all the sane usage, and
> go unsecure for an insane usage that we do not really care about.
> 
> If so, it seems to me that it would be the simplest not to touch
> the code at all, except that missing free().

That's exactly what I think as well.

> Am I missing something?

Not from my point of view.


Nicolas

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 21:42           ` Nicolas Pitre
  2007-04-03 22:07             ` Junio C Hamano
@ 2007-04-03 22:14             ` Linus Torvalds
  2007-04-03 22:55               ` Nicolas Pitre
  1 sibling, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 22:14 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Chris Lee, Junio C Hamano, Git Mailing List



On Tue, 3 Apr 2007, Nicolas Pitre wrote:
> > 
> > It could be as simple as just the index file itself. That's 11MB for the 
> > kernel. Imagine if the index file was 20 times bigger - 200MB of memory 
> > paged in with bad access patterns just for unnecessary lookups.
> 
> Again that presumes you have to page in the whole index, which should 
> not happen when pulling (much smaller packs) and with a better lookup 
> algorithm.

Actually, since SHA1's are randomly distributed, together with the binary 
search, you probably *will* pull in the bulk of the index, even with a 
fairly small set of SHA1's that you check.

> Make it conditionnal on --stdin then.  This covers all cases where we 
> really want the secure thing to happen, and the --stdin case already 
> perform the atomic rename-and-move thing when the pack is fully indexed.

I don't care *what* it is conditional on, but your arguments suck. You 
claim that it's not a normal case to already have the objects, when it 
*is* a normal case for alternates, etc.

I don't understand why you argue against hard numbers. You have none of 
your own.

		Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 22:14             ` Linus Torvalds
@ 2007-04-03 22:55               ` Nicolas Pitre
  2007-04-03 22:36                 ` David Lang
  2007-04-03 23:29                 ` Linus Torvalds
  0 siblings, 2 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 22:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Chris Lee, Junio C Hamano, Git Mailing List

On Tue, 3 Apr 2007, Linus Torvalds wrote:

> I don't care *what* it is conditional on, but your arguments suck. You 
> claim that it's not a normal case to already have the objects, when it 
> *is* a normal case for alternates, etc.
> 
> I don't understand why you argue against hard numbers. You have none of 
> your own.

Are hard numbers like 7% overhead (because right now that's all we have) 
really worth it against bad _perceptions_?

Sure, the SHA1 collision attack is paranoia.  But it is becoming 
increasingly *possible*.

And when we only had unpack-objects on the receiving end of a fetch, you 
yourself bragged about the implied security of GIT in the presence of a 
SHA1 collision attack.  Because let's admit it: when a SHA1 collision 
will happen it is way more probable to come on purpose than from pure 
accident.  But as you said at the time, it is not a problem because GIT 
trusts local objects more than remote ones and incidentally 
unpack-objects doesn't overwrite existing objects.

The keeping of fetched packs broke that presumption of trust towards 
local objects and it opened a real path for potential future attacks.  
Those attacks are still fairly theoretical of course.  But for how 
_long_?  Do we want GIT to be considered backdoor prone in a couple 
years from now just because we were obsessed by a 7% CPU overhead?

I think we have much more to gain by playing it safe and being more 
secure and paranoid than trying to squeeze some CPU cycles out of an 
operation that is likely to ever be bounded by network speed for most 
people.

And we _know_ that the operation can be optimized further anyway.

So IMHO in this case hard numbers alone aren't the end of it.  Not as 
long as they're reasonably low.  And especially not for a command which 
is 1) rather infrequent and 2) not really interactive like git-log might 
be.

Nicolas

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 22:55               ` Nicolas Pitre
@ 2007-04-03 22:36                 ` David Lang
  2007-04-04  9:51                   ` Alex Riesen
  2007-04-03 23:29                 ` Linus Torvalds
  1 sibling, 1 reply; 58+ messages in thread
From: David Lang @ 2007-04-03 22:36 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Linus Torvalds, Chris Lee, Junio C Hamano, Git Mailing List

On Tue, 3 Apr 2007, Nicolas Pitre wrote:

> On Tue, 3 Apr 2007, Linus Torvalds wrote:
>
>> I don't care *what* it is conditional on, but your arguments suck. You
>> claim that it's not a normal case to already have the objects, when it
>> *is* a normal case for alternates, etc.
>>
>> I don't understand why you argue against hard numbers. You have none of
>> your own.
>
> Are hard numbers like 7% overhead (because right now that's all we have)
> really worth it against bad _perceptions_?

plus 1s overhead on what's otherwise a noop command.

> The keeping of fetched packs broke that presumption of trust towards
> local objects and it opened a real path for potential future attacks.
> Those attacks are still fairly theoretical of course.  But for how
> _long_?  Do we want GIT to be considered backdoor prone in a couple
> years from now just because we were obsessed by a 7% CPU overhead?
>
> I think we have much more to gain by playing it safe and being more
> secure and paranoid than trying to squeeze some CPU cycles out of an
> operation that is likely to ever be bounded by network speed for most
> people.

this is why -paranoid should be left on for network pulls, but having it on for 
the local uses means that the cost isn't hidden in the network limits isn't 
good.

David Lang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 22:36                 ` David Lang
@ 2007-04-04  9:51                   ` Alex Riesen
       [not found]                     ` <P ine.LNX.4.63.0704061455380.24050@qynat.qvtvafvgr.pbz>
  2007-04-06 21:56                     ` David Lang
  0 siblings, 2 replies; 58+ messages in thread
From: Alex Riesen @ 2007-04-04  9:51 UTC (permalink / raw)
  To: David Lang
  Cc: Nicolas Pitre, Linus Torvalds, Chris Lee, Junio C Hamano,
	Git Mailing List

On 4/4/07, David Lang <david.lang@digitalinsight.com> wrote:
>
> > The keeping of fetched packs broke that presumption of trust towards
> > local objects and it opened a real path for potential future attacks.
> > Those attacks are still fairly theoretical of course.  But for how
> > _long_?  Do we want GIT to be considered backdoor prone in a couple
> > years from now just because we were obsessed by a 7% CPU overhead?
> >
> > I think we have much more to gain by playing it safe and being more
> > secure and paranoid than trying to squeeze some CPU cycles out of an
> > operation that is likely to ever be bounded by network speed for most
> > people.
>
> this is why -paranoid should be left on for network pulls, but having it on for
> the local uses means that the cost isn't hidden in the network limits isn't
> good.

You never know what pull is networked (or should I say: remote enough
to cause a collision).

^ permalink raw reply	[flat|nested] 58+ messages in thread

[parent not found: <P ine.LNX.4.63.0704061455380.24050@qynat.qvtvafvgr.pbz>]

* Re: git-index-pack really does suck..
  2007-04-04  9:51                   ` Alex Riesen
       [not found]                     ` <P ine.LNX.4.63.0704061455380.24050@qynat.qvtvafvgr.pbz>
@ 2007-04-06 21:56                     ` David Lang
  2007-04-06 22:47                       ` Junio C Hamano
  1 sibling, 1 reply; 58+ messages in thread
From: David Lang @ 2007-04-06 21:56 UTC (permalink / raw)
  To: Alex Riesen
  Cc: Nicolas Pitre, Linus Torvalds, Chris Lee, Junio C Hamano,
	Git Mailing List

On Wed, 4 Apr 2007, Alex Riesen wrote:

> On 4/4/07, David Lang <david.lang@digitalinsight.com> wrote:
>> 
>> > The keeping of fetched packs broke that presumption of trust towards
>> > local objects and it opened a real path for potential future attacks.
>> > Those attacks are still fairly theoretical of course.  But for how
>> > _long_?  Do we want GIT to be considered backdoor prone in a couple
>> > years from now just because we were obsessed by a 7% CPU overhead?
>> >
>> > I think we have much more to gain by playing it safe and being more
>> > secure and paranoid than trying to squeeze some CPU cycles out of an
>> > operation that is likely to ever be bounded by network speed for most
>> > people.
>> 
>> this is why -paranoid should be left on for network pulls, but having it on 
>> for
>> the local uses means that the cost isn't hidden in the network limits isn't
>> good.
>
> You never know what pull is networked (or should I say: remote enough
> to cause a collision).

so leave it on for all pulls, but for other commands don't turn it on.

remember that the command that linus ran into at the start of the thread wasn't 
a pull.

David Lang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-06 21:56                     ` David Lang
@ 2007-04-06 22:47                       ` Junio C Hamano
  2007-04-06 22:49                         ` Junio C Hamano
  0 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2007-04-06 22:47 UTC (permalink / raw)
  To: David Lang
  Cc: Alex Riesen, Nicolas Pitre, Linus Torvalds, Chris Lee,
	Git Mailing List

David Lang <david.lang@digitalinsight.com> writes:

> On Wed, 4 Apr 2007, Alex Riesen wrote:
> ...
>> You never know what pull is networked (or should I say: remote enough
>> to cause a collision).
>
> so leave it on for all pulls, but for other commands don't turn it on.
>
> remember that the command that linus ran into at the start of the
> thread wasn't a pull.

Are you referring to this command

 $ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack

in this message?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-06 22:47                       ` Junio C Hamano
@ 2007-04-06 22:49                         ` Junio C Hamano
  2007-04-06 22:22                           ` David Lang
  0 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2007-04-06 22:49 UTC (permalink / raw)
  To: David Lang
  Cc: Alex Riesen, Nicolas Pitre, Linus Torvalds, Chris Lee,
	Git Mailing List

Junio C Hamano <junkio@cox.net> writes:

> David Lang <david.lang@digitalinsight.com> writes:
>
>> On Wed, 4 Apr 2007, Alex Riesen wrote:
>> ...
>>> You never know what pull is networked (or should I say: remote enough
>>> to cause a collision).
>>
>> so leave it on for all pulls, but for other commands don't turn it on.
>>
>> remember that the command that linus ran into at the start of the
>> thread wasn't a pull.
>
> Are you referring to this command
>
>  $ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack
>
> in this message?

  From: Linus Torvalds <torvalds@linux-foundation.org>
  Subject: git-index-pack really does suck..
  Date: Tue, 3 Apr 2007 08:15:12 -0700 (PDT)
  Message-ID: <Pine.LNX.4.64.0704030754020.6730@woody.linux-foundation.org>

(sorry, chomped the message).

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-06 22:49                         ` Junio C Hamano
@ 2007-04-06 22:22                           ` David Lang
  2007-04-06 22:55                             ` Junio C Hamano
  0 siblings, 1 reply; 58+ messages in thread
From: David Lang @ 2007-04-06 22:22 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Alex Riesen, Nicolas Pitre, Linus Torvalds, Chris Lee,
	Git Mailing List

On Fri, 6 Apr 2007, Junio C Hamano wrote:

> Subject: Re: git-index-pack really does suck..
> 
> Junio C Hamano <junkio@cox.net> writes:
>
>> David Lang <david.lang@digitalinsight.com> writes:
>>
>>> On Wed, 4 Apr 2007, Alex Riesen wrote:
>>> ...
>>>> You never know what pull is networked (or should I say: remote enough
>>>> to cause a collision).
>>>
>>> so leave it on for all pulls, but for other commands don't turn it on.
>>>
>>> remember that the command that linus ran into at the start of the
>>> thread wasn't a pull.
>>
>> Are you referring to this command
>>
>>  $ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack
>>
>> in this message?
>
>  From: Linus Torvalds <torvalds@linux-foundation.org>
>  Subject: git-index-pack really does suck..
>  Date: Tue, 3 Apr 2007 08:15:12 -0700 (PDT)
>  Message-ID: <Pine.LNX.4.64.0704030754020.6730@woody.linux-foundation.org>
>
> (sorry, chomped the message).

probably (I useually don't keep the mail after I read or reply to it)

David Lang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-06 22:22                           ` David Lang
@ 2007-04-06 22:55                             ` Junio C Hamano
  2007-04-06 22:28                               ` David Lang
  0 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2007-04-06 22:55 UTC (permalink / raw)
  To: David Lang
  Cc: Alex Riesen, Nicolas Pitre, Linus Torvalds, Chris Lee,
	Git Mailing List

David Lang <david.lang@digitalinsight.com> writes:

> On Fri, 6 Apr 2007, Junio C Hamano wrote:
>
>> Subject: Re: git-index-pack really does suck..
>>
>> Junio C Hamano <junkio@cox.net> writes:
>>
>>> David Lang <david.lang@digitalinsight.com> writes:
>>>
>>>> On Wed, 4 Apr 2007, Alex Riesen wrote:
>>>> ...
>>>>> You never know what pull is networked (or should I say: remote enough
>>>>> to cause a collision).
>>>>
>>>> so leave it on for all pulls, but for other commands don't turn it on.
>>>>
>>>> remember that the command that linus ran into at the start of the
>>>> thread wasn't a pull.
>>>
>>> Are you referring to this command
>>>
>>>  $ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack
>>>
>>> in this message?
>>
>>  From: Linus Torvalds <torvalds@linux-foundation.org>
>>  Subject: git-index-pack really does suck..
>>  Date: Tue, 3 Apr 2007 08:15:12 -0700 (PDT)
>>  Message-ID: <Pine.LNX.4.64.0704030754020.6730@woody.linux-foundation.org>
>>
>> (sorry, chomped the message).
>
> probably (I useually don't keep the mail after I read or reply to it)

Well, then you should remember that the command linus ran into
was pretty much about pull, nothing else.

The quoted command was only to illustrate what 'git-pull'
invokes internally.  I do not think of any reason to use that
command for cases other than 'git-pull'.  What's the use case
you have in mind to run that command outside of the context of
git-pull?

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-06 22:55                             ` Junio C Hamano
@ 2007-04-06 22:28                               ` David Lang
  0 siblings, 0 replies; 58+ messages in thread
From: David Lang @ 2007-04-06 22:28 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Alex Riesen, Nicolas Pitre, Linus Torvalds, Chris Lee,
	Git Mailing List

On Fri, 6 Apr 2007, Junio C Hamano wrote:

> David Lang <david.lang@digitalinsight.com> writes:
>
>> On Fri, 6 Apr 2007, Junio C Hamano wrote:
>>
>>> Subject: Re: git-index-pack really does suck..
>>>
>>> Junio C Hamano <junkio@cox.net> writes:
>>>
>>>> David Lang <david.lang@digitalinsight.com> writes:
>>>>
>>>>> On Wed, 4 Apr 2007, Alex Riesen wrote:
>>>>> ...
>>>>>> You never know what pull is networked (or should I say: remote enough
>>>>>> to cause a collision).
>>>>>
>>>>> so leave it on for all pulls, but for other commands don't turn it on.
>>>>>
>>>>> remember that the command that linus ran into at the start of the
>>>>> thread wasn't a pull.
>>>>
>>>> Are you referring to this command
>>>>
>>>>  $ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack
>>>>
>>>> in this message?
>>>
>>>  From: Linus Torvalds <torvalds@linux-foundation.org>
>>>  Subject: git-index-pack really does suck..
>>>  Date: Tue, 3 Apr 2007 08:15:12 -0700 (PDT)
>>>  Message-ID: <Pine.LNX.4.64.0704030754020.6730@woody.linux-foundation.org>
>>>
>>> (sorry, chomped the message).
>>
>> probably (I useually don't keep the mail after I read or reply to it)
>
> Well, then you should remember that the command linus ran into
> was pretty much about pull, nothing else.
>
> The quoted command was only to illustrate what 'git-pull'
> invokes internally.  I do not think of any reason to use that
> command for cases other than 'git-pull'.  What's the use case
> you have in mind to run that command outside of the context of
> git-pull?

I guess I'm not remembering the thread accurately then (and/or am mising it up 
with a different thread). I thought that Linus had identified other cases that 
were impacted (something about proving that the object doesn't exist)

David Lang

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 22:55               ` Nicolas Pitre
  2007-04-03 22:36                 ` David Lang
@ 2007-04-03 23:29                 ` Linus Torvalds
  1 sibling, 0 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 23:29 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Chris Lee, Junio C Hamano, Git Mailing List

On Tue, 3 Apr 2007, Nicolas Pitre wrote:
> 
> Are hard numbers like 7% overhead (because right now that's all we have) 
> really worth it against bad _perceptions_?

If it actually stays at just 7% even with large repos (and the numbers 
from Chris seem to say that it doesn't get worse - in fact, it may be that 
the lookup gets relatively more efficient for a large repo thanks to the 
log(n) costs), I agree that 7% probably isn't worth worrying about when 
weighed against "guaranteed no SHA1 collision". Especially as long as 
you'd normally only hit it when your real performance issue is going to be 
the network.

So especially if we can make sure that the *local* case is ok, where the 
network isn't going to be the bottleneck, I think we can/should do the 
paranoia.

That's especially true as it is also the local case where the 7% has 
already been shown to be just the best case, with the worst case being 
many hundred percent (and memory use going up from 55M to 280M in one 
example), thanks to us actually *finding* the objects.

			Linus

^ permalink raw reply	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 20:18   ` Linus Torvalds
  2007-04-03 20:32     ` Nicolas Pitre
  2007-04-03 20:33     ` Linus Torvalds
@ 2007-04-03 20:34     ` Junio C Hamano
  2007-04-03 20:53       ` Nicolas Pitre
  2 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2007-04-03 20:34 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Chris Lee, Git Mailing List

Linus Torvalds <torvalds@linux-foundation.org> writes:

> That whole "verify no SHA1 hash collision" code is really pretty damn 
> paranoid. Maybe we shouldn't have it enabled by default.
>
> So how about this updated patch? We could certainly make "git pull" imply 
> "--paranoid" if we want to, but even that is likely pretty unnecessary. 
> It's not like anybody has ever shown a SHA1 collision, and if the *local* 
> repository is corrupt (and has an object with the wrong SHA1 - that's what 
> the testsuite checks for), then it's probably good to get the valid object 
> from the remote..

I agree with that reasoning. We did not do paranoid in git-pull
long after we introduced the .keep thing anyway, so I do not
think the following patch is even needed, but I am throwing it
out just for discussion.



diff --git a/fetch-pack.c b/fetch-pack.c
index 06f4aec..c687f9f 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -522,6 +522,7 @@ static int get_pack(int xd[2])
 
 	if (do_keep) {
 		*av++ = "index-pack";
+		*av++ = "--paranoid";
 		*av++ = "--stdin";
 		if (!quiet && !no_progress)
 			*av++ = "-v";

^ permalink raw reply related	[flat|nested] 58+ messages in thread

* Re: git-index-pack really does suck..
  2007-04-03 20:34     ` Junio C Hamano
@ 2007-04-03 20:53       ` Nicolas Pitre
  0 siblings, 0 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 20:53 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Chris Lee, Git Mailing List

On Tue, 3 Apr 2007, Junio C Hamano wrote:

> Linus Torvalds <torvalds@linux-foundation.org> writes:
> 
> > That whole "verify no SHA1 hash collision" code is really pretty damn 
> > paranoid. Maybe we shouldn't have it enabled by default.
> >
> > So how about this updated patch? We could certainly make "git pull" imply 
> > "--paranoid" if we want to, but even that is likely pretty unnecessary. 
> > It's not like anybody has ever shown a SHA1 collision, and if the *local* 
> > repository is corrupt (and has an object with the wrong SHA1 - that's what 
> > the testsuite checks for), then it's probably good to get the valid object 
> > from the remote..
> 
> I agree with that reasoning.

For the record, I don't agree.  I stated why in my other email.

> We did not do paranoid in git-pull long after we introduced the .keep 
> thing anyway,

That doesn't make it more "correct".

> so I do not
> think the following patch is even needed, but I am throwing it
> out just for discussion.

1) None of the objects in a pack should exist in the local repo when 
   fetching, meaning that the paranoia code should not be executed 
   normally.

2) Running index-pack on a pack _inside_ a repository is a dubious thing 
   to do with questionable usefulness already.

3) It is unefficient to run pack-objects with --stdout just to feed the 
   result to index-pack afterwards while repack-objects can create the 
   index itself, which is the source of this discussion.

4) I invite you to read the commit log for 8685da42561 where the 
   _perception_ of GIT's security is discussed which led to the paranoia 
   check, and sometimes the perception is more valuable than the 
   reality, especially when it is free.

Therefore Linus' patch and this one are working around the wrong issue 
as described in (3) IMHO.

What could be done instead, if really really needed, is to have the 
paranoia test be made conditional on index-pack --stdin instead.  But 
please no bogus extra switches pretty please.

Nicolas

^ permalink raw reply	[flat|nested] 58+ messages in thread

end of thread, other threads:[~2007-04-06 22:59 UTC | newest]

Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-03 15:15 git-index-pack really does suck Linus Torvalds
     [not found] ` <Pi ne.LNX.4.64.0704031413200.6730@woody.linux-foundation.org>
     [not found]   ` <alpine.LFD.0.98. 0704031836350.28181@xanadu.home>
     [not found] ` <db 69205d0704031227q1009eabfhdd82aa3636f25bb6@mail.gmail.com>
     [not found]   ` <Pine.LNX.4.64.07 04031304420.6730@woody.linux-foundation.org>
     [not found]     ` <Pine.LNX.4.64.0704031322490.67 30@woody.linux-foundation.org>
2007-04-03 16:21 ` Linus Torvalds
2007-04-03 16:40   ` Nicolas Pitre
2007-04-03 16:33 ` Nicolas Pitre
2007-04-03 19:27 ` Chris Lee
2007-04-03 19:49   ` Nicolas Pitre
2007-04-03 19:54     ` Chris Lee
2007-04-03 20:18   ` Linus Torvalds
2007-04-03 20:32     ` Nicolas Pitre
2007-04-03 20:40       ` Junio C Hamano
2007-04-03 21:00         ` Linus Torvalds
2007-04-03 21:28           ` Nicolas Pitre
2007-04-03 22:49           ` Chris Lee
2007-04-03 23:12             ` Linus Torvalds
2007-04-03 20:56       ` Linus Torvalds
2007-04-03 21:03         ` Shawn O. Pearce
2007-04-03 21:13           ` Linus Torvalds
2007-04-03 21:17             ` Shawn O. Pearce
2007-04-03 21:26               ` Linus Torvalds
2007-04-03 21:28                 ` Linus Torvalds
2007-04-03 22:31                   ` Junio C Hamano
2007-04-03 22:38                     ` Shawn O. Pearce
2007-04-03 22:41                       ` Junio C Hamano
2007-04-05 10:22                   ` [PATCH 1/2] git-fetch--tool pick-rref Junio C Hamano
2007-04-05 10:22                   ` [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate Junio C Hamano
2007-04-05 16:15                     ` Shawn O. Pearce
2007-04-05 21:37                       ` Junio C Hamano
2007-04-03 21:34               ` git-index-pack really does suck Nicolas Pitre
2007-04-03 21:37                 ` Shawn O. Pearce
2007-04-03 21:44                   ` Junio C Hamano
2007-04-03 21:53                     ` Shawn O. Pearce
2007-04-03 22:10                       ` Jeff King
2007-04-03 22:40                 ` Dana How
2007-04-03 22:52                   ` Linus Torvalds
2007-04-03 22:31                     ` David Lang
2007-04-03 23:00                   ` Nicolas Pitre
2007-04-03 21:21         ` Nicolas Pitre
2007-04-03 20:33     ` Linus Torvalds
2007-04-03 21:05       ` Nicolas Pitre
2007-04-03 21:11         ` Shawn O. Pearce
2007-04-03 21:24         ` Linus Torvalds
     [not found]           ` <alpine.LF D.0.98.0704031735470.28181@xanadu.home>
2007-04-03 21:42           ` Nicolas Pitre
2007-04-03 22:07             ` Junio C Hamano
2007-04-03 22:11               ` Shawn O. Pearce
2007-04-03 22:34               ` Nicolas Pitre
2007-04-03 22:14             ` Linus Torvalds
2007-04-03 22:55               ` Nicolas Pitre
2007-04-03 22:36                 ` David Lang
2007-04-04  9:51                   ` Alex Riesen
     [not found]                     ` <P ine.LNX.4.63.0704061455380.24050@qynat.qvtvafvgr.pbz>
2007-04-06 21:56                     ` David Lang
2007-04-06 22:47                       ` Junio C Hamano
2007-04-06 22:49                         ` Junio C Hamano
2007-04-06 22:22                           ` David Lang
2007-04-06 22:55                             ` Junio C Hamano
2007-04-06 22:28                               ` David Lang
2007-04-03 23:29                 ` Linus Torvalds
2007-04-03 20:34     ` Junio C Hamano
2007-04-03 20:53       ` Nicolas Pitre

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).