* git-index-pack really does suck..
@ 2007-04-03 15:15 Linus Torvalds
[not found] ` <Pi ne.LNX.4.64.0704031413200.6730@woody.linux-foundation.org>
` (4 more replies)
0 siblings, 5 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 15:15 UTC (permalink / raw)
To: Junio C Hamano, Nicolas Pitre, Git Mailing List
Junio, Nico,
I think we need to do something about it.
CLee was complaining about git-index-pack on #irc with the partial KDE
repo, and while I don't have the KDE repo, I decided to investigate a bit.
Even with just the kernel repo (with a single 170MB pack-file), I can do
git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack
and it uses 52s of CPU-time, and on my 4GB machine it actually started
doing IO and swapping, because git-index-pack grew to 4.8GB in size. So
while I initially thought I'd want a bigger test-case to see the problem,
I sure as heck don't.
The 52s of CPU time exploded into almost three minutes of actual
real-time:
47.33user 5.79system 2:41.65elapsed 32%CPU
2117major+1245763minor
And that's on a good system with a powerful CPU, "enough memory" for any
reasonable development, and good disks! Very much ungood-plus-plus.
I haven't looked into exactly why yet, but I bet it's just that we keep
every single object expanded in memory. We do need to keep the objects
around, so that we can resolve delta's, but we can certainly do it other
ways.
Two suggestion for other ways:
- simple one: don't keep unexploded objects around, just keep the deltas,
and spend tons of CPU-time just re-expanding them if required.
We *should* be able to do it with just keeping the original 170MB
pack-file in memory, not expanding it to 3.8GB!
Still, even this will be painful once you have a big pack-file, and the
CPU waste is nasty (although a delta-base cache like we do in
sha1_file.c would probably fix it 99% - at that point it's getting
less simple, and the "best" solution below looks more palatable)
- best one: when writing out the pack-file, we incrementally keep a
"struct packed_git" around, and update the index for it dynamically,
and totally get rid of all objects that we've written out, because we
can re-create them.
This means that we should have _zero_ memory footprint except for the
one object that we're working on right then and there, and any
unresolved deltas where we've not seen the base at all (and the latter
generally shouldn't happen any more with most pack-files)
The "best one" wouldn't seem to be *that* painful, but as mentioned, I
haven't even started looking at the code yet, I thought I'd try to rope
Nico into looking at this first ;)
Linus
^ permalink raw reply [flat|nested] 58+ messages in thread[parent not found: <Pi ne.LNX.4.64.0704031413200.6730@woody.linux-foundation.org>]
[parent not found: <alpine.LFD.0.98. 0704031836350.28181@xanadu.home>]
[parent not found: <db 69205d0704031227q1009eabfhdd82aa3636f25bb6@mail.gmail.com>]
[parent not found: <Pine.LNX.4.64.07 04031304420.6730@woody.linux-foundation.org>]
[parent not found: <Pine.LNX.4.64.0704031322490.67 30@woody.linux-foundation.org>]
* Re: git-index-pack really does suck.. 2007-04-03 15:15 git-index-pack really does suck Linus Torvalds [not found] ` <Pi ne.LNX.4.64.0704031413200.6730@woody.linux-foundation.org> [not found] ` <db 69205d0704031227q1009eabfhdd82aa3636f25bb6@mail.gmail.com> @ 2007-04-03 16:21 ` Linus Torvalds 2007-04-03 16:40 ` Nicolas Pitre 2007-04-03 16:33 ` Nicolas Pitre 2007-04-03 19:27 ` Chris Lee 4 siblings, 1 reply; 58+ messages in thread From: Linus Torvalds @ 2007-04-03 16:21 UTC (permalink / raw) To: Junio C Hamano, Nicolas Pitre, Git Mailing List On Tue, 3 Apr 2007, Linus Torvalds wrote: > > and it uses 52s of CPU-time, and on my 4GB machine it actually started > doing IO and swapping, because git-index-pack grew to 4.8GB in size. Ahh. False alarm. The problem is actually largely a really stupid memory leak in the SHA1 collision checking (which wouldn't trigger on a normal pull, but obviously does trigger for every single object when testing!) This trivial patch fixes most of it. git-index-pack still uses too much memory, but it does a *lot* better. Junio, please get this into 1.5.1 (I *think* the SHA1 checking is new, but if it exists in 1.5.0 too, it obviously needs the same fix). It still grows, but it grew to just 287M in size now for the 170M kernel object: 41.59user 1.39system 0:43.64elapsed 0major+73552minor which is quite a lot better. Duh. Linus --- index-pack.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/index-pack.c b/index-pack.c index 6284fe3..3c768fb 100644 --- a/index-pack.c +++ b/index-pack.c @@ -358,6 +358,7 @@ static void sha1_object(const void *data, unsigned long size, if (size != has_size || type != has_type || memcmp(data, has_data, size) != 0) die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1)); + free(has_data); } } ^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 16:21 ` Linus Torvalds @ 2007-04-03 16:40 ` Nicolas Pitre 0 siblings, 0 replies; 58+ messages in thread From: Nicolas Pitre @ 2007-04-03 16:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Linus Torvalds wrote: > > > On Tue, 3 Apr 2007, Linus Torvalds wrote: > > > > and it uses 52s of CPU-time, and on my 4GB machine it actually started > > doing IO and swapping, because git-index-pack grew to 4.8GB in size. > > Ahh. False alarm. > > The problem is actually largely a really stupid memory leak in the SHA1 > collision checking (which wouldn't trigger on a normal pull, but obviously > does trigger for every single object when testing!) Damn! Please don't report those things when I'm out for lunch so I could have a chance to fix my own stupidities myself in time! :-) > This trivial patch fixes most of it. git-index-pack still uses too much > memory, but it does a *lot* better. > > Junio, please get this into 1.5.1 (I *think* the SHA1 checking is new, but > if it exists in 1.5.0 too, it obviously needs the same fix). No, it is new to 1.5.1. > It still grows, but it grew to just 287M in size now for the 170M kernel > object: > > 41.59user 1.39system 0:43.64elapsed > 0major+73552minor > > which is quite a lot better. > > Duh. Indeed. Acked-by: Nicolas Pitre <nico@cam.org> > index-pack.c | 1 + > 1 files changed, 1 insertions(+), 0 deletions(-) > > diff --git a/index-pack.c b/index-pack.c > index 6284fe3..3c768fb 100644 > --- a/index-pack.c > +++ b/index-pack.c > @@ -358,6 +358,7 @@ static void sha1_object(const void *data, unsigned long size, > if (size != has_size || type != has_type || > memcmp(data, has_data, size) != 0) > die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1)); > + free(has_data); > } > } > > - > To unsubscribe from this list: send the line "unsubscribe git" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > Nicolas ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 15:15 git-index-pack really does suck Linus Torvalds ` (2 preceding siblings ...) 2007-04-03 16:21 ` Linus Torvalds @ 2007-04-03 16:33 ` Nicolas Pitre 2007-04-03 19:27 ` Chris Lee 4 siblings, 0 replies; 58+ messages in thread From: Nicolas Pitre @ 2007-04-03 16:33 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Linus Torvalds wrote: > > Junio, Nico, > I think we need to do something about it. Sure. Mea culpa. > CLee was complaining about git-index-pack on #irc with the partial KDE > repo, and while I don't have the KDE repo, I decided to investigate a bit. > > Even with just the kernel repo (with a single 170MB pack-file), I can do > > git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack > > and it uses 52s of CPU-time, and on my 4GB machine it actually started > doing IO and swapping, because git-index-pack grew to 4.8GB in size. Right. > Two suggestion for other ways: > > - simple one: don't keep unexploded objects around, just keep the deltas, > and spend tons of CPU-time just re-expanding them if required. > > We *should* be able to do it with just keeping the original 170MB > pack-file in memory, not expanding it to 3.8GB! > > Still, even this will be painful once you have a big pack-file, and the > CPU waste is nasty (although a delta-base cache like we do in > sha1_file.c would probably fix it 99% - at that point it's getting > less simple, and the "best" solution below looks more palatable) > > - best one: when writing out the pack-file, we incrementally keep a > "struct packed_git" around, and update the index for it dynamically, > and totally get rid of all objects that we've written out, because we > can re-create them. > > This means that we should have _zero_ memory footprint except for the > one object that we're working on right then and there, and any > unresolved deltas where we've not seen the base at all (and the latter > generally shouldn't happen any more with most pack-files) Even better: - Fix my own stupid mistake with a _single_ line of code: diff --git a/index-pack.c b/index-pack.c index 6284fe3..3c768fb 100644 --- a/index-pack.c +++ b/index-pack.c @@ -358,6 +358,7 @@ static void sha1_object(const void *data, unsigned long size, if (size != has_size || type != has_type || memcmp(data, has_data, size) != 0) die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1)); + free(has_data); } } The thing is, that code path is executed _only_ when index-pack is encountering an object already in the repository in order to protect against possible SHA1 collision attacks. See commit 8685da42561d log for the full story. Normally this should not happen in normal usage scenarios because the objects you fetch are those that you don't already have. But if you manually run index-pack inside an existing repository then you'll already have _all_ those objects already explaining the high CPU usage. But this is no excuse for not freeing the data though. Nicolas ^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 15:15 git-index-pack really does suck Linus Torvalds ` (3 preceding siblings ...) 2007-04-03 16:33 ` Nicolas Pitre @ 2007-04-03 19:27 ` Chris Lee 2007-04-03 19:49 ` Nicolas Pitre 2007-04-03 20:18 ` Linus Torvalds 4 siblings, 2 replies; 58+ messages in thread From: Chris Lee @ 2007-04-03 19:27 UTC (permalink / raw) To: Linus Torvalds, git On 4/3/07, Linus Torvalds <torvalds@linux-foundation.org> wrote: > > Junio, Nico, > I think we need to do something about it. > > CLee was complaining about git-index-pack on #irc with the partial KDE > repo, and while I don't have the KDE repo, I decided to investigate a bit. > > Even with just the kernel repo (with a single 170MB pack-file), I can do > > git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack > > and it uses 52s of CPU-time, and on my 4GB machine it actually started > doing IO and swapping, because git-index-pack grew to 4.8GB in size. So > while I initially thought I'd want a bigger test-case to see the problem, > I sure as heck don't. There's another issue here. I'm running git-index-pack as part of a workflow like so: $ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects $ grep 'blob' /tmp/all-objects > /tmp/blob-objects $ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects --delta-base-offset --all-progress --stdout > blob.pack $ git-index-pack -v blob.pack Now, when I run 'git-index-pack' on blob.pack in the current directory, memory usage is pretty horrific (even with the applied patch to not leak all everything). Shawn tells me that index-pack should only be decompressing the object twice - once from the repo and once from blob.pack - iff I call git-index-pack with --stdin, which I am not. If I move the blob.pack into /tmp, and run git-index-pack on it there, it completes much faster and the memory usage never exceeds 200MB. (Inside the repo, it takes up over 3G of RES according to top.) By "much faster", I mean: the entire pack was indexed and completed in 17:42.40, whereas I cancelled the inside-the-repo index because after 56 minutes it was only at 46%. (And, as far as getting this huge repo published - I have it burned to a DVD, and I'm going to drop it in the mail to hpa today on my lunch break, which I should be taking soon.) -clee ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 19:27 ` Chris Lee @ 2007-04-03 19:49 ` Nicolas Pitre 2007-04-03 19:54 ` Chris Lee 2007-04-03 20:18 ` Linus Torvalds 1 sibling, 1 reply; 58+ messages in thread From: Nicolas Pitre @ 2007-04-03 19:49 UTC (permalink / raw) To: Chris Lee; +Cc: Linus Torvalds, git On Tue, 3 Apr 2007, Chris Lee wrote: > There's another issue here. > > I'm running git-index-pack as part of a workflow like so: > > $ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects > $ grep 'blob' /tmp/all-objects > /tmp/blob-objects > $ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects > --delta-base-offset --all-progress --stdout > blob.pack > $ git-index-pack -v blob.pack Instead of using --stdout with git-pack-object, you should provide it with a suitable base name for the resulting pack and the index will be created automatically along side the pack for you. No need to use index-pack for that. > Now, when I run 'git-index-pack' on blob.pack in the current > directory, memory usage is pretty horrific (even with the applied > patch to not leak all everything). Shawn tells me that index-pack > should only be decompressing the object twice - once from the repo and > once from blob.pack - iff I call git-index-pack with --stdin, which I > am not. > > If I move the blob.pack into /tmp, and run git-index-pack on it there, > it completes much faster and the memory usage never exceeds 200MB. > (Inside the repo, it takes up over 3G of RES according to top.) The 3G should definitely be fixed with the added free(). The CPU usage is explained by the fact that you're running index-pack on objects that are all already found in your repo so the collision check is triggered. This is more or like the same issue as if you tried to run unpack-objects on the same pack where none of your objects will actually be unpacked. Nicolas ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 19:49 ` Nicolas Pitre @ 2007-04-03 19:54 ` Chris Lee 0 siblings, 0 replies; 58+ messages in thread From: Chris Lee @ 2007-04-03 19:54 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, git On 4/3/07, Nicolas Pitre <nico@cam.org> wrote: > On Tue, 3 Apr 2007, Chris Lee wrote: > > > There's another issue here. > > > > I'm running git-index-pack as part of a workflow like so: > > > > $ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects > > $ grep 'blob' /tmp/all-objects > /tmp/blob-objects > > $ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects > > --delta-base-offset --all-progress --stdout > blob.pack > > $ git-index-pack -v blob.pack > > Instead of using --stdout with git-pack-object, you should provide it > with a suitable base name for the resulting pack and the index will be > created automatically along side the pack for you. No need to use > index-pack for that. Right. But then I wouldn't have discovered how much git-index-pack sucks. :) > > Now, when I run 'git-index-pack' on blob.pack in the current > > directory, memory usage is pretty horrific (even with the applied > > patch to not leak all everything). Shawn tells me that index-pack > > should only be decompressing the object twice - once from the repo and > > once from blob.pack - iff I call git-index-pack with --stdin, which I > > am not. > > > > If I move the blob.pack into /tmp, and run git-index-pack on it there, > > it completes much faster and the memory usage never exceeds 200MB. > > (Inside the repo, it takes up over 3G of RES according to top.) > > The 3G should definitely be fixed with the added free(). Not really. This packfile is 2.6GB in size, and apparently it gets mmap'd. (Yesterday, my machine ran out of memory trying to do index-pack when the memleak still existed; I have 4G of RAM and, normally, 4G of swap, but I upped it to 32G of swap and it still ran out of memory.) > The CPU usage is explained by the fact that you're running index-pack on > objects that are all already found in your repo so the collision check > is triggered. This is more or like the same issue as if you tried to > run unpack-objects on the same pack where none of your objects will > actually be unpacked. Right, and if I was using --stdin, I would expect that. But I'm not. And, according to Shawn anyway, the current behaviour is not what was intended. -clee ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 19:27 ` Chris Lee 2007-04-03 19:49 ` Nicolas Pitre @ 2007-04-03 20:18 ` Linus Torvalds 2007-04-03 20:32 ` Nicolas Pitre ` (2 more replies) 1 sibling, 3 replies; 58+ messages in thread From: Linus Torvalds @ 2007-04-03 20:18 UTC (permalink / raw) To: Chris Lee, Junio C Hamano; +Cc: Git Mailing List On Tue, 3 Apr 2007, Chris Lee wrote: > > There's another issue here. > > I'm running git-index-pack as part of a workflow like so: > > $ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects > $ grep 'blob' /tmp/all-objects > /tmp/blob-objects > $ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects > --delta-base-offset --all-progress --stdout > blob.pack > $ git-index-pack -v blob.pack > > Now, when I run 'git-index-pack' on blob.pack in the current > directory, memory usage is pretty horrific (even with the applied > patch to not leak all everything). Shawn tells me that index-pack > should only be decompressing the object twice - once from the repo and > once from blob.pack - iff I call git-index-pack with --stdin, which I > am not. > > If I move the blob.pack into /tmp, and run git-index-pack on it there, > it completes much faster and the memory usage never exceeds 200MB. > (Inside the repo, it takes up over 3G of RES according to top.) Yeah. What happens is that inside the repo, because we do all the duplicate object checks (verifying that there are no evil hash collisions) even after fixing the memory leak, we end up keeping *track* of all those objects. And with a large repository, it's quite the expensive operation. That whole "verify no SHA1 hash collision" code is really pretty damn paranoid. Maybe we shouldn't have it enabled by default. So how about this updated patch? We could certainly make "git pull" imply "--paranoid" if we want to, but even that is likely pretty unnecessary. It's not like anybody has ever shown a SHA1 collision, and if the *local* repository is corrupt (and has an object with the wrong SHA1 - that's what the testsuite checks for), then it's probably good to get the valid object from the remote.. This includes the previous one-liner, but also adds the "--paranoid" flag and fixes up the Documentation and tests to match. Junio, your choice, but regardless which one you choose: Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Thanks, Linus --- Documentation/git-index-pack.txt | 3 +-- index-pack.c | 6 +++++- t/t5300-pack-object.sh | 4 ++-- 3 files changed, 8 insertions(+), 5 deletions(-) diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt index 2229ee8..7d8d33b 100644 --- a/Documentation/git-index-pack.txt +++ b/Documentation/git-index-pack.txt @@ -8,8 +8,7 @@ git-index-pack - Build pack index file for an existing packed archive SYNOPSIS -------- -'git-index-pack' [-v] [-o <index-file>] <pack-file> -'git-index-pack' --stdin [--fix-thin] [--keep] [-v] [-o <index-file>] [<pack-file>] +'git-index-pack' [--stdin [--fix-thin] [--keep]] [-v] [--paranoid] [-o <index-file>] <pack-file> DESCRIPTION diff --git a/index-pack.c b/index-pack.c index 6284fe3..8a4c27a 100644 --- a/index-pack.c +++ b/index-pack.c @@ -45,6 +45,7 @@ static int nr_resolved_deltas; static int from_stdin; static int verbose; +static int paranoid; static volatile sig_atomic_t progress_update; @@ -348,7 +349,7 @@ static void sha1_object(const void *data, unsigned long size, enum object_type type, unsigned char *sha1) { hash_sha1_file(data, size, typename(type), sha1); - if (has_sha1_file(sha1)) { + if (paranoid && has_sha1_file(sha1)) { void *has_data; enum object_type has_type; unsigned long has_size; @@ -358,6 +359,7 @@ static void sha1_object(const void *data, unsigned long size, if (size != has_size || type != has_type || memcmp(data, has_data, size) != 0) die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1)); + free(has_data); } } @@ -839,6 +841,8 @@ int main(int argc, char **argv) if (*arg == '-') { if (!strcmp(arg, "--stdin")) { from_stdin = 1; + } else if (!strcmp(arg, "--paranoid")) { + paranoid = 1; } else if (!strcmp(arg, "--fix-thin")) { fix_thin_pack = 1; } else if (!strcmp(arg, "--keep")) { diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh index 35e036a..407c71e 100755 --- a/t/t5300-pack-object.sh +++ b/t/t5300-pack-object.sh @@ -262,7 +262,7 @@ test_expect_success \ .git/objects/c8/2de19312b6c3695c0c18f70709a6c535682a67' test_expect_failure \ - 'make sure index-pack detects the SHA1 collision' \ - 'git-index-pack -o bad.idx test-3.pack' + 'make sure index-pack detects the SHA1 collision when paranoid' \ + 'git-index-pack --paranoid -o bad.idx test-3.pack' test_done ^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 20:18 ` Linus Torvalds @ 2007-04-03 20:32 ` Nicolas Pitre 2007-04-03 20:40 ` Junio C Hamano 2007-04-03 20:56 ` Linus Torvalds 2007-04-03 20:33 ` Linus Torvalds 2007-04-03 20:34 ` Junio C Hamano 2 siblings, 2 replies; 58+ messages in thread From: Nicolas Pitre @ 2007-04-03 20:32 UTC (permalink / raw) To: Linus Torvalds; +Cc: Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Linus Torvalds wrote: > > > On Tue, 3 Apr 2007, Chris Lee wrote: > > > > There's another issue here. > > > > I'm running git-index-pack as part of a workflow like so: > > > > $ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects > > $ grep 'blob' /tmp/all-objects > /tmp/blob-objects > > $ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects > > --delta-base-offset --all-progress --stdout > blob.pack > > $ git-index-pack -v blob.pack > > > > Now, when I run 'git-index-pack' on blob.pack in the current > > directory, memory usage is pretty horrific (even with the applied > > patch to not leak all everything). Shawn tells me that index-pack > > should only be decompressing the object twice - once from the repo and > > once from blob.pack - iff I call git-index-pack with --stdin, which I > > am not. > > > > If I move the blob.pack into /tmp, and run git-index-pack on it there, > > it completes much faster and the memory usage never exceeds 200MB. > > (Inside the repo, it takes up over 3G of RES according to top.) > > Yeah. What happens is that inside the repo, because we do all the > duplicate object checks (verifying that there are no evil hash collisions) > even after fixing the memory leak, we end up keeping *track* of all those > objects. What do you mean? > And with a large repository, it's quite the expensive operation. > > That whole "verify no SHA1 hash collision" code is really pretty damn > paranoid. Maybe we shouldn't have it enabled by default. Maybe we shouldn't run index-pack on packs for which we _already_ have an index for which is the most likely reason for the collision check to trigger in the first place. This is in the same category as trying to run unpack-objects on a pack within a repository and wondering why it doesn't work. > So how about this updated patch? We could certainly make "git pull" imply > "--paranoid" if we want to, but even that is likely pretty unnecessary. I'm of the opinion that this patch is unnecessary. It only helps in bogus workflows to start with, and it makes the default behavior unsafe (unsafe from a paranoid pov, but still). And in the _normal_ workflow it should never trigger. So I wouldn't merge it. Nicolas ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 20:32 ` Nicolas Pitre @ 2007-04-03 20:40 ` Junio C Hamano 2007-04-03 21:00 ` Linus Torvalds 2007-04-03 20:56 ` Linus Torvalds 1 sibling, 1 reply; 58+ messages in thread From: Junio C Hamano @ 2007-04-03 20:40 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, Chris Lee, Git Mailing List [sorry, sent a message without finishing] Nicolas Pitre <nico@cam.org> writes: > Maybe we shouldn't run index-pack on packs for which we _already_ have > an index for which is the most likely reason for the collision check to > trigger in the first place. > > This is in the same category as trying to run unpack-objects on a pack > within a repository and wondering why it doesn't work. > ... > I'm of the opinion that this patch is unnecessary. It only helps in > bogus workflows to start with, and it makes the default behavior unsafe > (unsafe from a paranoid pov, but still). And in the _normal_ workflow > it should never trigger. Hmmmm. You may have a point. So maybe we should retitle this thread from "git-index-pack really does suck.." to "I used git-index-pack in a stupid way"? ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 20:40 ` Junio C Hamano @ 2007-04-03 21:00 ` Linus Torvalds 2007-04-03 21:28 ` Nicolas Pitre 2007-04-03 22:49 ` Chris Lee 0 siblings, 2 replies; 58+ messages in thread From: Linus Torvalds @ 2007-04-03 21:00 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, Chris Lee, Git Mailing List On Tue, 3 Apr 2007, Junio C Hamano wrote: > > So maybe we should retitle this thread from "git-index-pack > really does suck.." to "I used git-index-pack in a stupid way"? See my separate timing numbers, although I bet that Chris can give even better ones.. Chris, try applying my patch, and then inside the KDE repo you have, do git index-pack --paranoid --stdin --fix-thin new.pack < ~/git/.git/objects/pack/pack-*.pack (ie index the objects of the *git* repository, not the KDE one). That should approximate doing a fair-sized "git pull" - getting new objects. Do it with and without --paranoid, and time it. I bet that what I see as a 7% slowdown will be much bigger for you, just because the negative lookups will be all that much more expensive when you have tons of objects. Linus ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:00 ` Linus Torvalds @ 2007-04-03 21:28 ` Nicolas Pitre 2007-04-03 22:49 ` Chris Lee 1 sibling, 0 replies; 58+ messages in thread From: Nicolas Pitre @ 2007-04-03 21:28 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Chris Lee, Git Mailing List On Tue, 3 Apr 2007, Linus Torvalds wrote: > See my separate timing numbers, although I bet that Chris can give even > better ones.. > > Chris, try applying my patch, and then inside the KDE repo you have, do > > git index-pack --paranoid --stdin --fix-thin new.pack < ~/git/.git/objects/pack/pack-*.pack > > (ie index the objects of the *git* repository, not the KDE one). That > should approximate doing a fair-sized "git pull" - getting new objects. Do > it with and without --paranoid, and time it. Like I said this is bogus since the index-pack is throttled by the network making this overhead a non issue in real life. And like I said there should _not_ be such a memory usage difference which is probably showing potential problems *elsewhere*. > I bet that what I see as a 7% slowdown will be much bigger for you, just > because the negative lookups will be all that much more expensive when you > have tons of objects. And I bet your newton-raphson lookup idea would shine and bring that overhead down considerably in that case. Nicolas ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:00 ` Linus Torvalds 2007-04-03 21:28 ` Nicolas Pitre @ 2007-04-03 22:49 ` Chris Lee 2007-04-03 23:12 ` Linus Torvalds 1 sibling, 1 reply; 58+ messages in thread From: Chris Lee @ 2007-04-03 22:49 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Nicolas Pitre, Git Mailing List On 4/3/07, Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, 3 Apr 2007, Junio C Hamano wrote: > > So maybe we should retitle this thread from "git-index-pack > > really does suck.." to "I used git-index-pack in a stupid way"? Well, I never claimed to be a genius. :) > See my separate timing numbers, although I bet that Chris can give even > better ones.. > > Chris, try applying my patch, and then inside the KDE repo you have, do > > git index-pack --paranoid --stdin --fix-thin new.pack < ~/git/.git/objects/pack/pack-*.pack > > (ie index the objects of the *git* repository, not the KDE one). That > should approximate doing a fair-sized "git pull" - getting new objects. Do > it with and without --paranoid, and time it. % time git-index-pack --paranoid --stdin --fix-thin paranoid.pack < /usr/local/src/git/.git/objects/pack/*pack pack bf8ba7897da9c84d1981ecdc92c0b1979506a4b9 git-index-pack --paranoid --stdin --fix-thin paranoid.pack < 5.28s user 0.24s system 98% cpu 5.592 total % time git-index-pack --stdin --fix-thin trusting.pack < /usr/local/src/git/.git/objects/pack/*pack pack bf8ba7897da9c84d1981ecdc92c0b1979506a4b9 git-index-pack --stdin --fix-thin trusting.pack < 5.07s user 0.12s system 99% cpu 5.202 total So, in my case, at least... not really much of a difference, which is puzzling. > I bet that what I see as a 7% slowdown will be much bigger for you, just > because the negative lookups will be all that much more expensive when you > have tons of objects. I applied exactly the patch you sent, and it applied perfectly cleanly - no failures. I also mailed out the DVD with the repo on it to hpa today, so hopefully by tomorrow he'll get it. (He's not even two cities over, and I suspect I could have just driven it to his place, but that might have been a little awkward since I've never met him.) Anyway, so, hopefully once he gets it he can put it up somewhere that you guys can grab it. For reference, the KDE repo is pretty big, but a "real" conversion of the repo would be bigger; the one that I've been playing with only has the KDE svn trunk, and only the first 409k revisions - there are, as of right now, over 650k revisions in KDE's svn repo. So, realistically speaking, a fully-converted KDE git repo would probably take up at least 6GB, packed, if not more. Subproject support would probably be *really* helpful to mitigate that. -clee ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 22:49 ` Chris Lee @ 2007-04-03 23:12 ` Linus Torvalds 0 siblings, 0 replies; 58+ messages in thread From: Linus Torvalds @ 2007-04-03 23:12 UTC (permalink / raw) To: Chris Lee; +Cc: Junio C Hamano, Nicolas Pitre, Git Mailing List On Tue, 3 Apr 2007, Chris Lee wrote: > > git-index-pack --paranoid --stdin --fix-thin paranoid.pack < > 5.28s user 0.24s system 98% cpu 5.592 total > > git-index-pack --stdin --fix-thin trusting.pack < > 5.07s user 0.12s system 99% cpu 5.202 total Ok, that's not a big enough of a difference to care. > So, in my case, at least... not really much of a difference, which is > puzzling. It's entirely possible that the object lookup is good enough to not be a problem even for huge packs, and it really only gets to be a problem when you actually unpack all the objects. In that case, the only real case to worry about is indeed the "alternates" case (or if people actually use a shared git object directory, but I don't think anybody really does - alternates just work well enough, and shared object directories are painful enough that I doubt anybody *really* uses it). > I also mailed out the DVD with the repo on it to hpa today, so > hopefully by tomorrow he'll get it. (He's not even two cities over, > and I suspect I could have just driven it to his place, but that might > have been a little awkward since I've never met him.) Heh. Ok, good. I'll torrent it or something when it's up. > Anyway, so, hopefully once he gets it he can put it up somewhere that > you guys can grab it. For reference, the KDE repo is pretty big, but a > "real" conversion of the repo would be bigger; the one that I've been > playing with only has the KDE svn trunk, and only the first 409k > revisions - there are, as of right now, over 650k revisions in KDE's > svn repo. So, realistically speaking, a fully-converted KDE git repo > would probably take up at least 6GB, packed, if not more. Subproject > support would probably be *really* helpful to mitigate that. Sure. I think subproject support is likely the big "missing feature" of git right now. The rest is "details", even if they can be big and involved details. But even at only 409k revisions, it's still going to be an order of magnitude bigger than what the kernel is, exactly *because* it's such a disaster from a maintenance setup standpoint, and it's going to be a useful real-world test-case. So whether that is a "good" git archive or not, it's going to be useful. Long ago we used to be able to look at the historic Linux archive as an example of a "big" archive, but it's not actually all that much bigger than the normal Linux archive any more, and we've pretty much fixed the problems we used to have. [ The historical pack-file is actually smaller, but that's because it was done with a much deeper delta-chain to make it small: the historical archive still has more objects in it than the current active git kernel tree - but it's only in the 20% range, not "20 *times* bigger" ] The Eclipse tree was useful (and I think we already improved performance for you thanks to working with it - I don't know how much faster the delta-base cache made things for you, but I'd assume it was at *least* by the factor-of-2.5 that we saw on Eclipse), but the KDE is bigger *and* deeper (the eclipse tree is 1.7GB, and 136k revisions in the main branch, so the KDE tree is more than twice the revisions). Linus ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 20:32 ` Nicolas Pitre 2007-04-03 20:40 ` Junio C Hamano @ 2007-04-03 20:56 ` Linus Torvalds 2007-04-03 21:03 ` Shawn O. Pearce 2007-04-03 21:21 ` Nicolas Pitre 1 sibling, 2 replies; 58+ messages in thread From: Linus Torvalds @ 2007-04-03 20:56 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Nicolas Pitre wrote: > > > > Yeah. What happens is that inside the repo, because we do all the > > duplicate object checks (verifying that there are no evil hash collisions) > > even after fixing the memory leak, we end up keeping *track* of all those > > objects. > > What do you mean? Look at what we have to do to look up a SHA1 object.. We create all the lookup infrastructure, we don't *just* read the object. The delta base cache is the most obvious one. > I'm of the opinion that this patch is unnecessary. It only helps in > bogus workflows to start with, and it makes the default behavior unsafe > (unsafe from a paranoid pov, but still). And in the _normal_ workflow > it should never trigger. Actually, even in the normal workflow it will do all the extra unnecessary work, if only because the lookup costs of *not* finding the entry. Lookie here: - git index-pack of the *git* pack-file in the v2.6/linux directory (zero overlap of objects) With --paranoid: 2.75user 0.37system 0:03.13elapsed 99%CPU 0major+5583minor pagefaults Without --paranoid: 2.55user 0.12system 0:02.68elapsed 99%CPU 0major+2957minor pagefaults See? That's the *normal* workflow. Zero objects found. 7% CPU overhead from just the unnecessary work, and almost twice as much memory used. Just from the index file lookup etc for a decent-sized project. Now, in the KDE situation, the *unnecessary* lookups will be about ten times more expensive, both on memory and CPU, just because the repository is about 20x the size. Even with no actual hits. Linus ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 20:56 ` Linus Torvalds @ 2007-04-03 21:03 ` Shawn O. Pearce 2007-04-03 21:13 ` Linus Torvalds 2007-04-03 21:21 ` Nicolas Pitre 1 sibling, 1 reply; 58+ messages in thread From: Shawn O. Pearce @ 2007-04-03 21:03 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, Chris Lee, Junio C Hamano, Git Mailing List Linus Torvalds <torvalds@linux-foundation.org> wrote: > Actually, even in the normal workflow it will do all the extra unnecessary > work, if only because the lookup costs of *not* finding the entry. > > Lookie here: > > - git index-pack of the *git* pack-file in the v2.6/linux directory (zero > overlap of objects) > > With --paranoid: > > 2.75user 0.37system 0:03.13elapsed 99%CPU > 0major+5583minor pagefaults > > Without --paranoid: > > 2.55user 0.12system 0:02.68elapsed 99%CPU > 0major+2957minor pagefaults > > See? That's the *normal* workflow. Zero objects found. 7% CPU overhead > from just the unnecessary work, and almost twice as much memory used. Just > from the index file lookup etc for a decent-sized project. OK, but what about that case with unpack-objects? Didn't we there do all this work to also check for the object already existing? During update-index, write-tree and commit-tree don't we also do a lot of work (per object anyway) to check for a non-existing object? So even with --paranoid (aka what we have now) index-pack still should be faster than unpack-objects for any sizeable transfer, and is just as "safe". If its the missing-object lookup that is expensive, maybe we should try to optimize that. We do it enough already in other parts of the code... -- Shawn. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:03 ` Shawn O. Pearce @ 2007-04-03 21:13 ` Linus Torvalds 2007-04-03 21:17 ` Shawn O. Pearce 0 siblings, 1 reply; 58+ messages in thread From: Linus Torvalds @ 2007-04-03 21:13 UTC (permalink / raw) To: Shawn O. Pearce Cc: Nicolas Pitre, Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Shawn O. Pearce wrote: > > So even with --paranoid (aka what we have now) index-pack still > should be faster than unpack-objects for any sizeable transfer, > and is just as "safe". I agree 100% that index-pack is much much MUCH better than unpack-objects ever was, so .. > If its the missing-object lookup that is expensive, maybe we should > try to optimize that. We do it enough already in other parts of > the code... Well, for all other cases it's really the "object found" case that is worth optimizing for, so I think optimizing for "no object" is actually wrong, unless it also speeds up (or at least doesn't make it worse) the "real" normal case. Linus ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:13 ` Linus Torvalds @ 2007-04-03 21:17 ` Shawn O. Pearce 2007-04-03 21:26 ` Linus Torvalds 2007-04-03 21:34 ` git-index-pack really does suck Nicolas Pitre 0 siblings, 2 replies; 58+ messages in thread From: Shawn O. Pearce @ 2007-04-03 21:17 UTC (permalink / raw) To: Linus Torvalds; +Cc: Nicolas Pitre, Chris Lee, Junio C Hamano, Git Mailing List Linus Torvalds <torvalds@linux-foundation.org> wrote: > On Tue, 3 Apr 2007, Shawn O. Pearce wrote: > > If its the missing-object lookup that is expensive, maybe we should > > try to optimize that. We do it enough already in other parts of > > the code... > > Well, for all other cases it's really the "object found" case that is > worth optimizing for, so I think optimizing for "no object" is actually > wrong, unless it also speeds up (or at least doesn't make it worse) the > "real" normal case. Right. But maybe we shouldn't be scanning for packfiles every time we don't find a loose object. Especially if the caller is in a context where we actually *expect* to not find said object like half of the time... say in git-add/update-index. ;-) -- Shawn. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:17 ` Shawn O. Pearce @ 2007-04-03 21:26 ` Linus Torvalds 2007-04-03 21:28 ` Linus Torvalds 2007-04-03 21:34 ` git-index-pack really does suck Nicolas Pitre 1 sibling, 1 reply; 58+ messages in thread From: Linus Torvalds @ 2007-04-03 21:26 UTC (permalink / raw) To: Shawn O. Pearce Cc: Nicolas Pitre, Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Shawn O. Pearce wrote: > > Right. But maybe we shouldn't be scanning for packfiles every > time we don't find a loose object. Especially if the caller is in > a context where we actually *expect* to not find said object like > half of the time... say in git-add/update-index. ;-) Yes, we could definitely skip the re-lookup if we had a "don't really care, I can recreate the object myself" flag (ie anybody who is going to write that object) Linus ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:26 ` Linus Torvalds @ 2007-04-03 21:28 ` Linus Torvalds 2007-04-03 22:31 ` Junio C Hamano ` (2 more replies) 0 siblings, 3 replies; 58+ messages in thread From: Linus Torvalds @ 2007-04-03 21:28 UTC (permalink / raw) To: Shawn O. Pearce Cc: Nicolas Pitre, Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Linus Torvalds wrote: > > Yes, we could definitely skip the re-lookup if we had a "don't really > care, I can recreate the object myself" flag (ie anybody who is going to > write that object) Side note: with "alternates" files, you might well *always* have the objects. If you do git clone -l -s ... to create various branches, and then pull between them, you'll actually end up in the situation that you'll always find the objects and get back to the really expensive case.. Linus ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:28 ` Linus Torvalds @ 2007-04-03 22:31 ` Junio C Hamano 2007-04-03 22:38 ` Shawn O. Pearce 2007-04-05 10:22 ` [PATCH 1/2] git-fetch--tool pick-rref Junio C Hamano 2007-04-05 10:22 ` [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate Junio C Hamano 2 siblings, 1 reply; 58+ messages in thread From: Junio C Hamano @ 2007-04-03 22:31 UTC (permalink / raw) To: Linus Torvalds Cc: Shawn O. Pearce, Nicolas Pitre, Chris Lee, Git Mailing List Linus Torvalds <torvalds@linux-foundation.org> writes: > Side note: with "alternates" files, you might well *always* have the > objects. If you do > > git clone -l -s ... > > to create various branches, and then pull between them, you'll actually > end up in the situation that you'll always find the objects and get back > to the really expensive case.. Ah, that's true. If you "git clone -l -s A B", create new objects in A and pull from B, the transfer would not exclude new objects as they are not visible from B's refs. In that scenario, the keep-pack behaviour is already worse than the unpack-objects behaviour. The former creates a packfile that duplicates objects that are in A while the latter, although expensive, ends up doing nothing. I wonder if we can have a backdoor to avoid any object transfer in such a case to begin with... ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 22:31 ` Junio C Hamano @ 2007-04-03 22:38 ` Shawn O. Pearce 2007-04-03 22:41 ` Junio C Hamano 0 siblings, 1 reply; 58+ messages in thread From: Shawn O. Pearce @ 2007-04-03 22:38 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, Nicolas Pitre, Chris Lee, Git Mailing List Junio C Hamano <junkio@cox.net> wrote: > Ah, that's true. If you "git clone -l -s A B", create new > objects in A and pull from B, the transfer would not exclude > new objects as they are not visible from B's refs. > > In that scenario, the keep-pack behaviour is already worse than > the unpack-objects behaviour. The former creates a packfile > that duplicates objects that are in A while the latter, although > expensive, ends up doing nothing. > > I wonder if we can have a backdoor to avoid any object transfer > in such a case to begin with... Yea, symlink to the corresponding refs directory of the alternate ODB. Then the refs will be visible. ;-) -- Shawn. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 22:38 ` Shawn O. Pearce @ 2007-04-03 22:41 ` Junio C Hamano 0 siblings, 0 replies; 58+ messages in thread From: Junio C Hamano @ 2007-04-03 22:41 UTC (permalink / raw) To: Shawn O. Pearce Cc: Linus Torvalds, Nicolas Pitre, Chris Lee, Git Mailing List "Shawn O. Pearce" <spearce@spearce.org> writes: >> I wonder if we can have a backdoor to avoid any object transfer >> in such a case to begin with... > > Yea, symlink to the corresponding refs directory of the alternate > ODB. Then the refs will be visible. ;-) I was thinking about going the other way. When git-fetch is asked to fetch over a local transport, it could check if the source is one of its alternate object stores. ^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 1/2] git-fetch--tool pick-rref 2007-04-03 21:28 ` Linus Torvalds 2007-04-03 22:31 ` Junio C Hamano @ 2007-04-05 10:22 ` Junio C Hamano 2007-04-05 10:22 ` [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate Junio C Hamano 2 siblings, 0 replies; 58+ messages in thread From: Junio C Hamano @ 2007-04-05 10:22 UTC (permalink / raw) To: Linus Torvalds Cc: Shawn O. Pearce, Nicolas Pitre, Chris Lee, Git Mailing List This script helper takes list of fully qualified refnames and results from ls-remote and grabs only the lines for the named refs from the latter. Signed-off-by: Junio C Hamano <junkio@cox.net> --- Linus Torvalds <torvalds@linux-foundation.org> writes: > On Tue, 3 Apr 2007, Linus Torvalds wrote: >> >> Yes, we could definitely skip the re-lookup if we had a "don't really >> care, I can recreate the object myself" flag (ie anybody who is going to >> write that object) > > Side note: with "alternates" files, you might well *always* have the > objects. If you do > > git clone -l -s ... > > to create various branches, and then pull between them, you'll actually > end up in the situation that you'll always find the objects and get back > to the really expensive case.. Ah, that's true. If you "git clone -l -s A B", create new objects in A and pull from B, the transfer would not exclude new objects as they are not visible from B's refs. In that scenario, the keep-pack behaviour is already worse than the unpack-objects behaviour. The former creates a packfile that duplicates objects that are in A while the latter, although expensive, ends up doing nothing. I wonder if we can have a backdoor to avoid any object transfer in such a case to begin with... ... and this two patch series does exactly that. builtin-fetch--tool.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++++ 1 files changed, 84 insertions(+), 0 deletions(-) diff --git a/builtin-fetch--tool.c b/builtin-fetch--tool.c index e9d6764..be341c1 100644 --- a/builtin-fetch--tool.c +++ b/builtin-fetch--tool.c @@ -436,10 +436,87 @@ static int expand_refs_wildcard(const char *ls_remote_result, int numrefs, return 0; } +static int pick_rref(int sha1_only, const char *rref, const char *ls_remote_result) +{ + int err = 0; + int lrr_count = lrr_count, i, pass; + const char *cp; + struct lrr { + const char *line; + const char *name; + int namelen; + int shown; + } *lrr_list = lrr_list; + + for (pass = 0; pass < 2; pass++) { + /* pass 0 counts and allocates, pass 1 fills... */ + cp = ls_remote_result; + i = 0; + while (1) { + const char *np; + while (*cp && isspace(*cp)) + cp++; + if (!*cp) + break; + np = strchr(cp, '\n'); + if (!np) + np = cp + strlen(cp); + if (pass) { + lrr_list[i].line = cp; + lrr_list[i].name = cp + 41; + lrr_list[i].namelen = np - (cp + 41); + } + i++; + cp = np; + } + if (!pass) { + lrr_count = i; + lrr_list = xcalloc(lrr_count, sizeof(*lrr_list)); + } + } + + while (1) { + const char *next; + int rreflen; + int i; + + while (*rref && isspace(*rref)) + rref++; + if (!*rref) + break; + next = strchr(rref, '\n'); + if (!next) + next = rref + strlen(rref); + rreflen = next - rref; + + for (i = 0; i < lrr_count; i++) { + struct lrr *lrr = &(lrr_list[i]); + + if (rreflen == lrr->namelen && + !memcmp(lrr->name, rref, rreflen)) { + if (!lrr->shown) + printf("%.*s\n", + sha1_only ? 40 : lrr->namelen + 41, + lrr->line); + lrr->shown = 1; + break; + } + } + if (lrr_count <= i) { + error("pick-rref: %.*s not found", rreflen, rref); + err = 1; + } + rref = next; + } + free(lrr_list); + return err; +} + int cmd_fetch__tool(int argc, const char **argv, const char *prefix) { int verbose = 0; int force = 0; + int sopt = 0; while (1 < argc) { const char *arg = argv[1]; @@ -447,6 +524,8 @@ int cmd_fetch__tool(int argc, const char **argv, const char *prefix) verbose = 1; else if (!strcmp("-f", arg)) force = 1; + else if (!strcmp("-s", arg)) + sopt = 1; else break; argc--; @@ -491,6 +570,11 @@ int cmd_fetch__tool(int argc, const char **argv, const char *prefix) reflist = get_stdin(); return parse_reflist(reflist); } + if (!strcmp("pick-rref", argv[1])) { + if (argc != 4) + return error("pick-rref takes 2 args"); + return pick_rref(sopt, argv[2], argv[3]); + } if (!strcmp("expand-refs-wildcard", argv[1])) { const char *reflist; if (argc < 4) -- 1.5.1.45.g1ddb ^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate 2007-04-03 21:28 ` Linus Torvalds 2007-04-03 22:31 ` Junio C Hamano 2007-04-05 10:22 ` [PATCH 1/2] git-fetch--tool pick-rref Junio C Hamano @ 2007-04-05 10:22 ` Junio C Hamano 2007-04-05 16:15 ` Shawn O. Pearce 2 siblings, 1 reply; 58+ messages in thread From: Junio C Hamano @ 2007-04-05 10:22 UTC (permalink / raw) To: Linus Torvalds Cc: Shawn O. Pearce, Nicolas Pitre, Chris Lee, Git Mailing List When we are fetching from a repository that is on a local filesystem, first check if we have all the objects that we are going to fetch available locally, by not just checking the tips of what we are fetching, but with a full reachability analysis to our existing refs. In such a case, we do not have to run git-fetch-pack which would send many needless objects. This is especially true when the other repository is an alternate of the current repository (e.g. perhaps the repository was created by running "git clone -l -s" from there). The useless objects transferred used to be discarded when they were expanded by git-unpack-objects called from git-fetch-pack, but recent git-fetch-pack prefers to keep the data it receives from the other end without exploding them into loose objects, resulting in a pack full of duplicated data when fetching from your own alternate. This also uses fetch--tool pick-rref on dumb transport side to remove a shell loop to do the same. Signed-off-by: Junio C Hamano <junkio@cox.net> --- * Strictly speaking, there is no need to even check if $remote is a local directory for this to operate properly, as rev-list would barf and die as soon as it finds something unavailable, while limiting the traversal to stop immediately after it hits what are known to be reachable locally. On the other hand, if we really want to limit this to the case to a repository with an alternate to "clone -l -s" origin, we could add 'test -f "$GIT_OBJECT_DIRECTORY/info/alternates"', but I chose not to. git-fetch.sh | 41 ++++++++++++++++++++++++++++------------- 1 files changed, 28 insertions(+), 13 deletions(-) diff --git a/git-fetch.sh b/git-fetch.sh index fd70696..5dc3063 100755 --- a/git-fetch.sh +++ b/git-fetch.sh @@ -173,9 +173,32 @@ fetch_all_at_once () { git-bundle unbundle "$remote" $rref || echo failed "$remote" else - git-fetch-pack --thin $exec $keep $shallow_depth $no_progress \ - "$remote" $rref || - echo failed "$remote" + if test -d "$remote" && + + # The remote might be our alternate. With + # this optimization we will bypass fetch-pack + # altogether, which means we cannot be doing + # the shallow stuff at all. + test ! -f "$GIT_DIR/shallow" && + test -z "$shallow_depth" && + + # See if all of what we are going to fetch are + # connected to our repository's tips, in which + # case we do not have to do any fetch. + theirs=$(git-fetch--tool -s pick-rref \ + "$rref" "$ls_remote_result") && + + # This will barf when $theirs reach an object that + # we do not have in our repository. Otherwise, + # we already have everything the fetch would bring in. + git-rev-list --objects $theirs --not --all 2>/dev/null + then + git-fetch--tool pick-rref "$rref" "$ls_remote_result" + else + git-fetch-pack --thin $exec $keep $shallow_depth \ + $no_progress "$remote" $rref || + echo failed "$remote" + fi fi ) | ( @@ -235,16 +258,8 @@ fetch_per_ref () { fi # Find $remote_name from ls-remote output. - head=$( - IFS=' ' - echo "$ls_remote_result" | - while read sha1 name - do - test "z$name" = "z$remote_name" || continue - echo "$sha1" - break - done - ) + head=$(git-fetch--tool -s pick-rref \ + "$remote_name" "$ls_remote_result") expr "z$head" : "z$_x40\$" >/dev/null || die "No such ref $remote_name at $remote" echo >&2 "Fetching $remote_name from $remote using $proto" -- 1.5.1.45.g1ddb ^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate 2007-04-05 10:22 ` [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate Junio C Hamano @ 2007-04-05 16:15 ` Shawn O. Pearce 2007-04-05 21:37 ` Junio C Hamano 0 siblings, 1 reply; 58+ messages in thread From: Shawn O. Pearce @ 2007-04-05 16:15 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, Nicolas Pitre, Chris Lee, Git Mailing List Junio C Hamano <junkio@cox.net> wrote: > + # This will barf when $theirs reach an object that > + # we do not have in our repository. Otherwise, > + # we already have everything the fetch would bring in. > + git-rev-list --objects $theirs --not --all 2>/dev/null OK, I must be missing something here. That rev-list is going to print out the SHA-1s for the objects we would have copied, but didn't, isn't it? So fetch--tool native-store is going to get a whole lot of SHA-1s it doesn't want to see, right? Otherwise this is a nice trick. It doesn't assure us that after the fetch those objects are still in the alternate. Meaning someone could run prune in the alternate between the rev-list and the native-store, and whack these objects. Given how small of a window it is, and the improvements this brings to alternates, I say its worth that small downside. Just don't prune while fetching. ;-) -- Shawn. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate 2007-04-05 16:15 ` Shawn O. Pearce @ 2007-04-05 21:37 ` Junio C Hamano 0 siblings, 0 replies; 58+ messages in thread From: Junio C Hamano @ 2007-04-05 21:37 UTC (permalink / raw) To: Shawn O. Pearce Cc: Linus Torvalds, Nicolas Pitre, Chris Lee, Git Mailing List "Shawn O. Pearce" <spearce@spearce.org> writes: > Junio C Hamano <junkio@cox.net> wrote: >> + # This will barf when $theirs reach an object that >> + # we do not have in our repository. Otherwise, >> + # we already have everything the fetch would bring in. >> + git-rev-list --objects $theirs --not --all 2>/dev/null > > OK, I must be missing something here. > > That rev-list is going to print out the SHA-1s for the objects we > would have copied, but didn't, isn't it? So fetch--tool native-store > is going to get a whole lot of SHA-1s it doesn't want to see, right? True. We should send the standard output also to /dev/null. > Otherwise this is a nice trick. It doesn't assure us that after the > fetch those objects are still in the alternate. Meaning someone > could run prune in the alternate between the rev-list and the > native-store, and whack these objects. Given how small of a window > it is, and the improvements this brings to alternates, I say its > worth that small downside. Just don't prune while fetching. ;-) That is "don't prune after making an alternate that depends on you" in general. Without this patch, and without the keep (i.e. when the fetch is very small and the transfarred pack is given to unpack-objects) you would have depended on the alternates for those objects anyway. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:17 ` Shawn O. Pearce 2007-04-03 21:26 ` Linus Torvalds @ 2007-04-03 21:34 ` Nicolas Pitre 2007-04-03 21:37 ` Shawn O. Pearce 2007-04-03 22:40 ` Dana How 1 sibling, 2 replies; 58+ messages in thread From: Nicolas Pitre @ 2007-04-03 21:34 UTC (permalink / raw) To: Shawn O. Pearce Cc: Linus Torvalds, Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Shawn O. Pearce wrote: > Linus Torvalds <torvalds@linux-foundation.org> wrote: > > On Tue, 3 Apr 2007, Shawn O. Pearce wrote: > > > If its the missing-object lookup that is expensive, maybe we should > > > try to optimize that. We do it enough already in other parts of > > > the code... > > > > Well, for all other cases it's really the "object found" case that is > > worth optimizing for, so I think optimizing for "no object" is actually > > wrong, unless it also speeds up (or at least doesn't make it worse) the > > "real" normal case. > > Right. But maybe we shouldn't be scanning for packfiles every > time we don't find a loose object. Especially if the caller is in > a context where we actually *expect* to not find said object like > half of the time... say in git-add/update-index. ;-) First, I truly believe we should have a 64-bit pack index and fewer larger packs than many small packs. Which leaves us with the actual pack index lookup. At that point the cost of finding an existing object and finding that a given object doesn't exist is about the same thing, isn't it? Optimizing that lookup is going to benefit both cases. Nicolas ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:34 ` git-index-pack really does suck Nicolas Pitre @ 2007-04-03 21:37 ` Shawn O. Pearce 2007-04-03 21:44 ` Junio C Hamano 2007-04-03 22:40 ` Dana How 1 sibling, 1 reply; 58+ messages in thread From: Shawn O. Pearce @ 2007-04-03 21:37 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, Chris Lee, Junio C Hamano, Git Mailing List Nicolas Pitre <nico@cam.org> wrote: > First, I truly believe we should have a 64-bit pack index and fewer > larger packs than many small packs. I'll buy that. ;-) > Which leaves us with the actual pack index lookup. At that point the > cost of finding an existing object and finding that a given object > doesn't exist is about the same thing, isn't it? Here's the rub: in the missing object case we didn't find it in the pack index, but it could be loose. That's one failed syscall per object if the object isn't loose. If the object isn't loose, it could be that it was *just* removed by a running prune-packed, and the packfile that it was moved to was created after we scanned for packfiles, so time to rescan... -- Shawn. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:37 ` Shawn O. Pearce @ 2007-04-03 21:44 ` Junio C Hamano 2007-04-03 21:53 ` Shawn O. Pearce 0 siblings, 1 reply; 58+ messages in thread From: Junio C Hamano @ 2007-04-03 21:44 UTC (permalink / raw) To: Shawn O. Pearce Cc: Nicolas Pitre, Linus Torvalds, Chris Lee, Git Mailing List "Shawn O. Pearce" <spearce@spearce.org> writes: > Nicolas Pitre <nico@cam.org> wrote: >> First, I truly believe we should have a 64-bit pack index and fewer >> larger packs than many small packs. > > I'll buy that. ;-) > >> Which leaves us with the actual pack index lookup. At that point the >> cost of finding an existing object and finding that a given object >> doesn't exist is about the same thing, isn't it? > > Here's the rub: in the missing object case we didn't find it > in the pack index, but it could be loose. That's one failed > syscall per object if the object isn't loose. If the object > isn't loose, it could be that it was *just* removed by a > running prune-packed, and the packfile that it was moved > to was created after we scanned for packfiles, so time to > rescan... If that is the only reason we have these reprepare_packed_git() sprinkled all over in sha1_file.c (by 637cdd9d), perhaps we should rethink that. Is there a cheap way to trigger these rescanning only when a prune-packed is in progress, I wonder... ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:44 ` Junio C Hamano @ 2007-04-03 21:53 ` Shawn O. Pearce 2007-04-03 22:10 ` Jeff King 0 siblings, 1 reply; 58+ messages in thread From: Shawn O. Pearce @ 2007-04-03 21:53 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, Linus Torvalds, Chris Lee, Git Mailing List Junio C Hamano <junkio@cox.net> wrote: > "Shawn O. Pearce" <spearce@spearce.org> writes: > > Here's the rub: in the missing object case we didn't find it > > in the pack index, but it could be loose. That's one failed > > syscall per object if the object isn't loose. If the object > > isn't loose, it could be that it was *just* removed by a > > running prune-packed, and the packfile that it was moved > > to was created after we scanned for packfiles, so time to > > rescan... > > If that is the only reason we have these reprepare_packed_git() > sprinkled all over in sha1_file.c (by 637cdd9d), perhaps we > should rethink that. Is there a cheap way to trigger these > rescanning only when a prune-packed is in progress, I wonder... Yea, it is the only reason. So... if we could have some magic to trigger that, it would be good. I just don't know what magic that would be. -- Shawn. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:53 ` Shawn O. Pearce @ 2007-04-03 22:10 ` Jeff King 0 siblings, 0 replies; 58+ messages in thread From: Jeff King @ 2007-04-03 22:10 UTC (permalink / raw) To: Shawn O. Pearce Cc: Junio C Hamano, Nicolas Pitre, Linus Torvalds, Chris Lee, Git Mailing List On Tue, Apr 03, 2007 at 05:53:42PM -0400, Shawn O. Pearce wrote: > > If that is the only reason we have these reprepare_packed_git() > > sprinkled all over in sha1_file.c (by 637cdd9d), perhaps we > > should rethink that. Is there a cheap way to trigger these > > rescanning only when a prune-packed is in progress, I wonder... > > Yea, it is the only reason. So... if we could have some > magic to trigger that, it would be good. I just don't > know what magic that would be. It doesn't have to be in progress; it might have started and completed between pack-scanning and object lookup. Something like: 1. start git-repack -a -d 2. start git-rev-list, scan for packs 3. repack moves finished pack into place, starts git-prune-packed 4. git-prune-pack completes 5. git-rev-list looks up object So you would need to have some sort of incremented counter that says "there's a new pack available." Perhaps instead of rescanning unconditionally, it should simply stat() the pack directory and look for a change? That should be much cheaper. -Jeff ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:34 ` git-index-pack really does suck Nicolas Pitre 2007-04-03 21:37 ` Shawn O. Pearce @ 2007-04-03 22:40 ` Dana How 2007-04-03 22:52 ` Linus Torvalds 2007-04-03 23:00 ` Nicolas Pitre 1 sibling, 2 replies; 58+ messages in thread From: Dana How @ 2007-04-03 22:40 UTC (permalink / raw) To: Nicolas Pitre Cc: Shawn O. Pearce, Linus Torvalds, Chris Lee, Junio C Hamano, Git Mailing List On 4/3/07, Nicolas Pitre <nico@cam.org> wrote: > On Tue, 3 Apr 2007, Shawn O. Pearce wrote: > > Right. But maybe we shouldn't be scanning for packfiles every > > time we don't find a loose object. Especially if the caller is in > > a context where we actually *expect* to not find said object like > > half of the time... say in git-add/update-index. ;-) > > First, I truly believe we should have a 64-bit pack index and fewer > larger packs than many small packs. > > Which leaves us with the actual pack index lookup. At that point the > cost of finding an existing object and finding that a given object > doesn't exist is about the same thing, isn't it? > > Optimizing that lookup is going to benefit both cases. Do you get what you want if you move to fewer larger INDEX files but not pack files -- in the extreme, one large index file? A "super index" could be built from multiple .idx files. This would be a new file (format) unencumbered by the past, so it could be tried out more quickly. Just like objects are pruned when packed, .idx files could be pruned when the super index is built. Perhaps a number of (<2GB) packfiles and a large index file could work out. Larger and larger pack files make me nervous. They are expensive to manipulate, and >2GB requires a file format change. Thanks, -- Dana L. How danahow@gmail.com +1 650 804 5991 cell ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 22:40 ` Dana How @ 2007-04-03 22:52 ` Linus Torvalds 2007-04-03 22:31 ` David Lang 2007-04-03 23:00 ` Nicolas Pitre 1 sibling, 1 reply; 58+ messages in thread From: Linus Torvalds @ 2007-04-03 22:52 UTC (permalink / raw) To: Dana How Cc: Nicolas Pitre, Shawn O. Pearce, Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Dana How wrote: > > Larger and larger pack files make me nervous. > They are expensive to manipulate, > and >2GB requires a file format change. It sometimes also requires a new filesystem. There are a lot of filesystems that can handle more than 4GB total, but not necessarily in a single file. The only really useful such filesystem is probably FAT, which is still quite useful for things like USB memory sticks. But that is probably already worth supporting. So I think we want to support 64-bit (or at least something like 40+ bit) pack-files, but yes, I think that even if/when we support it, we still want to support the "multiple smaller pack-files" schenario exactly because for some uses it's much *better* to have ten 2GB packfiles rather than one 20GB pack-file. Linus ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 22:52 ` Linus Torvalds @ 2007-04-03 22:31 ` David Lang 0 siblings, 0 replies; 58+ messages in thread From: David Lang @ 2007-04-03 22:31 UTC (permalink / raw) To: Linus Torvalds Cc: Dana How, Nicolas Pitre, Shawn O. Pearce, Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Linus Torvalds wrote: > On Tue, 3 Apr 2007, Dana How wrote: >> >> Larger and larger pack files make me nervous. >> They are expensive to manipulate, >> and >2GB requires a file format change. > > It sometimes also requires a new filesystem. There are a lot of > filesystems that can handle more than 4GB total, but not necessarily in a > single file. > > The only really useful such filesystem is probably FAT, which is still > quite useful for things like USB memory sticks. But that is probably > already worth supporting. > > So I think we want to support 64-bit (or at least something like 40+ bit) > pack-files, but yes, I think that even if/when we support it, we still > want to support the "multiple smaller pack-files" schenario exactly > because for some uses it's much *better* to have ten 2GB packfiles rather > than one 20GB pack-file. however, for historical archives you may end up with wanting to do a 2GB packfile of 'recent' stuff, and a 14GB packfile of 'ancient' stuff (with the large one built with all space-saving options turned up all the way, no matter how much time it takes to build the pack) David Lang ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 22:40 ` Dana How 2007-04-03 22:52 ` Linus Torvalds @ 2007-04-03 23:00 ` Nicolas Pitre 1 sibling, 0 replies; 58+ messages in thread From: Nicolas Pitre @ 2007-04-03 23:00 UTC (permalink / raw) To: Dana How Cc: Shawn O. Pearce, Linus Torvalds, Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Dana How wrote: > Do you get what you want if you move to fewer larger INDEX files > but not pack files -- in the extreme, one large index file? No that doesn't solve the clone/fetch of large amount of data cleanly. > A "super index" could be built from multiple .idx files. > This would be a new file (format) unencumbered by the past, > so it could be tried out more quickly. > Just like objects are pruned when packed, > .idx files could be pruned when the super index is built. This is an idea to consider independently of any other issues. > Perhaps a number of (<2GB) packfiles and a large index > file could work out. > > Larger and larger pack files make me nervous. > They are expensive to manipulate, > and >2GB requires a file format change. No. Larger pack files are not more expensive than the same set of objects spread into multiple packs. In fact I'd say it's quite the opposite. And larger pack files do not require a pack format change -- it's just the index that has to change and the index is a local matter. Nicolas ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 20:56 ` Linus Torvalds 2007-04-03 21:03 ` Shawn O. Pearce @ 2007-04-03 21:21 ` Nicolas Pitre 1 sibling, 0 replies; 58+ messages in thread From: Nicolas Pitre @ 2007-04-03 21:21 UTC (permalink / raw) To: Linus Torvalds; +Cc: Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Linus Torvalds wrote: > > > On Tue, 3 Apr 2007, Nicolas Pitre wrote: > > > > > > Yeah. What happens is that inside the repo, because we do all the > > > duplicate object checks (verifying that there are no evil hash collisions) > > > even after fixing the memory leak, we end up keeping *track* of all those > > > objects. > > > > What do you mean? > > Look at what we have to do to look up a SHA1 object.. We create all the > lookup infrastructure, we don't *just* read the object. The delta base > cache is the most obvious one. It is caped to 16MB, so we're far from the 200+ MB count. > > I'm of the opinion that this patch is unnecessary. It only helps in > > bogus workflows to start with, and it makes the default behavior unsafe > > (unsafe from a paranoid pov, but still). And in the _normal_ workflow > > it should never trigger. > > Actually, even in the normal workflow it will do all the extra unnecessary > work, if only because the lookup costs of *not* finding the entry. > > Lookie here: > > - git index-pack of the *git* pack-file in the v2.6/linux directory (zero > overlap of objects) > > With --paranoid: > > 2.75user 0.37system 0:03.13elapsed 99%CPU > 0major+5583minor pagefaults > > Without --paranoid: > > 2.55user 0.12system 0:02.68elapsed 99%CPU > 0major+2957minor pagefaults > > See? That's the *normal* workflow. Zero objects found. 7% CPU overhead > from just the unnecessary work, and almost twice as much memory used. Just > from the index file lookup etc for a decent-sized project. 7% overhead over 2 second and a half of CPU which, _normally_, happens when cloning the whole thing over a network connection which, if you're lucky and have a 6mbps cable connection, will still be spread over 5 minutes of real time. And that is assuming that you're cloning a big project inside itself which wouldn't work anyway. Otherwise a big clone wound run index-pack in an empty repo where the lookup of exinsting object is zero. Remains git-fetch which should concern itself with much smaller packs pushing this overhead in the noise. > Now, in the KDE situation, the *unnecessary* lookups will be about ten > times more expensive, both on memory and CPU, just because the repository > is about 20x the size. Even with no actual hits. So? When would you really perform such an operation in a meaningful way? The memory usage worries me. I still cannot explain nor justify it. But the CPU overhead is certainly not of any concern in _normal_ usage scenarios, is it? If anything that might be a good test case for the newton-raphson pack lookup idea. Nicolas ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 20:18 ` Linus Torvalds 2007-04-03 20:32 ` Nicolas Pitre @ 2007-04-03 20:33 ` Linus Torvalds 2007-04-03 21:05 ` Nicolas Pitre 2007-04-03 20:34 ` Junio C Hamano 2 siblings, 1 reply; 58+ messages in thread From: Linus Torvalds @ 2007-04-03 20:33 UTC (permalink / raw) To: Chris Lee, Nicolas Pitre, Junio C Hamano; +Cc: Git Mailing List On Tue, 3 Apr 2007, Linus Torvalds wrote: > > So how about this updated patch? We could certainly make "git pull" imply > "--paranoid" if we want to, but even that is likely pretty unnecessary. > It's not like anybody has ever shown a SHA1 collision, and if the *local* > repository is corrupt (and has an object with the wrong SHA1 - that's what > the testsuite checks for), then it's probably good to get the valid object > from the remote.. Some trivial timings for indexing just the kernel pack.. Without --paranoid: 24.61user 2.16system 0:27.04elapsed 99%CPU 0major+14120minor pagefaults With --paranoid: 42.74user 3.04system 0:46.36elapsed 98%CPU 0major+72768minor pagefaults so it's a noticeable CPU issue, but it's even more noticeable in memory usage (55MB vs 284MB - pagefaults give a good way to look at how much memory really got allocated for the process). All that extra memory is just for SHA1 commit ID information. Now, clearly the usage scenario here is a big odd (ie the case where we have all the objects already), so in that sense this is very much a worst-case situation, and you simply shouldn't *do* something like this, but at the same time, I'm just not convinced a very theoretical SHA1 collision check is worth it. Btw, even if we don't have any of the objects, if you have tons and tons of objects and do a "git pull", just the *lookup* of the nonexistent objects will be expensive: first we won't find it in any pack, then we'll look at the loose objects, and then we'll look int he pack *again* due to the race avoidance. So looking up nonexistent objects is actually pretty expensive. In fact, "--paranoid" takes one second more for me even totally outside of a git repository, just because we waste so much time trying to look up non-existent object files ;) Linus ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 20:33 ` Linus Torvalds @ 2007-04-03 21:05 ` Nicolas Pitre 2007-04-03 21:11 ` Shawn O. Pearce 2007-04-03 21:24 ` Linus Torvalds 0 siblings, 2 replies; 58+ messages in thread From: Nicolas Pitre @ 2007-04-03 21:05 UTC (permalink / raw) To: Linus Torvalds; +Cc: Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Linus Torvalds wrote: > All that extra memory is just for SHA1 commit ID information. I don't see where that might be. The only thing that the paranoia check triggers is: foo = read_sha1_file(blah); memcmp(foo with bar); free(foo); So where is that commit ID information actually stored when using read_sha1_file()? > Btw, even if we don't have any of the objects, if you have tons and tons > of objects and do a "git pull", just the *lookup* of the nonexistent > objects will be expensive: first we won't find it in any pack, then we'll > look at the loose objects, and then we'll look int he pack *again* due to > the race avoidance. So looking up nonexistent objects is actually pretty > expensive. Not if you consider that it is performed _while_ receiving (and waiting for) the pack data over the net in the normal case. Nicolas ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:05 ` Nicolas Pitre @ 2007-04-03 21:11 ` Shawn O. Pearce 2007-04-03 21:24 ` Linus Torvalds 1 sibling, 0 replies; 58+ messages in thread From: Shawn O. Pearce @ 2007-04-03 21:11 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, Chris Lee, Junio C Hamano, Git Mailing List Nicolas Pitre <nico@cam.org> wrote: > On Tue, 3 Apr 2007, Linus Torvalds wrote: > > > All that extra memory is just for SHA1 commit ID information. > > I don't see where that might be. The only thing that the paranoia check > triggers is: > > foo = read_sha1_file(blah); > memcmp(foo with bar); > free(foo); > > So where is that commit ID information actually stored when using > read_sha1_file()? That might just be the mmap code in sha1_file.c. We are mmaping the existing packfile, to fetch the object. That counts as virtual memory. ;-) -- Shawn. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:05 ` Nicolas Pitre 2007-04-03 21:11 ` Shawn O. Pearce @ 2007-04-03 21:24 ` Linus Torvalds [not found] ` <alpine.LF D.0.98.0704031735470.28181@xanadu.home> 2007-04-03 21:42 ` Nicolas Pitre 1 sibling, 2 replies; 58+ messages in thread From: Linus Torvalds @ 2007-04-03 21:24 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Nicolas Pitre wrote: > > I don't see where that might be. The only thing that the paranoia check > triggers is: > > foo = read_sha1_file(blah); > memcmp(foo with bar); > free(foo); > > So where is that commit ID information actually stored when using > read_sha1_file()? I've got the numbers: it uses much more memory when doing even failing lookups, ie: With --paranoid: 5583 minor pagefaults (21MB) Without --paranoid: 2957 minor pagefaults (11MB) (remember, this was *just* the git pack, not the kernel pack) It could be as simple as just the index file itself. That's 11MB for the kernel. Imagine if the index file was 20 times bigger - 200MB of memory paged in with bad access patterns just for unnecessary lookups. Running valgrind shows no leak at all without --paranoid. With --paranoid, there's some really trivial stuff (the "packed_git" structure etc, so I think it's really just the index itself). > Not if you consider that it is performed _while_ receiving (and waiting > for) the pack data over the net in the normal case. ..which is why I think it makes sense for "pull" to be paranoid. I just don't think it makes sense to be paranoid *all* the time, since it's clearly expensive. Linus ^ permalink raw reply [flat|nested] 58+ messages in thread
[parent not found: <alpine.LF D.0.98.0704031735470.28181@xanadu.home>]
* Re: git-index-pack really does suck.. 2007-04-03 21:24 ` Linus Torvalds [not found] ` <alpine.LF D.0.98.0704031735470.28181@xanadu.home> @ 2007-04-03 21:42 ` Nicolas Pitre 2007-04-03 22:07 ` Junio C Hamano 2007-04-03 22:14 ` Linus Torvalds 1 sibling, 2 replies; 58+ messages in thread From: Nicolas Pitre @ 2007-04-03 21:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Linus Torvalds wrote: > > > On Tue, 3 Apr 2007, Nicolas Pitre wrote: > > > > I don't see where that might be. The only thing that the paranoia check > > triggers is: > > > > foo = read_sha1_file(blah); > > memcmp(foo with bar); > > free(foo); > > > > So where is that commit ID information actually stored when using > > read_sha1_file()? > > I've got the numbers: it uses much more memory when doing even failing > lookups, ie: > > With --paranoid: 5583 minor pagefaults (21MB) > Without --paranoid: 2957 minor pagefaults (11MB) > > (remember, this was *just* the git pack, not the kernel pack) > > It could be as simple as just the index file itself. That's 11MB for the > kernel. Imagine if the index file was 20 times bigger - 200MB of memory > paged in with bad access patterns just for unnecessary lookups. Again that presumes you have to page in the whole index, which should not happen when pulling (much smaller packs) and with a better lookup algorithm. > > Not if you consider that it is performed _while_ receiving (and waiting > > for) the pack data over the net in the normal case. > > ..which is why I think it makes sense for "pull" to be paranoid. I just > don't think it makes sense to be paranoid *all* the time, since it's > clearly expensive. Make it conditionnal on --stdin then. This covers all cases where we really want the secure thing to happen, and the --stdin case already perform the atomic rename-and-move thing when the pack is fully indexed. Nicolas ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:42 ` Nicolas Pitre @ 2007-04-03 22:07 ` Junio C Hamano 2007-04-03 22:11 ` Shawn O. Pearce 2007-04-03 22:34 ` Nicolas Pitre 2007-04-03 22:14 ` Linus Torvalds 1 sibling, 2 replies; 58+ messages in thread From: Junio C Hamano @ 2007-04-03 22:07 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, Chris Lee, Git Mailing List Nicolas Pitre <nico@cam.org> writes: > Make it conditionnal on --stdin then. This covers all cases where we > really want the secure thing to happen, and the --stdin case already > perform the atomic rename-and-move thing when the pack is fully indexed. Repacking objects in a repository uses pack-objects without using index-pack, as you suggested Chris. Is there a sane usage of index-pack that does not use --stdin? I do not think of any. If there isn't, the "conditional on --stdin" suggestion means we unconditionally do the secure thing for all the sane usage, and go unsecure for an insane usage that we do not really care about. If so, it seems to me that it would be the simplest not to touch the code at all, except that missing free(). Am I missing something? ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 22:07 ` Junio C Hamano @ 2007-04-03 22:11 ` Shawn O. Pearce 2007-04-03 22:34 ` Nicolas Pitre 1 sibling, 0 replies; 58+ messages in thread From: Shawn O. Pearce @ 2007-04-03 22:11 UTC (permalink / raw) To: Junio C Hamano; +Cc: Nicolas Pitre, Linus Torvalds, Chris Lee, Git Mailing List Junio C Hamano <junkio@cox.net> wrote: > Nicolas Pitre <nico@cam.org> writes: > > > Make it conditionnal on --stdin then. This covers all cases where we > > really want the secure thing to happen, and the --stdin case already > > perform the atomic rename-and-move thing when the pack is fully indexed. > > Repacking objects in a repository uses pack-objects without > using index-pack, as you suggested Chris. Is there a sane usage > of index-pack that does not use --stdin? I do not think of any. > > If there isn't, the "conditional on --stdin" suggestion means we > unconditionally do the secure thing for all the sane usage, and > go unsecure for an insane usage that we do not really care about. > > If so, it seems to me that it would be the simplest not to touch > the code at all, except that missing free(). > > Am I missing something? Nope. I agree with you completely. -- Shawn. ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 22:07 ` Junio C Hamano 2007-04-03 22:11 ` Shawn O. Pearce @ 2007-04-03 22:34 ` Nicolas Pitre 1 sibling, 0 replies; 58+ messages in thread From: Nicolas Pitre @ 2007-04-03 22:34 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, Chris Lee, Git Mailing List On Tue, 3 Apr 2007, Junio C Hamano wrote: > Nicolas Pitre <nico@cam.org> writes: > > > Make it conditionnal on --stdin then. This covers all cases where we > > really want the secure thing to happen, and the --stdin case already > > perform the atomic rename-and-move thing when the pack is fully indexed. > > Repacking objects in a repository uses pack-objects without > using index-pack, as you suggested Chris. Is there a sane usage > of index-pack that does not use --stdin? I do not think of any. > > If there isn't, the "conditional on --stdin" suggestion means we > unconditionally do the secure thing for all the sane usage, and > go unsecure for an insane usage that we do not really care about. > > If so, it seems to me that it would be the simplest not to touch > the code at all, except that missing free(). That's exactly what I think as well. > Am I missing something? Not from my point of view. Nicolas ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 21:42 ` Nicolas Pitre 2007-04-03 22:07 ` Junio C Hamano @ 2007-04-03 22:14 ` Linus Torvalds 2007-04-03 22:55 ` Nicolas Pitre 1 sibling, 1 reply; 58+ messages in thread From: Linus Torvalds @ 2007-04-03 22:14 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Nicolas Pitre wrote: > > > > It could be as simple as just the index file itself. That's 11MB for the > > kernel. Imagine if the index file was 20 times bigger - 200MB of memory > > paged in with bad access patterns just for unnecessary lookups. > > Again that presumes you have to page in the whole index, which should > not happen when pulling (much smaller packs) and with a better lookup > algorithm. Actually, since SHA1's are randomly distributed, together with the binary search, you probably *will* pull in the bulk of the index, even with a fairly small set of SHA1's that you check. > Make it conditionnal on --stdin then. This covers all cases where we > really want the secure thing to happen, and the --stdin case already > perform the atomic rename-and-move thing when the pack is fully indexed. I don't care *what* it is conditional on, but your arguments suck. You claim that it's not a normal case to already have the objects, when it *is* a normal case for alternates, etc. I don't understand why you argue against hard numbers. You have none of your own. Linus ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 22:14 ` Linus Torvalds @ 2007-04-03 22:55 ` Nicolas Pitre 2007-04-03 22:36 ` David Lang 2007-04-03 23:29 ` Linus Torvalds 0 siblings, 2 replies; 58+ messages in thread From: Nicolas Pitre @ 2007-04-03 22:55 UTC (permalink / raw) To: Linus Torvalds; +Cc: Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Linus Torvalds wrote: > I don't care *what* it is conditional on, but your arguments suck. You > claim that it's not a normal case to already have the objects, when it > *is* a normal case for alternates, etc. > > I don't understand why you argue against hard numbers. You have none of > your own. Are hard numbers like 7% overhead (because right now that's all we have) really worth it against bad _perceptions_? Sure, the SHA1 collision attack is paranoia. But it is becoming increasingly *possible*. And when we only had unpack-objects on the receiving end of a fetch, you yourself bragged about the implied security of GIT in the presence of a SHA1 collision attack. Because let's admit it: when a SHA1 collision will happen it is way more probable to come on purpose than from pure accident. But as you said at the time, it is not a problem because GIT trusts local objects more than remote ones and incidentally unpack-objects doesn't overwrite existing objects. The keeping of fetched packs broke that presumption of trust towards local objects and it opened a real path for potential future attacks. Those attacks are still fairly theoretical of course. But for how _long_? Do we want GIT to be considered backdoor prone in a couple years from now just because we were obsessed by a 7% CPU overhead? I think we have much more to gain by playing it safe and being more secure and paranoid than trying to squeeze some CPU cycles out of an operation that is likely to ever be bounded by network speed for most people. And we _know_ that the operation can be optimized further anyway. So IMHO in this case hard numbers alone aren't the end of it. Not as long as they're reasonably low. And especially not for a command which is 1) rather infrequent and 2) not really interactive like git-log might be. Nicolas ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 22:55 ` Nicolas Pitre @ 2007-04-03 22:36 ` David Lang 2007-04-04 9:51 ` Alex Riesen 2007-04-03 23:29 ` Linus Torvalds 1 sibling, 1 reply; 58+ messages in thread From: David Lang @ 2007-04-03 22:36 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Linus Torvalds, Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Nicolas Pitre wrote: > On Tue, 3 Apr 2007, Linus Torvalds wrote: > >> I don't care *what* it is conditional on, but your arguments suck. You >> claim that it's not a normal case to already have the objects, when it >> *is* a normal case for alternates, etc. >> >> I don't understand why you argue against hard numbers. You have none of >> your own. > > Are hard numbers like 7% overhead (because right now that's all we have) > really worth it against bad _perceptions_? plus 1s overhead on what's otherwise a noop command. > The keeping of fetched packs broke that presumption of trust towards > local objects and it opened a real path for potential future attacks. > Those attacks are still fairly theoretical of course. But for how > _long_? Do we want GIT to be considered backdoor prone in a couple > years from now just because we were obsessed by a 7% CPU overhead? > > I think we have much more to gain by playing it safe and being more > secure and paranoid than trying to squeeze some CPU cycles out of an > operation that is likely to ever be bounded by network speed for most > people. this is why -paranoid should be left on for network pulls, but having it on for the local uses means that the cost isn't hidden in the network limits isn't good. David Lang ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 22:36 ` David Lang @ 2007-04-04 9:51 ` Alex Riesen [not found] ` <P ine.LNX.4.63.0704061455380.24050@qynat.qvtvafvgr.pbz> 2007-04-06 21:56 ` David Lang 0 siblings, 2 replies; 58+ messages in thread From: Alex Riesen @ 2007-04-04 9:51 UTC (permalink / raw) To: David Lang Cc: Nicolas Pitre, Linus Torvalds, Chris Lee, Junio C Hamano, Git Mailing List On 4/4/07, David Lang <david.lang@digitalinsight.com> wrote: > > > The keeping of fetched packs broke that presumption of trust towards > > local objects and it opened a real path for potential future attacks. > > Those attacks are still fairly theoretical of course. But for how > > _long_? Do we want GIT to be considered backdoor prone in a couple > > years from now just because we were obsessed by a 7% CPU overhead? > > > > I think we have much more to gain by playing it safe and being more > > secure and paranoid than trying to squeeze some CPU cycles out of an > > operation that is likely to ever be bounded by network speed for most > > people. > > this is why -paranoid should be left on for network pulls, but having it on for > the local uses means that the cost isn't hidden in the network limits isn't > good. You never know what pull is networked (or should I say: remote enough to cause a collision). ^ permalink raw reply [flat|nested] 58+ messages in thread
[parent not found: <P ine.LNX.4.63.0704061455380.24050@qynat.qvtvafvgr.pbz>]
* Re: git-index-pack really does suck.. 2007-04-04 9:51 ` Alex Riesen [not found] ` <P ine.LNX.4.63.0704061455380.24050@qynat.qvtvafvgr.pbz> @ 2007-04-06 21:56 ` David Lang 2007-04-06 22:47 ` Junio C Hamano 1 sibling, 1 reply; 58+ messages in thread From: David Lang @ 2007-04-06 21:56 UTC (permalink / raw) To: Alex Riesen Cc: Nicolas Pitre, Linus Torvalds, Chris Lee, Junio C Hamano, Git Mailing List On Wed, 4 Apr 2007, Alex Riesen wrote: > On 4/4/07, David Lang <david.lang@digitalinsight.com> wrote: >> >> > The keeping of fetched packs broke that presumption of trust towards >> > local objects and it opened a real path for potential future attacks. >> > Those attacks are still fairly theoretical of course. But for how >> > _long_? Do we want GIT to be considered backdoor prone in a couple >> > years from now just because we were obsessed by a 7% CPU overhead? >> > >> > I think we have much more to gain by playing it safe and being more >> > secure and paranoid than trying to squeeze some CPU cycles out of an >> > operation that is likely to ever be bounded by network speed for most >> > people. >> >> this is why -paranoid should be left on for network pulls, but having it on >> for >> the local uses means that the cost isn't hidden in the network limits isn't >> good. > > You never know what pull is networked (or should I say: remote enough > to cause a collision). so leave it on for all pulls, but for other commands don't turn it on. remember that the command that linus ran into at the start of the thread wasn't a pull. David Lang ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-06 21:56 ` David Lang @ 2007-04-06 22:47 ` Junio C Hamano 2007-04-06 22:49 ` Junio C Hamano 0 siblings, 1 reply; 58+ messages in thread From: Junio C Hamano @ 2007-04-06 22:47 UTC (permalink / raw) To: David Lang Cc: Alex Riesen, Nicolas Pitre, Linus Torvalds, Chris Lee, Git Mailing List David Lang <david.lang@digitalinsight.com> writes: > On Wed, 4 Apr 2007, Alex Riesen wrote: > ... >> You never know what pull is networked (or should I say: remote enough >> to cause a collision). > > so leave it on for all pulls, but for other commands don't turn it on. > > remember that the command that linus ran into at the start of the > thread wasn't a pull. Are you referring to this command $ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack in this message? ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-06 22:47 ` Junio C Hamano @ 2007-04-06 22:49 ` Junio C Hamano 2007-04-06 22:22 ` David Lang 0 siblings, 1 reply; 58+ messages in thread From: Junio C Hamano @ 2007-04-06 22:49 UTC (permalink / raw) To: David Lang Cc: Alex Riesen, Nicolas Pitre, Linus Torvalds, Chris Lee, Git Mailing List Junio C Hamano <junkio@cox.net> writes: > David Lang <david.lang@digitalinsight.com> writes: > >> On Wed, 4 Apr 2007, Alex Riesen wrote: >> ... >>> You never know what pull is networked (or should I say: remote enough >>> to cause a collision). >> >> so leave it on for all pulls, but for other commands don't turn it on. >> >> remember that the command that linus ran into at the start of the >> thread wasn't a pull. > > Are you referring to this command > > $ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack > > in this message? From: Linus Torvalds <torvalds@linux-foundation.org> Subject: git-index-pack really does suck.. Date: Tue, 3 Apr 2007 08:15:12 -0700 (PDT) Message-ID: <Pine.LNX.4.64.0704030754020.6730@woody.linux-foundation.org> (sorry, chomped the message). ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-06 22:49 ` Junio C Hamano @ 2007-04-06 22:22 ` David Lang 2007-04-06 22:55 ` Junio C Hamano 0 siblings, 1 reply; 58+ messages in thread From: David Lang @ 2007-04-06 22:22 UTC (permalink / raw) To: Junio C Hamano Cc: Alex Riesen, Nicolas Pitre, Linus Torvalds, Chris Lee, Git Mailing List On Fri, 6 Apr 2007, Junio C Hamano wrote: > Subject: Re: git-index-pack really does suck.. > > Junio C Hamano <junkio@cox.net> writes: > >> David Lang <david.lang@digitalinsight.com> writes: >> >>> On Wed, 4 Apr 2007, Alex Riesen wrote: >>> ... >>>> You never know what pull is networked (or should I say: remote enough >>>> to cause a collision). >>> >>> so leave it on for all pulls, but for other commands don't turn it on. >>> >>> remember that the command that linus ran into at the start of the >>> thread wasn't a pull. >> >> Are you referring to this command >> >> $ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack >> >> in this message? > > From: Linus Torvalds <torvalds@linux-foundation.org> > Subject: git-index-pack really does suck.. > Date: Tue, 3 Apr 2007 08:15:12 -0700 (PDT) > Message-ID: <Pine.LNX.4.64.0704030754020.6730@woody.linux-foundation.org> > > (sorry, chomped the message). probably (I useually don't keep the mail after I read or reply to it) David Lang ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-06 22:22 ` David Lang @ 2007-04-06 22:55 ` Junio C Hamano 2007-04-06 22:28 ` David Lang 0 siblings, 1 reply; 58+ messages in thread From: Junio C Hamano @ 2007-04-06 22:55 UTC (permalink / raw) To: David Lang Cc: Alex Riesen, Nicolas Pitre, Linus Torvalds, Chris Lee, Git Mailing List David Lang <david.lang@digitalinsight.com> writes: > On Fri, 6 Apr 2007, Junio C Hamano wrote: > >> Subject: Re: git-index-pack really does suck.. >> >> Junio C Hamano <junkio@cox.net> writes: >> >>> David Lang <david.lang@digitalinsight.com> writes: >>> >>>> On Wed, 4 Apr 2007, Alex Riesen wrote: >>>> ... >>>>> You never know what pull is networked (or should I say: remote enough >>>>> to cause a collision). >>>> >>>> so leave it on for all pulls, but for other commands don't turn it on. >>>> >>>> remember that the command that linus ran into at the start of the >>>> thread wasn't a pull. >>> >>> Are you referring to this command >>> >>> $ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack >>> >>> in this message? >> >> From: Linus Torvalds <torvalds@linux-foundation.org> >> Subject: git-index-pack really does suck.. >> Date: Tue, 3 Apr 2007 08:15:12 -0700 (PDT) >> Message-ID: <Pine.LNX.4.64.0704030754020.6730@woody.linux-foundation.org> >> >> (sorry, chomped the message). > > probably (I useually don't keep the mail after I read or reply to it) Well, then you should remember that the command linus ran into was pretty much about pull, nothing else. The quoted command was only to illustrate what 'git-pull' invokes internally. I do not think of any reason to use that command for cases other than 'git-pull'. What's the use case you have in mind to run that command outside of the context of git-pull? ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-06 22:55 ` Junio C Hamano @ 2007-04-06 22:28 ` David Lang 0 siblings, 0 replies; 58+ messages in thread From: David Lang @ 2007-04-06 22:28 UTC (permalink / raw) To: Junio C Hamano Cc: Alex Riesen, Nicolas Pitre, Linus Torvalds, Chris Lee, Git Mailing List On Fri, 6 Apr 2007, Junio C Hamano wrote: > David Lang <david.lang@digitalinsight.com> writes: > >> On Fri, 6 Apr 2007, Junio C Hamano wrote: >> >>> Subject: Re: git-index-pack really does suck.. >>> >>> Junio C Hamano <junkio@cox.net> writes: >>> >>>> David Lang <david.lang@digitalinsight.com> writes: >>>> >>>>> On Wed, 4 Apr 2007, Alex Riesen wrote: >>>>> ... >>>>>> You never know what pull is networked (or should I say: remote enough >>>>>> to cause a collision). >>>>> >>>>> so leave it on for all pulls, but for other commands don't turn it on. >>>>> >>>>> remember that the command that linus ran into at the start of the >>>>> thread wasn't a pull. >>>> >>>> Are you referring to this command >>>> >>>> $ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack >>>> >>>> in this message? >>> >>> From: Linus Torvalds <torvalds@linux-foundation.org> >>> Subject: git-index-pack really does suck.. >>> Date: Tue, 3 Apr 2007 08:15:12 -0700 (PDT) >>> Message-ID: <Pine.LNX.4.64.0704030754020.6730@woody.linux-foundation.org> >>> >>> (sorry, chomped the message). >> >> probably (I useually don't keep the mail after I read or reply to it) > > Well, then you should remember that the command linus ran into > was pretty much about pull, nothing else. > > The quoted command was only to illustrate what 'git-pull' > invokes internally. I do not think of any reason to use that > command for cases other than 'git-pull'. What's the use case > you have in mind to run that command outside of the context of > git-pull? I guess I'm not remembering the thread accurately then (and/or am mising it up with a different thread). I thought that Linus had identified other cases that were impacted (something about proving that the object doesn't exist) David Lang ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 22:55 ` Nicolas Pitre 2007-04-03 22:36 ` David Lang @ 2007-04-03 23:29 ` Linus Torvalds 1 sibling, 0 replies; 58+ messages in thread From: Linus Torvalds @ 2007-04-03 23:29 UTC (permalink / raw) To: Nicolas Pitre; +Cc: Chris Lee, Junio C Hamano, Git Mailing List On Tue, 3 Apr 2007, Nicolas Pitre wrote: > > Are hard numbers like 7% overhead (because right now that's all we have) > really worth it against bad _perceptions_? If it actually stays at just 7% even with large repos (and the numbers from Chris seem to say that it doesn't get worse - in fact, it may be that the lookup gets relatively more efficient for a large repo thanks to the log(n) costs), I agree that 7% probably isn't worth worrying about when weighed against "guaranteed no SHA1 collision". Especially as long as you'd normally only hit it when your real performance issue is going to be the network. So especially if we can make sure that the *local* case is ok, where the network isn't going to be the bottleneck, I think we can/should do the paranoia. That's especially true as it is also the local case where the 7% has already been shown to be just the best case, with the worst case being many hundred percent (and memory use going up from 55M to 280M in one example), thanks to us actually *finding* the objects. Linus ^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 20:18 ` Linus Torvalds 2007-04-03 20:32 ` Nicolas Pitre 2007-04-03 20:33 ` Linus Torvalds @ 2007-04-03 20:34 ` Junio C Hamano 2007-04-03 20:53 ` Nicolas Pitre 2 siblings, 1 reply; 58+ messages in thread From: Junio C Hamano @ 2007-04-03 20:34 UTC (permalink / raw) To: Linus Torvalds; +Cc: Chris Lee, Git Mailing List Linus Torvalds <torvalds@linux-foundation.org> writes: > That whole "verify no SHA1 hash collision" code is really pretty damn > paranoid. Maybe we shouldn't have it enabled by default. > > So how about this updated patch? We could certainly make "git pull" imply > "--paranoid" if we want to, but even that is likely pretty unnecessary. > It's not like anybody has ever shown a SHA1 collision, and if the *local* > repository is corrupt (and has an object with the wrong SHA1 - that's what > the testsuite checks for), then it's probably good to get the valid object > from the remote.. I agree with that reasoning. We did not do paranoid in git-pull long after we introduced the .keep thing anyway, so I do not think the following patch is even needed, but I am throwing it out just for discussion. diff --git a/fetch-pack.c b/fetch-pack.c index 06f4aec..c687f9f 100644 --- a/fetch-pack.c +++ b/fetch-pack.c @@ -522,6 +522,7 @@ static int get_pack(int xd[2]) if (do_keep) { *av++ = "index-pack"; + *av++ = "--paranoid"; *av++ = "--stdin"; if (!quiet && !no_progress) *av++ = "-v"; ^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck.. 2007-04-03 20:34 ` Junio C Hamano @ 2007-04-03 20:53 ` Nicolas Pitre 0 siblings, 0 replies; 58+ messages in thread From: Nicolas Pitre @ 2007-04-03 20:53 UTC (permalink / raw) To: Junio C Hamano; +Cc: Linus Torvalds, Chris Lee, Git Mailing List On Tue, 3 Apr 2007, Junio C Hamano wrote: > Linus Torvalds <torvalds@linux-foundation.org> writes: > > > That whole "verify no SHA1 hash collision" code is really pretty damn > > paranoid. Maybe we shouldn't have it enabled by default. > > > > So how about this updated patch? We could certainly make "git pull" imply > > "--paranoid" if we want to, but even that is likely pretty unnecessary. > > It's not like anybody has ever shown a SHA1 collision, and if the *local* > > repository is corrupt (and has an object with the wrong SHA1 - that's what > > the testsuite checks for), then it's probably good to get the valid object > > from the remote.. > > I agree with that reasoning. For the record, I don't agree. I stated why in my other email. > We did not do paranoid in git-pull long after we introduced the .keep > thing anyway, That doesn't make it more "correct". > so I do not > think the following patch is even needed, but I am throwing it > out just for discussion. 1) None of the objects in a pack should exist in the local repo when fetching, meaning that the paranoia code should not be executed normally. 2) Running index-pack on a pack _inside_ a repository is a dubious thing to do with questionable usefulness already. 3) It is unefficient to run pack-objects with --stdout just to feed the result to index-pack afterwards while repack-objects can create the index itself, which is the source of this discussion. 4) I invite you to read the commit log for 8685da42561 where the _perception_ of GIT's security is discussed which led to the paranoia check, and sometimes the perception is more valuable than the reality, especially when it is free. Therefore Linus' patch and this one are working around the wrong issue as described in (3) IMHO. What could be done instead, if really really needed, is to have the paranoia test be made conditional on index-pack --stdin instead. But please no bogus extra switches pretty please. Nicolas ^ permalink raw reply [flat|nested] 58+ messages in thread
end of thread, other threads:[~2007-04-06 22:59 UTC | newest]
Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-03 15:15 git-index-pack really does suck Linus Torvalds
[not found] ` <Pi ne.LNX.4.64.0704031413200.6730@woody.linux-foundation.org>
[not found] ` <alpine.LFD.0.98. 0704031836350.28181@xanadu.home>
[not found] ` <db 69205d0704031227q1009eabfhdd82aa3636f25bb6@mail.gmail.com>
[not found] ` <Pine.LNX.4.64.07 04031304420.6730@woody.linux-foundation.org>
[not found] ` <Pine.LNX.4.64.0704031322490.67 30@woody.linux-foundation.org>
2007-04-03 16:21 ` Linus Torvalds
2007-04-03 16:40 ` Nicolas Pitre
2007-04-03 16:33 ` Nicolas Pitre
2007-04-03 19:27 ` Chris Lee
2007-04-03 19:49 ` Nicolas Pitre
2007-04-03 19:54 ` Chris Lee
2007-04-03 20:18 ` Linus Torvalds
2007-04-03 20:32 ` Nicolas Pitre
2007-04-03 20:40 ` Junio C Hamano
2007-04-03 21:00 ` Linus Torvalds
2007-04-03 21:28 ` Nicolas Pitre
2007-04-03 22:49 ` Chris Lee
2007-04-03 23:12 ` Linus Torvalds
2007-04-03 20:56 ` Linus Torvalds
2007-04-03 21:03 ` Shawn O. Pearce
2007-04-03 21:13 ` Linus Torvalds
2007-04-03 21:17 ` Shawn O. Pearce
2007-04-03 21:26 ` Linus Torvalds
2007-04-03 21:28 ` Linus Torvalds
2007-04-03 22:31 ` Junio C Hamano
2007-04-03 22:38 ` Shawn O. Pearce
2007-04-03 22:41 ` Junio C Hamano
2007-04-05 10:22 ` [PATCH 1/2] git-fetch--tool pick-rref Junio C Hamano
2007-04-05 10:22 ` [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate Junio C Hamano
2007-04-05 16:15 ` Shawn O. Pearce
2007-04-05 21:37 ` Junio C Hamano
2007-04-03 21:34 ` git-index-pack really does suck Nicolas Pitre
2007-04-03 21:37 ` Shawn O. Pearce
2007-04-03 21:44 ` Junio C Hamano
2007-04-03 21:53 ` Shawn O. Pearce
2007-04-03 22:10 ` Jeff King
2007-04-03 22:40 ` Dana How
2007-04-03 22:52 ` Linus Torvalds
2007-04-03 22:31 ` David Lang
2007-04-03 23:00 ` Nicolas Pitre
2007-04-03 21:21 ` Nicolas Pitre
2007-04-03 20:33 ` Linus Torvalds
2007-04-03 21:05 ` Nicolas Pitre
2007-04-03 21:11 ` Shawn O. Pearce
2007-04-03 21:24 ` Linus Torvalds
[not found] ` <alpine.LF D.0.98.0704031735470.28181@xanadu.home>
2007-04-03 21:42 ` Nicolas Pitre
2007-04-03 22:07 ` Junio C Hamano
2007-04-03 22:11 ` Shawn O. Pearce
2007-04-03 22:34 ` Nicolas Pitre
2007-04-03 22:14 ` Linus Torvalds
2007-04-03 22:55 ` Nicolas Pitre
2007-04-03 22:36 ` David Lang
2007-04-04 9:51 ` Alex Riesen
[not found] ` <P ine.LNX.4.63.0704061455380.24050@qynat.qvtvafvgr.pbz>
2007-04-06 21:56 ` David Lang
2007-04-06 22:47 ` Junio C Hamano
2007-04-06 22:49 ` Junio C Hamano
2007-04-06 22:22 ` David Lang
2007-04-06 22:55 ` Junio C Hamano
2007-04-06 22:28 ` David Lang
2007-04-03 23:29 ` Linus Torvalds
2007-04-03 20:34 ` Junio C Hamano
2007-04-03 20:53 ` Nicolas Pitre
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).