* git-index-pack really does suck..
@ 2007-04-03 15:15 Linus Torvalds
[not found] ` <Pi ne.LNX.4.64.0704031413200.6730@woody.linux-foundation.org>
` (4 more replies)
0 siblings, 5 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 15:15 UTC (permalink / raw)
To: Junio C Hamano, Nicolas Pitre, Git Mailing List
Junio, Nico,
I think we need to do something about it.
CLee was complaining about git-index-pack on #irc with the partial KDE
repo, and while I don't have the KDE repo, I decided to investigate a bit.
Even with just the kernel repo (with a single 170MB pack-file), I can do
git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack
and it uses 52s of CPU-time, and on my 4GB machine it actually started
doing IO and swapping, because git-index-pack grew to 4.8GB in size. So
while I initially thought I'd want a bigger test-case to see the problem,
I sure as heck don't.
The 52s of CPU time exploded into almost three minutes of actual
real-time:
47.33user 5.79system 2:41.65elapsed 32%CPU
2117major+1245763minor
And that's on a good system with a powerful CPU, "enough memory" for any
reasonable development, and good disks! Very much ungood-plus-plus.
I haven't looked into exactly why yet, but I bet it's just that we keep
every single object expanded in memory. We do need to keep the objects
around, so that we can resolve delta's, but we can certainly do it other
ways.
Two suggestion for other ways:
- simple one: don't keep unexploded objects around, just keep the deltas,
and spend tons of CPU-time just re-expanding them if required.
We *should* be able to do it with just keeping the original 170MB
pack-file in memory, not expanding it to 3.8GB!
Still, even this will be painful once you have a big pack-file, and the
CPU waste is nasty (although a delta-base cache like we do in
sha1_file.c would probably fix it 99% - at that point it's getting
less simple, and the "best" solution below looks more palatable)
- best one: when writing out the pack-file, we incrementally keep a
"struct packed_git" around, and update the index for it dynamically,
and totally get rid of all objects that we've written out, because we
can re-create them.
This means that we should have _zero_ memory footprint except for the
one object that we're working on right then and there, and any
unresolved deltas where we've not seen the base at all (and the latter
generally shouldn't happen any more with most pack-files)
The "best one" wouldn't seem to be *that* painful, but as mentioned, I
haven't even started looking at the code yet, I thought I'd try to rope
Nico into looking at this first ;)
Linus
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 15:15 git-index-pack really does suck Linus Torvalds
[not found] ` <Pi ne.LNX.4.64.0704031413200.6730@woody.linux-foundation.org>
[not found] ` <db 69205d0704031227q1009eabfhdd82aa3636f25bb6@mail.gmail.com>
@ 2007-04-03 16:21 ` Linus Torvalds
2007-04-03 16:40 ` Nicolas Pitre
2007-04-03 16:33 ` Nicolas Pitre
2007-04-03 19:27 ` Chris Lee
4 siblings, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 16:21 UTC (permalink / raw)
To: Junio C Hamano, Nicolas Pitre, Git Mailing List
On Tue, 3 Apr 2007, Linus Torvalds wrote:
>
> and it uses 52s of CPU-time, and on my 4GB machine it actually started
> doing IO and swapping, because git-index-pack grew to 4.8GB in size.
Ahh. False alarm.
The problem is actually largely a really stupid memory leak in the SHA1
collision checking (which wouldn't trigger on a normal pull, but obviously
does trigger for every single object when testing!)
This trivial patch fixes most of it. git-index-pack still uses too much
memory, but it does a *lot* better.
Junio, please get this into 1.5.1 (I *think* the SHA1 checking is new, but
if it exists in 1.5.0 too, it obviously needs the same fix).
It still grows, but it grew to just 287M in size now for the 170M kernel
object:
41.59user 1.39system 0:43.64elapsed
0major+73552minor
which is quite a lot better.
Duh.
Linus
---
index-pack.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/index-pack.c b/index-pack.c
index 6284fe3..3c768fb 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -358,6 +358,7 @@ static void sha1_object(const void *data, unsigned long size,
if (size != has_size || type != has_type ||
memcmp(data, has_data, size) != 0)
die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1));
+ free(has_data);
}
}
^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 15:15 git-index-pack really does suck Linus Torvalds
` (2 preceding siblings ...)
2007-04-03 16:21 ` Linus Torvalds
@ 2007-04-03 16:33 ` Nicolas Pitre
2007-04-03 19:27 ` Chris Lee
4 siblings, 0 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 16:33 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Linus Torvalds wrote:
>
> Junio, Nico,
> I think we need to do something about it.
Sure. Mea culpa.
> CLee was complaining about git-index-pack on #irc with the partial KDE
> repo, and while I don't have the KDE repo, I decided to investigate a bit.
>
> Even with just the kernel repo (with a single 170MB pack-file), I can do
>
> git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack
>
> and it uses 52s of CPU-time, and on my 4GB machine it actually started
> doing IO and swapping, because git-index-pack grew to 4.8GB in size.
Right.
> Two suggestion for other ways:
>
> - simple one: don't keep unexploded objects around, just keep the deltas,
> and spend tons of CPU-time just re-expanding them if required.
>
> We *should* be able to do it with just keeping the original 170MB
> pack-file in memory, not expanding it to 3.8GB!
>
> Still, even this will be painful once you have a big pack-file, and the
> CPU waste is nasty (although a delta-base cache like we do in
> sha1_file.c would probably fix it 99% - at that point it's getting
> less simple, and the "best" solution below looks more palatable)
>
> - best one: when writing out the pack-file, we incrementally keep a
> "struct packed_git" around, and update the index for it dynamically,
> and totally get rid of all objects that we've written out, because we
> can re-create them.
>
> This means that we should have _zero_ memory footprint except for the
> one object that we're working on right then and there, and any
> unresolved deltas where we've not seen the base at all (and the latter
> generally shouldn't happen any more with most pack-files)
Even better:
- Fix my own stupid mistake with a _single_ line of code:
diff --git a/index-pack.c b/index-pack.c
index 6284fe3..3c768fb 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -358,6 +358,7 @@ static void sha1_object(const void *data, unsigned long size,
if (size != has_size || type != has_type ||
memcmp(data, has_data, size) != 0)
die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1));
+ free(has_data);
}
}
The thing is, that code path is executed _only_ when index-pack is
encountering an object already in the repository in order to protect
against possible SHA1 collision attacks. See commit 8685da42561d log
for the full story.
Normally this should not happen in normal usage scenarios because the
objects you fetch are those that you don't already have. But if you
manually run index-pack inside an existing repository then you'll
already have _all_ those objects already explaining the high CPU usage.
But this is no excuse for not freeing the data though.
Nicolas
^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 16:21 ` Linus Torvalds
@ 2007-04-03 16:40 ` Nicolas Pitre
0 siblings, 0 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 16:40 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Linus Torvalds wrote:
>
>
> On Tue, 3 Apr 2007, Linus Torvalds wrote:
> >
> > and it uses 52s of CPU-time, and on my 4GB machine it actually started
> > doing IO and swapping, because git-index-pack grew to 4.8GB in size.
>
> Ahh. False alarm.
>
> The problem is actually largely a really stupid memory leak in the SHA1
> collision checking (which wouldn't trigger on a normal pull, but obviously
> does trigger for every single object when testing!)
Damn!
Please don't report those things when I'm out for lunch so I could have
a chance to fix my own stupidities myself in time! :-)
> This trivial patch fixes most of it. git-index-pack still uses too much
> memory, but it does a *lot* better.
>
> Junio, please get this into 1.5.1 (I *think* the SHA1 checking is new, but
> if it exists in 1.5.0 too, it obviously needs the same fix).
No, it is new to 1.5.1.
> It still grows, but it grew to just 287M in size now for the 170M kernel
> object:
>
> 41.59user 1.39system 0:43.64elapsed
> 0major+73552minor
>
> which is quite a lot better.
>
> Duh.
Indeed.
Acked-by: Nicolas Pitre <nico@cam.org>
> index-pack.c | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
> diff --git a/index-pack.c b/index-pack.c
> index 6284fe3..3c768fb 100644
> --- a/index-pack.c
> +++ b/index-pack.c
> @@ -358,6 +358,7 @@ static void sha1_object(const void *data, unsigned long size,
> if (size != has_size || type != has_type ||
> memcmp(data, has_data, size) != 0)
> die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1));
> + free(has_data);
> }
> }
>
> -
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Nicolas
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 15:15 git-index-pack really does suck Linus Torvalds
` (3 preceding siblings ...)
2007-04-03 16:33 ` Nicolas Pitre
@ 2007-04-03 19:27 ` Chris Lee
2007-04-03 19:49 ` Nicolas Pitre
2007-04-03 20:18 ` Linus Torvalds
4 siblings, 2 replies; 58+ messages in thread
From: Chris Lee @ 2007-04-03 19:27 UTC (permalink / raw)
To: Linus Torvalds, git
On 4/3/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>
> Junio, Nico,
> I think we need to do something about it.
>
> CLee was complaining about git-index-pack on #irc with the partial KDE
> repo, and while I don't have the KDE repo, I decided to investigate a bit.
>
> Even with just the kernel repo (with a single 170MB pack-file), I can do
>
> git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack
>
> and it uses 52s of CPU-time, and on my 4GB machine it actually started
> doing IO and swapping, because git-index-pack grew to 4.8GB in size. So
> while I initially thought I'd want a bigger test-case to see the problem,
> I sure as heck don't.
There's another issue here.
I'm running git-index-pack as part of a workflow like so:
$ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects
$ grep 'blob' /tmp/all-objects > /tmp/blob-objects
$ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects
--delta-base-offset --all-progress --stdout > blob.pack
$ git-index-pack -v blob.pack
Now, when I run 'git-index-pack' on blob.pack in the current
directory, memory usage is pretty horrific (even with the applied
patch to not leak all everything). Shawn tells me that index-pack
should only be decompressing the object twice - once from the repo and
once from blob.pack - iff I call git-index-pack with --stdin, which I
am not.
If I move the blob.pack into /tmp, and run git-index-pack on it there,
it completes much faster and the memory usage never exceeds 200MB.
(Inside the repo, it takes up over 3G of RES according to top.)
By "much faster", I mean: the entire pack was indexed and completed in
17:42.40, whereas I cancelled the inside-the-repo index because after
56 minutes it was only at 46%.
(And, as far as getting this huge repo published - I have it burned to
a DVD, and I'm going to drop it in the mail to hpa today on my lunch
break, which I should be taking soon.)
-clee
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 19:27 ` Chris Lee
@ 2007-04-03 19:49 ` Nicolas Pitre
2007-04-03 19:54 ` Chris Lee
2007-04-03 20:18 ` Linus Torvalds
1 sibling, 1 reply; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 19:49 UTC (permalink / raw)
To: Chris Lee; +Cc: Linus Torvalds, git
On Tue, 3 Apr 2007, Chris Lee wrote:
> There's another issue here.
>
> I'm running git-index-pack as part of a workflow like so:
>
> $ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects
> $ grep 'blob' /tmp/all-objects > /tmp/blob-objects
> $ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects
> --delta-base-offset --all-progress --stdout > blob.pack
> $ git-index-pack -v blob.pack
Instead of using --stdout with git-pack-object, you should provide it
with a suitable base name for the resulting pack and the index will be
created automatically along side the pack for you. No need to use
index-pack for that.
> Now, when I run 'git-index-pack' on blob.pack in the current
> directory, memory usage is pretty horrific (even with the applied
> patch to not leak all everything). Shawn tells me that index-pack
> should only be decompressing the object twice - once from the repo and
> once from blob.pack - iff I call git-index-pack with --stdin, which I
> am not.
>
> If I move the blob.pack into /tmp, and run git-index-pack on it there,
> it completes much faster and the memory usage never exceeds 200MB.
> (Inside the repo, it takes up over 3G of RES according to top.)
The 3G should definitely be fixed with the added free().
The CPU usage is explained by the fact that you're running index-pack on
objects that are all already found in your repo so the collision check
is triggered. This is more or like the same issue as if you tried to
run unpack-objects on the same pack where none of your objects will
actually be unpacked.
Nicolas
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 19:49 ` Nicolas Pitre
@ 2007-04-03 19:54 ` Chris Lee
0 siblings, 0 replies; 58+ messages in thread
From: Chris Lee @ 2007-04-03 19:54 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Linus Torvalds, git
On 4/3/07, Nicolas Pitre <nico@cam.org> wrote:
> On Tue, 3 Apr 2007, Chris Lee wrote:
>
> > There's another issue here.
> >
> > I'm running git-index-pack as part of a workflow like so:
> >
> > $ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects
> > $ grep 'blob' /tmp/all-objects > /tmp/blob-objects
> > $ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects
> > --delta-base-offset --all-progress --stdout > blob.pack
> > $ git-index-pack -v blob.pack
>
> Instead of using --stdout with git-pack-object, you should provide it
> with a suitable base name for the resulting pack and the index will be
> created automatically along side the pack for you. No need to use
> index-pack for that.
Right. But then I wouldn't have discovered how much git-index-pack sucks. :)
> > Now, when I run 'git-index-pack' on blob.pack in the current
> > directory, memory usage is pretty horrific (even with the applied
> > patch to not leak all everything). Shawn tells me that index-pack
> > should only be decompressing the object twice - once from the repo and
> > once from blob.pack - iff I call git-index-pack with --stdin, which I
> > am not.
> >
> > If I move the blob.pack into /tmp, and run git-index-pack on it there,
> > it completes much faster and the memory usage never exceeds 200MB.
> > (Inside the repo, it takes up over 3G of RES according to top.)
>
> The 3G should definitely be fixed with the added free().
Not really. This packfile is 2.6GB in size, and apparently it gets mmap'd.
(Yesterday, my machine ran out of memory trying to do index-pack when
the memleak still existed; I have 4G of RAM and, normally, 4G of swap,
but I upped it to 32G of swap and it still ran out of memory.)
> The CPU usage is explained by the fact that you're running index-pack on
> objects that are all already found in your repo so the collision check
> is triggered. This is more or like the same issue as if you tried to
> run unpack-objects on the same pack where none of your objects will
> actually be unpacked.
Right, and if I was using --stdin, I would expect that. But I'm not.
And, according to Shawn anyway, the current behaviour is not what was
intended.
-clee
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 19:27 ` Chris Lee
2007-04-03 19:49 ` Nicolas Pitre
@ 2007-04-03 20:18 ` Linus Torvalds
2007-04-03 20:32 ` Nicolas Pitre
` (2 more replies)
1 sibling, 3 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 20:18 UTC (permalink / raw)
To: Chris Lee, Junio C Hamano; +Cc: Git Mailing List
On Tue, 3 Apr 2007, Chris Lee wrote:
>
> There's another issue here.
>
> I'm running git-index-pack as part of a workflow like so:
>
> $ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects
> $ grep 'blob' /tmp/all-objects > /tmp/blob-objects
> $ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects
> --delta-base-offset --all-progress --stdout > blob.pack
> $ git-index-pack -v blob.pack
>
> Now, when I run 'git-index-pack' on blob.pack in the current
> directory, memory usage is pretty horrific (even with the applied
> patch to not leak all everything). Shawn tells me that index-pack
> should only be decompressing the object twice - once from the repo and
> once from blob.pack - iff I call git-index-pack with --stdin, which I
> am not.
>
> If I move the blob.pack into /tmp, and run git-index-pack on it there,
> it completes much faster and the memory usage never exceeds 200MB.
> (Inside the repo, it takes up over 3G of RES according to top.)
Yeah. What happens is that inside the repo, because we do all the
duplicate object checks (verifying that there are no evil hash collisions)
even after fixing the memory leak, we end up keeping *track* of all those
objects.
And with a large repository, it's quite the expensive operation.
That whole "verify no SHA1 hash collision" code is really pretty damn
paranoid. Maybe we shouldn't have it enabled by default.
So how about this updated patch? We could certainly make "git pull" imply
"--paranoid" if we want to, but even that is likely pretty unnecessary.
It's not like anybody has ever shown a SHA1 collision, and if the *local*
repository is corrupt (and has an object with the wrong SHA1 - that's what
the testsuite checks for), then it's probably good to get the valid object
from the remote..
This includes the previous one-liner, but also adds the "--paranoid" flag
and fixes up the Documentation and tests to match.
Junio, your choice, but regardless which one you choose:
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Thanks,
Linus
---
Documentation/git-index-pack.txt | 3 +--
index-pack.c | 6 +++++-
t/t5300-pack-object.sh | 4 ++--
3 files changed, 8 insertions(+), 5 deletions(-)
diff --git a/Documentation/git-index-pack.txt b/Documentation/git-index-pack.txt
index 2229ee8..7d8d33b 100644
--- a/Documentation/git-index-pack.txt
+++ b/Documentation/git-index-pack.txt
@@ -8,8 +8,7 @@ git-index-pack - Build pack index file for an existing packed archive
SYNOPSIS
--------
-'git-index-pack' [-v] [-o <index-file>] <pack-file>
-'git-index-pack' --stdin [--fix-thin] [--keep] [-v] [-o <index-file>] [<pack-file>]
+'git-index-pack' [--stdin [--fix-thin] [--keep]] [-v] [--paranoid] [-o <index-file>] <pack-file>
DESCRIPTION
diff --git a/index-pack.c b/index-pack.c
index 6284fe3..8a4c27a 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -45,6 +45,7 @@ static int nr_resolved_deltas;
static int from_stdin;
static int verbose;
+static int paranoid;
static volatile sig_atomic_t progress_update;
@@ -348,7 +349,7 @@ static void sha1_object(const void *data, unsigned long size,
enum object_type type, unsigned char *sha1)
{
hash_sha1_file(data, size, typename(type), sha1);
- if (has_sha1_file(sha1)) {
+ if (paranoid && has_sha1_file(sha1)) {
void *has_data;
enum object_type has_type;
unsigned long has_size;
@@ -358,6 +359,7 @@ static void sha1_object(const void *data, unsigned long size,
if (size != has_size || type != has_type ||
memcmp(data, has_data, size) != 0)
die("SHA1 COLLISION FOUND WITH %s !", sha1_to_hex(sha1));
+ free(has_data);
}
}
@@ -839,6 +841,8 @@ int main(int argc, char **argv)
if (*arg == '-') {
if (!strcmp(arg, "--stdin")) {
from_stdin = 1;
+ } else if (!strcmp(arg, "--paranoid")) {
+ paranoid = 1;
} else if (!strcmp(arg, "--fix-thin")) {
fix_thin_pack = 1;
} else if (!strcmp(arg, "--keep")) {
diff --git a/t/t5300-pack-object.sh b/t/t5300-pack-object.sh
index 35e036a..407c71e 100755
--- a/t/t5300-pack-object.sh
+++ b/t/t5300-pack-object.sh
@@ -262,7 +262,7 @@ test_expect_success \
.git/objects/c8/2de19312b6c3695c0c18f70709a6c535682a67'
test_expect_failure \
- 'make sure index-pack detects the SHA1 collision' \
- 'git-index-pack -o bad.idx test-3.pack'
+ 'make sure index-pack detects the SHA1 collision when paranoid' \
+ 'git-index-pack --paranoid -o bad.idx test-3.pack'
test_done
^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 20:18 ` Linus Torvalds
@ 2007-04-03 20:32 ` Nicolas Pitre
2007-04-03 20:40 ` Junio C Hamano
2007-04-03 20:56 ` Linus Torvalds
2007-04-03 20:33 ` Linus Torvalds
2007-04-03 20:34 ` Junio C Hamano
2 siblings, 2 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 20:32 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Chris Lee, Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Linus Torvalds wrote:
>
>
> On Tue, 3 Apr 2007, Chris Lee wrote:
> >
> > There's another issue here.
> >
> > I'm running git-index-pack as part of a workflow like so:
> >
> > $ git-verify-pack -v .git/objects/pack/*.idx > /tmp/all-objects
> > $ grep 'blob' /tmp/all-objects > /tmp/blob-objects
> > $ cat /tmp/blob-objects | awk '{print $1;}' | git-pack-objects
> > --delta-base-offset --all-progress --stdout > blob.pack
> > $ git-index-pack -v blob.pack
> >
> > Now, when I run 'git-index-pack' on blob.pack in the current
> > directory, memory usage is pretty horrific (even with the applied
> > patch to not leak all everything). Shawn tells me that index-pack
> > should only be decompressing the object twice - once from the repo and
> > once from blob.pack - iff I call git-index-pack with --stdin, which I
> > am not.
> >
> > If I move the blob.pack into /tmp, and run git-index-pack on it there,
> > it completes much faster and the memory usage never exceeds 200MB.
> > (Inside the repo, it takes up over 3G of RES according to top.)
>
> Yeah. What happens is that inside the repo, because we do all the
> duplicate object checks (verifying that there are no evil hash collisions)
> even after fixing the memory leak, we end up keeping *track* of all those
> objects.
What do you mean?
> And with a large repository, it's quite the expensive operation.
>
> That whole "verify no SHA1 hash collision" code is really pretty damn
> paranoid. Maybe we shouldn't have it enabled by default.
Maybe we shouldn't run index-pack on packs for which we _already_ have
an index for which is the most likely reason for the collision check to
trigger in the first place.
This is in the same category as trying to run unpack-objects on a pack
within a repository and wondering why it doesn't work.
> So how about this updated patch? We could certainly make "git pull" imply
> "--paranoid" if we want to, but even that is likely pretty unnecessary.
I'm of the opinion that this patch is unnecessary. It only helps in
bogus workflows to start with, and it makes the default behavior unsafe
(unsafe from a paranoid pov, but still). And in the _normal_ workflow
it should never trigger.
So I wouldn't merge it.
Nicolas
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 20:18 ` Linus Torvalds
2007-04-03 20:32 ` Nicolas Pitre
@ 2007-04-03 20:33 ` Linus Torvalds
2007-04-03 21:05 ` Nicolas Pitre
2007-04-03 20:34 ` Junio C Hamano
2 siblings, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 20:33 UTC (permalink / raw)
To: Chris Lee, Nicolas Pitre, Junio C Hamano; +Cc: Git Mailing List
On Tue, 3 Apr 2007, Linus Torvalds wrote:
>
> So how about this updated patch? We could certainly make "git pull" imply
> "--paranoid" if we want to, but even that is likely pretty unnecessary.
> It's not like anybody has ever shown a SHA1 collision, and if the *local*
> repository is corrupt (and has an object with the wrong SHA1 - that's what
> the testsuite checks for), then it's probably good to get the valid object
> from the remote..
Some trivial timings for indexing just the kernel pack..
Without --paranoid:
24.61user 2.16system 0:27.04elapsed 99%CPU
0major+14120minor pagefaults
With --paranoid:
42.74user 3.04system 0:46.36elapsed 98%CPU
0major+72768minor pagefaults
so it's a noticeable CPU issue, but it's even more noticeable in memory
usage (55MB vs 284MB - pagefaults give a good way to look at how much
memory really got allocated for the process).
All that extra memory is just for SHA1 commit ID information.
Now, clearly the usage scenario here is a big odd (ie the case where we
have all the objects already), so in that sense this is very much a
worst-case situation, and you simply shouldn't *do* something like this,
but at the same time, I'm just not convinced a very theoretical SHA1
collision check is worth it.
Btw, even if we don't have any of the objects, if you have tons and tons
of objects and do a "git pull", just the *lookup* of the nonexistent
objects will be expensive: first we won't find it in any pack, then we'll
look at the loose objects, and then we'll look int he pack *again* due to
the race avoidance. So looking up nonexistent objects is actually pretty
expensive.
In fact, "--paranoid" takes one second more for me even totally outside of
a git repository, just because we waste so much time trying to look up
non-existent object files ;)
Linus
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 20:18 ` Linus Torvalds
2007-04-03 20:32 ` Nicolas Pitre
2007-04-03 20:33 ` Linus Torvalds
@ 2007-04-03 20:34 ` Junio C Hamano
2007-04-03 20:53 ` Nicolas Pitre
2 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2007-04-03 20:34 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Chris Lee, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> That whole "verify no SHA1 hash collision" code is really pretty damn
> paranoid. Maybe we shouldn't have it enabled by default.
>
> So how about this updated patch? We could certainly make "git pull" imply
> "--paranoid" if we want to, but even that is likely pretty unnecessary.
> It's not like anybody has ever shown a SHA1 collision, and if the *local*
> repository is corrupt (and has an object with the wrong SHA1 - that's what
> the testsuite checks for), then it's probably good to get the valid object
> from the remote..
I agree with that reasoning. We did not do paranoid in git-pull
long after we introduced the .keep thing anyway, so I do not
think the following patch is even needed, but I am throwing it
out just for discussion.
diff --git a/fetch-pack.c b/fetch-pack.c
index 06f4aec..c687f9f 100644
--- a/fetch-pack.c
+++ b/fetch-pack.c
@@ -522,6 +522,7 @@ static int get_pack(int xd[2])
if (do_keep) {
*av++ = "index-pack";
+ *av++ = "--paranoid";
*av++ = "--stdin";
if (!quiet && !no_progress)
*av++ = "-v";
^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 20:32 ` Nicolas Pitre
@ 2007-04-03 20:40 ` Junio C Hamano
2007-04-03 21:00 ` Linus Torvalds
2007-04-03 20:56 ` Linus Torvalds
1 sibling, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2007-04-03 20:40 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Linus Torvalds, Chris Lee, Git Mailing List
[sorry, sent a message without finishing]
Nicolas Pitre <nico@cam.org> writes:
> Maybe we shouldn't run index-pack on packs for which we _already_ have
> an index for which is the most likely reason for the collision check to
> trigger in the first place.
>
> This is in the same category as trying to run unpack-objects on a pack
> within a repository and wondering why it doesn't work.
> ...
> I'm of the opinion that this patch is unnecessary. It only helps in
> bogus workflows to start with, and it makes the default behavior unsafe
> (unsafe from a paranoid pov, but still). And in the _normal_ workflow
> it should never trigger.
Hmmmm. You may have a point.
So maybe we should retitle this thread from "git-index-pack
really does suck.." to "I used git-index-pack in a stupid way"?
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 20:34 ` Junio C Hamano
@ 2007-04-03 20:53 ` Nicolas Pitre
0 siblings, 0 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 20:53 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Linus Torvalds, Chris Lee, Git Mailing List
On Tue, 3 Apr 2007, Junio C Hamano wrote:
> Linus Torvalds <torvalds@linux-foundation.org> writes:
>
> > That whole "verify no SHA1 hash collision" code is really pretty damn
> > paranoid. Maybe we shouldn't have it enabled by default.
> >
> > So how about this updated patch? We could certainly make "git pull" imply
> > "--paranoid" if we want to, but even that is likely pretty unnecessary.
> > It's not like anybody has ever shown a SHA1 collision, and if the *local*
> > repository is corrupt (and has an object with the wrong SHA1 - that's what
> > the testsuite checks for), then it's probably good to get the valid object
> > from the remote..
>
> I agree with that reasoning.
For the record, I don't agree. I stated why in my other email.
> We did not do paranoid in git-pull long after we introduced the .keep
> thing anyway,
That doesn't make it more "correct".
> so I do not
> think the following patch is even needed, but I am throwing it
> out just for discussion.
1) None of the objects in a pack should exist in the local repo when
fetching, meaning that the paranoia code should not be executed
normally.
2) Running index-pack on a pack _inside_ a repository is a dubious thing
to do with questionable usefulness already.
3) It is unefficient to run pack-objects with --stdout just to feed the
result to index-pack afterwards while repack-objects can create the
index itself, which is the source of this discussion.
4) I invite you to read the commit log for 8685da42561 where the
_perception_ of GIT's security is discussed which led to the paranoia
check, and sometimes the perception is more valuable than the
reality, especially when it is free.
Therefore Linus' patch and this one are working around the wrong issue
as described in (3) IMHO.
What could be done instead, if really really needed, is to have the
paranoia test be made conditional on index-pack --stdin instead. But
please no bogus extra switches pretty please.
Nicolas
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 20:32 ` Nicolas Pitre
2007-04-03 20:40 ` Junio C Hamano
@ 2007-04-03 20:56 ` Linus Torvalds
2007-04-03 21:03 ` Shawn O. Pearce
2007-04-03 21:21 ` Nicolas Pitre
1 sibling, 2 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 20:56 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Chris Lee, Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Nicolas Pitre wrote:
> >
> > Yeah. What happens is that inside the repo, because we do all the
> > duplicate object checks (verifying that there are no evil hash collisions)
> > even after fixing the memory leak, we end up keeping *track* of all those
> > objects.
>
> What do you mean?
Look at what we have to do to look up a SHA1 object.. We create all the
lookup infrastructure, we don't *just* read the object. The delta base
cache is the most obvious one.
> I'm of the opinion that this patch is unnecessary. It only helps in
> bogus workflows to start with, and it makes the default behavior unsafe
> (unsafe from a paranoid pov, but still). And in the _normal_ workflow
> it should never trigger.
Actually, even in the normal workflow it will do all the extra unnecessary
work, if only because the lookup costs of *not* finding the entry.
Lookie here:
- git index-pack of the *git* pack-file in the v2.6/linux directory (zero
overlap of objects)
With --paranoid:
2.75user 0.37system 0:03.13elapsed 99%CPU
0major+5583minor pagefaults
Without --paranoid:
2.55user 0.12system 0:02.68elapsed 99%CPU
0major+2957minor pagefaults
See? That's the *normal* workflow. Zero objects found. 7% CPU overhead
from just the unnecessary work, and almost twice as much memory used. Just
from the index file lookup etc for a decent-sized project.
Now, in the KDE situation, the *unnecessary* lookups will be about ten
times more expensive, both on memory and CPU, just because the repository
is about 20x the size. Even with no actual hits.
Linus
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 20:40 ` Junio C Hamano
@ 2007-04-03 21:00 ` Linus Torvalds
2007-04-03 21:28 ` Nicolas Pitre
2007-04-03 22:49 ` Chris Lee
0 siblings, 2 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 21:00 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Nicolas Pitre, Chris Lee, Git Mailing List
On Tue, 3 Apr 2007, Junio C Hamano wrote:
>
> So maybe we should retitle this thread from "git-index-pack
> really does suck.." to "I used git-index-pack in a stupid way"?
See my separate timing numbers, although I bet that Chris can give even
better ones..
Chris, try applying my patch, and then inside the KDE repo you have, do
git index-pack --paranoid --stdin --fix-thin new.pack < ~/git/.git/objects/pack/pack-*.pack
(ie index the objects of the *git* repository, not the KDE one). That
should approximate doing a fair-sized "git pull" - getting new objects. Do
it with and without --paranoid, and time it.
I bet that what I see as a 7% slowdown will be much bigger for you, just
because the negative lookups will be all that much more expensive when you
have tons of objects.
Linus
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 20:56 ` Linus Torvalds
@ 2007-04-03 21:03 ` Shawn O. Pearce
2007-04-03 21:13 ` Linus Torvalds
2007-04-03 21:21 ` Nicolas Pitre
1 sibling, 1 reply; 58+ messages in thread
From: Shawn O. Pearce @ 2007-04-03 21:03 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Nicolas Pitre, Chris Lee, Junio C Hamano, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Actually, even in the normal workflow it will do all the extra unnecessary
> work, if only because the lookup costs of *not* finding the entry.
>
> Lookie here:
>
> - git index-pack of the *git* pack-file in the v2.6/linux directory (zero
> overlap of objects)
>
> With --paranoid:
>
> 2.75user 0.37system 0:03.13elapsed 99%CPU
> 0major+5583minor pagefaults
>
> Without --paranoid:
>
> 2.55user 0.12system 0:02.68elapsed 99%CPU
> 0major+2957minor pagefaults
>
> See? That's the *normal* workflow. Zero objects found. 7% CPU overhead
> from just the unnecessary work, and almost twice as much memory used. Just
> from the index file lookup etc for a decent-sized project.
OK, but what about that case with unpack-objects? Didn't we there
do all this work to also check for the object already existing?
During update-index, write-tree and commit-tree don't we also do
a lot of work (per object anyway) to check for a non-existing object?
So even with --paranoid (aka what we have now) index-pack still
should be faster than unpack-objects for any sizeable transfer,
and is just as "safe".
If its the missing-object lookup that is expensive, maybe we should
try to optimize that. We do it enough already in other parts of
the code...
--
Shawn.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 20:33 ` Linus Torvalds
@ 2007-04-03 21:05 ` Nicolas Pitre
2007-04-03 21:11 ` Shawn O. Pearce
2007-04-03 21:24 ` Linus Torvalds
0 siblings, 2 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 21:05 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Chris Lee, Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Linus Torvalds wrote:
> All that extra memory is just for SHA1 commit ID information.
I don't see where that might be. The only thing that the paranoia check
triggers is:
foo = read_sha1_file(blah);
memcmp(foo with bar);
free(foo);
So where is that commit ID information actually stored when using
read_sha1_file()?
> Btw, even if we don't have any of the objects, if you have tons and tons
> of objects and do a "git pull", just the *lookup* of the nonexistent
> objects will be expensive: first we won't find it in any pack, then we'll
> look at the loose objects, and then we'll look int he pack *again* due to
> the race avoidance. So looking up nonexistent objects is actually pretty
> expensive.
Not if you consider that it is performed _while_ receiving (and waiting
for) the pack data over the net in the normal case.
Nicolas
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:05 ` Nicolas Pitre
@ 2007-04-03 21:11 ` Shawn O. Pearce
2007-04-03 21:24 ` Linus Torvalds
1 sibling, 0 replies; 58+ messages in thread
From: Shawn O. Pearce @ 2007-04-03 21:11 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Linus Torvalds, Chris Lee, Junio C Hamano, Git Mailing List
Nicolas Pitre <nico@cam.org> wrote:
> On Tue, 3 Apr 2007, Linus Torvalds wrote:
>
> > All that extra memory is just for SHA1 commit ID information.
>
> I don't see where that might be. The only thing that the paranoia check
> triggers is:
>
> foo = read_sha1_file(blah);
> memcmp(foo with bar);
> free(foo);
>
> So where is that commit ID information actually stored when using
> read_sha1_file()?
That might just be the mmap code in sha1_file.c. We are mmaping
the existing packfile, to fetch the object. That counts as virtual
memory. ;-)
--
Shawn.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:03 ` Shawn O. Pearce
@ 2007-04-03 21:13 ` Linus Torvalds
2007-04-03 21:17 ` Shawn O. Pearce
0 siblings, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 21:13 UTC (permalink / raw)
To: Shawn O. Pearce
Cc: Nicolas Pitre, Chris Lee, Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Shawn O. Pearce wrote:
>
> So even with --paranoid (aka what we have now) index-pack still
> should be faster than unpack-objects for any sizeable transfer,
> and is just as "safe".
I agree 100% that index-pack is much much MUCH better than unpack-objects
ever was, so ..
> If its the missing-object lookup that is expensive, maybe we should
> try to optimize that. We do it enough already in other parts of
> the code...
Well, for all other cases it's really the "object found" case that is
worth optimizing for, so I think optimizing for "no object" is actually
wrong, unless it also speeds up (or at least doesn't make it worse) the
"real" normal case.
Linus
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:13 ` Linus Torvalds
@ 2007-04-03 21:17 ` Shawn O. Pearce
2007-04-03 21:26 ` Linus Torvalds
2007-04-03 21:34 ` git-index-pack really does suck Nicolas Pitre
0 siblings, 2 replies; 58+ messages in thread
From: Shawn O. Pearce @ 2007-04-03 21:17 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Nicolas Pitre, Chris Lee, Junio C Hamano, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 3 Apr 2007, Shawn O. Pearce wrote:
> > If its the missing-object lookup that is expensive, maybe we should
> > try to optimize that. We do it enough already in other parts of
> > the code...
>
> Well, for all other cases it's really the "object found" case that is
> worth optimizing for, so I think optimizing for "no object" is actually
> wrong, unless it also speeds up (or at least doesn't make it worse) the
> "real" normal case.
Right. But maybe we shouldn't be scanning for packfiles every
time we don't find a loose object. Especially if the caller is in
a context where we actually *expect* to not find said object like
half of the time... say in git-add/update-index. ;-)
--
Shawn.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 20:56 ` Linus Torvalds
2007-04-03 21:03 ` Shawn O. Pearce
@ 2007-04-03 21:21 ` Nicolas Pitre
1 sibling, 0 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 21:21 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Chris Lee, Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Linus Torvalds wrote:
>
>
> On Tue, 3 Apr 2007, Nicolas Pitre wrote:
> > >
> > > Yeah. What happens is that inside the repo, because we do all the
> > > duplicate object checks (verifying that there are no evil hash collisions)
> > > even after fixing the memory leak, we end up keeping *track* of all those
> > > objects.
> >
> > What do you mean?
>
> Look at what we have to do to look up a SHA1 object.. We create all the
> lookup infrastructure, we don't *just* read the object. The delta base
> cache is the most obvious one.
It is caped to 16MB, so we're far from the 200+ MB count.
> > I'm of the opinion that this patch is unnecessary. It only helps in
> > bogus workflows to start with, and it makes the default behavior unsafe
> > (unsafe from a paranoid pov, but still). And in the _normal_ workflow
> > it should never trigger.
>
> Actually, even in the normal workflow it will do all the extra unnecessary
> work, if only because the lookup costs of *not* finding the entry.
>
> Lookie here:
>
> - git index-pack of the *git* pack-file in the v2.6/linux directory (zero
> overlap of objects)
>
> With --paranoid:
>
> 2.75user 0.37system 0:03.13elapsed 99%CPU
> 0major+5583minor pagefaults
>
> Without --paranoid:
>
> 2.55user 0.12system 0:02.68elapsed 99%CPU
> 0major+2957minor pagefaults
>
> See? That's the *normal* workflow. Zero objects found. 7% CPU overhead
> from just the unnecessary work, and almost twice as much memory used. Just
> from the index file lookup etc for a decent-sized project.
7% overhead over 2 second and a half of CPU which, _normally_, happens
when cloning the whole thing over a network connection which, if you're
lucky and have a 6mbps cable connection, will still be spread over 5
minutes of real time. And that is assuming that you're cloning a big
project inside itself which wouldn't work anyway. Otherwise a big clone
wound run index-pack in an empty repo where the lookup of exinsting
object is zero. Remains git-fetch which should concern itself with much
smaller packs pushing this overhead in the noise.
> Now, in the KDE situation, the *unnecessary* lookups will be about ten
> times more expensive, both on memory and CPU, just because the repository
> is about 20x the size. Even with no actual hits.
So? When would you really perform such an operation in a meaningful
way?
The memory usage worries me. I still cannot explain nor justify it.
But the CPU overhead is certainly not of any concern in _normal_ usage
scenarios, is it?
If anything that might be a good test case for the newton-raphson pack
lookup idea.
Nicolas
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:05 ` Nicolas Pitre
2007-04-03 21:11 ` Shawn O. Pearce
@ 2007-04-03 21:24 ` Linus Torvalds
[not found] ` <alpine.LF D.0.98.0704031735470.28181@xanadu.home>
2007-04-03 21:42 ` Nicolas Pitre
1 sibling, 2 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 21:24 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Chris Lee, Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Nicolas Pitre wrote:
>
> I don't see where that might be. The only thing that the paranoia check
> triggers is:
>
> foo = read_sha1_file(blah);
> memcmp(foo with bar);
> free(foo);
>
> So where is that commit ID information actually stored when using
> read_sha1_file()?
I've got the numbers: it uses much more memory when doing even failing
lookups, ie:
With --paranoid: 5583 minor pagefaults (21MB)
Without --paranoid: 2957 minor pagefaults (11MB)
(remember, this was *just* the git pack, not the kernel pack)
It could be as simple as just the index file itself. That's 11MB for the
kernel. Imagine if the index file was 20 times bigger - 200MB of memory
paged in with bad access patterns just for unnecessary lookups.
Running valgrind shows no leak at all without --paranoid. With --paranoid,
there's some really trivial stuff (the "packed_git" structure etc, so I
think it's really just the index itself).
> Not if you consider that it is performed _while_ receiving (and waiting
> for) the pack data over the net in the normal case.
..which is why I think it makes sense for "pull" to be paranoid. I just
don't think it makes sense to be paranoid *all* the time, since it's
clearly expensive.
Linus
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:17 ` Shawn O. Pearce
@ 2007-04-03 21:26 ` Linus Torvalds
2007-04-03 21:28 ` Linus Torvalds
2007-04-03 21:34 ` git-index-pack really does suck Nicolas Pitre
1 sibling, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 21:26 UTC (permalink / raw)
To: Shawn O. Pearce
Cc: Nicolas Pitre, Chris Lee, Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Shawn O. Pearce wrote:
>
> Right. But maybe we shouldn't be scanning for packfiles every
> time we don't find a loose object. Especially if the caller is in
> a context where we actually *expect* to not find said object like
> half of the time... say in git-add/update-index. ;-)
Yes, we could definitely skip the re-lookup if we had a "don't really
care, I can recreate the object myself" flag (ie anybody who is going to
write that object)
Linus
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:26 ` Linus Torvalds
@ 2007-04-03 21:28 ` Linus Torvalds
2007-04-03 22:31 ` Junio C Hamano
` (2 more replies)
0 siblings, 3 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 21:28 UTC (permalink / raw)
To: Shawn O. Pearce
Cc: Nicolas Pitre, Chris Lee, Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Linus Torvalds wrote:
>
> Yes, we could definitely skip the re-lookup if we had a "don't really
> care, I can recreate the object myself" flag (ie anybody who is going to
> write that object)
Side note: with "alternates" files, you might well *always* have the
objects. If you do
git clone -l -s ...
to create various branches, and then pull between them, you'll actually
end up in the situation that you'll always find the objects and get back
to the really expensive case..
Linus
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:00 ` Linus Torvalds
@ 2007-04-03 21:28 ` Nicolas Pitre
2007-04-03 22:49 ` Chris Lee
1 sibling, 0 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 21:28 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, Chris Lee, Git Mailing List
On Tue, 3 Apr 2007, Linus Torvalds wrote:
> See my separate timing numbers, although I bet that Chris can give even
> better ones..
>
> Chris, try applying my patch, and then inside the KDE repo you have, do
>
> git index-pack --paranoid --stdin --fix-thin new.pack < ~/git/.git/objects/pack/pack-*.pack
>
> (ie index the objects of the *git* repository, not the KDE one). That
> should approximate doing a fair-sized "git pull" - getting new objects. Do
> it with and without --paranoid, and time it.
Like I said this is bogus since the index-pack is throttled by the
network making this overhead a non issue in real life.
And like I said there should _not_ be such a memory usage difference
which is probably showing potential problems *elsewhere*.
> I bet that what I see as a 7% slowdown will be much bigger for you, just
> because the negative lookups will be all that much more expensive when you
> have tons of objects.
And I bet your newton-raphson lookup idea would shine and bring that
overhead down considerably in that case.
Nicolas
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:17 ` Shawn O. Pearce
2007-04-03 21:26 ` Linus Torvalds
@ 2007-04-03 21:34 ` Nicolas Pitre
2007-04-03 21:37 ` Shawn O. Pearce
2007-04-03 22:40 ` Dana How
1 sibling, 2 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 21:34 UTC (permalink / raw)
To: Shawn O. Pearce
Cc: Linus Torvalds, Chris Lee, Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Shawn O. Pearce wrote:
> Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > On Tue, 3 Apr 2007, Shawn O. Pearce wrote:
> > > If its the missing-object lookup that is expensive, maybe we should
> > > try to optimize that. We do it enough already in other parts of
> > > the code...
> >
> > Well, for all other cases it's really the "object found" case that is
> > worth optimizing for, so I think optimizing for "no object" is actually
> > wrong, unless it also speeds up (or at least doesn't make it worse) the
> > "real" normal case.
>
> Right. But maybe we shouldn't be scanning for packfiles every
> time we don't find a loose object. Especially if the caller is in
> a context where we actually *expect* to not find said object like
> half of the time... say in git-add/update-index. ;-)
First, I truly believe we should have a 64-bit pack index and fewer
larger packs than many small packs.
Which leaves us with the actual pack index lookup. At that point the
cost of finding an existing object and finding that a given object
doesn't exist is about the same thing, isn't it?
Optimizing that lookup is going to benefit both cases.
Nicolas
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:34 ` git-index-pack really does suck Nicolas Pitre
@ 2007-04-03 21:37 ` Shawn O. Pearce
2007-04-03 21:44 ` Junio C Hamano
2007-04-03 22:40 ` Dana How
1 sibling, 1 reply; 58+ messages in thread
From: Shawn O. Pearce @ 2007-04-03 21:37 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Linus Torvalds, Chris Lee, Junio C Hamano, Git Mailing List
Nicolas Pitre <nico@cam.org> wrote:
> First, I truly believe we should have a 64-bit pack index and fewer
> larger packs than many small packs.
I'll buy that. ;-)
> Which leaves us with the actual pack index lookup. At that point the
> cost of finding an existing object and finding that a given object
> doesn't exist is about the same thing, isn't it?
Here's the rub: in the missing object case we didn't find it
in the pack index, but it could be loose. That's one failed
syscall per object if the object isn't loose. If the object
isn't loose, it could be that it was *just* removed by a
running prune-packed, and the packfile that it was moved
to was created after we scanned for packfiles, so time to
rescan...
--
Shawn.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:24 ` Linus Torvalds
[not found] ` <alpine.LF D.0.98.0704031735470.28181@xanadu.home>
@ 2007-04-03 21:42 ` Nicolas Pitre
2007-04-03 22:07 ` Junio C Hamano
2007-04-03 22:14 ` Linus Torvalds
1 sibling, 2 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 21:42 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Chris Lee, Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Linus Torvalds wrote:
>
>
> On Tue, 3 Apr 2007, Nicolas Pitre wrote:
> >
> > I don't see where that might be. The only thing that the paranoia check
> > triggers is:
> >
> > foo = read_sha1_file(blah);
> > memcmp(foo with bar);
> > free(foo);
> >
> > So where is that commit ID information actually stored when using
> > read_sha1_file()?
>
> I've got the numbers: it uses much more memory when doing even failing
> lookups, ie:
>
> With --paranoid: 5583 minor pagefaults (21MB)
> Without --paranoid: 2957 minor pagefaults (11MB)
>
> (remember, this was *just* the git pack, not the kernel pack)
>
> It could be as simple as just the index file itself. That's 11MB for the
> kernel. Imagine if the index file was 20 times bigger - 200MB of memory
> paged in with bad access patterns just for unnecessary lookups.
Again that presumes you have to page in the whole index, which should
not happen when pulling (much smaller packs) and with a better lookup
algorithm.
> > Not if you consider that it is performed _while_ receiving (and waiting
> > for) the pack data over the net in the normal case.
>
> ..which is why I think it makes sense for "pull" to be paranoid. I just
> don't think it makes sense to be paranoid *all* the time, since it's
> clearly expensive.
Make it conditionnal on --stdin then. This covers all cases where we
really want the secure thing to happen, and the --stdin case already
perform the atomic rename-and-move thing when the pack is fully indexed.
Nicolas
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:37 ` Shawn O. Pearce
@ 2007-04-03 21:44 ` Junio C Hamano
2007-04-03 21:53 ` Shawn O. Pearce
0 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2007-04-03 21:44 UTC (permalink / raw)
To: Shawn O. Pearce
Cc: Nicolas Pitre, Linus Torvalds, Chris Lee, Git Mailing List
"Shawn O. Pearce" <spearce@spearce.org> writes:
> Nicolas Pitre <nico@cam.org> wrote:
>> First, I truly believe we should have a 64-bit pack index and fewer
>> larger packs than many small packs.
>
> I'll buy that. ;-)
>
>> Which leaves us with the actual pack index lookup. At that point the
>> cost of finding an existing object and finding that a given object
>> doesn't exist is about the same thing, isn't it?
>
> Here's the rub: in the missing object case we didn't find it
> in the pack index, but it could be loose. That's one failed
> syscall per object if the object isn't loose. If the object
> isn't loose, it could be that it was *just* removed by a
> running prune-packed, and the packfile that it was moved
> to was created after we scanned for packfiles, so time to
> rescan...
If that is the only reason we have these reprepare_packed_git()
sprinkled all over in sha1_file.c (by 637cdd9d), perhaps we
should rethink that. Is there a cheap way to trigger these
rescanning only when a prune-packed is in progress, I wonder...
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:44 ` Junio C Hamano
@ 2007-04-03 21:53 ` Shawn O. Pearce
2007-04-03 22:10 ` Jeff King
0 siblings, 1 reply; 58+ messages in thread
From: Shawn O. Pearce @ 2007-04-03 21:53 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Nicolas Pitre, Linus Torvalds, Chris Lee, Git Mailing List
Junio C Hamano <junkio@cox.net> wrote:
> "Shawn O. Pearce" <spearce@spearce.org> writes:
> > Here's the rub: in the missing object case we didn't find it
> > in the pack index, but it could be loose. That's one failed
> > syscall per object if the object isn't loose. If the object
> > isn't loose, it could be that it was *just* removed by a
> > running prune-packed, and the packfile that it was moved
> > to was created after we scanned for packfiles, so time to
> > rescan...
>
> If that is the only reason we have these reprepare_packed_git()
> sprinkled all over in sha1_file.c (by 637cdd9d), perhaps we
> should rethink that. Is there a cheap way to trigger these
> rescanning only when a prune-packed is in progress, I wonder...
Yea, it is the only reason. So... if we could have some
magic to trigger that, it would be good. I just don't
know what magic that would be.
--
Shawn.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:42 ` Nicolas Pitre
@ 2007-04-03 22:07 ` Junio C Hamano
2007-04-03 22:11 ` Shawn O. Pearce
2007-04-03 22:34 ` Nicolas Pitre
2007-04-03 22:14 ` Linus Torvalds
1 sibling, 2 replies; 58+ messages in thread
From: Junio C Hamano @ 2007-04-03 22:07 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Linus Torvalds, Chris Lee, Git Mailing List
Nicolas Pitre <nico@cam.org> writes:
> Make it conditionnal on --stdin then. This covers all cases where we
> really want the secure thing to happen, and the --stdin case already
> perform the atomic rename-and-move thing when the pack is fully indexed.
Repacking objects in a repository uses pack-objects without
using index-pack, as you suggested Chris. Is there a sane usage
of index-pack that does not use --stdin? I do not think of any.
If there isn't, the "conditional on --stdin" suggestion means we
unconditionally do the secure thing for all the sane usage, and
go unsecure for an insane usage that we do not really care about.
If so, it seems to me that it would be the simplest not to touch
the code at all, except that missing free().
Am I missing something?
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:53 ` Shawn O. Pearce
@ 2007-04-03 22:10 ` Jeff King
0 siblings, 0 replies; 58+ messages in thread
From: Jeff King @ 2007-04-03 22:10 UTC (permalink / raw)
To: Shawn O. Pearce
Cc: Junio C Hamano, Nicolas Pitre, Linus Torvalds, Chris Lee,
Git Mailing List
On Tue, Apr 03, 2007 at 05:53:42PM -0400, Shawn O. Pearce wrote:
> > If that is the only reason we have these reprepare_packed_git()
> > sprinkled all over in sha1_file.c (by 637cdd9d), perhaps we
> > should rethink that. Is there a cheap way to trigger these
> > rescanning only when a prune-packed is in progress, I wonder...
>
> Yea, it is the only reason. So... if we could have some
> magic to trigger that, it would be good. I just don't
> know what magic that would be.
It doesn't have to be in progress; it might have started and completed
between pack-scanning and object lookup. Something like:
1. start git-repack -a -d
2. start git-rev-list, scan for packs
3. repack moves finished pack into place, starts git-prune-packed
4. git-prune-pack completes
5. git-rev-list looks up object
So you would need to have some sort of incremented counter that says
"there's a new pack available."
Perhaps instead of rescanning unconditionally, it should simply stat()
the pack directory and look for a change? That should be much cheaper.
-Jeff
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 22:07 ` Junio C Hamano
@ 2007-04-03 22:11 ` Shawn O. Pearce
2007-04-03 22:34 ` Nicolas Pitre
1 sibling, 0 replies; 58+ messages in thread
From: Shawn O. Pearce @ 2007-04-03 22:11 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Nicolas Pitre, Linus Torvalds, Chris Lee, Git Mailing List
Junio C Hamano <junkio@cox.net> wrote:
> Nicolas Pitre <nico@cam.org> writes:
>
> > Make it conditionnal on --stdin then. This covers all cases where we
> > really want the secure thing to happen, and the --stdin case already
> > perform the atomic rename-and-move thing when the pack is fully indexed.
>
> Repacking objects in a repository uses pack-objects without
> using index-pack, as you suggested Chris. Is there a sane usage
> of index-pack that does not use --stdin? I do not think of any.
>
> If there isn't, the "conditional on --stdin" suggestion means we
> unconditionally do the secure thing for all the sane usage, and
> go unsecure for an insane usage that we do not really care about.
>
> If so, it seems to me that it would be the simplest not to touch
> the code at all, except that missing free().
>
> Am I missing something?
Nope. I agree with you completely.
--
Shawn.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:42 ` Nicolas Pitre
2007-04-03 22:07 ` Junio C Hamano
@ 2007-04-03 22:14 ` Linus Torvalds
2007-04-03 22:55 ` Nicolas Pitre
1 sibling, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 22:14 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Chris Lee, Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Nicolas Pitre wrote:
> >
> > It could be as simple as just the index file itself. That's 11MB for the
> > kernel. Imagine if the index file was 20 times bigger - 200MB of memory
> > paged in with bad access patterns just for unnecessary lookups.
>
> Again that presumes you have to page in the whole index, which should
> not happen when pulling (much smaller packs) and with a better lookup
> algorithm.
Actually, since SHA1's are randomly distributed, together with the binary
search, you probably *will* pull in the bulk of the index, even with a
fairly small set of SHA1's that you check.
> Make it conditionnal on --stdin then. This covers all cases where we
> really want the secure thing to happen, and the --stdin case already
> perform the atomic rename-and-move thing when the pack is fully indexed.
I don't care *what* it is conditional on, but your arguments suck. You
claim that it's not a normal case to already have the objects, when it
*is* a normal case for alternates, etc.
I don't understand why you argue against hard numbers. You have none of
your own.
Linus
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:28 ` Linus Torvalds
@ 2007-04-03 22:31 ` Junio C Hamano
2007-04-03 22:38 ` Shawn O. Pearce
2007-04-05 10:22 ` [PATCH 1/2] git-fetch--tool pick-rref Junio C Hamano
2007-04-05 10:22 ` [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate Junio C Hamano
2 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2007-04-03 22:31 UTC (permalink / raw)
To: Linus Torvalds
Cc: Shawn O. Pearce, Nicolas Pitre, Chris Lee, Git Mailing List
Linus Torvalds <torvalds@linux-foundation.org> writes:
> Side note: with "alternates" files, you might well *always* have the
> objects. If you do
>
> git clone -l -s ...
>
> to create various branches, and then pull between them, you'll actually
> end up in the situation that you'll always find the objects and get back
> to the really expensive case..
Ah, that's true. If you "git clone -l -s A B", create new
objects in A and pull from B, the transfer would not exclude
new objects as they are not visible from B's refs.
In that scenario, the keep-pack behaviour is already worse than
the unpack-objects behaviour. The former creates a packfile
that duplicates objects that are in A while the latter, although
expensive, ends up doing nothing.
I wonder if we can have a backdoor to avoid any object transfer
in such a case to begin with...
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 22:52 ` Linus Torvalds
@ 2007-04-03 22:31 ` David Lang
0 siblings, 0 replies; 58+ messages in thread
From: David Lang @ 2007-04-03 22:31 UTC (permalink / raw)
To: Linus Torvalds
Cc: Dana How, Nicolas Pitre, Shawn O. Pearce, Chris Lee,
Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Linus Torvalds wrote:
> On Tue, 3 Apr 2007, Dana How wrote:
>>
>> Larger and larger pack files make me nervous.
>> They are expensive to manipulate,
>> and >2GB requires a file format change.
>
> It sometimes also requires a new filesystem. There are a lot of
> filesystems that can handle more than 4GB total, but not necessarily in a
> single file.
>
> The only really useful such filesystem is probably FAT, which is still
> quite useful for things like USB memory sticks. But that is probably
> already worth supporting.
>
> So I think we want to support 64-bit (or at least something like 40+ bit)
> pack-files, but yes, I think that even if/when we support it, we still
> want to support the "multiple smaller pack-files" schenario exactly
> because for some uses it's much *better* to have ten 2GB packfiles rather
> than one 20GB pack-file.
however, for historical archives you may end up with wanting to do a 2GB
packfile of 'recent' stuff, and a 14GB packfile of 'ancient' stuff (with the
large one built with all space-saving options turned up all the way, no matter
how much time it takes to build the pack)
David Lang
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 22:07 ` Junio C Hamano
2007-04-03 22:11 ` Shawn O. Pearce
@ 2007-04-03 22:34 ` Nicolas Pitre
1 sibling, 0 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 22:34 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Linus Torvalds, Chris Lee, Git Mailing List
On Tue, 3 Apr 2007, Junio C Hamano wrote:
> Nicolas Pitre <nico@cam.org> writes:
>
> > Make it conditionnal on --stdin then. This covers all cases where we
> > really want the secure thing to happen, and the --stdin case already
> > perform the atomic rename-and-move thing when the pack is fully indexed.
>
> Repacking objects in a repository uses pack-objects without
> using index-pack, as you suggested Chris. Is there a sane usage
> of index-pack that does not use --stdin? I do not think of any.
>
> If there isn't, the "conditional on --stdin" suggestion means we
> unconditionally do the secure thing for all the sane usage, and
> go unsecure for an insane usage that we do not really care about.
>
> If so, it seems to me that it would be the simplest not to touch
> the code at all, except that missing free().
That's exactly what I think as well.
> Am I missing something?
Not from my point of view.
Nicolas
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 22:55 ` Nicolas Pitre
@ 2007-04-03 22:36 ` David Lang
2007-04-04 9:51 ` Alex Riesen
2007-04-03 23:29 ` Linus Torvalds
1 sibling, 1 reply; 58+ messages in thread
From: David Lang @ 2007-04-03 22:36 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Linus Torvalds, Chris Lee, Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Nicolas Pitre wrote:
> On Tue, 3 Apr 2007, Linus Torvalds wrote:
>
>> I don't care *what* it is conditional on, but your arguments suck. You
>> claim that it's not a normal case to already have the objects, when it
>> *is* a normal case for alternates, etc.
>>
>> I don't understand why you argue against hard numbers. You have none of
>> your own.
>
> Are hard numbers like 7% overhead (because right now that's all we have)
> really worth it against bad _perceptions_?
plus 1s overhead on what's otherwise a noop command.
> The keeping of fetched packs broke that presumption of trust towards
> local objects and it opened a real path for potential future attacks.
> Those attacks are still fairly theoretical of course. But for how
> _long_? Do we want GIT to be considered backdoor prone in a couple
> years from now just because we were obsessed by a 7% CPU overhead?
>
> I think we have much more to gain by playing it safe and being more
> secure and paranoid than trying to squeeze some CPU cycles out of an
> operation that is likely to ever be bounded by network speed for most
> people.
this is why -paranoid should be left on for network pulls, but having it on for
the local uses means that the cost isn't hidden in the network limits isn't
good.
David Lang
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 22:31 ` Junio C Hamano
@ 2007-04-03 22:38 ` Shawn O. Pearce
2007-04-03 22:41 ` Junio C Hamano
0 siblings, 1 reply; 58+ messages in thread
From: Shawn O. Pearce @ 2007-04-03 22:38 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Linus Torvalds, Nicolas Pitre, Chris Lee, Git Mailing List
Junio C Hamano <junkio@cox.net> wrote:
> Ah, that's true. If you "git clone -l -s A B", create new
> objects in A and pull from B, the transfer would not exclude
> new objects as they are not visible from B's refs.
>
> In that scenario, the keep-pack behaviour is already worse than
> the unpack-objects behaviour. The former creates a packfile
> that duplicates objects that are in A while the latter, although
> expensive, ends up doing nothing.
>
> I wonder if we can have a backdoor to avoid any object transfer
> in such a case to begin with...
Yea, symlink to the corresponding refs directory of the alternate
ODB. Then the refs will be visible. ;-)
--
Shawn.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:34 ` git-index-pack really does suck Nicolas Pitre
2007-04-03 21:37 ` Shawn O. Pearce
@ 2007-04-03 22:40 ` Dana How
2007-04-03 22:52 ` Linus Torvalds
2007-04-03 23:00 ` Nicolas Pitre
1 sibling, 2 replies; 58+ messages in thread
From: Dana How @ 2007-04-03 22:40 UTC (permalink / raw)
To: Nicolas Pitre
Cc: Shawn O. Pearce, Linus Torvalds, Chris Lee, Junio C Hamano,
Git Mailing List
On 4/3/07, Nicolas Pitre <nico@cam.org> wrote:
> On Tue, 3 Apr 2007, Shawn O. Pearce wrote:
> > Right. But maybe we shouldn't be scanning for packfiles every
> > time we don't find a loose object. Especially if the caller is in
> > a context where we actually *expect* to not find said object like
> > half of the time... say in git-add/update-index. ;-)
>
> First, I truly believe we should have a 64-bit pack index and fewer
> larger packs than many small packs.
>
> Which leaves us with the actual pack index lookup. At that point the
> cost of finding an existing object and finding that a given object
> doesn't exist is about the same thing, isn't it?
>
> Optimizing that lookup is going to benefit both cases.
Do you get what you want if you move to fewer larger INDEX files
but not pack files -- in the extreme, one large index file?
A "super index" could be built from multiple .idx files.
This would be a new file (format) unencumbered by the past,
so it could be tried out more quickly.
Just like objects are pruned when packed,
.idx files could be pruned when the super index is built.
Perhaps a number of (<2GB) packfiles and a large index
file could work out.
Larger and larger pack files make me nervous.
They are expensive to manipulate,
and >2GB requires a file format change.
Thanks,
--
Dana L. How danahow@gmail.com +1 650 804 5991 cell
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 22:38 ` Shawn O. Pearce
@ 2007-04-03 22:41 ` Junio C Hamano
0 siblings, 0 replies; 58+ messages in thread
From: Junio C Hamano @ 2007-04-03 22:41 UTC (permalink / raw)
To: Shawn O. Pearce
Cc: Linus Torvalds, Nicolas Pitre, Chris Lee, Git Mailing List
"Shawn O. Pearce" <spearce@spearce.org> writes:
>> I wonder if we can have a backdoor to avoid any object transfer
>> in such a case to begin with...
>
> Yea, symlink to the corresponding refs directory of the alternate
> ODB. Then the refs will be visible. ;-)
I was thinking about going the other way. When git-fetch is
asked to fetch over a local transport, it could check if the
source is one of its alternate object stores.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 21:00 ` Linus Torvalds
2007-04-03 21:28 ` Nicolas Pitre
@ 2007-04-03 22:49 ` Chris Lee
2007-04-03 23:12 ` Linus Torvalds
1 sibling, 1 reply; 58+ messages in thread
From: Chris Lee @ 2007-04-03 22:49 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Junio C Hamano, Nicolas Pitre, Git Mailing List
On 4/3/07, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Tue, 3 Apr 2007, Junio C Hamano wrote:
> > So maybe we should retitle this thread from "git-index-pack
> > really does suck.." to "I used git-index-pack in a stupid way"?
Well, I never claimed to be a genius. :)
> See my separate timing numbers, although I bet that Chris can give even
> better ones..
>
> Chris, try applying my patch, and then inside the KDE repo you have, do
>
> git index-pack --paranoid --stdin --fix-thin new.pack < ~/git/.git/objects/pack/pack-*.pack
>
> (ie index the objects of the *git* repository, not the KDE one). That
> should approximate doing a fair-sized "git pull" - getting new objects. Do
> it with and without --paranoid, and time it.
% time git-index-pack --paranoid --stdin --fix-thin paranoid.pack <
/usr/local/src/git/.git/objects/pack/*pack
pack bf8ba7897da9c84d1981ecdc92c0b1979506a4b9
git-index-pack --paranoid --stdin --fix-thin paranoid.pack < 5.28s
user 0.24s system 98% cpu 5.592 total
% time git-index-pack --stdin --fix-thin trusting.pack <
/usr/local/src/git/.git/objects/pack/*pack
pack bf8ba7897da9c84d1981ecdc92c0b1979506a4b9
git-index-pack --stdin --fix-thin trusting.pack < 5.07s user 0.12s
system 99% cpu 5.202 total
So, in my case, at least... not really much of a difference, which is puzzling.
> I bet that what I see as a 7% slowdown will be much bigger for you, just
> because the negative lookups will be all that much more expensive when you
> have tons of objects.
I applied exactly the patch you sent, and it applied perfectly cleanly
- no failures.
I also mailed out the DVD with the repo on it to hpa today, so
hopefully by tomorrow he'll get it. (He's not even two cities over,
and I suspect I could have just driven it to his place, but that might
have been a little awkward since I've never met him.)
Anyway, so, hopefully once he gets it he can put it up somewhere that
you guys can grab it. For reference, the KDE repo is pretty big, but a
"real" conversion of the repo would be bigger; the one that I've been
playing with only has the KDE svn trunk, and only the first 409k
revisions - there are, as of right now, over 650k revisions in KDE's
svn repo. So, realistically speaking, a fully-converted KDE git repo
would probably take up at least 6GB, packed, if not more. Subproject
support would probably be *really* helpful to mitigate that.
-clee
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 22:40 ` Dana How
@ 2007-04-03 22:52 ` Linus Torvalds
2007-04-03 22:31 ` David Lang
2007-04-03 23:00 ` Nicolas Pitre
1 sibling, 1 reply; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 22:52 UTC (permalink / raw)
To: Dana How
Cc: Nicolas Pitre, Shawn O. Pearce, Chris Lee, Junio C Hamano,
Git Mailing List
On Tue, 3 Apr 2007, Dana How wrote:
>
> Larger and larger pack files make me nervous.
> They are expensive to manipulate,
> and >2GB requires a file format change.
It sometimes also requires a new filesystem. There are a lot of
filesystems that can handle more than 4GB total, but not necessarily in a
single file.
The only really useful such filesystem is probably FAT, which is still
quite useful for things like USB memory sticks. But that is probably
already worth supporting.
So I think we want to support 64-bit (or at least something like 40+ bit)
pack-files, but yes, I think that even if/when we support it, we still
want to support the "multiple smaller pack-files" schenario exactly
because for some uses it's much *better* to have ten 2GB packfiles rather
than one 20GB pack-file.
Linus
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 22:14 ` Linus Torvalds
@ 2007-04-03 22:55 ` Nicolas Pitre
2007-04-03 22:36 ` David Lang
2007-04-03 23:29 ` Linus Torvalds
0 siblings, 2 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 22:55 UTC (permalink / raw)
To: Linus Torvalds; +Cc: Chris Lee, Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Linus Torvalds wrote:
> I don't care *what* it is conditional on, but your arguments suck. You
> claim that it's not a normal case to already have the objects, when it
> *is* a normal case for alternates, etc.
>
> I don't understand why you argue against hard numbers. You have none of
> your own.
Are hard numbers like 7% overhead (because right now that's all we have)
really worth it against bad _perceptions_?
Sure, the SHA1 collision attack is paranoia. But it is becoming
increasingly *possible*.
And when we only had unpack-objects on the receiving end of a fetch, you
yourself bragged about the implied security of GIT in the presence of a
SHA1 collision attack. Because let's admit it: when a SHA1 collision
will happen it is way more probable to come on purpose than from pure
accident. But as you said at the time, it is not a problem because GIT
trusts local objects more than remote ones and incidentally
unpack-objects doesn't overwrite existing objects.
The keeping of fetched packs broke that presumption of trust towards
local objects and it opened a real path for potential future attacks.
Those attacks are still fairly theoretical of course. But for how
_long_? Do we want GIT to be considered backdoor prone in a couple
years from now just because we were obsessed by a 7% CPU overhead?
I think we have much more to gain by playing it safe and being more
secure and paranoid than trying to squeeze some CPU cycles out of an
operation that is likely to ever be bounded by network speed for most
people.
And we _know_ that the operation can be optimized further anyway.
So IMHO in this case hard numbers alone aren't the end of it. Not as
long as they're reasonably low. And especially not for a command which
is 1) rather infrequent and 2) not really interactive like git-log might
be.
Nicolas
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 22:40 ` Dana How
2007-04-03 22:52 ` Linus Torvalds
@ 2007-04-03 23:00 ` Nicolas Pitre
1 sibling, 0 replies; 58+ messages in thread
From: Nicolas Pitre @ 2007-04-03 23:00 UTC (permalink / raw)
To: Dana How
Cc: Shawn O. Pearce, Linus Torvalds, Chris Lee, Junio C Hamano,
Git Mailing List
On Tue, 3 Apr 2007, Dana How wrote:
> Do you get what you want if you move to fewer larger INDEX files
> but not pack files -- in the extreme, one large index file?
No that doesn't solve the clone/fetch of large amount of data cleanly.
> A "super index" could be built from multiple .idx files.
> This would be a new file (format) unencumbered by the past,
> so it could be tried out more quickly.
> Just like objects are pruned when packed,
> .idx files could be pruned when the super index is built.
This is an idea to consider independently of any other issues.
> Perhaps a number of (<2GB) packfiles and a large index
> file could work out.
>
> Larger and larger pack files make me nervous.
> They are expensive to manipulate,
> and >2GB requires a file format change.
No. Larger pack files are not more expensive than the same set of
objects spread into multiple packs. In fact I'd say it's quite the
opposite. And larger pack files do not require a pack format change --
it's just the index that has to change and the index is a local matter.
Nicolas
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 22:49 ` Chris Lee
@ 2007-04-03 23:12 ` Linus Torvalds
0 siblings, 0 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 23:12 UTC (permalink / raw)
To: Chris Lee; +Cc: Junio C Hamano, Nicolas Pitre, Git Mailing List
On Tue, 3 Apr 2007, Chris Lee wrote:
>
> git-index-pack --paranoid --stdin --fix-thin paranoid.pack <
> 5.28s user 0.24s system 98% cpu 5.592 total
>
> git-index-pack --stdin --fix-thin trusting.pack <
> 5.07s user 0.12s system 99% cpu 5.202 total
Ok, that's not a big enough of a difference to care.
> So, in my case, at least... not really much of a difference, which is
> puzzling.
It's entirely possible that the object lookup is good enough to not be a
problem even for huge packs, and it really only gets to be a problem when
you actually unpack all the objects.
In that case, the only real case to worry about is indeed the "alternates"
case (or if people actually use a shared git object directory, but I don't
think anybody really does - alternates just work well enough, and shared
object directories are painful enough that I doubt anybody *really* uses
it).
> I also mailed out the DVD with the repo on it to hpa today, so
> hopefully by tomorrow he'll get it. (He's not even two cities over,
> and I suspect I could have just driven it to his place, but that might
> have been a little awkward since I've never met him.)
Heh. Ok, good. I'll torrent it or something when it's up.
> Anyway, so, hopefully once he gets it he can put it up somewhere that
> you guys can grab it. For reference, the KDE repo is pretty big, but a
> "real" conversion of the repo would be bigger; the one that I've been
> playing with only has the KDE svn trunk, and only the first 409k
> revisions - there are, as of right now, over 650k revisions in KDE's
> svn repo. So, realistically speaking, a fully-converted KDE git repo
> would probably take up at least 6GB, packed, if not more. Subproject
> support would probably be *really* helpful to mitigate that.
Sure. I think subproject support is likely the big "missing feature" of
git right now. The rest is "details", even if they can be big and involved
details.
But even at only 409k revisions, it's still going to be an order of
magnitude bigger than what the kernel is, exactly *because* it's such a
disaster from a maintenance setup standpoint, and it's going to be a
useful real-world test-case. So whether that is a "good" git archive or
not, it's going to be useful.
Long ago we used to be able to look at the historic Linux archive as an
example of a "big" archive, but it's not actually all that much bigger
than the normal Linux archive any more, and we've pretty much fixed the
problems we used to have.
[ The historical pack-file is actually smaller, but that's because it was
done with a much deeper delta-chain to make it small: the historical
archive still has more objects in it than the current active git kernel
tree - but it's only in the 20% range, not "20 *times* bigger" ]
The Eclipse tree was useful (and I think we already improved performance
for you thanks to working with it - I don't know how much faster the
delta-base cache made things for you, but I'd assume it was at *least* by
the factor-of-2.5 that we saw on Eclipse), but the KDE is bigger *and*
deeper (the eclipse tree is 1.7GB, and 136k revisions in the main branch,
so the KDE tree is more than twice the revisions).
Linus
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 22:55 ` Nicolas Pitre
2007-04-03 22:36 ` David Lang
@ 2007-04-03 23:29 ` Linus Torvalds
1 sibling, 0 replies; 58+ messages in thread
From: Linus Torvalds @ 2007-04-03 23:29 UTC (permalink / raw)
To: Nicolas Pitre; +Cc: Chris Lee, Junio C Hamano, Git Mailing List
On Tue, 3 Apr 2007, Nicolas Pitre wrote:
>
> Are hard numbers like 7% overhead (because right now that's all we have)
> really worth it against bad _perceptions_?
If it actually stays at just 7% even with large repos (and the numbers
from Chris seem to say that it doesn't get worse - in fact, it may be that
the lookup gets relatively more efficient for a large repo thanks to the
log(n) costs), I agree that 7% probably isn't worth worrying about when
weighed against "guaranteed no SHA1 collision". Especially as long as
you'd normally only hit it when your real performance issue is going to be
the network.
So especially if we can make sure that the *local* case is ok, where the
network isn't going to be the bottleneck, I think we can/should do the
paranoia.
That's especially true as it is also the local case where the 7% has
already been shown to be just the best case, with the worst case being
many hundred percent (and memory use going up from 55M to 280M in one
example), thanks to us actually *finding* the objects.
Linus
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-03 22:36 ` David Lang
@ 2007-04-04 9:51 ` Alex Riesen
[not found] ` <P ine.LNX.4.63.0704061455380.24050@qynat.qvtvafvgr.pbz>
2007-04-06 21:56 ` David Lang
0 siblings, 2 replies; 58+ messages in thread
From: Alex Riesen @ 2007-04-04 9:51 UTC (permalink / raw)
To: David Lang
Cc: Nicolas Pitre, Linus Torvalds, Chris Lee, Junio C Hamano,
Git Mailing List
On 4/4/07, David Lang <david.lang@digitalinsight.com> wrote:
>
> > The keeping of fetched packs broke that presumption of trust towards
> > local objects and it opened a real path for potential future attacks.
> > Those attacks are still fairly theoretical of course. But for how
> > _long_? Do we want GIT to be considered backdoor prone in a couple
> > years from now just because we were obsessed by a 7% CPU overhead?
> >
> > I think we have much more to gain by playing it safe and being more
> > secure and paranoid than trying to squeeze some CPU cycles out of an
> > operation that is likely to ever be bounded by network speed for most
> > people.
>
> this is why -paranoid should be left on for network pulls, but having it on for
> the local uses means that the cost isn't hidden in the network limits isn't
> good.
You never know what pull is networked (or should I say: remote enough
to cause a collision).
^ permalink raw reply [flat|nested] 58+ messages in thread
* [PATCH 1/2] git-fetch--tool pick-rref
2007-04-03 21:28 ` Linus Torvalds
2007-04-03 22:31 ` Junio C Hamano
@ 2007-04-05 10:22 ` Junio C Hamano
2007-04-05 10:22 ` [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate Junio C Hamano
2 siblings, 0 replies; 58+ messages in thread
From: Junio C Hamano @ 2007-04-05 10:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: Shawn O. Pearce, Nicolas Pitre, Chris Lee, Git Mailing List
This script helper takes list of fully qualified refnames and
results from ls-remote and grabs only the lines for the named
refs from the latter.
Signed-off-by: Junio C Hamano <junkio@cox.net>
---
Linus Torvalds <torvalds@linux-foundation.org> writes:
> On Tue, 3 Apr 2007, Linus Torvalds wrote:
>>
>> Yes, we could definitely skip the re-lookup if we had a "don't really
>> care, I can recreate the object myself" flag (ie anybody who is going to
>> write that object)
>
> Side note: with "alternates" files, you might well *always* have the
> objects. If you do
>
> git clone -l -s ...
>
> to create various branches, and then pull between them, you'll actually
> end up in the situation that you'll always find the objects and get back
> to the really expensive case..
Ah, that's true. If you "git clone -l -s A B", create new
objects in A and pull from B, the transfer would not exclude
new objects as they are not visible from B's refs.
In that scenario, the keep-pack behaviour is already worse than
the unpack-objects behaviour. The former creates a packfile
that duplicates objects that are in A while the latter, although
expensive, ends up doing nothing.
I wonder if we can have a backdoor to avoid any object transfer
in such a case to begin with...
... and this two patch series does exactly that.
builtin-fetch--tool.c | 84 +++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 84 insertions(+), 0 deletions(-)
diff --git a/builtin-fetch--tool.c b/builtin-fetch--tool.c
index e9d6764..be341c1 100644
--- a/builtin-fetch--tool.c
+++ b/builtin-fetch--tool.c
@@ -436,10 +436,87 @@ static int expand_refs_wildcard(const char *ls_remote_result, int numrefs,
return 0;
}
+static int pick_rref(int sha1_only, const char *rref, const char *ls_remote_result)
+{
+ int err = 0;
+ int lrr_count = lrr_count, i, pass;
+ const char *cp;
+ struct lrr {
+ const char *line;
+ const char *name;
+ int namelen;
+ int shown;
+ } *lrr_list = lrr_list;
+
+ for (pass = 0; pass < 2; pass++) {
+ /* pass 0 counts and allocates, pass 1 fills... */
+ cp = ls_remote_result;
+ i = 0;
+ while (1) {
+ const char *np;
+ while (*cp && isspace(*cp))
+ cp++;
+ if (!*cp)
+ break;
+ np = strchr(cp, '\n');
+ if (!np)
+ np = cp + strlen(cp);
+ if (pass) {
+ lrr_list[i].line = cp;
+ lrr_list[i].name = cp + 41;
+ lrr_list[i].namelen = np - (cp + 41);
+ }
+ i++;
+ cp = np;
+ }
+ if (!pass) {
+ lrr_count = i;
+ lrr_list = xcalloc(lrr_count, sizeof(*lrr_list));
+ }
+ }
+
+ while (1) {
+ const char *next;
+ int rreflen;
+ int i;
+
+ while (*rref && isspace(*rref))
+ rref++;
+ if (!*rref)
+ break;
+ next = strchr(rref, '\n');
+ if (!next)
+ next = rref + strlen(rref);
+ rreflen = next - rref;
+
+ for (i = 0; i < lrr_count; i++) {
+ struct lrr *lrr = &(lrr_list[i]);
+
+ if (rreflen == lrr->namelen &&
+ !memcmp(lrr->name, rref, rreflen)) {
+ if (!lrr->shown)
+ printf("%.*s\n",
+ sha1_only ? 40 : lrr->namelen + 41,
+ lrr->line);
+ lrr->shown = 1;
+ break;
+ }
+ }
+ if (lrr_count <= i) {
+ error("pick-rref: %.*s not found", rreflen, rref);
+ err = 1;
+ }
+ rref = next;
+ }
+ free(lrr_list);
+ return err;
+}
+
int cmd_fetch__tool(int argc, const char **argv, const char *prefix)
{
int verbose = 0;
int force = 0;
+ int sopt = 0;
while (1 < argc) {
const char *arg = argv[1];
@@ -447,6 +524,8 @@ int cmd_fetch__tool(int argc, const char **argv, const char *prefix)
verbose = 1;
else if (!strcmp("-f", arg))
force = 1;
+ else if (!strcmp("-s", arg))
+ sopt = 1;
else
break;
argc--;
@@ -491,6 +570,11 @@ int cmd_fetch__tool(int argc, const char **argv, const char *prefix)
reflist = get_stdin();
return parse_reflist(reflist);
}
+ if (!strcmp("pick-rref", argv[1])) {
+ if (argc != 4)
+ return error("pick-rref takes 2 args");
+ return pick_rref(sopt, argv[2], argv[3]);
+ }
if (!strcmp("expand-refs-wildcard", argv[1])) {
const char *reflist;
if (argc < 4)
--
1.5.1.45.g1ddb
^ permalink raw reply related [flat|nested] 58+ messages in thread
* [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate
2007-04-03 21:28 ` Linus Torvalds
2007-04-03 22:31 ` Junio C Hamano
2007-04-05 10:22 ` [PATCH 1/2] git-fetch--tool pick-rref Junio C Hamano
@ 2007-04-05 10:22 ` Junio C Hamano
2007-04-05 16:15 ` Shawn O. Pearce
2 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2007-04-05 10:22 UTC (permalink / raw)
To: Linus Torvalds
Cc: Shawn O. Pearce, Nicolas Pitre, Chris Lee, Git Mailing List
When we are fetching from a repository that is on a local
filesystem, first check if we have all the objects that we are
going to fetch available locally, by not just checking the tips
of what we are fetching, but with a full reachability analysis
to our existing refs. In such a case, we do not have to run
git-fetch-pack which would send many needless objects. This is
especially true when the other repository is an alternate of the
current repository (e.g. perhaps the repository was created by
running "git clone -l -s" from there).
The useless objects transferred used to be discarded when they
were expanded by git-unpack-objects called from git-fetch-pack,
but recent git-fetch-pack prefers to keep the data it receives
from the other end without exploding them into loose objects,
resulting in a pack full of duplicated data when fetching from
your own alternate.
This also uses fetch--tool pick-rref on dumb transport side to
remove a shell loop to do the same.
Signed-off-by: Junio C Hamano <junkio@cox.net>
---
* Strictly speaking, there is no need to even check if $remote
is a local directory for this to operate properly, as
rev-list would barf and die as soon as it finds something
unavailable, while limiting the traversal to stop immediately
after it hits what are known to be reachable locally. On the
other hand, if we really want to limit this to the case to a
repository with an alternate to "clone -l -s" origin, we
could add 'test -f "$GIT_OBJECT_DIRECTORY/info/alternates"',
but I chose not to.
git-fetch.sh | 41 ++++++++++++++++++++++++++++-------------
1 files changed, 28 insertions(+), 13 deletions(-)
diff --git a/git-fetch.sh b/git-fetch.sh
index fd70696..5dc3063 100755
--- a/git-fetch.sh
+++ b/git-fetch.sh
@@ -173,9 +173,32 @@ fetch_all_at_once () {
git-bundle unbundle "$remote" $rref ||
echo failed "$remote"
else
- git-fetch-pack --thin $exec $keep $shallow_depth $no_progress \
- "$remote" $rref ||
- echo failed "$remote"
+ if test -d "$remote" &&
+
+ # The remote might be our alternate. With
+ # this optimization we will bypass fetch-pack
+ # altogether, which means we cannot be doing
+ # the shallow stuff at all.
+ test ! -f "$GIT_DIR/shallow" &&
+ test -z "$shallow_depth" &&
+
+ # See if all of what we are going to fetch are
+ # connected to our repository's tips, in which
+ # case we do not have to do any fetch.
+ theirs=$(git-fetch--tool -s pick-rref \
+ "$rref" "$ls_remote_result") &&
+
+ # This will barf when $theirs reach an object that
+ # we do not have in our repository. Otherwise,
+ # we already have everything the fetch would bring in.
+ git-rev-list --objects $theirs --not --all 2>/dev/null
+ then
+ git-fetch--tool pick-rref "$rref" "$ls_remote_result"
+ else
+ git-fetch-pack --thin $exec $keep $shallow_depth \
+ $no_progress "$remote" $rref ||
+ echo failed "$remote"
+ fi
fi
) |
(
@@ -235,16 +258,8 @@ fetch_per_ref () {
fi
# Find $remote_name from ls-remote output.
- head=$(
- IFS=' '
- echo "$ls_remote_result" |
- while read sha1 name
- do
- test "z$name" = "z$remote_name" || continue
- echo "$sha1"
- break
- done
- )
+ head=$(git-fetch--tool -s pick-rref \
+ "$remote_name" "$ls_remote_result")
expr "z$head" : "z$_x40\$" >/dev/null ||
die "No such ref $remote_name at $remote"
echo >&2 "Fetching $remote_name from $remote using $proto"
--
1.5.1.45.g1ddb
^ permalink raw reply related [flat|nested] 58+ messages in thread
* Re: [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate
2007-04-05 10:22 ` [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate Junio C Hamano
@ 2007-04-05 16:15 ` Shawn O. Pearce
2007-04-05 21:37 ` Junio C Hamano
0 siblings, 1 reply; 58+ messages in thread
From: Shawn O. Pearce @ 2007-04-05 16:15 UTC (permalink / raw)
To: Junio C Hamano; +Cc: Linus Torvalds, Nicolas Pitre, Chris Lee, Git Mailing List
Junio C Hamano <junkio@cox.net> wrote:
> + # This will barf when $theirs reach an object that
> + # we do not have in our repository. Otherwise,
> + # we already have everything the fetch would bring in.
> + git-rev-list --objects $theirs --not --all 2>/dev/null
OK, I must be missing something here.
That rev-list is going to print out the SHA-1s for the objects we
would have copied, but didn't, isn't it? So fetch--tool native-store
is going to get a whole lot of SHA-1s it doesn't want to see, right?
Otherwise this is a nice trick. It doesn't assure us that after the
fetch those objects are still in the alternate. Meaning someone
could run prune in the alternate between the rev-list and the
native-store, and whack these objects. Given how small of a window
it is, and the improvements this brings to alternates, I say its
worth that small downside. Just don't prune while fetching. ;-)
--
Shawn.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate
2007-04-05 16:15 ` Shawn O. Pearce
@ 2007-04-05 21:37 ` Junio C Hamano
0 siblings, 0 replies; 58+ messages in thread
From: Junio C Hamano @ 2007-04-05 21:37 UTC (permalink / raw)
To: Shawn O. Pearce
Cc: Linus Torvalds, Nicolas Pitre, Chris Lee, Git Mailing List
"Shawn O. Pearce" <spearce@spearce.org> writes:
> Junio C Hamano <junkio@cox.net> wrote:
>> + # This will barf when $theirs reach an object that
>> + # we do not have in our repository. Otherwise,
>> + # we already have everything the fetch would bring in.
>> + git-rev-list --objects $theirs --not --all 2>/dev/null
>
> OK, I must be missing something here.
>
> That rev-list is going to print out the SHA-1s for the objects we
> would have copied, but didn't, isn't it? So fetch--tool native-store
> is going to get a whole lot of SHA-1s it doesn't want to see, right?
True. We should send the standard output also to /dev/null.
> Otherwise this is a nice trick. It doesn't assure us that after the
> fetch those objects are still in the alternate. Meaning someone
> could run prune in the alternate between the rev-list and the
> native-store, and whack these objects. Given how small of a window
> it is, and the improvements this brings to alternates, I say its
> worth that small downside. Just don't prune while fetching. ;-)
That is "don't prune after making an alternate that depends on
you" in general. Without this patch, and without the keep
(i.e. when the fetch is very small and the transfarred pack is
given to unpack-objects) you would have depended on the
alternates for those objects anyway.
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-04 9:51 ` Alex Riesen
[not found] ` <P ine.LNX.4.63.0704061455380.24050@qynat.qvtvafvgr.pbz>
@ 2007-04-06 21:56 ` David Lang
2007-04-06 22:47 ` Junio C Hamano
1 sibling, 1 reply; 58+ messages in thread
From: David Lang @ 2007-04-06 21:56 UTC (permalink / raw)
To: Alex Riesen
Cc: Nicolas Pitre, Linus Torvalds, Chris Lee, Junio C Hamano,
Git Mailing List
On Wed, 4 Apr 2007, Alex Riesen wrote:
> On 4/4/07, David Lang <david.lang@digitalinsight.com> wrote:
>>
>> > The keeping of fetched packs broke that presumption of trust towards
>> > local objects and it opened a real path for potential future attacks.
>> > Those attacks are still fairly theoretical of course. But for how
>> > _long_? Do we want GIT to be considered backdoor prone in a couple
>> > years from now just because we were obsessed by a 7% CPU overhead?
>> >
>> > I think we have much more to gain by playing it safe and being more
>> > secure and paranoid than trying to squeeze some CPU cycles out of an
>> > operation that is likely to ever be bounded by network speed for most
>> > people.
>>
>> this is why -paranoid should be left on for network pulls, but having it on
>> for
>> the local uses means that the cost isn't hidden in the network limits isn't
>> good.
>
> You never know what pull is networked (or should I say: remote enough
> to cause a collision).
so leave it on for all pulls, but for other commands don't turn it on.
remember that the command that linus ran into at the start of the thread wasn't
a pull.
David Lang
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-06 22:49 ` Junio C Hamano
@ 2007-04-06 22:22 ` David Lang
2007-04-06 22:55 ` Junio C Hamano
0 siblings, 1 reply; 58+ messages in thread
From: David Lang @ 2007-04-06 22:22 UTC (permalink / raw)
To: Junio C Hamano
Cc: Alex Riesen, Nicolas Pitre, Linus Torvalds, Chris Lee,
Git Mailing List
On Fri, 6 Apr 2007, Junio C Hamano wrote:
> Subject: Re: git-index-pack really does suck..
>
> Junio C Hamano <junkio@cox.net> writes:
>
>> David Lang <david.lang@digitalinsight.com> writes:
>>
>>> On Wed, 4 Apr 2007, Alex Riesen wrote:
>>> ...
>>>> You never know what pull is networked (or should I say: remote enough
>>>> to cause a collision).
>>>
>>> so leave it on for all pulls, but for other commands don't turn it on.
>>>
>>> remember that the command that linus ran into at the start of the
>>> thread wasn't a pull.
>>
>> Are you referring to this command
>>
>> $ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack
>>
>> in this message?
>
> From: Linus Torvalds <torvalds@linux-foundation.org>
> Subject: git-index-pack really does suck..
> Date: Tue, 3 Apr 2007 08:15:12 -0700 (PDT)
> Message-ID: <Pine.LNX.4.64.0704030754020.6730@woody.linux-foundation.org>
>
> (sorry, chomped the message).
probably (I useually don't keep the mail after I read or reply to it)
David Lang
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-06 22:55 ` Junio C Hamano
@ 2007-04-06 22:28 ` David Lang
0 siblings, 0 replies; 58+ messages in thread
From: David Lang @ 2007-04-06 22:28 UTC (permalink / raw)
To: Junio C Hamano
Cc: Alex Riesen, Nicolas Pitre, Linus Torvalds, Chris Lee,
Git Mailing List
On Fri, 6 Apr 2007, Junio C Hamano wrote:
> David Lang <david.lang@digitalinsight.com> writes:
>
>> On Fri, 6 Apr 2007, Junio C Hamano wrote:
>>
>>> Subject: Re: git-index-pack really does suck..
>>>
>>> Junio C Hamano <junkio@cox.net> writes:
>>>
>>>> David Lang <david.lang@digitalinsight.com> writes:
>>>>
>>>>> On Wed, 4 Apr 2007, Alex Riesen wrote:
>>>>> ...
>>>>>> You never know what pull is networked (or should I say: remote enough
>>>>>> to cause a collision).
>>>>>
>>>>> so leave it on for all pulls, but for other commands don't turn it on.
>>>>>
>>>>> remember that the command that linus ran into at the start of the
>>>>> thread wasn't a pull.
>>>>
>>>> Are you referring to this command
>>>>
>>>> $ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack
>>>>
>>>> in this message?
>>>
>>> From: Linus Torvalds <torvalds@linux-foundation.org>
>>> Subject: git-index-pack really does suck..
>>> Date: Tue, 3 Apr 2007 08:15:12 -0700 (PDT)
>>> Message-ID: <Pine.LNX.4.64.0704030754020.6730@woody.linux-foundation.org>
>>>
>>> (sorry, chomped the message).
>>
>> probably (I useually don't keep the mail after I read or reply to it)
>
> Well, then you should remember that the command linus ran into
> was pretty much about pull, nothing else.
>
> The quoted command was only to illustrate what 'git-pull'
> invokes internally. I do not think of any reason to use that
> command for cases other than 'git-pull'. What's the use case
> you have in mind to run that command outside of the context of
> git-pull?
I guess I'm not remembering the thread accurately then (and/or am mising it up
with a different thread). I thought that Linus had identified other cases that
were impacted (something about proving that the object doesn't exist)
David Lang
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-06 21:56 ` David Lang
@ 2007-04-06 22:47 ` Junio C Hamano
2007-04-06 22:49 ` Junio C Hamano
0 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2007-04-06 22:47 UTC (permalink / raw)
To: David Lang
Cc: Alex Riesen, Nicolas Pitre, Linus Torvalds, Chris Lee,
Git Mailing List
David Lang <david.lang@digitalinsight.com> writes:
> On Wed, 4 Apr 2007, Alex Riesen wrote:
> ...
>> You never know what pull is networked (or should I say: remote enough
>> to cause a collision).
>
> so leave it on for all pulls, but for other commands don't turn it on.
>
> remember that the command that linus ran into at the start of the
> thread wasn't a pull.
Are you referring to this command
$ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack
in this message?
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-06 22:47 ` Junio C Hamano
@ 2007-04-06 22:49 ` Junio C Hamano
2007-04-06 22:22 ` David Lang
0 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2007-04-06 22:49 UTC (permalink / raw)
To: David Lang
Cc: Alex Riesen, Nicolas Pitre, Linus Torvalds, Chris Lee,
Git Mailing List
Junio C Hamano <junkio@cox.net> writes:
> David Lang <david.lang@digitalinsight.com> writes:
>
>> On Wed, 4 Apr 2007, Alex Riesen wrote:
>> ...
>>> You never know what pull is networked (or should I say: remote enough
>>> to cause a collision).
>>
>> so leave it on for all pulls, but for other commands don't turn it on.
>>
>> remember that the command that linus ran into at the start of the
>> thread wasn't a pull.
>
> Are you referring to this command
>
> $ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack
>
> in this message?
From: Linus Torvalds <torvalds@linux-foundation.org>
Subject: git-index-pack really does suck..
Date: Tue, 3 Apr 2007 08:15:12 -0700 (PDT)
Message-ID: <Pine.LNX.4.64.0704030754020.6730@woody.linux-foundation.org>
(sorry, chomped the message).
^ permalink raw reply [flat|nested] 58+ messages in thread
* Re: git-index-pack really does suck..
2007-04-06 22:22 ` David Lang
@ 2007-04-06 22:55 ` Junio C Hamano
2007-04-06 22:28 ` David Lang
0 siblings, 1 reply; 58+ messages in thread
From: Junio C Hamano @ 2007-04-06 22:55 UTC (permalink / raw)
To: David Lang
Cc: Alex Riesen, Nicolas Pitre, Linus Torvalds, Chris Lee,
Git Mailing List
David Lang <david.lang@digitalinsight.com> writes:
> On Fri, 6 Apr 2007, Junio C Hamano wrote:
>
>> Subject: Re: git-index-pack really does suck..
>>
>> Junio C Hamano <junkio@cox.net> writes:
>>
>>> David Lang <david.lang@digitalinsight.com> writes:
>>>
>>>> On Wed, 4 Apr 2007, Alex Riesen wrote:
>>>> ...
>>>>> You never know what pull is networked (or should I say: remote enough
>>>>> to cause a collision).
>>>>
>>>> so leave it on for all pulls, but for other commands don't turn it on.
>>>>
>>>> remember that the command that linus ran into at the start of the
>>>> thread wasn't a pull.
>>>
>>> Are you referring to this command
>>>
>>> $ git index-pack --stdin --fix-thin new.pack < .git/objects/pack/pack-*.pack
>>>
>>> in this message?
>>
>> From: Linus Torvalds <torvalds@linux-foundation.org>
>> Subject: git-index-pack really does suck..
>> Date: Tue, 3 Apr 2007 08:15:12 -0700 (PDT)
>> Message-ID: <Pine.LNX.4.64.0704030754020.6730@woody.linux-foundation.org>
>>
>> (sorry, chomped the message).
>
> probably (I useually don't keep the mail after I read or reply to it)
Well, then you should remember that the command linus ran into
was pretty much about pull, nothing else.
The quoted command was only to illustrate what 'git-pull'
invokes internally. I do not think of any reason to use that
command for cases other than 'git-pull'. What's the use case
you have in mind to run that command outside of the context of
git-pull?
^ permalink raw reply [flat|nested] 58+ messages in thread
end of thread, other threads:[~2007-04-06 22:59 UTC | newest]
Thread overview: 58+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-04-03 15:15 git-index-pack really does suck Linus Torvalds
[not found] ` <Pi ne.LNX.4.64.0704031413200.6730@woody.linux-foundation.org>
[not found] ` <alpine.LFD.0.98. 0704031836350.28181@xanadu.home>
[not found] ` <db 69205d0704031227q1009eabfhdd82aa3636f25bb6@mail.gmail.com>
[not found] ` <Pine.LNX.4.64.07 04031304420.6730@woody.linux-foundation.org>
[not found] ` <Pine.LNX.4.64.0704031322490.67 30@woody.linux-foundation.org>
2007-04-03 16:21 ` Linus Torvalds
2007-04-03 16:40 ` Nicolas Pitre
2007-04-03 16:33 ` Nicolas Pitre
2007-04-03 19:27 ` Chris Lee
2007-04-03 19:49 ` Nicolas Pitre
2007-04-03 19:54 ` Chris Lee
2007-04-03 20:18 ` Linus Torvalds
2007-04-03 20:32 ` Nicolas Pitre
2007-04-03 20:40 ` Junio C Hamano
2007-04-03 21:00 ` Linus Torvalds
2007-04-03 21:28 ` Nicolas Pitre
2007-04-03 22:49 ` Chris Lee
2007-04-03 23:12 ` Linus Torvalds
2007-04-03 20:56 ` Linus Torvalds
2007-04-03 21:03 ` Shawn O. Pearce
2007-04-03 21:13 ` Linus Torvalds
2007-04-03 21:17 ` Shawn O. Pearce
2007-04-03 21:26 ` Linus Torvalds
2007-04-03 21:28 ` Linus Torvalds
2007-04-03 22:31 ` Junio C Hamano
2007-04-03 22:38 ` Shawn O. Pearce
2007-04-03 22:41 ` Junio C Hamano
2007-04-05 10:22 ` [PATCH 1/2] git-fetch--tool pick-rref Junio C Hamano
2007-04-05 10:22 ` [PATCH 2/2] git-fetch: use fetch--tool pick-rref to avoid local fetch from alternate Junio C Hamano
2007-04-05 16:15 ` Shawn O. Pearce
2007-04-05 21:37 ` Junio C Hamano
2007-04-03 21:34 ` git-index-pack really does suck Nicolas Pitre
2007-04-03 21:37 ` Shawn O. Pearce
2007-04-03 21:44 ` Junio C Hamano
2007-04-03 21:53 ` Shawn O. Pearce
2007-04-03 22:10 ` Jeff King
2007-04-03 22:40 ` Dana How
2007-04-03 22:52 ` Linus Torvalds
2007-04-03 22:31 ` David Lang
2007-04-03 23:00 ` Nicolas Pitre
2007-04-03 21:21 ` Nicolas Pitre
2007-04-03 20:33 ` Linus Torvalds
2007-04-03 21:05 ` Nicolas Pitre
2007-04-03 21:11 ` Shawn O. Pearce
2007-04-03 21:24 ` Linus Torvalds
[not found] ` <alpine.LF D.0.98.0704031735470.28181@xanadu.home>
2007-04-03 21:42 ` Nicolas Pitre
2007-04-03 22:07 ` Junio C Hamano
2007-04-03 22:11 ` Shawn O. Pearce
2007-04-03 22:34 ` Nicolas Pitre
2007-04-03 22:14 ` Linus Torvalds
2007-04-03 22:55 ` Nicolas Pitre
2007-04-03 22:36 ` David Lang
2007-04-04 9:51 ` Alex Riesen
[not found] ` <P ine.LNX.4.63.0704061455380.24050@qynat.qvtvafvgr.pbz>
2007-04-06 21:56 ` David Lang
2007-04-06 22:47 ` Junio C Hamano
2007-04-06 22:49 ` Junio C Hamano
2007-04-06 22:22 ` David Lang
2007-04-06 22:55 ` Junio C Hamano
2007-04-06 22:28 ` David Lang
2007-04-03 23:29 ` Linus Torvalds
2007-04-03 20:34 ` Junio C Hamano
2007-04-03 20:53 ` Nicolas Pitre
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).