[PATCH/RFC] Allow writing loose objects that are corrupted in a pack file

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
@ 2008-12-09  8:36 Jan Krüger
  2008-12-09  9:02 ` R. Tyler Ballance
                   ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Jan Krüger @ 2008-12-09  8:36 UTC (permalink / raw)
  To: Git ML; +Cc: tyler

For fixing a corrupted repository by using backup copies of individual
files, allow write_sha1_file() to write loose files even if the object
already exists in a pack file, but only if the existing entry is marked
as corrupted.

Signed-off-by: Jan Krüger <jk@jk.gs>
---

On IRC I talked to rtyler who had a corrupted pack file and plenty of
object backups by way of cloned repositories. We decided to try
extracting the corrupted objects from the other object database and
injecting them into the broken repo as loose objects, but this failed
because sha1_write_file() refuses to write loose objects that are
already present in a pack file.

This patch expands the check to see if the pack entry has been marked
as corrupted and, if so, allows writing a loose object with the same
ID. Unfortunately, when Tyler tried a merge while using this patch,
something we didn't manage to track down happened and now git doesn't
consider the object corrupted anymore. I'm not sure enough that it
wasn't caused by the patch to submit this patch without hesitation.

Apart from that, I think the change is not all too great since it makes
write_sha1_file() walk the list of pack entries twice. That's a bit of
a waste.

So those are the reasons why I wanted a few opinions first. Another
reason is that there might be a way smarter method to fix this kind of
problem, in which case I'd love hearing about it for future reference.

 sha1_file.c |    9 +++++----
 1 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index 6c0e251..17085cc 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2373,14 +2373,17 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha
 	char hdr[32];
 	int hdrlen;

-	/* Normally if we have it in the pack then we do not bother writing
-	 * it out into .git/objects/??/?{38} file.
-	 */
 	write_sha1_file_prepare(buf, len, type, sha1, hdr, &hdrlen);
 	if (returnsha1)
 		hashcpy(returnsha1, sha1);
-	if (has_sha1_file(sha1))
-		return 0;
+	/* Normally if we have it in the pack then we do not bother writing
+	 * it out into .git/objects/??/?{38} file. We do, though, if there
+	 * is no chance that we have an uncorrupted version of the object.
+	 */
+	if (has_sha1_file(sha1)) {
+		if (has_loose_object(sha1) || !has_packed_and_bad(sha1))
+			return 0;
+	}
 	return write_loose_object(sha1, hdr, hdrlen, buf, len, 0);
 }

-- 
1.6.0.4.766.g6fc4a

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2008-12-09  8:36 [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file Jan Krüger
@ 2008-12-09  9:02 ` R. Tyler Ballance
  2008-12-09 16:24 ` Shawn O. Pearce
  2009-01-06 22:52 ` R. Tyler Ballance
  2 siblings, 0 replies; 49+ messages in thread
From: R. Tyler Ballance @ 2008-12-09  9:02 UTC (permalink / raw)
  To: Jan Krüger; +Cc: Git ML

[-- Attachment #1: Type: text/plain, Size: 2313 bytes --]

On Tue, 2008-12-09 at 09:36 +0100, Jan Krüger wrote:
> For fixing a corrupted repository by using backup copies of individual
> files, allow write_sha1_file() to write loose files even if the object
> already exists in a pack file, but only if the existing entry is marked
> as corrupted.
> 
> Signed-off-by: Jan Krüger <jk@jk.gs>
> ---
> 
> On IRC I talked to rtyler who had a corrupted pack file and plenty of
> object backups by way of cloned repositories. We decided to try
> extracting the corrupted objects from the other object database and
> injecting them into the broken repo as loose objects, but this failed
> because sha1_write_file() refuses to write loose objects that are
> already present in a pack file.

Figured I'd chime in here with some anecdotal evidence with the error
condition that I hit shortly after Jan sent the email.

        xdev3 (master-release)% git pull --no-ff . master
        From .
         * branch            master     -> FETCH_HEAD
        error: failed to read object
        befd9bc4d184b4383569909e4d245f3337c1f8ed at offset 1415784644
        from .git/objects/pack/pack-f7eb06e39f01b528c1d1a2c413ac51b31b8515aa.pack
        fatal: object befd9bc4d184b4383569909e4d245f3337c1f8ed is
        corrupted
        Merge with strategy recursive failed.
        xdev3 (master-release)%

I ran that command a couple of times to make sure it wasn't a fluke, I
repeated the error numerous times (without switching branches or pulling
from a remote). This pull was done with a slightly modified internal
version of v1.6.0.4
        xdev3 (master-release)% git --version
        git version 1.6.0.4-kb1
        xdev3 (master-release)
        

After consulting with Jan, I tried running the same command with a
modified version of v1.6.0.5 with Jan's patch
        xdev3 (master-release)% ~/basket/bin/git pull --no-ff . master
        From .
         * branch            master     -> FETCH_HEAD
        Merge made by recursive.
         ** TOP SECRET MERGES! ;) **
        
         13 files changed, 51 insertions(+), 21 deletions(-)
        xdev3 (master-release)%
        
        
Purely anecdotal as I'm not entirely clear what the hell is actually going on here :)


Cheers
-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2008-12-09  8:36 [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file Jan Krüger
  2008-12-09  9:02 ` R. Tyler Ballance
@ 2008-12-09 16:24 ` Shawn O. Pearce
  2009-01-06 22:52 ` R. Tyler Ballance
  2 siblings, 0 replies; 49+ messages in thread
From: Shawn O. Pearce @ 2008-12-09 16:24 UTC (permalink / raw)
  To: Jan Krüger; +Cc: Git ML, tyler

Jan Krüger <jk@jk.gs> wrote:
> For fixing a corrupted repository by using backup copies of individual
> files, allow write_sha1_file() to write loose files even if the object
> already exists in a pack file, but only if the existing entry is marked
> as corrupted.

Huh.  So I'm digging around sha1_file.c and I'm not yet sure why
your patch makes a difference.

has_sha1_file() calls find_pack_entry() to determine which pack has
the object, and at what offset (if found).  It doesn't care about
the offset, but it does care about the successful match.

find_pack_entry() already considers the bad_object_sha1 for each
pack, before it even tries the binary search within the index.
So if the entry was known to be bad has_sha1_file() will return 0,
unless the object is loose.

Where this breaks down is if the object is being created,
its very likely we didn't attempt to read it in this process.
The bad_object_sha1 table is transient and populated only when
unpacking an object entry fails.  So for example during a merge
if a tree was stored in a pack and is corrupt and the merge
result produces that same tree object we won't write it out with
write_sha1_file() because it exists in a pack, but since we never
read it we also don't know the pack entry is corrupt.

Its horribly inefficient to read every object before we write it
back out.  The best thing to do when faced with corruption is to
stop and repack, overlaying the object database from a known good
copy of the repository so pack-objects can use the good copy when
a corrupt object is identified.

So I agree with you that changing this in write_sha1_file() is a
bad idea for the normal good cases, but I also don't see how this
patch changes anything at all... the code path you introduced is
already implemented.

> diff --git a/sha1_file.c b/sha1_file.c
> index 6c0e251..17085cc 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -2373,14 +2373,17 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha
>  	char hdr[32];
>  	int hdrlen;
>  
> -	/* Normally if we have it in the pack then we do not bother writing
> -	 * it out into .git/objects/??/?{38} file.
> -	 */
>  	write_sha1_file_prepare(buf, len, type, sha1, hdr, &hdrlen);
>  	if (returnsha1)
>  		hashcpy(returnsha1, sha1);
> -	if (has_sha1_file(sha1))
> -		return 0;
> +	/* Normally if we have it in the pack then we do not bother writing
> +	 * it out into .git/objects/??/?{38} file. We do, though, if there
> +	 * is no chance that we have an uncorrupted version of the object.
> +	 */
> +	if (has_sha1_file(sha1)) {
> +		if (has_loose_object(sha1) || !has_packed_and_bad(sha1))
> +			return 0;
> +	}
>  	return write_loose_object(sha1, hdr, hdrlen, buf, len, 0);
>  }

-- 
Shawn.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2008-12-09  8:36 [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file Jan Krüger
  2008-12-09  9:02 ` R. Tyler Ballance
  2008-12-09 16:24 ` Shawn O. Pearce
@ 2009-01-06 22:52 ` R. Tyler Ballance
  2009-01-07  1:25   ` Nicolas Pitre
  2 siblings, 1 reply; 49+ messages in thread
From: R. Tyler Ballance @ 2009-01-06 22:52 UTC (permalink / raw)
  To: Jan Krüger; +Cc: Git ML

[-- Attachment #1: Type: text/plain, Size: 3858 bytes --]

On Tue, 2008-12-09 at 09:36 +0100, Jan Krüger wrote:
> For fixing a corrupted repository by using backup copies of individual
> files, allow write_sha1_file() to write loose files even if the object
> already exists in a pack file, but only if the existing entry is marked
> as corrupted.

I figured I'd reply to this again, since the issue cropped up again.

We started experiencing *large* numbers of corruptions like the ones
that started the thread (one developer was receiving them once or twice
a day) with v1.6.0.4

We went ahead and upgraded to a custom build of v1.6.1 with Jan's patch
(below) and the issues /seem/ to have resolved themselves. I'm not
certain whether Jan's patch was really responsible, or if there was
another issue that caused this to correct itself in v1.6.1. 

As it stands, I think it's safe to assume that given the frequency of
the occurances that they were not tied to a memory or disk error (or
other levels of the machine's stack would be suffering as well). The
only thing I can think of is that /some/ developers who've experienced
the issue are using Samba mount points and changing files in Mac OS X,
but using Git on the mounted share (i.e. TextMate changes a file hosted
on Samba, changes are committed in an SSH session on that machine), but
that doesn't account for everything.

If there was something else included in the v1.6.1 release please let me
know so I can back Jan's patch out.


Cheers


> 
> Signed-off-by: Jan Krüger <jk@jk.gs>
> ---
> 
> On IRC I talked to rtyler who had a corrupted pack file and plenty of
> object backups by way of cloned repositories. We decided to try
> extracting the corrupted objects from the other object database and
> injecting them into the broken repo as loose objects, but this failed
> because sha1_write_file() refuses to write loose objects that are
> already present in a pack file.
> 
> This patch expands the check to see if the pack entry has been marked
> as corrupted and, if so, allows writing a loose object with the same
> ID. Unfortunately, when Tyler tried a merge while using this patch,
> something we didn't manage to track down happened and now git doesn't
> consider the object corrupted anymore. I'm not sure enough that it
> wasn't caused by the patch to submit this patch without hesitation.
> 
> Apart from that, I think the change is not all too great since it makes
> write_sha1_file() walk the list of pack entries twice. That's a bit of
> a waste.
> 
> So those are the reasons why I wanted a few opinions first. Another
> reason is that there might be a way smarter method to fix this kind of
> problem, in which case I'd love hearing about it for future reference.
> 
>  sha1_file.c |    9 +++++----
>  1 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/sha1_file.c b/sha1_file.c
> index 6c0e251..17085cc 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -2373,14 +2373,17 @@ int write_sha1_file(void *buf, unsigned long len, const char *type, unsigned cha
>  	char hdr[32];
>  	int hdrlen;
>  
> -	/* Normally if we have it in the pack then we do not bother writing
> -	 * it out into .git/objects/??/?{38} file.
> -	 */
>  	write_sha1_file_prepare(buf, len, type, sha1, hdr, &hdrlen);
>  	if (returnsha1)
>  		hashcpy(returnsha1, sha1);
> -	if (has_sha1_file(sha1))
> -		return 0;
> +	/* Normally if we have it in the pack then we do not bother writing
> +	 * it out into .git/objects/??/?{38} file. We do, though, if there
> +	 * is no chance that we have an uncorrupted version of the object.
> +	 */
> +	if (has_sha1_file(sha1)) {
> +		if (has_loose_object(sha1) || !has_packed_and_bad(sha1))
> +			return 0;
> +	}
>  	return write_loose_object(sha1, hdr, hdrlen, buf, len, 0);
>  }
>  
-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-06 22:52 ` R. Tyler Ballance
@ 2009-01-07  1:25   ` Nicolas Pitre
  2009-01-07  1:39     ` R. Tyler Ballance
  0 siblings, 1 reply; 49+ messages in thread
From: Nicolas Pitre @ 2009-01-07  1:25 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Jan Krüger, Git ML

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2646 bytes --]

On Tue, 6 Jan 2009, R. Tyler Ballance wrote:

> On Tue, 2008-12-09 at 09:36 +0100, Jan Krüger wrote:
> > For fixing a corrupted repository by using backup copies of individual
> > files, allow write_sha1_file() to write loose files even if the object
> > already exists in a pack file, but only if the existing entry is marked
> > as corrupted.
> 
> I figured I'd reply to this again, since the issue cropped up again.
> 
> We started experiencing *large* numbers of corruptions like the ones
> that started the thread (one developer was receiving them once or twice
> a day) with v1.6.0.4
> 
> We went ahead and upgraded to a custom build of v1.6.1 with Jan's patch
> (below) and the issues /seem/ to have resolved themselves. I'm not
> certain whether Jan's patch was really responsible, or if there was
> another issue that caused this to correct itself in v1.6.1. 
> 
> As it stands, I think it's safe to assume that given the frequency of
> the occurances that they were not tied to a memory or disk error (or
> other levels of the machine's stack would be suffering as well). The
> only thing I can think of is that /some/ developers who've experienced
> the issue are using Samba mount points and changing files in Mac OS X,
> but using Git on the mounted share (i.e. TextMate changes a file hosted
> on Samba, changes are committed in an SSH session on that machine), but
> that doesn't account for everything.
> 
> If there was something else included in the v1.6.1 release please let me
> know so I can back Jan's patch out.

Please back it out.  As it stands, that patch is a no op because of the 
way git is used, and even if the patch was to work as intended, its 
purpose is not to magically fix corruptions without special action from 
your part.  If you have corruption problems coming back only because of 
the removal of this patch then something is really really fishy and I 
would really like to know about it.

There were indeed many changes between v1.6.0.4 and v1.6.1: the exact 
number is 1029.  A couple of them are especially addressing increased 
robustness against some kind of pack corruptions.  But in any case you 
still should see error messages appearing about them.

And don't underestimate the power of disk corruptions.  I started to 
work on git corruption resilience simply because I ended up with a 
corrupted pack at some point.  Then a while later I got another 
corrupted pack.  Then another while later I lost my filesystem entirely 
and had to reinstall my system (after buying a new disk).  Turns out 
that my old disk is silently corrupting data without signaling any 
errors to the host.

Nicolas

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07  1:25   ` Nicolas Pitre
@ 2009-01-07  1:39     ` R. Tyler Ballance
  2009-01-07  2:09       ` Nicolas Pitre
  2009-01-07  4:54       ` Linus Torvalds
  0 siblings, 2 replies; 49+ messages in thread
From: R. Tyler Ballance @ 2009-01-07  1:39 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jan Krüger, Git ML

[-- Attachment #1: Type: text/plain, Size: 3010 bytes --]

On Tue, 2009-01-06 at 20:25 -0500, Nicolas Pitre wrote:
> On Tue, 6 Jan 2009, R. Tyler Ballance wrote:
> 
> > On Tue, 2008-12-09 at 09:36 +0100, Jan Krüger wrote:
> > > For fixing a corrupted repository by using backup copies of individual
> > > files, allow write_sha1_file() to write loose files even if the object
> > > already exists in a pack file, but only if the existing entry is marked
> > > as corrupted.
> > 
> > I figured I'd reply to this again, since the issue cropped up again.
> > 
> > We started experiencing *large* numbers of corruptions like the ones
> > that started the thread (one developer was receiving them once or twice
> > a day) with v1.6.0.4
> > 
> > We went ahead and upgraded to a custom build of v1.6.1 with Jan's patch
> > (below) and the issues /seem/ to have resolved themselves. I'm not
> > certain whether Jan's patch was really responsible, or if there was
> > another issue that caused this to correct itself in v1.6.1. 

I'll back the patch out and redeploy, it's worth mentioning that a
coworker of mine just got the issue as well (on 1.6.1). He was able to
`git pull` and the error went away, but I doubt that it "magically fixed
itself"


> Please back it out.  As it stands, that patch is a no op because of the 
> way git is used, and even if the patch was to work as intended, its 
> purpose is not to magically fix corruptions without special action from 
> your part.  If you have corruption problems coming back only because of 
> the removal of this patch then something is really really fishy and I 
> would really like to know about it.
> 
> There were indeed many changes between v1.6.0.4 and v1.6.1: the exact 
> number is 1029.  A couple of them are especially addressing increased 
> robustness against some kind of pack corruptions.  But in any case you 
> still should see error messages appearing about them.
> 
> And don't underestimate the power of disk corruptions.  I started to 
> work on git corruption resilience simply because I ended up with a 
> corrupted pack at some point.  Then a while later I got another 
> corrupted pack.  Then another while later I lost my filesystem entirely 
> and had to reinstall my system (after buying a new disk).  Turns out 
> that my old disk is silently corrupting data without signaling any 
> errors to the host.

I highly doubt this, I've got the issue appearing on at least 7
different development boxes (not workstations, 2U quad-core ECC RAM, etc
machines), while that doesn't mean that they all don't have issues, the
probability of them *all* having disk issues, and it somehow only
manifesting itself with Git usage, is low ;)

I've tarred one of the repositories that had it in a reproducible state
so I can create a build and extract the tar and run against that to
verify any patches anybody might have, but unfortunately at 7GB of
company code and assets, I can't exactly share ;)


Cheers


-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07  1:39     ` R. Tyler Ballance
@ 2009-01-07  2:09       ` Nicolas Pitre
  2009-01-07  2:47         ` R. Tyler Ballance
  2009-01-07  4:54       ` Linus Torvalds
  1 sibling, 1 reply; 49+ messages in thread
From: Nicolas Pitre @ 2009-01-07  2:09 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Jan Krüger, Git ML

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2165 bytes --]

On Tue, 6 Jan 2009, R. Tyler Ballance wrote:

> On Tue, 2009-01-06 at 20:25 -0500, Nicolas Pitre wrote:
> > On Tue, 6 Jan 2009, R. Tyler Ballance wrote:
> > 
> > > On Tue, 2008-12-09 at 09:36 +0100, Jan Krüger wrote:
> > > > For fixing a corrupted repository by using backup copies of individual
> > > > files, allow write_sha1_file() to write loose files even if the object
> > > > already exists in a pack file, but only if the existing entry is marked
> > > > as corrupted.
> > > 
> > > I figured I'd reply to this again, since the issue cropped up again.
> > > 
> > > We started experiencing *large* numbers of corruptions like the ones
> > > that started the thread (one developer was receiving them once or twice
> > > a day) with v1.6.0.4
> > > 
> > > We went ahead and upgraded to a custom build of v1.6.1 with Jan's patch
> > > (below) and the issues /seem/ to have resolved themselves. I'm not
> > > certain whether Jan's patch was really responsible, or if there was
> > > another issue that caused this to correct itself in v1.6.1. 
> 
> I'll back the patch out and redeploy, it's worth mentioning that a
> coworker of mine just got the issue as well (on 1.6.1). He was able to
> `git pull` and the error went away, but I doubt that it "magically fixed
> itself"

Please describe the "issue", ideally with transcripts of error messages, 
etc.  Normally a simple pull operation should not provide any "fix" for 
corruptions.

> I highly doubt this, I've got the issue appearing on at least 7
> different development boxes (not workstations, 2U quad-core ECC RAM, etc
> machines), while that doesn't mean that they all don't have issues, the
> probability of them *all* having disk issues, and it somehow only
> manifesting itself with Git usage, is low ;)

Agreed.

> I've tarred one of the repositories that had it in a reproducible state

That is wonderful.

> so I can create a build and extract the tar and run against that to
> verify any patches anybody might have, but unfortunately at 7GB of
> company code and assets, I can't exactly share ;)

First step is to understand what is going on.  Only then could reliable 
patches be made.


Nicolas

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07  2:09       ` Nicolas Pitre
@ 2009-01-07  2:47         ` R. Tyler Ballance
  2009-01-07  3:21           ` Nicolas Pitre
  0 siblings, 1 reply; 49+ messages in thread
From: R. Tyler Ballance @ 2009-01-07  2:47 UTC (permalink / raw)
  To: Nicolas Pitre; +Cc: Jan Krüger, Git ML

[-- Attachment #1: Type: text/plain, Size: 751 bytes --]

On Tue, 2009-01-06 at 21:09 -0500, Nicolas Pitre wrote:
> > I've tarred one of the repositories that had it in a reproducible
> state
> 
> That is wonderful.
> 
> > so I can create a build and extract the tar and run against that to
> > verify any patches anybody might have, but unfortunately at 7GB of
> > company code and assets, I can't exactly share ;)
> 
> First step is to understand what is going on.  Only then could reliable 
> patches be made.

If you want to point me in the right direction, I have a few hours to
kill this evening and fscking around with gdb(1) and printf() just might
be some of my favorite things</sarcasm> ;)

Looking forward to killing this issue


Cheers

-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07  2:47         ` R. Tyler Ballance
@ 2009-01-07  3:21           ` Nicolas Pitre
  0 siblings, 0 replies; 49+ messages in thread
From: Nicolas Pitre @ 2009-01-07  3:21 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Jan Krüger, Git ML

On Tue, 6 Jan 2009, R. Tyler Ballance wrote:

> On Tue, 2009-01-06 at 21:09 -0500, Nicolas Pitre wrote:
> > > I've tarred one of the repositories that had it in a reproducible
> > state
> > 
> > That is wonderful.
> > 
> > > so I can create a build and extract the tar and run against that to
> > > verify any patches anybody might have, but unfortunately at 7GB of
> > > company code and assets, I can't exactly share ;)
> > 
> > First step is to understand what is going on.  Only then could reliable 
> > patches be made.
> 
> If you want to point me in the right direction, I have a few hours to
> kill this evening and fscking around with gdb(1) and printf() just might
> be some of my favorite things</sarcasm> ;)

Heh.  ;-)

To start with, a simple log of what you need to do to reproduce the 
issue would be nice.  Just do

	script /tmp/foo

then reproduce the issue and exit, after which I'd be interrested in the 
content of /tmp/foo.


Nicolas

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07  1:39     ` R. Tyler Ballance
  2009-01-07  2:09       ` Nicolas Pitre
@ 2009-01-07  4:54       ` Linus Torvalds
  2009-01-07  7:41         ` R. Tyler Ballance
  1 sibling, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2009-01-07  4:54 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Nicolas Pitre, Jan Krüger, Git ML

On Tue, 6 Jan 2009, R. Tyler Ballance wrote:
> 
> I'll back the patch out and redeploy, it's worth mentioning that a
> coworker of mine just got the issue as well (on 1.6.1). He was able to
> `git pull` and the error went away, but I doubt that it "magically fixed
> itself"

Quite frankly, that behaviour sounds like a disk _cache_ corruption issue. 
The fact that some corruption "comes and goes" and sometimes magically 
heals itself sounds very much like some disk cache problem, and then that 
particular part of the cache gets replaced and then when re-populated it 
is magically correct.

We had that in one case with a Linux NFS client, where a rename across 
directories caused problems.

This was a networked filesystem on OS X, right? File caching is much more 
"interesting" in networked filesystems than it is in normal private 
on-disk ones.

> I've tarred one of the repositories that had it in a reproducible state
> so I can create a build and extract the tar and run against that to
> verify any patches anybody might have, but unfortunately at 7GB of
> company code and assets, I can't exactly share ;)

The thing to do is

 - untar it on some trusted machine with a local disk and a known-good 
   filesystem.

   IOW, not that networked samba share.

 - verify that it really does happen on that machine, with that untarred 
   image. Because maybe it doesn't. 

   The hope is that you caught the corruption in the cache, and it 
   actually got written out to the tar-file. But if it _is_ a disk cache 
   (well, network cache) issue, maybe the IO required to tar everything up 
   was enough to flush it, and the tar-file actually _works_ because it 
   got repopulated correctly.

   So that's why you should double-check that it really ends up being 
   corrupt after being untarred again.

 - go back and test the original git repo on the network share, preferably 
   on another client. See if the error has gone away.

 - If so, try to compare that known-corrupt filesystem with the original 
   one:  and preferably do this on another machine over the network mount. 

   See if they differ. They obviously should *not* differ, since it's an 
   tar/untar of the same files, but ...

The fact that you seem to get a _lot_ of these errors really does make it 
sound like something in your environment. It's actually really hard to get 
git to corrupt anything. Especially objects that got packed. They've been 
quiescent for a long time, they got repacked in a very simple way, they 
are totally read-only.

But it is _not_ hard to corrupt network filesystems. It's downright 
trivial with some of them, especially with some hardware (eg there's no 
end-to-end checksumming except for the _extremely_ weak 16-bit IP csum, 
and even that has been known to be disabled, or screwed up by ethernet 
cards that do IP packet offloading and thus computing the csum not on the 
data that tee user actually wrote, but the data that the card received, 
which is not necessarily at all the same thing).

And while ethernet uses a stronger CRC, that one is not end-to-end, so 
corruption on the card or in a switch in between easily defeats that too. 

Just google for something like

	"OS X" SMB "file corruption"

and you'll find quite a bit of hits. Not all that unusual.

				Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07  4:54       ` Linus Torvalds
@ 2009-01-07  7:41         ` R. Tyler Ballance
  2009-01-07  8:16           ` Junio C Hamano
                             ` (3 more replies)
  0 siblings, 4 replies; 49+ messages in thread
From: R. Tyler Ballance @ 2009-01-07  7:41 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Jan Krüger, Git ML

[-- Attachment #1: Type: text/plain, Size: 5231 bytes --]

On Tue, 2009-01-06 at 20:54 -0800, Linus Torvalds wrote:
> 
> On Tue, 6 Jan 2009, R. Tyler Ballance wrote:
> > 
> > I'll back the patch out and redeploy, it's worth mentioning that a
> > coworker of mine just got the issue as well (on 1.6.1). He was able to
> > `git pull` and the error went away, but I doubt that it "magically fixed
> > itself"
> 
> Quite frankly, that behaviour sounds like a disk _cache_ corruption issue. 
> The fact that some corruption "comes and goes" and sometimes magically 
> heals itself sounds very much like some disk cache problem, and then that 
> particular part of the cache gets replaced and then when re-populated it 
> is magically correct.
> 
> We had that in one case with a Linux NFS client, where a rename across 
> directories caused problems.
> 
> This was a networked filesystem on OS X, right? File caching is much more 
> "interesting" in networked filesystems than it is in normal private 
> on-disk ones.

Not quite, what I meant was that some users (not all) who've experienced
this issue are using Samba to copy files over directly into the Git
repository. I was mentioning this in case somewhere between Finder,
Samba, ext3 and Git, some file system change events were pissing Git off
and causing it. I don't think this is the case as the coworker that I
mentioned earlier doesn't use Samba and neither do I (we both experience
the issue today, mine disappeared by upgrading to 1.6.1, his by `git
pull`).


> 
> > I've tarred one of the repositories that had it in a reproducible state
> > so I can create a build and extract the tar and run against that to
> > verify any patches anybody might have, but unfortunately at 7GB of
> > company code and assets, I can't exactly share ;)
> 
> The thing to do is
> 
>  - untar it on some trusted machine with a local disk and a known-good 
>    filesystem.
> 
>    IOW, not that networked samba share.
> 
>  - verify that it really does happen on that machine, with that untarred 
>    image. Because maybe it doesn't. 

Unfortunately it doesn't, what I did notice was this when I did a `git
status` in the directory right after untarring:
        tyler@grapefruit:~/jburgess_main> git status
        #
        # ---impressive amount of file names fly by---
        # ----snip---
        #
        # Untracked files:
        #   (use "git add <file>..." to include in what will be
        committed)
        #
        #       artwork/
        #       bt/
        #       flash/
        tyler@grapefruit:~/jburgess_main>

Basically, somehow Git thinks that *every* file in the repository is
deleted at this point. I went ahead and performed a `git reset --hard`
to see if the issue would manifest itself thereafter, but it did not.

I did try to do a git-fsck(1), and this is what I got:
        tyler@grapefruit:~/jburgess_main> /usr/local/bin/git fsck --full
        [1]    19381 segmentation fault  /usr/local/bin/git fsck --full
        tyler@grapefruit:~/jburgess_main> 
> 
>    The hope is that you caught the corruption in the cache, and it 
>    actually got written out to the tar-file. But if it _is_ a disk cache 
>    (well, network cache) issue, maybe the IO required to tar everything up 
>    was enough to flush it, and the tar-file actually _works_ because it 
>    got repopulated correctly.

When I was working through this with Jan, one of the things that we did
was move the actual object file in .git/objects, they existed so maybe I
could look into those to check?

> 
>    So that's why you should double-check that it really ends up being 
>    corrupt after being untarred again.
> 
>  - go back and test the original git repo on the network share, preferably 
>    on another client. See if the error has gone away.

Unfortunately the repository is being used by the original developer I
tarred from with our 1.6.1 build, he hasn't reported any issues, but I
can't exactly steal it back (that's why I made the tar)

> The fact that you seem to get a _lot_ of these errors really does make
> it 
> sound like something in your environment. It's actually really hard to get 
> git to corrupt anything. Especially objects that got packed. They've been 
> quiescent for a long time, they got repacked in a very simple way, they 
> are totally read-only.

I checked with our operations team, and contrary to my suspicion (your
NFS comment piqued my curiosity), these disks that are actually on the
machines are not NFS mounts but rather local disk arrays.
        
        --> is it NFSd? or all local storage
        <== all local
        <== df -h
        <== mount
        <== /dev/sda5             705G  247G  423G  37% /nail
        --> hm, there goes that theory
        <== git corruption?
        --> yeah, looking into it
        <== sucks
        --> Linus had a theory about NFS/etc corruption of the disk
        cache
        <== when the company folds we can all blame you...  and your
        silly git games
        <== (think positive, joel)
        --> thanks 
        
;)


Any thing else I can do to help debug this? :-/

Cheers
-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07  7:41         ` R. Tyler Ballance
@ 2009-01-07  8:16           ` Junio C Hamano
  2009-01-07  8:32             ` R. Tyler Ballance
  2009-01-07  9:05           ` R. Tyler Ballance
                             ` (2 subsequent siblings)
  3 siblings, 1 reply; 49+ messages in thread
From: Junio C Hamano @ 2009-01-07  8:16 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Linus Torvalds, Nicolas Pitre, Jan Krüger, Git ML

"R. Tyler Ballance" <tyler@slide.com> writes:

> Unfortunately it doesn't, what I did notice was this when I did a `git
> status` in the directory right after untarring:
>         tyler@grapefruit:~/jburgess_main> git status
>         #
>         # ---impressive amount of file names fly by---
>         # ----snip---
> ...
> Basically, somehow Git thinks that *every* file in the repository is
> deleted at this point.

That makes me suspect that your .git/index file is corrupt.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07  8:16           ` Junio C Hamano
@ 2009-01-07  8:32             ` R. Tyler Ballance
  2009-01-07  9:42               ` Junio C Hamano
  0 siblings, 1 reply; 49+ messages in thread
From: R. Tyler Ballance @ 2009-01-07  8:32 UTC (permalink / raw)
  To: Junio C Hamano; +Cc: Linus Torvalds, Nicolas Pitre, Jan Krüger, Git ML

[-- Attachment #1: Type: text/plain, Size: 823 bytes --]

On Wed, 2009-01-07 at 00:16 -0800, Junio C Hamano wrote:
> "R. Tyler Ballance" <tyler@slide.com> writes:
> 
> > Unfortunately it doesn't, what I did notice was this when I did a `git
> > status` in the directory right after untarring:
> >         tyler@grapefruit:~/jburgess_main> git status
> >         #
> >         # ---impressive amount of file names fly by---
> >         # ----snip---
> > ...
> > Basically, somehow Git thinks that *every* file in the repository is
> > deleted at this point.
> 
> That makes me suspect that your .git/index file is corrupt.

Would this be tied to the corrupted pack file issue, or separate.

Either way, how could I verify your assumptions? (i'll be lurking in
#git for a while if you want to interactively help ;))

Cheers
-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07  8:32             ` R. Tyler Ballance
@ 2009-01-07  9:42               ` Junio C Hamano
  0 siblings, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2009-01-07  9:42 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Linus Torvalds, Nicolas Pitre, Jan Krüger, Git ML

"R. Tyler Ballance" <tyler@slide.com> writes:

> On Wed, 2009-01-07 at 00:16 -0800, Junio C Hamano wrote:
>> "R. Tyler Ballance" <tyler@slide.com> writes:
>> 
>> > Unfortunately it doesn't, what I did notice was this when I did a `git
>> > status` in the directory right after untarring:
>> >         tyler@grapefruit:~/jburgess_main> git status
>> >         #
>> >         # ---impressive amount of file names fly by---
>> >         # ----snip---
>> > ...
>> > Basically, somehow Git thinks that *every* file in the repository is
>> > deleted at this point.
>> 
>> That makes me suspect that your .git/index file is corrupt.
>
> Would this be tied to the corrupted pack file issue, or separate.

If you have perfectly good set of packs, if your index is corrupt you may
see "everything deleted", so in that sense it is independent.

As Linus's earlier conjecture was that this is related to some sort of
disk/cache corruption, I wouldn't be surprised if such a failure hit packs
and the index file indiscriminatingly.  So in that sense they are
related.

I think "git ls-files" (before doing anything else, such as resetting, of
course) would report that the index is corrupt, if that is indeed the case.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07  7:41         ` R. Tyler Ballance
  2009-01-07  8:16           ` Junio C Hamano
@ 2009-01-07  9:05           ` R. Tyler Ballance
  2009-01-07 15:31           ` Nicolas Pitre
  2009-01-07 16:07           ` Linus Torvalds
  3 siblings, 0 replies; 49+ messages in thread
From: R. Tyler Ballance @ 2009-01-07  9:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Jan Krüger, Git ML

[-- Attachment #1: Type: text/plain, Size: 2154 bytes --]

On Tue, 2009-01-06 at 23:41 -0800, R. Tyler Ballance wrote:

> I did try to do a git-fsck(1), and this is what I got:
>         tyler@grapefruit:~/jburgess_main> /usr/local/bin/git fsck --full
>         [1]    19381 segmentation fault  /usr/local/bin/git fsck --full
>         tyler@grapefruit:~/jburgess_main> 
> > 

Disregard this comment, Jan reminded me in IRC of the issue I had with
this repository and the stack-size ulimit. A healthy 32M stack-size
allows for the command to complete:

        tyler@grapefruit:~/jburgess_main> /usr/local/bin/git fsck --full
        dangling blob 6e58a06cd0f027fce1fc8a923a8c81d6b55f1705
        dangling blob 5e87e049f69ee06af1f4f92a3d4ddcd912f8535e
        dangling blob acb8e0937cea3f4068c5c67f60d3b97952654fa1
        dangling blob 8bd540d657027ab12228e0522de86a97d3f8a7a9
        dangling blob 90d93096369df4b2151df3e289952297c5f390dd
        dangling blob aaa4a1bf9e6d39991503b7908ff71605d6632fef
        dangling blob b003e20f06de7d0a7e11a1047fbb39e2deb84899
        dangling blob bf207363786f08e888a725cfb8b2fe4bea11ceab
        dangling blob 642493a1bcf09f74551f0ce5b4c3ccc23acaf3c9
        dangling blob ff80b45e48795d4544d0a5b2f18714123ab21746
        dangling blob e3c0e67e23ec588acab733e2069fb09fa38a235d
        dangling blob 62efa65bcbb6d94f26be9b91dc98d7a7f6fbf602
        dangling blob a6a2c717c230fe26263b9ce002f3f82fdab6f727
        dangling blob d32988875aa90ef728dc1af6b71c130f2c8e8b94
        dangling blob c8aad9fc37a27872bae18af78b1623bfb8f9a9f7
        dangling blob 4cdbe97301e333d89037dc9f4a8440dab9e62049
        dangling blob 2819bbcda3b0efe828709f5b22624712cb9ebdae
        dangling blob b156cba3ee89e1506d02fbb845e8dc0889ff4090
        dangling blob ed658b8beca69ebf6e25ab1ed6217881e27d4e9a
        dangling blob d581cb2fb1cf2b4d15f8f9b90b08a3ebc619414f
        dangling blob a85c2e55d27c9fb1f5874f8bd81c65e597327b4f
        dangling blob 548f3e70c06f9306d71f6b8d11ed24ade581fa31
        dangling blob 7a1c8fa8bc59c4e8fb347426cc11fabf8b0d1639
        tyler@grapefruit:~/jburgess_main> 
        
Cheers

-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07  7:41         ` R. Tyler Ballance
  2009-01-07  8:16           ` Junio C Hamano
  2009-01-07  9:05           ` R. Tyler Ballance
@ 2009-01-07 15:31           ` Nicolas Pitre
  2009-01-07 16:07           ` Linus Torvalds
  3 siblings, 0 replies; 49+ messages in thread
From: Nicolas Pitre @ 2009-01-07 15:31 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Linus Torvalds, Jan Krüger, Git ML

On Tue, 6 Jan 2009, R. Tyler Ballance wrote:

> On Tue, 2009-01-06 at 20:54 -0800, Linus Torvalds wrote:
> > 
> > On Tue, 6 Jan 2009, R. Tyler Ballance wrote:
> > > 
> > > I'll back the patch out and redeploy, it's worth mentioning that a
> > > coworker of mine just got the issue as well (on 1.6.1). He was able to
> > > `git pull` and the error went away, but I doubt that it "magically fixed
> > > itself"
> > 
> > Quite frankly, that behaviour sounds like a disk _cache_ corruption issue. 
> > The fact that some corruption "comes and goes" and sometimes magically 
> > heals itself sounds very much like some disk cache problem, and then that 
> > particular part of the cache gets replaced and then when re-populated it 
> > is magically correct.
> > 
> > We had that in one case with a Linux NFS client, where a rename across 
> > directories caused problems.
> > 
> > This was a networked filesystem on OS X, right? File caching is much more 
> > "interesting" in networked filesystems than it is in normal private 
> > on-disk ones.
> 
> Not quite, what I meant was that some users (not all) who've experienced
> this issue are using Samba to copy files over directly into the Git
> repository. I was mentioning this in case somewhere between Finder,
> Samba, ext3 and Git, some file system change events were pissing Git off
> and causing it.

As long as those files are not within the .git directory that should be 
fine.

> I don't think this is the case as the coworker that I
> mentioned earlier doesn't use Samba and neither do I (we both experience
> the issue today, mine disappeared by upgrading to 1.6.1, his by `git
> pull`).

Problem is that none of that "makes sense".  If a real git corruption 
was there, it wouldn't disappear without explicit action from your part.  
What git v1.6.1 is able to do over earlier versions is to still function 
properly if some corrupted objects have a redundant copy in the same 
repository, but it wouldn't stop complaining about the existence of 
corrupted data.  Doing a 'git gc' or 'git repack -a' might get rid of 
the corruption.  And a 'git pull' might hide it if the received pack 
during the pull operation happens to contain another copy of the object 
that was corrupted before, but it wouldn't prevent 'git fsck --full' 
from seeing it.

The fact that you cannot reproduce the corruption issues after 
unarchiving a repository that was known to have problems right before it 
was archived is really really strange.  That does not rule out git bugs 
of course, but at least this shows that no actual corruption on disk was 
initially involved.

Again, I'd suggest you perform your git usage within a script session so 
to capture the exact operation performed and error messages produced 
when/if similar problems do occur again.  Otherwise we're only running 
after our tail.

Nicolas

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07  7:41         ` R. Tyler Ballance
                             ` (2 preceding siblings ...)
  2009-01-07 15:31           ` Nicolas Pitre
@ 2009-01-07 16:07           ` Linus Torvalds
  2009-01-07 16:08             ` Linus Torvalds
  2009-01-07 22:55             ` R. Tyler Ballance
  3 siblings, 2 replies; 49+ messages in thread
From: Linus Torvalds @ 2009-01-07 16:07 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Nicolas Pitre, Jan Krüger, Git ML

On Tue, 6 Jan 2009, R. Tyler Ballance wrote:
> > 
> > The thing to do is
> > 
> >  - untar it on some trusted machine with a local disk and a known-good 
> >    filesystem.
> > 
> >    IOW, not that networked samba share.
> > 
> >  - verify that it really does happen on that machine, with that untarred 
> >    image. Because maybe it doesn't. 
> 
> Unfortunately it doesn't

Well, that's not necessarily "unfortunate". It does actually end up 
showing that the objects themselves were apparently never really corrupt.

So there is no fundamental data structure corrupttion - because when you 
copy the repository, it's all good agin!

In other words, that's not a worthless piece of information at all, and it 
does tell us a lot, namely that the corruption was never real long-term 
data corruption of the git object archive, but something local and 
temporary. Again, we're really back to either:

 - it could be some _temporary_ git corruption caused internally inside a 
   git process - ie a wild pointer, or perhaps a race condition (but we 
   don't really use threading in 1.6.0.4 unless you ask for it, and even 
   then just for pack-file generation)

 - or it's the disk cache corruption, and the tar/untar ended up flushing 
   it.

And quite frankly, since the corruption seems to be site-specific, I 
really do suspect the second case. Although it's possible, of course, that 
it could be some compiler issue that makes _your_ binaries have issues 
even when nobody else sees it.

> what I did notice was this when I did a `git
> status` in the directory right after untarring:
>         tyler@grapefruit:~/jburgess_main> git status
>         #
>         # ---impressive amount of file names fly by---
>         # ----snip---
>         #
>         # Untracked files:
>         #   (use "git add <file>..." to include in what will be
>         committed)
>         #
>         #       artwork/
>         #       bt/
>         #       flash/

Hmm. That's actually _normal_ under some circumstances. At least with 
older git versions, or if your .git/index file couldn't be rewritten for 
some reason - your existing index file contains all the old stat 
information, and if git cannot (or, in the case of older git version, just 
will not) refresh it automatically, it will show all the files as changed, 
even if it's just the inode number that really changed.

A _normal_ git install should have auto-refreshed the index, though. 
Unless the tar archive only contained the ".git" directory, and not the 
checkout?

>         tyler@grapefruit:~/jburgess_main>
> 
> Basically, somehow Git thinks that *every* file in the repository is
> deleted at this point. I went ahead and performed a `git reset --hard`
> to see if the issue would manifest itself thereafter, but it did not.

That would be what I'd expect if you had only tarred up .git, although 
then I wouldn't have expected those "Untracked files:". Hmm. Without being 
able to look at the archive, I'm just guessing randomly.

> I did try to do a git-fsck(1), and this is what I got:
>         tyler@grapefruit:~/jburgess_main> /usr/local/bin/git fsck --full
>         [1]    19381 segmentation fault  /usr/local/bin/git fsck --full
>         tyler@grapefruit:~/jburgess_main> 

.. and that's the unrelated fsck bug that got fixed later.

> >    The hope is that you caught the corruption in the cache, and it 
> >    actually got written out to the tar-file. But if it _is_ a disk cache 
> >    (well, network cache) issue, maybe the IO required to tar everything up 
> >    was enough to flush it, and the tar-file actually _works_ because it 
> >    got repopulated correctly.
> 
> When I was working through this with Jan, one of the things that we did
> was move the actual object file in .git/objects, they existed so maybe I
> could look into those to check?

Yes. If you have any bad loose objects, if you compare them to the good 
objects with the same name, that's going to be interesting information. 
The pattern of corruption can be very telling. For example, on Linux, a 
disk cache corruption would usually be at 4kB block boundaries, because 
that's the granularity of the cache. While a bit error would be obvious 
etc etc.

> I checked with our operations team, and contrary to my suspicion (your
> NFS comment piqued my curiosity), these disks that are actually on the
> machines are not NFS mounts but rather local disk arrays.

Ok, that generally makes caching much simpler. What filesystem?

Is there anything else that you do that is site-specific and/or slightly 
different? For example, a long time ago we had a bug related to CRLF 
conversion which would cause a use-after-free problem, and that would 
corrupt the data internally to git.

And dobody else saw it than this one person, and it was a total mystery to 
everybody until we realized that he used this one feature that nobody else 
was using. So as you're on OS X, I assume you don't have CRLF conversion, 
but maybe you use some other feature that we support but nobody really 
actually uses. Like keyword expansion or something?

Oh - that would also explain why you got all those entries in "git status" 
that went away when you did a "git reset --hard": if you had some keyword 
expansion (or CRLF) enabled in the original users "~/.gitconfig", that 
checkout would have had expansion/CRLF/whatever conversion, but then when 
you tarred/untarred it on another setup, the expansion would be seen as a 
difference because it wasn't enabled.

Hmm?

		Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07 16:07           ` Linus Torvalds
@ 2009-01-07 16:08             ` Linus Torvalds
  2009-01-07 22:55             ` R. Tyler Ballance
  1 sibling, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2009-01-07 16:08 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Nicolas Pitre, Jan Krüger, Git ML



On Wed, 7 Jan 2009, Linus Torvalds wrote:
> 
> And dobody else saw it than this one person, and it was a total mystery to 
> everybody until we realized that he used this one feature that nobody else 
> was using. So as you're on OS X, I assume you don't have CRLF conversion, 
> but maybe you use some other feature that we support but nobody really 
> actually uses. Like keyword expansion or something?
> 
> Oh - that would also explain why you got all those entries in "git status" 
> that went away when you did a "git reset --hard": if you had some keyword 
> expansion (or CRLF) enabled in the original users "~/.gitconfig", that 
> checkout would have had expansion/CRLF/whatever conversion, but then when 
> you tarred/untarred it on another setup, the expansion would be seen as a 
> difference because it wasn't enabled.

Btw, if you untar it again, and just do a "git diff", that should show any 
such effects. Rather than showing just that something changed, it should 
show _how_ it changed.

		Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07 16:07           ` Linus Torvalds
  2009-01-07 16:08             ` Linus Torvalds
@ 2009-01-07 22:55             ` R. Tyler Ballance
  2009-01-07 23:29               ` Linus Torvalds
  1 sibling, 1 reply; 49+ messages in thread
From: R. Tyler Ballance @ 2009-01-07 22:55 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Jan Krüger, Git ML

[-- Attachment #1: Type: text/plain, Size: 4576 bytes --]

On Wed, 2009-01-07 at 08:07 -0800, Linus Torvalds wrote:
> Well, that's not necessarily "unfortunate". It does actually end up 
> showing that the objects themselves were apparently never really corrupt.
> 
> So there is no fundamental data structure corrupttion - because when you 
> copy the repository, it's all good agin!
>  - it could be some _temporary_ git corruption caused internally inside a 
>    git process - ie a wild pointer, or perhaps a race condition (but we 
>    don't really use threading in 1.6.0.4 unless you ask for it, and even 
>    then just for pack-file generation)

I have a feeling it's something like this, one of our operations guys
did some research while I was looking at code and he came across this:

        On Wed, 2009-01-07 at 14:17 -0800, Ken Brownfield wrote:
        git-merge is using too much RAM, and failing to malloc() but
        NOT  
        > reporting it.  This is all sorts of bad:
        > 
        >   A) using an unscalable amount of RAM
        >   B) failing to detect malloc() failure
        >   C) reporting file corruption instead
        > I was able to reproduce this.
        >
        > limit ~1.5GB -> corrupt file
        > limit ~3GB -> magically no longer corrupt.
        >
        > The false fail may be limited to git-merge, but git status also  
        > allocates the same amount of RAM.
        > 
        > To temporarily work around this problem, issue this once you
        log in to  
        > a dev box:
        > 
        > tcsh:
        >         limit vmemoryuse 3000000
        > bash:
        >         ulimit -v 3000000
        > 
        > Be gentle.
        

> And quite frankly, since the corruption seems to be site-specific, I 
> really do suspect the second case. Although it's possible, of course, that 
> it could be some compiler issue that makes _your_ binaries have issues 
> even when nobody else sees it.

I think you're correct insofar that our major site-specific alteration
has come up on the mailing list before (okay maybe two site-specific
things). 
	* Our Git repo is ~7.1GB
	* ulimit -v is set to ~1.5G


I think I know how this could be failing and corrupting things (assuming
it's malloc(2)) related.


What I'm thinking is that in xmalloc() or one of the other x*)_
functions, the malloc(size) is failing because of the ulimits, and then
the potentially somewhere it's silently failing or maybe even
accidentally returning one of those "malloc(1)" pointers?

I've got two new tarred repositories from two developers the issue
happened to today, so I'm flush full of sample repositories to try stuff
on :)


> 
> Hmm. That's actually _normal_ under some circumstances. At least with 
> older git versions, or if your .git/index file couldn't be rewritten for 
> some reason - your existing index file contains all the old stat 
> information, and if git cannot (or, in the case of older git version, just 
> will not) refresh it automatically, it will show all the files as changed, 
> even if it's just the inode number that really changed.
> 
> A _normal_ git install should have auto-refreshed the index, though. 
> Unless the tar archive only contained the ".git" directory, and not the 
> checkout?

I believe the issues I noticed when untarring the repo were a red
herring, I did the `git diff` after untarring and I noticed that only a
certain set of files where changed, I'm willing to go so far as to guess
that they were the files affected in the corrupted packs. Of the 32k
files in our repository, 98 were actually different after untarring
(according to git-diff(1))

> And dobody else saw it than this one person, and it was a total mystery to 
> everybody until we realized that he used this one feature that nobody else 
> was using. So as you're on OS X, I assume you don't have CRLF conversion, 
> but maybe you use some other feature that we support but nobody really 
> actually uses. Like keyword expansion or something?

The two new folks this happened to today had nothing "special" about
them other than the ulimit.


I've got the script(1) output of performing git-ls-files(1) and some
other commands that I tried, nothing they output was particular
informative or interesting, and I don't think it will help if this
really is a memory related issue, that said I'd be more than happy to
send it to a couple of you (Junio, Linus, Nico).


I'm *so* ready for this bug to die >=\


Cheers

-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07 22:55             ` R. Tyler Ballance
@ 2009-01-07 23:29               ` Linus Torvalds
  2009-01-08  0:28                 ` Public repro case! " R. Tyler Ballance
  2009-01-08  0:37                 ` [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file Linus Torvalds
  0 siblings, 2 replies; 49+ messages in thread
From: Linus Torvalds @ 2009-01-07 23:29 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Nicolas Pitre, Jan Krüger, Git ML

On Wed, 7 Jan 2009, R. Tyler Ballance wrote:
>
> >    git process - ie a wild pointer, or perhaps a race condition (but we 
> >    don't really use threading in 1.6.0.4 unless you ask for it, and even 
> >    then just for pack-file generation)
> 
> I have a feeling it's something like this, one of our operations guys
> did some research while I was looking at code and he came across this:
> 
>         On Wed, 2009-01-07 at 14:17 -0800, Ken Brownfield wrote:
>         git-merge is using too much RAM, and failing to malloc() but
>         NOT  
>         > reporting it.  This is all sorts of bad:
>         > 
>         >   A) using an unscalable amount of RAM
>         >   B) failing to detect malloc() failure
>         >   C) reporting file corruption instead

Well, I dont' think that's exactly it. git internally doesn't really use 
malloc at all, and uses xmalloc() instead which will die() if the malloc 
fails. So there's almost certainly no "failing to detect failures"

Yes, there's a few places that don't use the wrapper, but they should be 
safe (eg either they SIGSEGV, or they are like create_delta_index() and 
just create a sub-optimal pack with a warning).

HOWEVER:

>         > I was able to reproduce this.
>         >
>         > limit ~1.5GB -> corrupt file
>         > limit ~3GB -> magically no longer corrupt.

That is interesting, although I also worry that there might be other 
issues going on (ie since you've reported thigns magically fixing 
themselves, maybe the ulimit tests just _happened_ to show that, even if 
it wasn't the core reason).

BUT! This is definitely worth looking at.

For example, we do have some cases where we try to do "mmap()", and if it 
fails, we try to free some memory and try again. In particular, in 
xmmap(), if an mmap() fails - which may be due to running out of virtual 
address space - we'll actually try to release some pack-file memory and 
try again. Maybe there's a bug there - and it would be one that seldom 
triggers for others.

> I think you're correct insofar that our major site-specific alteration
> has come up on the mailing list before (okay maybe two site-specific
> things). 
> 	* Our Git repo is ~7.1GB
> 	* ulimit -v is set to ~1.5G

It is certainly possible. It's too bad that it's private, because it makes 
it _much_ harder to try to pinpoint this.

				Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07 23:29               ` Linus Torvalds
@ 2009-01-08  0:28                 ` R. Tyler Ballance
  2009-01-08  0:48                   ` Linus Torvalds
  2009-01-08  2:52                   ` Linus Torvalds
  2009-01-08  0:37                 ` [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file Linus Torvalds
  1 sibling, 2 replies; 49+ messages in thread
From: R. Tyler Ballance @ 2009-01-08  0:28 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Jan Krüger, Git ML, kb

[-- Attachment #1: Type: text/plain, Size: 2370 bytes --]

On Wed, 2009-01-07 at 15:29 -0800, Linus Torvalds wrote:
> It is certainly possible. It's too bad that it's private, because it makes 
> it _much_ harder to try to pinpoint this.

My most esteemed colleague (Ken aka kb) who pointed out the memory issue
was on the right path (I think), and I have a reproduction case you can
try with your very own Linux kernel tree!

WOO!

I set ulimit -v really low (150M), and the operations I made got an
mmap(2) fatal error, but there is a sweet spot that I found, see the
transcript below. I basically chose an arbitrary revision from a couple
of weeks ago, and rolled the repository back to that point, then I tried
with iterations of ulimit -v 150, 250, 450, and then back down to 350.

        tyler@grapefruit:~/source/git/linux-2.6> limit
        cputime         unlimited
        filesize        unlimited
        datasize        unlimited
        stacksize       8MB
        coredumpsize    0kB
        memoryuse       2561MB
        maxproc         24564
        descriptors     1024
        memorylocked    64kB
        addressspace    unlimited
        maxfilelocks    unlimited
        sigpending      24564
        msgqueue        819200
        nice            0
        rt_priority     0
        tyler@grapefruit:~/source/git/linux-2.6> export
        START=56d18e9932ebf4e8eca42d2ce509450e6c9c1666
        tyler@grapefruit:~/source/git/linux-2.6> git reset --hard $START
        HEAD is now at 56d18e9 Merge branch 'upstream' of
        git://ftp.linux-mips.org/pub/scm/upstream-linus
        tyler@grapefruit:~/source/git/linux-2.6> ulimit -v `echo "350 *
        1024" | bc -l`
        tyler@grapefruit:~/source/git/linux-2.6> git pull
        error: failed to read object
        be1b87c70af69acfadb8a27a7a76dfb61de92643 at offset 1850923
        from .git/objects/pack/pack-dbe154052997a05499eb6b4fd90b924da68e799a.pack
        fatal: object be1b87c70af69acfadb8a27a7a76dfb61de92643 is
        corrupted
        tyler@grapefruit:~/source/git/linux-2.6>
        
I've tried this a couple of times, and it does seem to be reproducible,
let me know if you have any issues reproducing it locally and I'll try
to dig into it more with valgrind or something a bit more pin-pointing
than "ulimit -v && try, try again"


Cheers
-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  0:28                 ` Public repro case! " R. Tyler Ballance
@ 2009-01-08  0:48                   ` Linus Torvalds
  2009-01-08  0:57                     ` R. Tyler Ballance
  2009-01-08  2:21                     ` James Pickens
  2009-01-08  2:52                   ` Linus Torvalds
  1 sibling, 2 replies; 49+ messages in thread
From: Linus Torvalds @ 2009-01-08  0:48 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Nicolas Pitre, Jan Krüger, Git ML, kb

On Wed, 7 Jan 2009, R. Tyler Ballance wrote:
>
> My most esteemed colleague (Ken aka kb) who pointed out the memory issue
> was on the right path (I think), and I have a reproduction case you can
> try with your very own Linux kernel tree!
> 
> WOO!
> 
> I set ulimit -v really low (150M), and the operations I made got an
> mmap(2) fatal error, but there is a sweet spot that I found, see the
> transcript below.

This is indeed the packfile mapping. The sweet spot you found depends on 
how big the biggest two pack-files are, I do believe.

And if you do that

	[core]
		packedgitwindowsize = 64M

I think you'll find that it works. Of course, with a _really_ low ulimit, 
you'd need to make it even smaller, but at some point you start hitting 
other problems than the pack-file limits, ie just the simple fact that git 
wants and expects you to have a certain amount of memory available ;)

Can you cnfirm that your "reproducible" case starts working with that 
addition to your ~/.gitconfig? If so, the solution is pretty simple: we 
should just lower the default pack windowsize.

		Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  0:48                   ` Linus Torvalds
@ 2009-01-08  0:57                     ` R. Tyler Ballance
  2009-01-08  1:08                       ` Linus Torvalds
  2009-01-08  2:21                     ` James Pickens
  1 sibling, 1 reply; 49+ messages in thread
From: R. Tyler Ballance @ 2009-01-08  0:57 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Jan Krüger, Git ML, kb

[-- Attachment #1: Type: text/plain, Size: 1584 bytes --]

On Wed, 2009-01-07 at 16:48 -0800, Linus Torvalds wrote:
> 
> On Wed, 7 Jan 2009, R. Tyler Ballance wrote:
> >
> > My most esteemed colleague (Ken aka kb) who pointed out the memory issue
> > was on the right path (I think), and I have a reproduction case you can
> > try with your very own Linux kernel tree!
> > 
> > WOO!
> > 
> > I set ulimit -v really low (150M), and the operations I made got an
> > mmap(2) fatal error, but there is a sweet spot that I found, see the
> > transcript below.
> 
> This is indeed the packfile mapping. The sweet spot you found depends on 
> how big the biggest two pack-files are, I do believe.
> 
> And if you do that
> 
> 	[core]
> 		packedgitwindowsize = 64M
> 
> I think you'll find that it works. Of course, with a _really_ low ulimit, 
> you'd need to make it even smaller, but at some point you start hitting 
> other problems than the pack-file limits, ie just the simple fact that git 
> wants and expects you to have a certain amount of memory available ;)
> 
> Can you cnfirm that your "reproducible" case starts working with that 
> addition to your ~/.gitconfig? If so, the solution is pretty simple: we 
> should just lower the default pack windowsize.

This certainly corrected the issue, is there some magic
packedgitwindowsize that i should be looking at my own repository (our
internal one) in order to prevent the issue from occurring? 

Looking into .git/objects/pack, I think the two biggest pack files are
3.5G and 177MBG respectively :-!


Cheers
-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  0:57                     ` R. Tyler Ballance
@ 2009-01-08  1:08                       ` Linus Torvalds
  2009-01-08  1:29                         ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2009-01-08  1:08 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Nicolas Pitre, Jan Krüger, Git ML, kb

On Wed, 7 Jan 2009, R. Tyler Ballance wrote:
> > 
> > Can you cnfirm that your "reproducible" case starts working with that 
> > addition to your ~/.gitconfig? If so, the solution is pretty simple: we 
> > should just lower the default pack windowsize.
> 
> This certainly corrected the issue, is there some magic
> packedgitwindowsize that i should be looking at my own repository (our
> internal one) in order to prevent the issue from occurring? 
> 
> Looking into .git/objects/pack, I think the two biggest pack files are
> 3.5G and 177MBG respectively :-!

So there's a few rules to packedgitwindowsize:

 - we need to be able to have at least two windows open at a time, in 
   addition to all the "normal" memory git needs just for objects, of 
   course. And quite frankly, you'd be better off with a few more windows, 
  even if that obviously implies smaller windows.

 - the window size really wants to be a round power-of-two number, and at 
   _least_ it wants to be a nice multiple of the 2*page size.

So if you have a virtual memory limit of 1.5GB, I'd hesitate to make the 
pack window size less than 512M, and 256M is probably better. That way, 
I'd expect you to be able to always have at least four windows open 
(assuming a reasonably generous half a gigabyte for "other stuff").

And quite frankly, there's not a huge downside to making them smaller. At 
"just" 32MB, you'll still fit plenty of data in one pack window, and while 
it will cost you a few mmap/unmap's to switch windows around, most 
operations simply will not likely ever notice. At least not under Linux, 
where mmap/munmap is pretty cheap.

		Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  1:08                       ` Linus Torvalds
@ 2009-01-08  1:29                         ` Linus Torvalds
  2009-01-08  1:46                           ` Shawn O. Pearce
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2009-01-08  1:29 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Nicolas Pitre, Jan Krüger, Git ML, kb



On Wed, 7 Jan 2009, Linus Torvalds wrote:
> 
> So there's a few rules to packedgitwindowsize:
> 
>  - we need to be able to have at least two windows open at a time, in 
>    addition to all the "normal" memory git needs just for objects, of 
>    course. And quite frankly, you'd be better off with a few more windows, 
>    even if that obviously implies smaller windows.

Btw, I'm not 100% certain of this. Somebody should double-check me. Maybe 
there are cases where we want more than two windows alive. And maybe there 
aren't even that, and we can always make do with just one.

So I will _not_ guarantee that "at least two pack windows" is necessarily 
the right answer. The windowing code was mostly other people doing it. I 
think Shawn and Nico.

			Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  1:29                         ` Linus Torvalds
@ 2009-01-08  1:46                           ` Shawn O. Pearce
  0 siblings, 0 replies; 49+ messages in thread
From: Shawn O. Pearce @ 2009-01-08  1:46 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: R. Tyler Ballance, Nicolas Pitre, Jan Krüger, Git ML, kb

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 7 Jan 2009, Linus Torvalds wrote:
> > 
> > So there's a few rules to packedgitwindowsize:
> > 
> >  - we need to be able to have at least two windows open at a time, in 
> >    addition to all the "normal" memory git needs just for objects, of 
> >    course. And quite frankly, you'd be better off with a few more windows, 
> >    even if that obviously implies smaller windows.
> 
> Btw, I'm not 100% certain of this. Somebody should double-check me. Maybe 
> there are cases where we want more than two windows alive. And maybe there 
> aren't even that, and we can always make do with just one.
> 
> So I will _not_ guarantee that "at least two pack windows" is necessarily 
> the right answer. The windowing code was mostly other people doing it. I 
> think Shawn and Nico.

I was fairly certain we needed at least two windows open at once,
but reviewing the code in sha1_file.c I don't see a reason for that
restriction anymore.

I think it used to have to do with the delta reconstruction; to
unpack a delta we would read the delta header from one window,
but we may need base data from another.  The delta unpack code
kept the delta window pinned in use, so we couldn't replace it
to access base data from elsewhere, hence we needed two windows.
Thinking about it now I don't recall how we handled the recusion
on a delta chain longer than 2.  ;-)

But looking at the code we have long since refactored it so this
isn't an issue anymore.  We release the window between reading
the delta header and reading the base, so the delta window can be
replaced if necessary.  I think the "2 window minimum" is just a
performance suggestion, not a requirement.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  0:48                   ` Linus Torvalds
  2009-01-08  0:57                     ` R. Tyler Ballance
@ 2009-01-08  2:21                     ` James Pickens
  2009-01-08  2:43                       ` Shawn O. Pearce
  2009-01-08  2:52                       ` Boyd Stephen Smith Jr.
  1 sibling, 2 replies; 49+ messages in thread
From: James Pickens @ 2009-01-08  2:21 UTC (permalink / raw)
  To: Git ML
  Cc: R. Tyler Ballance, Nicolas Pitre, Jan Krüger, kb,
	Linus Torvalds

On Wed, Jan 7, 2009, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Can you cnfirm that your "reproducible" case starts working with that
> addition to your ~/.gitconfig? If so, the solution is pretty simple: we
> should just lower the default pack windowsize.

Umm... isn't that more of a workaround than a solution?  I.e., if you lower
the default pack windowsize, couldn't the corruption still happen under the
right conditions?

James

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  2:21                     ` James Pickens
@ 2009-01-08  2:43                       ` Shawn O. Pearce
  2009-01-08  5:40                         ` Junio C Hamano
  2009-01-08  2:52                       ` Boyd Stephen Smith Jr.
  1 sibling, 1 reply; 49+ messages in thread
From: Shawn O. Pearce @ 2009-01-08  2:43 UTC (permalink / raw)
  To: James Pickens
  Cc: Git ML, R. Tyler Ballance, Nicolas Pitre, Jan Krüger, kb,
	Linus Torvalds

James Pickens <jepicken@gmail.com> wrote:
> On Wed, Jan 7, 2009, Linus Torvalds <torvalds@linux-foundation.org> wrote:
> > Can you cnfirm that your "reproducible" case starts working with that
> > addition to your ~/.gitconfig? If so, the solution is pretty simple: we
> > should just lower the default pack windowsize.
> 
> Umm... isn't that more of a workaround than a solution?  I.e., if you lower
> the default pack windowsize, couldn't the corruption still happen under the
> right conditions?

Uhm, yea.  So I managed to reproduce it on a Linux system here.
Different object ids than R. Tyler's case, but I'm going to try
to debug it and see why we are getting these.

For those following along at home, Linus' 2.6 tree:

$ ulimit -v `echo '150 * 1024'|bc -l`
$ git co 56d18e9932ebf4e8eca42d2ce509450e6c9c1666
HEAD is now at 56d18e9... Merge branch 'upstream' of git://ftp.linux-mips.org/pub/scm/upstream-linus
$ git merge 9e42d0cf5020aaf217433cad1a224745241d212a
Updating 56d18e9..9e42d0c
error: failed to read delta base object ef135b90084f3c54fccea4e273aeff029db2d873 at offset 48342508 from .git/objects/pack/pack-edb47354be787909e05c15bd1d9eb8b4684d2e4d.pack
error: failed to read delta base object c4e828b71d96622bb258938d69aab9cec53d5cae at offset 128427683 from .git/objects/pack/pack-edb47354be787909e05c15bd1d9eb8b4684d2e4d.pack
error: failed to read object 3cd5a6463cfd9306095bf6312a9b7ab09d4f2f5d at offset 128427777 from .git/objects/pack/pack-edb47354be787909e05c15bd1d9eb8b4684d2e4d.pack
fatal: object 3cd5a6463cfd9306095bf6312a9b7ab09d4f2f5d is corrupted

No, the repository is not corrupt.  We f'd up our memory management
somewhere.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  2:43                       ` Shawn O. Pearce
@ 2009-01-08  5:40                         ` Junio C Hamano
  2009-01-08  6:04                           ` Shawn O. Pearce
  0 siblings, 1 reply; 49+ messages in thread
From: Junio C Hamano @ 2009-01-08  5:40 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: James Pickens, Git ML, R. Tyler Ballance, Nicolas Pitre,
	Jan Krüger, kb, Linus Torvalds

"Shawn O. Pearce" <spearce@spearce.org> writes:

> For those following along at home, Linus' 2.6 tree:
>
> $ ulimit -v `echo '150 * 1024'|bc -l`
> $ git co 56d18e9932ebf4e8eca42d2ce509450e6c9c1666

Hmm, without any "wrap zlib to die on error" patch, this step already
fails with:

    $ git checkout 56d18e9932ebf4e8eca42d2ce509450e6c9c1666
    fatal: Out of memory? mmap failed: Cannot allocate memory

I guess that is because our test repositories are packed differently.
I'll retry after repacking..

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  5:40                         ` Junio C Hamano
@ 2009-01-08  6:04                           ` Shawn O. Pearce
  0 siblings, 0 replies; 49+ messages in thread
From: Shawn O. Pearce @ 2009-01-08  6:04 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: James Pickens, Git ML, R. Tyler Ballance, Nicolas Pitre,
	Jan Krüger, kb, Linus Torvalds

Junio C Hamano <gitster@pobox.com> wrote:
> "Shawn O. Pearce" <spearce@spearce.org> writes:
> > For those following along at home, Linus' 2.6 tree:
> >
> > $ ulimit -v `echo '150 * 1024'|bc -l`
> > $ git co 56d18e9932ebf4e8eca42d2ce509450e6c9c1666
> 
> Hmm, without any "wrap zlib to die on error" patch, this step already
> fails with:
> 
>     $ git checkout 56d18e9932ebf4e8eca42d2ce509450e6c9c1666
>     fatal: Out of memory? mmap failed: Cannot allocate memory
> 
> I guess that is because our test repositories are packed differently.
> I'll retry after repacking..

Yup.  I actually did something more like this to get the test
repository:

  git clone git://android.git.kernel.org/kernel/common.git
  git fetch git://kernel.org/pub/.../torvalds/linux-2.6.git master

The android kernel repository I had handy on my local system was
quite a bit away from Linus' so I wound up with two different but
sizable packs.  I thought android was closer to upstream, but its
apparently not.  I started from there because it was local and I
thought it would be a quick way to get a test environment, but
sadly it didn't even have the base 56d18e we were talking about.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  2:21                     ` James Pickens
  2009-01-08  2:43                       ` Shawn O. Pearce
@ 2009-01-08  2:52                       ` Boyd Stephen Smith Jr.
  1 sibling, 0 replies; 49+ messages in thread
From: Boyd Stephen Smith Jr. @ 2009-01-08  2:52 UTC (permalink / raw)
  To: James Pickens
  Cc: Git ML, R. Tyler Ballance, Nicolas Pitre, Jan Krüger, kb,
	Linus Torvalds

[-- Attachment #1: Type: text/plain, Size: 945 bytes --]

On Wednesday 2009 January 07 20:21:18 James Pickens wrote:
>On Wed, Jan 7, 2009, Linus Torvalds <torvalds@linux-foundation.org> wrote:
>> Can you cnfirm that your "reproducible" case starts working with that
>> addition to your ~/.gitconfig? If so, the solution is pretty simple: we
>> should just lower the default pack windowsize.
>
>Umm... isn't that more of a workaround than a solution?  I.e., if you lower
>the default pack windowsize, couldn't the corruption still happen under the
>right conditions?

IMHO:
I agree, somewhat.  I'm fine with a "die()" message when there's not enough 
memory, but either corruption or just a spurious, but scary "<SHA> is 
corrupt" messages should be fixed.
-- 
Boyd Stephen Smith Jr.                     ,= ,-_-. =. 
bss@iguanasuicide.net                     ((_/)o o(\_))
ICQ: 514984 YM/AIM: DaTwinkDaddy           `-'(. .)`-' 
http://iguanasuicide.net/                      \_/     

[-- Attachment #2: This is a digitally signed message part. --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  0:28                 ` Public repro case! " R. Tyler Ballance
  2009-01-08  0:48                   ` Linus Torvalds
@ 2009-01-08  2:52                   ` Linus Torvalds
  2009-01-08  3:01                     ` Shawn O. Pearce
  1 sibling, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2009-01-08  2:52 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Nicolas Pitre, Jan Krüger, Git ML, kb

On Wed, 7 Jan 2009, R. Tyler Ballance wrote:
>
>         tyler@grapefruit:~/source/git/linux-2.6> git pull
>         error: failed to read object be1b87c70af69acfadb8a27a7a76dfb61de92643 at offset 1850923
>         from .git/objects/pack/pack-dbe154052997a05499eb6b4fd90b924da68e799a.pack
>         fatal: object be1b87c70af69acfadb8a27a7a76dfb61de92643 is corrupted

Btw, this is an interesting error message, mostly because of what is 
_not_ there.

In particular, it doesn't report any reason _why_ it failed to read the 
object, which as far as I can tell can happen for only one reason: 
unpack_compressed_entry() returns NULL, and that path is the only thing 
that can do so without a message.

And it only does it if zlib fails.

Now, zlib can fail because the unpacking fails, but it can fail for other 
cases too.

And the thing is, we don't check/report those kinds of failure cases very 
well. Which really bit us here, because if we had checked the return value 
of inflateInit(), we'd almost certainly would have gotten a nice big "you 
ran out of memory" thing, and we wouldn't have been chasing this down as a 
corruption issue.

We probably should wrap all the "inflateInit()" calls, and do something 
like

	static void xinflateInit(z_streamp strm)
	{
		const char *err;

		switch (inflateInit(strm)) {
		case Z_OK:
			return;
		case Z_MEM_ERROR:
			err = "out of memory";
			break;
		case Z_VERSION_ERROR:
			err = "wrong version";
			break;
		default:
			err = "error";
		}
		die("inflateInit: %s (%s)", err,
			strm->msg ? strm->msg : "no message");
	}

or similar. That way we'd get good error reports when we run out of 
memory, rather than consider it to be a corruption issue.

We could also try to make a few of these wrappers actually release some of 
the memory (the way xmmap() does), but there are likely diminishing 
returns. And the much more important issue is the proper error reporting: 
if we had reported "out of memory", we'd not have spent so much time 
chasing disk corruption etc.

			Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  2:52                   ` Linus Torvalds
@ 2009-01-08  3:01                     ` Shawn O. Pearce
  2009-01-08  3:06                       ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Shawn O. Pearce @ 2009-01-08  3:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: R. Tyler Ballance, Nicolas Pitre, Jan Krüger, Git ML, kb

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Wed, 7 Jan 2009, R. Tyler Ballance wrote:
> >
> >         tyler@grapefruit:~/source/git/linux-2.6> git pull
> >         error: failed to read object be1b87c70af69acfadb8a27a7a76dfb61de92643 at offset 1850923
> >         from .git/objects/pack/pack-dbe154052997a05499eb6b4fd90b924da68e799a.pack
> >         fatal: object be1b87c70af69acfadb8a27a7a76dfb61de92643 is corrupted
> 
> Btw, this is an interesting error message, mostly because of what is 
> _not_ there.
> 
> In particular, it doesn't report any reason _why_ it failed to read the 
> object, which as far as I can tell can happen for only one reason: 
> unpack_compressed_entry() returns NULL, and that path is the only thing 
> that can do so without a message.
> 
> And it only does it if zlib fails.

Ok, well, in this case I've been able to reproduce a zlib inflate
failure on the base object in a 2 deep delta chain.  We got back:

  #define Z_STREAM_ERROR (-2)

this causes the buffer to be freed and NULL to come back out of
unpack_compressed_entry(), and then everything is corrupt...

-- 
Shawn.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  3:01                     ` Shawn O. Pearce
@ 2009-01-08  3:06                       ` Linus Torvalds
  2009-01-08  3:13                         ` Shawn O. Pearce
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2009-01-08  3:06 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: R. Tyler Ballance, Nicolas Pitre, Jan Krüger, Git ML, kb



On Wed, 7 Jan 2009, Shawn O. Pearce wrote:
> 
> Ok, well, in this case I've been able to reproduce a zlib inflate
> failure on the base object in a 2 deep delta chain.  We got back:
> 
>   #define Z_STREAM_ERROR (-2)
> 
> this causes the buffer to be freed and NULL to come back out of
> unpack_compressed_entry(), and then everything is corrupt...

I bet you actually got an earlier error already from the inflateInit. 

The Z_STREAM_ERROR probably comes from inflate() itself - and could very 
easily be due to a allocation error in inflateInit leaving the stream data 
incomplete.

Let me try wrapping that dang thing and send a patch. 

		Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: Public repro case! Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  3:06                       ` Linus Torvalds
@ 2009-01-08  3:13                         ` Shawn O. Pearce
  2009-01-08  3:16                           ` [PATCH] Wrap inflateInit to retry allocation after releasing pack memory Shawn O. Pearce
  0 siblings, 1 reply; 49+ messages in thread
From: Shawn O. Pearce @ 2009-01-08  3:13 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: R. Tyler Ballance, Nicolas Pitre, Jan Krüger, Git ML, kb

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> The Z_STREAM_ERROR probably comes from inflate() itself - and could very 
> easily be due to a allocation error in inflateInit leaving the stream data 
> incomplete.
> 
> Let me try wrapping that dang thing and send a patch. 

Yup.  I'm actually doing the same thing...

-- 
Shawn.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH] Wrap inflateInit to retry allocation after releasing pack memory
  2009-01-08  3:13                         ` Shawn O. Pearce
@ 2009-01-08  3:16                           ` Shawn O. Pearce
  2009-01-08  3:54                             ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: Shawn O. Pearce @ 2009-01-08  3:16 UTC (permalink / raw)
  To: Linus Torvalds, Junio C Hamano
  Cc: R. Tyler Ballance, Nicolas Pitre, Jan Krüger, Git ML, kb

If we are running low on virtual memory we should release pack
windows if zlib's inflateInit fails due to an out of memory error.
It may be that we are running under a low ulimit and are getting
tight on address space.  Shedding unused windows may get us
sufficient working space to continue.

Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Shawn O. Pearce <spearce@spearce.org>
---
  "Shawn O. Pearce" <spearce@spearce.org> wrote:
  > Linus Torvalds <torvalds@linux-foundation.org> wrote:
  > > The Z_STREAM_ERROR probably comes from inflate() itself - and could very 
  > > easily be due to a allocation error in inflateInit leaving the stream data 
  > > incomplete.
  > > 
  > > Let me try wrapping that dang thing and send a patch. 
  > 
  > Yup.  I'm actually doing the same thing...

 builtin-apply.c          |    2 +-
 builtin-pack-objects.c   |    2 +-
 builtin-unpack-objects.c |    2 +-
 cache.h                  |    1 +
 http-push.c              |    4 ++--
 http-walker.c            |    4 ++--
 index-pack.c             |    4 ++--
 sha1_file.c              |    8 ++++----
 wrapper.c                |   20 ++++++++++++++++++++
 9 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/builtin-apply.c b/builtin-apply.c
index af25ee9..cb2663e 100644
--- a/builtin-apply.c
+++ b/builtin-apply.c
@@ -1258,7 +1258,7 @@ static char *inflate_it(const void *data, unsigned long size,
 	stream.avail_in = size;
 	stream.next_out = out = xmalloc(inflated_size);
 	stream.avail_out = inflated_size;
-	inflateInit(&stream);
+	xinflateInit(&stream);
 	st = inflate(&stream, Z_FINISH);
 	if ((st != Z_STREAM_END) || stream.total_out != inflated_size) {
 		free(out);
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index e851534..09576c6 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -195,7 +195,7 @@ static int check_pack_inflate(struct packed_git *p,
 	int st;
 
 	memset(&stream, 0, sizeof(stream));
-	inflateInit(&stream);
+	xinflateInit(&stream);
 	do {
 		in = use_pack(p, w_curs, offset, &stream.avail_in);
 		stream.next_in = in;
diff --git a/builtin-unpack-objects.c b/builtin-unpack-objects.c
index 47ed610..cb9edac 100644
--- a/builtin-unpack-objects.c
+++ b/builtin-unpack-objects.c
@@ -99,7 +99,7 @@ static void *get_data(unsigned long size)
 	stream.avail_out = size;
 	stream.next_in = fill(1);
 	stream.avail_in = len;
-	inflateInit(&stream);
+	xinflateInit(&stream);
 
 	for (;;) {
 		int ret = inflate(&stream, 0);
diff --git a/cache.h b/cache.h
index 231c06d..7d5c38d 100644
--- a/cache.h
+++ b/cache.h
@@ -17,6 +17,7 @@
 #if defined(NO_DEFLATE_BOUND) || ZLIB_VERNUM < 0x1200
 #define deflateBound(c,s)  ((s) + (((s) + 7) >> 3) + (((s) + 63) >> 6) + 11)
 #endif
+extern void xinflateInit(z_stream *stream);
 
 #if defined(DT_UNKNOWN) && !defined(NO_D_TYPE_IN_DIRENT)
 #define DTYPE(de)	((de)->d_type)
diff --git a/http-push.c b/http-push.c
index a4b7d08..906ca48 100644
--- a/http-push.c
+++ b/http-push.c
@@ -269,7 +269,7 @@ static void start_fetch_loose(struct transfer_request *request)
 
 	memset(&request->stream, 0, sizeof(request->stream));
 
-	inflateInit(&request->stream);
+	xinflateInit(&request->stream);
 
 	git_SHA1_Init(&request->c);
 
@@ -310,7 +310,7 @@ static void start_fetch_loose(struct transfer_request *request)
 	   file; also rewind to the beginning of the local file. */
 	if (prev_read == -1) {
 		memset(&request->stream, 0, sizeof(request->stream));
-		inflateInit(&request->stream);
+		xinflateInit(&request->stream);
 		git_SHA1_Init(&request->c);
 		if (prev_posn>0) {
 			prev_posn = 0;
diff --git a/http-walker.c b/http-walker.c
index 7271c7d..6aa8486 100644
--- a/http-walker.c
+++ b/http-walker.c
@@ -142,7 +142,7 @@ static void start_object_request(struct walker *walker,
 
 	memset(&obj_req->stream, 0, sizeof(obj_req->stream));
 
-	inflateInit(&obj_req->stream);
+	xinflateInit(&obj_req->stream);
 
 	git_SHA1_Init(&obj_req->c);
 
@@ -183,7 +183,7 @@ static void start_object_request(struct walker *walker,
 	   file; also rewind to the beginning of the local file. */
 	if (prev_read == -1) {
 		memset(&obj_req->stream, 0, sizeof(obj_req->stream));
-		inflateInit(&obj_req->stream);
+		xinflateInit(&obj_req->stream);
 		git_SHA1_Init(&obj_req->c);
 		if (prev_posn>0) {
 			prev_posn = 0;
diff --git a/index-pack.c b/index-pack.c
index 2931511..c6bfc12 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -275,7 +275,7 @@ static void *unpack_entry_data(unsigned long offset, unsigned long size)
 	stream.avail_out = size;
 	stream.next_in = fill(1);
 	stream.avail_in = input_len;
-	inflateInit(&stream);
+	xinflateInit(&stream);
 
 	for (;;) {
 		int ret = inflate(&stream, 0);
@@ -382,7 +382,7 @@ static void *get_data_from_pack(struct object_entry *obj)
 	stream.avail_out = obj->size;
 	stream.next_in = src;
 	stream.avail_in = len;
-	inflateInit(&stream);
+	xinflateInit(&stream);
 	while ((st = inflate(&stream, Z_FINISH)) == Z_OK);
 	inflateEnd(&stream);
 	if (st != Z_STREAM_END || stream.total_out != obj->size)
diff --git a/sha1_file.c b/sha1_file.c
index 52d1ead..9aabae2 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1196,7 +1196,7 @@ static int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned lon
 	stream->avail_out = bufsiz;
 
 	if (legacy_loose_object(map)) {
-		inflateInit(stream);
+		xinflateInit(stream);
 		return inflate(stream, 0);
 	}
 
@@ -1217,7 +1217,7 @@ static int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned lon
 	/* Set up the stream for the rest.. */
 	stream->next_in = map;
 	stream->avail_in = mapsize;
-	inflateInit(stream);
+	xinflateInit(stream);
 
 	/* And generate the fake traditional header */
 	stream->total_out = 1 + snprintf(buffer, bufsiz, "%s %lu",
@@ -1348,7 +1348,7 @@ unsigned long get_size_from_delta(struct packed_git *p,
 	stream.next_out = delta_head;
 	stream.avail_out = sizeof(delta_head);
 
-	inflateInit(&stream);
+	xinflateInit(&stream);
 	do {
 		in = use_pack(p, w_curs, curpos, &stream.avail_in);
 		stream.next_in = in;
@@ -1585,7 +1585,7 @@ static void *unpack_compressed_entry(struct packed_git *p,
 	stream.next_out = buffer;
 	stream.avail_out = size;
 
-	inflateInit(&stream);
+	xinflateInit(&stream);
 	do {
 		in = use_pack(p, w_curs, curpos, &stream.avail_in);
 		stream.next_in = in;
diff --git a/wrapper.c b/wrapper.c
index 93562f0..f255eef 100644
--- a/wrapper.c
+++ b/wrapper.c
@@ -196,3 +196,23 @@ int xmkstemp(char *template)
 		die("Unable to create temporary file: %s", strerror(errno));
 	return fd;
 }
+
+void xinflateInit(z_stream *stream)
+{
+	switch (inflateInit(stream)) {
+	case Z_OK:
+		return;
+
+	case Z_MEM_ERROR:
+		release_pack_memory(128 * 1024, -1);
+		if (inflateInit(stream) == Z_OK)
+			return;
+		die("Out of memory? inflateInit failed");
+
+	case Z_VERSION_ERROR:
+		die("Wrong zlib version? inflateInit failed");
+
+	default:
+		die("Unknown inflateInit failure");
+	}
+}
-- 
1.6.1.141.gfe98e

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH] Wrap inflateInit to retry allocation after releasing pack memory
  2009-01-08  3:16                           ` [PATCH] Wrap inflateInit to retry allocation after releasing pack memory Shawn O. Pearce
@ 2009-01-08  3:54                             ` Linus Torvalds
  2009-01-08  5:23                               ` Junio C Hamano
                                                 ` (2 more replies)
  0 siblings, 3 replies; 49+ messages in thread
From: Linus Torvalds @ 2009-01-08  3:54 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Junio C Hamano, R. Tyler Ballance, Nicolas Pitre, Jan Krüger,
	Git ML, kb



On Wed, 7 Jan 2009, Shawn O. Pearce wrote:
>
> If we are running low on virtual memory we should release pack
> windows if zlib's inflateInit fails due to an out of memory error.
> It may be that we are running under a low ulimit and are getting
> tight on address space.  Shedding unused windows may get us
> sufficient working space to continue.

Let's do this (more complete) wrapping instead, ok?

This one _just_ wraps things, btw - it doesn't do the "retry on low memory 
error" part, at least not yet. I think that's an independent issue from 
the reporting.

Hmm? 

Tyler - does this make the corruption errors go away, and be replaced by 
hard failures with "out of memory" reporting?

This patch is potentially pretty noisy, on purpose. I didn't remove the 
reporting from places that already do so - some of them have stricter 
errors than this.

For example: Z_BUF_ERROR can be valid depending on circumstance, so the 
wrapper doesn't complain about it, but the caller may not accept it. 

		Linus

---
 builtin-apply.c          |    5 ++-
 builtin-pack-objects.c   |    6 ++--
 builtin-unpack-objects.c |    6 ++--
 cache.h                  |    4 +++
 http-push.c              |    8 +++---
 http-walker.c            |    8 +++---
 index-pack.c             |   12 ++++----
 sha1_file.c              |   24 +++++++++---------
 wrapper.c                |   60 ++++++++++++++++++++++++++++++++++++++++++++++
 9 files changed, 99 insertions(+), 34 deletions(-)

diff --git a/builtin-apply.c b/builtin-apply.c
index 07244b0..ed02b6d 100644
--- a/builtin-apply.c
+++ b/builtin-apply.c
@@ -1253,8 +1253,9 @@ static char *inflate_it(const void *data, unsigned long size,
 	stream.avail_in = size;
 	stream.next_out = out = xmalloc(inflated_size);
 	stream.avail_out = inflated_size;
-	inflateInit(&stream);
-	st = inflate(&stream, Z_FINISH);
+	git_inflate_init(&stream);
+	st = git_inflate(&stream, Z_FINISH);
+	git_inflate_end(&stream);
 	if ((st != Z_STREAM_END) || stream.total_out != inflated_size) {
 		free(out);
 		return NULL;
diff --git a/builtin-pack-objects.c b/builtin-pack-objects.c
index e851534..cb51916 100644
--- a/builtin-pack-objects.c
+++ b/builtin-pack-objects.c
@@ -195,16 +195,16 @@ static int check_pack_inflate(struct packed_git *p,
 	int st;
 
 	memset(&stream, 0, sizeof(stream));
-	inflateInit(&stream);
+	git_inflate_init(&stream);
 	do {
 		in = use_pack(p, w_curs, offset, &stream.avail_in);
 		stream.next_in = in;
 		stream.next_out = fakebuf;
 		stream.avail_out = sizeof(fakebuf);
-		st = inflate(&stream, Z_FINISH);
+		st = git_inflate(&stream, Z_FINISH);
 		offset += stream.next_in - in;
 	} while (st == Z_OK || st == Z_BUF_ERROR);
-	inflateEnd(&stream);
+	git_inflate_end(&stream);
 	return (st == Z_STREAM_END &&
 		stream.total_out == expect &&
 		stream.total_in == len) ? 0 : -1;
diff --git a/builtin-unpack-objects.c b/builtin-unpack-objects.c
index 47ed610..9a77323 100644
--- a/builtin-unpack-objects.c
+++ b/builtin-unpack-objects.c
@@ -99,10 +99,10 @@ static void *get_data(unsigned long size)
 	stream.avail_out = size;
 	stream.next_in = fill(1);
 	stream.avail_in = len;
-	inflateInit(&stream);
+	git_inflate_init(&stream);
 
 	for (;;) {
-		int ret = inflate(&stream, 0);
+		int ret = git_inflate(&stream, 0);
 		use(len - stream.avail_in);
 		if (stream.total_out == size && ret == Z_STREAM_END)
 			break;
@@ -118,7 +118,7 @@ static void *get_data(unsigned long size)
 		stream.next_in = fill(1);
 		stream.avail_in = len;
 	}
-	inflateEnd(&stream);
+	git_inflate_end(&stream);
 	return buf;
 }
 
diff --git a/cache.h b/cache.h
index 231c06d..49e54fb 100644
--- a/cache.h
+++ b/cache.h
@@ -18,6 +18,10 @@
 #define deflateBound(c,s)  ((s) + (((s) + 7) >> 3) + (((s) + 63) >> 6) + 11)
 #endif
 
+void git_inflate_init(z_streamp strm);
+void git_inflate_end(z_streamp strm);
+int git_inflate(z_streamp strm, int flush);
+
 #if defined(DT_UNKNOWN) && !defined(NO_D_TYPE_IN_DIRENT)
 #define DTYPE(de)	((de)->d_type)
 #else
diff --git a/http-push.c b/http-push.c
index 7c64609..809002b 100644
--- a/http-push.c
+++ b/http-push.c
@@ -208,7 +208,7 @@ static size_t fwrite_sha1_file(void *ptr, size_t eltsize, size_t nmemb,
 	do {
 		request->stream.next_out = expn;
 		request->stream.avail_out = sizeof(expn);
-		request->zret = inflate(&request->stream, Z_SYNC_FLUSH);
+		request->zret = git_inflate(&request->stream, Z_SYNC_FLUSH);
 		git_SHA1_Update(&request->c, expn,
 			    sizeof(expn) - request->stream.avail_out);
 	} while (request->stream.avail_in && request->zret == Z_OK);
@@ -268,7 +268,7 @@ static void start_fetch_loose(struct transfer_request *request)
 
 	memset(&request->stream, 0, sizeof(request->stream));
 
-	inflateInit(&request->stream);
+	git_inflate_init(&request->stream);
 
 	git_SHA1_Init(&request->c);
 
@@ -309,7 +309,7 @@ static void start_fetch_loose(struct transfer_request *request)
 	   file; also rewind to the beginning of the local file. */
 	if (prev_read == -1) {
 		memset(&request->stream, 0, sizeof(request->stream));
-		inflateInit(&request->stream);
+		git_inflate_init(&request->stream);
 		git_SHA1_Init(&request->c);
 		if (prev_posn>0) {
 			prev_posn = 0;
@@ -741,7 +741,7 @@ static void finish_request(struct transfer_request *request)
 			if (request->http_code == 416)
 				fprintf(stderr, "Warning: requested range invalid; we may already have all the data.\n");
 
-			inflateEnd(&request->stream);
+			git_inflate_end(&request->stream);
 			git_SHA1_Final(request->real_sha1, &request->c);
 			if (request->zret != Z_STREAM_END) {
 				unlink(request->tmpfile);
diff --git a/http-walker.c b/http-walker.c
index 7271c7d..0dbad3c 100644
--- a/http-walker.c
+++ b/http-walker.c
@@ -82,7 +82,7 @@ static size_t fwrite_sha1_file(void *ptr, size_t eltsize, size_t nmemb,
 	do {
 		obj_req->stream.next_out = expn;
 		obj_req->stream.avail_out = sizeof(expn);
-		obj_req->zret = inflate(&obj_req->stream, Z_SYNC_FLUSH);
+		obj_req->zret = git_inflate(&obj_req->stream, Z_SYNC_FLUSH);
 		git_SHA1_Update(&obj_req->c, expn,
 			    sizeof(expn) - obj_req->stream.avail_out);
 	} while (obj_req->stream.avail_in && obj_req->zret == Z_OK);
@@ -142,7 +142,7 @@ static void start_object_request(struct walker *walker,
 
 	memset(&obj_req->stream, 0, sizeof(obj_req->stream));
 
-	inflateInit(&obj_req->stream);
+	git_inflate_init(&obj_req->stream);
 
 	git_SHA1_Init(&obj_req->c);
 
@@ -183,7 +183,7 @@ static void start_object_request(struct walker *walker,
 	   file; also rewind to the beginning of the local file. */
 	if (prev_read == -1) {
 		memset(&obj_req->stream, 0, sizeof(obj_req->stream));
-		inflateInit(&obj_req->stream);
+		git_inflate_init(&obj_req->stream);
 		git_SHA1_Init(&obj_req->c);
 		if (prev_posn>0) {
 			prev_posn = 0;
@@ -243,7 +243,7 @@ static void finish_object_request(struct object_request *obj_req)
 		return;
 	}
 
-	inflateEnd(&obj_req->stream);
+	git_inflate_end(&obj_req->stream);
 	git_SHA1_Final(obj_req->real_sha1, &obj_req->c);
 	if (obj_req->zret != Z_STREAM_END) {
 		unlink(obj_req->tmpfile);
diff --git a/index-pack.c b/index-pack.c
index 60ed41a..c0a3d97 100644
--- a/index-pack.c
+++ b/index-pack.c
@@ -275,10 +275,10 @@ static void *unpack_entry_data(unsigned long offset, unsigned long size)
 	stream.avail_out = size;
 	stream.next_in = fill(1);
 	stream.avail_in = input_len;
-	inflateInit(&stream);
+	git_inflate_init(&stream);
 
 	for (;;) {
-		int ret = inflate(&stream, 0);
+		int ret = git_inflate(&stream, 0);
 		use(input_len - stream.avail_in);
 		if (stream.total_out == size && ret == Z_STREAM_END)
 			break;
@@ -287,7 +287,7 @@ static void *unpack_entry_data(unsigned long offset, unsigned long size)
 		stream.next_in = fill(1);
 		stream.avail_in = input_len;
 	}
-	inflateEnd(&stream);
+	git_inflate_end(&stream);
 	return buf;
 }
 
@@ -382,9 +382,9 @@ static void *get_data_from_pack(struct object_entry *obj)
 	stream.avail_out = obj->size;
 	stream.next_in = src;
 	stream.avail_in = len;
-	inflateInit(&stream);
-	while ((st = inflate(&stream, Z_FINISH)) == Z_OK);
-	inflateEnd(&stream);
+	git_inflate_init(&stream);
+	while ((st = git_inflate(&stream, Z_FINISH)) == Z_OK);
+	git_inflate_end(&stream);
 	if (st != Z_STREAM_END || stream.total_out != obj->size)
 		die("serious inflate inconsistency");
 	free(src);
diff --git a/sha1_file.c b/sha1_file.c
index 52d1ead..8600b04 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -1196,8 +1196,8 @@ static int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned lon
 	stream->avail_out = bufsiz;
 
 	if (legacy_loose_object(map)) {
-		inflateInit(stream);
-		return inflate(stream, 0);
+		git_inflate_init(stream);
+		return git_inflate(stream, 0);
 	}
 
 
@@ -1217,7 +1217,7 @@ static int unpack_sha1_header(z_stream *stream, unsigned char *map, unsigned lon
 	/* Set up the stream for the rest.. */
 	stream->next_in = map;
 	stream->avail_in = mapsize;
-	inflateInit(stream);
+	git_inflate_init(stream);
 
 	/* And generate the fake traditional header */
 	stream->total_out = 1 + snprintf(buffer, bufsiz, "%s %lu",
@@ -1254,11 +1254,11 @@ static void *unpack_sha1_rest(z_stream *stream, void *buffer, unsigned long size
 		stream->next_out = buf + bytes;
 		stream->avail_out = size - bytes;
 		while (status == Z_OK)
-			status = inflate(stream, Z_FINISH);
+			status = git_inflate(stream, Z_FINISH);
 	}
 	buf[size] = 0;
 	if (status == Z_STREAM_END && !stream->avail_in) {
-		inflateEnd(stream);
+		git_inflate_end(stream);
 		return buf;
 	}
 
@@ -1348,15 +1348,15 @@ unsigned long get_size_from_delta(struct packed_git *p,
 	stream.next_out = delta_head;
 	stream.avail_out = sizeof(delta_head);
 
-	inflateInit(&stream);
+	git_inflate_init(&stream);
 	do {
 		in = use_pack(p, w_curs, curpos, &stream.avail_in);
 		stream.next_in = in;
-		st = inflate(&stream, Z_FINISH);
+		st = git_inflate(&stream, Z_FINISH);
 		curpos += stream.next_in - in;
 	} while ((st == Z_OK || st == Z_BUF_ERROR) &&
 		 stream.total_out < sizeof(delta_head));
-	inflateEnd(&stream);
+	git_inflate_end(&stream);
 	if ((st != Z_STREAM_END) && stream.total_out != sizeof(delta_head)) {
 		error("delta data unpack-initial failed");
 		return 0;
@@ -1585,14 +1585,14 @@ static void *unpack_compressed_entry(struct packed_git *p,
 	stream.next_out = buffer;
 	stream.avail_out = size;
 
-	inflateInit(&stream);
+	git_inflate_init(&stream);
 	do {
 		in = use_pack(p, w_curs, curpos, &stream.avail_in);
 		stream.next_in = in;
-		st = inflate(&stream, Z_FINISH);
+		st = git_inflate(&stream, Z_FINISH);
 		curpos += stream.next_in - in;
 	} while (st == Z_OK || st == Z_BUF_ERROR);
-	inflateEnd(&stream);
+	git_inflate_end(&stream);
 	if ((st != Z_STREAM_END) || stream.total_out != size) {
 		free(buffer);
 		return NULL;
@@ -2017,7 +2017,7 @@ static int sha1_loose_object_info(const unsigned char *sha1, unsigned long *size
 		status = error("unable to parse %s header", sha1_to_hex(sha1));
 	else if (sizep)
 		*sizep = size;
-	inflateEnd(&stream);
+	git_inflate_end(&stream);
 	munmap(map, mapsize);
 	return status;
 }
diff --git a/wrapper.c b/wrapper.c
index 93562f0..29afa96 100644
--- a/wrapper.c
+++ b/wrapper.c
@@ -196,3 +196,63 @@ int xmkstemp(char *template)
 		die("Unable to create temporary file: %s", strerror(errno));
 	return fd;
 }
+
+/*
+ * zlib wrappers to make sure we don't silently miss errors
+ * at init time.
+ */
+void git_inflate_init(z_streamp strm)
+{
+	const char *err;
+
+	switch (inflateInit(strm)) {
+	case Z_OK:
+		return;
+
+	case Z_MEM_ERROR:
+		err = "out of memory";
+		break;
+	case Z_VERSION_ERROR:
+		err = "wrong version";
+		break;
+	default:
+		err = "error";
+	}
+	die("inflateInit: %s (%s)", err, strm->msg ? strm->msg : "no message");
+}
+
+void git_inflate_end(z_streamp strm)
+{
+	if (inflateEnd(strm) != Z_OK)
+		error("inflateEnd: %s", strm->msg ? strm->msg : "failed");
+}
+
+int git_inflate(z_streamp strm, int flush)
+{
+	int ret = inflate(strm, flush);
+	const char *err;
+
+	switch (ret) {
+	/* Out of memory is fatal. */
+	case Z_MEM_ERROR:
+		die("inflate: out of memory");
+
+	/* Data corruption errors: we may want to recover from them (fsck) */
+	case Z_NEED_DICT:
+		err = "needs dictionary"; break;
+	case Z_DATA_ERROR:
+		err = "data stream error"; break;
+	case Z_STREAM_ERROR:
+		err = "stream consistency error"; break;
+	default:
+		err = "unknown error"; break;
+
+	/* Z_BUF_ERROR: normal, needs a buffer output buffer */
+	case Z_BUF_ERROR:
+	case Z_OK:
+	case Z_STREAM_END:
+		return ret;
+	}
+	error("inflate: %s (%s)", err, strm->msg ? strm->msg : "no message");
+	return ret;
+}

^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH] Wrap inflateInit to retry allocation after releasing pack memory
  2009-01-08  3:54                             ` Linus Torvalds
@ 2009-01-08  5:23                               ` Junio C Hamano
  2009-01-08 15:35                                 ` Linus Torvalds
  2009-01-08 15:34                               ` Shawn O. Pearce
  2009-01-08 18:15                               ` R. Tyler Ballance
  2 siblings, 1 reply; 49+ messages in thread
From: Junio C Hamano @ 2009-01-08  5:23 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Shawn O. Pearce, R. Tyler Ballance, Nicolas Pitre,
	Jan Krüger, Git ML, kb

Linus Torvalds <torvalds@linux-foundation.org> writes:

> +int git_inflate(z_streamp strm, int flush)
> +{
> +...
> +	/* Z_BUF_ERROR: normal, needs a buffer output buffer */
> +	case Z_BUF_ERROR:

Thanks, but "needs a buffer output buffer" made me scratch my head
somewhat.

  ... Z_BUF_ERROR if no progress is possible or if there was not enough
  room in the output buffer when Z_FINISH is used. Note that Z_BUF_ERROR
  is not fatal, and inflate() can be called again with more input and more
  output space to continue decompressing.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Wrap inflateInit to retry allocation after releasing pack memory
  2009-01-08  5:23                               ` Junio C Hamano
@ 2009-01-08 15:35                                 ` Linus Torvalds
  0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2009-01-08 15:35 UTC (permalink / raw)
  To: Junio C Hamano
  Cc: Shawn O. Pearce, R. Tyler Ballance, Nicolas Pitre,
	Jan Krüger, Git ML, kb

On Wed, 7 Jan 2009, Junio C Hamano wrote:
> 
> Thanks, but "needs a buffer output buffer" made me scratch my head
> somewhat.

That was just me editing it.

It was originally "needs an output buffer" and then I was supposed to edit 
it to "needs a buffer" (because it can be either output or input, and in 
the case of git it's usually actually the input that was partial).

And then I messed up, and it became that.

		Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Wrap inflateInit to retry allocation after releasing pack memory
  2009-01-08  3:54                             ` Linus Torvalds
  2009-01-08  5:23                               ` Junio C Hamano
@ 2009-01-08 15:34                               ` Shawn O. Pearce
  2009-01-08 16:14                                 ` Linus Torvalds
  2009-01-08 18:15                               ` R. Tyler Ballance
  2 siblings, 1 reply; 49+ messages in thread
From: Shawn O. Pearce @ 2009-01-08 15:34 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Junio C Hamano, R. Tyler Ballance, Nicolas Pitre, Jan Krüger,
	Git ML, kb

Linus Torvalds <torvalds@linux-foundation.org> wrote:
> Let's do this (more complete) wrapping instead, ok?

Ack.

> This one _just_ wraps things, btw - it doesn't do the "retry on low memory 
> error" part, at least not yet. I think that's an independent issue from 
> the reporting.

I still think we should try to reduce pack memory usage when we get
oom from zlib and retry the current operation once.  We do it almost
everywhere else and it works relatively well.

We may also want to consider dropping (e.g. halving) the window
size and/or limit when we run out of memory.  We'll run slower but
if the OS has denied us further resources it may be a ulimit thing
on a shared system, and we should try harder to work with what we
have available to us.

-- 
Shawn.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Wrap inflateInit to retry allocation after releasing pack memory
  2009-01-08 15:34                               ` Shawn O. Pearce
@ 2009-01-08 16:14                                 ` Linus Torvalds
  0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2009-01-08 16:14 UTC (permalink / raw)
  To: Shawn O. Pearce
  Cc: Junio C Hamano, R. Tyler Ballance, Nicolas Pitre, Jan Krüger,
	Git ML, kb



On Thu, 8 Jan 2009, Shawn O. Pearce wrote:
> 
> I still think we should try to reduce pack memory usage when we get
> oom from zlib and retry the current operation once.  We do it almost
> everywhere else and it works relatively well.

Oh, I agree.

It's just that I wanted to verify that people who see this problem 
actually see the message, and that we confirm that it is due to this and 
nothing else.

			Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Wrap inflateInit to retry allocation after releasing pack memory
  2009-01-08  3:54                             ` Linus Torvalds
  2009-01-08  5:23                               ` Junio C Hamano
  2009-01-08 15:34                               ` Shawn O. Pearce
@ 2009-01-08 18:15                               ` R. Tyler Ballance
  2009-01-08 20:22                                 ` Linus Torvalds
  2 siblings, 1 reply; 49+ messages in thread
From: R. Tyler Ballance @ 2009-01-08 18:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Shawn O. Pearce, Junio C Hamano, Nicolas Pitre, Jan Krüger,
	Git ML, kb

[-- Attachment #1: Type: text/plain, Size: 2597 bytes --]

On Wed, 2009-01-07 at 19:54 -0800, Linus Torvalds wrote:
> 
> On Wed, 7 Jan 2009, Shawn O. Pearce wrote:
> >
> > If we are running low on virtual memory we should release pack
> > windows if zlib's inflateInit fails due to an out of memory error.
> > It may be that we are running under a low ulimit and are getting
> > tight on address space.  Shedding unused windows may get us
> > sufficient working space to continue.
> 
> Let's do this (more complete) wrapping instead, ok?
> 
> This one _just_ wraps things, btw - it doesn't do the "retry on low memory 
> error" part, at least not yet. I think that's an independent issue from 
> the reporting.
> 
> Hmm? 
> 
> Tyler - does this make the corruption errors go away, and be replaced by 
> hard failures with "out of memory" reporting?

Yeah, looks like it:


        tyler@grapefruit:~/source/git/linux-2.6> export
        START=56d18e9932ebf4e8eca42d2ce509450e6c9c1666
        tyler@grapefruit:~/source/git/linux-2.6> git reset --hard
        HEAD is now at 9e42d0c Merge
        git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-2.6
        tyler@grapefruit:~/source/git/linux-2.6> git reset --hard $START
        HEAD is now at 56d18e9 Merge branch 'upstream' of
        git://ftp.linux-mips.org/pub/scm/upstream-linus
        tyler@grapefruit:~/source/git/linux-2.6> ulimit -v `echo "350 *
        1024" | bc -l`
        tyler@grapefruit:~/source/git/linux-2.6> limit
        cputime         unlimited
        filesize        unlimited
        datasize        unlimited
        stacksize       8MB
        coredumpsize    0kB
        memoryuse       2561MB
        maxproc         24564
        descriptors     1024
        memorylocked    64kB
        addressspace    350MB
        maxfilelocks    unlimited
        sigpending      24564
        msgqueue        819200
        nice            0
        rt_priority     0
        tyler@grapefruit:~/source/git/linux-2.6> git pull
        Updating 56d18e9..9e42d0c
        fatal: Out of memory? inflateInit failed
        tyler@grapefruit:~/source/git/linux-2.6> which git
        /home/tyler/bin/git
        tyler@grapefruit:~/source/git/linux-2.6> 



> 
> This patch is potentially pretty noisy, on purpose. I didn't remove the 
> reporting from places that already do so - some of them have stricter 
> errors than this.

I'm assuming this patch is going to be reworked, if so, I'll back it out
of our internal 1.6.1 build and anxiously await The Real Deal(tm)


Cheers
-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Wrap inflateInit to retry allocation after releasing pack memory
  2009-01-08 18:15                               ` R. Tyler Ballance
@ 2009-01-08 20:22                                 ` Linus Torvalds
  2009-01-08 20:37                                   ` R. Tyler Ballance
  2009-01-09  1:43                                   ` Junio C Hamano
  0 siblings, 2 replies; 49+ messages in thread
From: Linus Torvalds @ 2009-01-08 20:22 UTC (permalink / raw)
  To: R. Tyler Ballance
  Cc: Shawn O. Pearce, Junio C Hamano, Nicolas Pitre, Jan Krüger,
	Git ML, kb

On Thu, 8 Jan 2009, R. Tyler Ballance wrote:
> > 
> > Tyler - does this make the corruption errors go away, and be replaced by 
> > hard failures with "out of memory" reporting?
> 
> Yeah, looks like it:

Well, I was hoping that you'd have a confirmation from your own huge repo, 
but I do suspect it's all the same thing, so I guess this counts as 
confirmation too.

> > This patch is potentially pretty noisy, on purpose. I didn't remove the 
> > reporting from places that already do so - some of them have stricter 
> > errors than this.
> 
> I'm assuming this patch is going to be reworked, if so, I'll back it out
> of our internal 1.6.1 build and anxiously await The Real Deal(tm)

Oh, it shouldn't be any noisier under _normal_ load - it's more that 
certain real corruption cases will now report the error twice. That said, 
the new errors should actually be more informative than the old ones, so 
even that isn't necessarily all bad.

Junio - I think we should apply this, and likely to the stable branch too. 
Add the re-trying the inflateInit() after shrinking pack windows on top of 
it.

			Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Wrap inflateInit to retry allocation after releasing pack memory
  2009-01-08 20:22                                 ` Linus Torvalds
@ 2009-01-08 20:37                                   ` R. Tyler Ballance
  2009-01-09  1:43                                   ` Junio C Hamano
  1 sibling, 0 replies; 49+ messages in thread
From: R. Tyler Ballance @ 2009-01-08 20:37 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Shawn O. Pearce, Junio C Hamano, Nicolas Pitre, Jan Krüger,
	Git ML, kb

[-- Attachment #1: Type: text/plain, Size: 2325 bytes --]

On Thu, 2009-01-08 at 12:22 -0800, Linus Torvalds wrote:
> 
> On Thu, 8 Jan 2009, R. Tyler Ballance wrote:
> > > 
> > > Tyler - does this make the corruption errors go away, and be replaced by 
> > > hard failures with "out of memory" reporting?
> > 
> > Yeah, looks like it:
> 
> Well, I was hoping that you'd have a confirmation from your own huge repo, 
> but I do suspect it's all the same thing, so I guess this counts as 
> confirmation too.

I never got a real solid "consistent" reproduction case with our
repository, just a lot of users that experienced the issue. I think the
Linux repro case is a far better example, and yeah, it's sorta
confirmation (waiting for operations here to deploy the patched 1.6.1 to
dev machines).

> 
> > > This patch is potentially pretty noisy, on purpose. I didn't remove the 
> > > reporting from places that already do so - some of them have stricter 
> > > errors than this.
> > 
> > I'm assuming this patch is going to be reworked, if so, I'll back it out
> > of our internal 1.6.1 build and anxiously await The Real Deal(tm)
> 
> Oh, it shouldn't be any noisier under _normal_ load - it's more that 
> certain real corruption cases will now report the error twice. That said, 
> the new errors should actually be more informative than the old ones, so 
> even that isn't necessarily all bad.
> 
> Junio - I think we should apply this, and likely to the stable branch too. 
> Add the re-trying the inflateInit() after shrinking pack windows on top of 
> it.

I really appreciate this guys, this is one of the longer threads I've
participated (spanning over a month) and I'm glad you guys were finally
able to track the issue down.

From now moving forward, I'll try to get a reproduction case with the
kernel tree or something equally big since I know it's frustrating to
play the game of telephone with a proprietary code base ("try this? what
does that do? okay, then this?").

Linus, I'll have a chance to look at your comments on my "variable
packed git window size" patch this weekend, and I'll follow-up in the
appropriate thread.


I'm relatively certain that after this witch hunt, I can get Slide to
cover a round of beers at LinuxWorld or the nearest GitTogether ;)


Cheers
-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH] Wrap inflateInit to retry allocation after releasing pack memory
  2009-01-08 20:22                                 ` Linus Torvalds
  2009-01-08 20:37                                   ` R. Tyler Ballance
@ 2009-01-09  1:43                                   ` Junio C Hamano
  1 sibling, 0 replies; 49+ messages in thread
From: Junio C Hamano @ 2009-01-09  1:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: R. Tyler Ballance, Shawn O. Pearce, Nicolas Pitre,
	Jan Krüger, Git ML, kb

Linus Torvalds <torvalds@linux-foundation.org> writes:

> Junio - I think we should apply this, and likely to the stable branch too. 

Yeah, I didn't lose the patch.

> Add the re-trying the inflateInit() after shrinking pack windows on top of 
> it.

That too.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-07 23:29               ` Linus Torvalds
  2009-01-08  0:28                 ` Public repro case! " R. Tyler Ballance
@ 2009-01-08  0:37                 ` Linus Torvalds
  2009-01-08  0:49                   ` R. Tyler Ballance
  1 sibling, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2009-01-08  0:37 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Nicolas Pitre, Jan Krüger, Git ML

On Wed, 7 Jan 2009, Linus Torvalds wrote:
>
> >         > limit ~1.5GB -> corrupt file
> >         > limit ~3GB -> magically no longer corrupt.
> 
> That is interesting, although I also worry that there might be other 
> issues going on (ie since you've reported thigns magically fixing 
> themselves, maybe the ulimit tests just _happened_ to show that, even if 
> it wasn't the core reason).
> 
> BUT! This is definitely worth looking at.
> 
> For example, we do have some cases where we try to do "mmap()", and if it 
> fails, we try to free some memory and try again. In particular, in 
> xmmap(), if an mmap() fails - which may be due to running out of virtual 
> address space - we'll actually try to release some pack-file memory and 
> try again. Maybe there's a bug there - and it would be one that seldom 
> triggers for others.

Ho humm. We really do have some interesting things there. 

Is this a 64-bit machine? I didn't think OS X did that, but if there is 
some limited 64-bit support there, maybe "sizeof(void *)" is 8, then we 
default the default git pack-window to a pretty healthy 1GB.

I could easily see that if you have a virtual memory size limit of 1.5GB, 
and the pack window size is 1GB, we might have trouble. Because we could 
only keep one such pack window in memory at a time.

I have _not_ looked at the code, though. I'd have expected a SIGSEGV if we 
really had issues with the window handling.

Anyway, _if_ your system has 64-bit pointers, then _maybe_ something the 
default 1GB pack window causes problem.

If so, then adding a

	[core]
		packedgitwindowsize = 64M

might make a difference. It would certainly be very interesting to hear if 
there's any impact.

		Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  0:37                 ` [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file Linus Torvalds
@ 2009-01-08  0:49                   ` R. Tyler Ballance
  2009-01-08  1:01                     ` Linus Torvalds
  0 siblings, 1 reply; 49+ messages in thread
From: R. Tyler Ballance @ 2009-01-08  0:49 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Jan Krüger, Git ML

[-- Attachment #1: Type: text/plain, Size: 2976 bytes --]

On Wed, 2009-01-07 at 16:37 -0800, Linus Torvalds wrote:
> 
> On Wed, 7 Jan 2009, Linus Torvalds wrote:
> >
> > >         > limit ~1.5GB -> corrupt file
> > >         > limit ~3GB -> magically no longer corrupt.
> > 
> > That is interesting, although I also worry that there might be other 
> > issues going on (ie since you've reported thigns magically fixing 
> > themselves, maybe the ulimit tests just _happened_ to show that, even if 
> > it wasn't the core reason).
> > 
> > BUT! This is definitely worth looking at.
> > 
> > For example, we do have some cases where we try to do "mmap()", and if it 
> > fails, we try to free some memory and try again. In particular, in 
> > xmmap(), if an mmap() fails - which may be due to running out of virtual 
> > address space - we'll actually try to release some pack-file memory and 
> > try again. Maybe there's a bug there - and it would be one that seldom 
> > triggers for others.
> 
> Ho humm. We really do have some interesting things there. 

Always enjoyable when these mail threads get this deep ;)

> 
> Is this a 64-bit machine? I didn't think OS X did that, but if there is 
> some limited 64-bit support there, maybe "sizeof(void *)" is 8, then we 
> default the default git pack-window to a pretty healthy 1GB.

I was only mentioning OS X with regards to the Samba/NFS red herring,
the rest of our operations are on 64-bit Linux machines.

The machine I reproduced this on ("Public repo case!") is the following:
        tyler@grapefruit:~> uname -a
        Linux grapefruit.corp.slide.com 2.6.27.7-9-default #1 SMP
        2008-12-04 18:10:04 +0100 x86_64 x86_64 x86_64 GNU/Linux
        tyler@grapefruit:~> cat /etc/issue
        Welcome to openSUSE 11.1   - Kernel \r (\l).
        
The machines we're experiencing this issue on "in the wild" are:
        xdev3 (master)% uname -a 
        Linux xdev3 2.6.24-22-server #1 SMP Mon Nov 24 20:06:28 UTC 2008
        x86_64 GNU/Linux
        xdev3 (master)% cat /etc/issue
        Ubuntu 8.04.1 \n \l
> 
> I could easily see that if you have a virtual memory size limit of 1.5GB, 
> and the pack window size is 1GB, we might have trouble. Because we could 
> only keep one such pack window in memory at a time.

The DEFAULT_PACKED_GIT_WINDOW_SIZE in our local builds is 256M, FWIW

> 
> I have _not_ looked at the code, though. I'd have expected a SIGSEGV if we 
> really had issues with the window handling.
> 
> Anyway, _if_ your system has 64-bit pointers, then _maybe_ something the 
> default 1GB pack window causes problem.
> 
> If so, then adding a
> 
> 	[core]
> 		packedgitwindowsize = 64M
> 
> might make a difference. It would certainly be very interesting to hear if 
> there's any impact.

I can try this still if you'd like, but it doesn't seem like that'd be
the issue since we're already lowering the window size system-wide



Cheers
-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  0:49                   ` R. Tyler Ballance
@ 2009-01-08  1:01                     ` Linus Torvalds
  2009-01-08  1:06                       ` R. Tyler Ballance
  0 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2009-01-08  1:01 UTC (permalink / raw)
  To: R. Tyler Ballance; +Cc: Nicolas Pitre, Jan Krüger, Git ML



On Wed, 7 Jan 2009, R. Tyler Ballance wrote:
> 
> I was only mentioning OS X with regards to the Samba/NFS red herring,
> the rest of our operations are on 64-bit Linux machines.

Ahh, ok. Good. 

> > I could easily see that if you have a virtual memory size limit of 1.5GB, 
> > and the pack window size is 1GB, we might have trouble. Because we could 
> > only keep one such pack window in memory at a time.
> 
> The DEFAULT_PACKED_GIT_WINDOW_SIZE in our local builds is 256M, FWIW

Interesting. So you already had to lower it. However, now that you mention 
it, and now that I search for your emails about it on the mailing list (I 
don't normally read the mailing list except very occasionally), I see your 
patch that does

	#define DYNAMIC_WINDOW_SIZE_PERCENTAGE 0.85
	...
	packed_git_window_size = (unsigned int)(as->rlim_cur * DYNAMIC_WINDOW_SIZE_PERCENTAGE);

which is actually very bad.

It's bad for several reasons:

 - 85% of the virtual address space is actually pessimal.

   You need space for AT LEAST two full-sized windows, so you need less 
   than 50%.

 - the way that variable is used, it _has_ to be a multiple of the page 
   size. In fact, it needs to be a multiple of _twice_ the page size. So 
   just doing a random fraction of the rlimit is not correct.

Setting it in the .gitconfig does it right, though.

> > If so, then adding a
> > 
> > 	[core]
> > 		packedgitwindowsize = 64M
> > 
> > might make a difference. It would certainly be very interesting to hear if 
> > there's any impact.
> 
> I can try this still if you'd like, but it doesn't seem like that'd be
> the issue since we're already lowering the window size system-wide

Please do try, at least if your local git changes still match that patch I 
found, because that patch generates problems.

		Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file
  2009-01-08  1:01                     ` Linus Torvalds
@ 2009-01-08  1:06                       ` R. Tyler Ballance
  0 siblings, 0 replies; 49+ messages in thread
From: R. Tyler Ballance @ 2009-01-08  1:06 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Nicolas Pitre, Jan Krüger, Git ML

[-- Attachment #1: Type: text/plain, Size: 2255 bytes --]

On Wed, 2009-01-07 at 17:01 -0800, Linus Torvalds wrote:
> 
> On Wed, 7 Jan 2009, R. Tyler Ballance wrote:
> > 
> > I was only mentioning OS X with regards to the Samba/NFS red herring,
> > the rest of our operations are on 64-bit Linux machines.
> 
> Ahh, ok. Good. 
> 
> > > I could easily see that if you have a virtual memory size limit of 1.5GB, 
> > > and the pack window size is 1GB, we might have trouble. Because we could 
> > > only keep one such pack window in memory at a time.
> > 
> > The DEFAULT_PACKED_GIT_WINDOW_SIZE in our local builds is 256M, FWIW
> 
> Interesting. So you already had to lower it. However, now that you mention 
> it, and now that I search for your emails about it on the mailing list (I 
> don't normally read the mailing list except very occasionally), I see your 
> patch that does
> 
> 	#define DYNAMIC_WINDOW_SIZE_PERCENTAGE 0.85
> 	...
> 	packed_git_window_size = (unsigned int)(as->rlim_cur * DYNAMIC_WINDOW_SIZE_PERCENTAGE);
> 
> which is actually very bad.
> 
> It's bad for several reasons:
> 
>  - 85% of the virtual address space is actually pessimal.
> 
>    You need space for AT LEAST two full-sized windows, so you need less 
>    than 50%.
> 
>  - the way that variable is used, it _has_ to be a multiple of the page 
>    size. In fact, it needs to be a multiple of _twice_ the page size. So 
>    just doing a random fraction of the rlimit is not correct.

This patch never made it into any of our Git builds because my flight
landed and it wasn't stable enough (and as you pointed out, it sucks ;))



> 
> Setting it in the .gitconfig does it right, though.
> 
> > > If so, then adding a
> > > 
> > > 	[core]
> > > 		packedgitwindowsize = 64M
> > > 
> > > might make a difference. It would certainly be very interesting to hear if 
> > > there's any impact.
> > 
> > I can try this still if you'd like, but it doesn't seem like that'd be
> > the issue since we're already lowering the window size system-wide
> 
> Please do try, at least if your local git changes still match that patch I 
> found, because that patch generates problems.

See my prior reply in "Public repo case!" sent at 4:57PST

-- 
-R. Tyler Ballance
Slide, Inc.

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2009-01-09  1:45 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-09  8:36 [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file Jan Krüger
2008-12-09  9:02 ` R. Tyler Ballance
2008-12-09 16:24 ` Shawn O. Pearce
2009-01-06 22:52 ` R. Tyler Ballance
2009-01-07  1:25   ` Nicolas Pitre
2009-01-07  1:39     ` R. Tyler Ballance
2009-01-07  2:09       ` Nicolas Pitre
2009-01-07  2:47         ` R. Tyler Ballance
2009-01-07  3:21           ` Nicolas Pitre
2009-01-07  4:54       ` Linus Torvalds
2009-01-07  7:41         ` R. Tyler Ballance
2009-01-07  8:16           ` Junio C Hamano
2009-01-07  8:32             ` R. Tyler Ballance
2009-01-07  9:42               ` Junio C Hamano
2009-01-07  9:05           ` R. Tyler Ballance
2009-01-07 15:31           ` Nicolas Pitre
2009-01-07 16:07           ` Linus Torvalds
2009-01-07 16:08             ` Linus Torvalds
2009-01-07 22:55             ` R. Tyler Ballance
2009-01-07 23:29               ` Linus Torvalds
2009-01-08  0:28                 ` Public repro case! " R. Tyler Ballance
2009-01-08  0:48                   ` Linus Torvalds
2009-01-08  0:57                     ` R. Tyler Ballance
2009-01-08  1:08                       ` Linus Torvalds
2009-01-08  1:29                         ` Linus Torvalds
2009-01-08  1:46                           ` Shawn O. Pearce
2009-01-08  2:21                     ` James Pickens
2009-01-08  2:43                       ` Shawn O. Pearce
2009-01-08  5:40                         ` Junio C Hamano
2009-01-08  6:04                           ` Shawn O. Pearce
2009-01-08  2:52                       ` Boyd Stephen Smith Jr.
2009-01-08  2:52                   ` Linus Torvalds
2009-01-08  3:01                     ` Shawn O. Pearce
2009-01-08  3:06                       ` Linus Torvalds
2009-01-08  3:13                         ` Shawn O. Pearce
2009-01-08  3:16                           ` [PATCH] Wrap inflateInit to retry allocation after releasing pack memory Shawn O. Pearce
2009-01-08  3:54                             ` Linus Torvalds
2009-01-08  5:23                               ` Junio C Hamano
2009-01-08 15:35                                 ` Linus Torvalds
2009-01-08 15:34                               ` Shawn O. Pearce
2009-01-08 16:14                                 ` Linus Torvalds
2009-01-08 18:15                               ` R. Tyler Ballance
2009-01-08 20:22                                 ` Linus Torvalds
2009-01-08 20:37                                   ` R. Tyler Ballance
2009-01-09  1:43                                   ` Junio C Hamano
2009-01-08  0:37                 ` [PATCH/RFC] Allow writing loose objects that are corrupted in a pack file Linus Torvalds
2009-01-08  0:49                   ` R. Tyler Ballance
2009-01-08  1:01                     ` Linus Torvalds
2009-01-08  1:06                       ` R. Tyler Ballance

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).