git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Consolidate SHA1 object file close
@ 2008-06-11  1:47 Linus Torvalds
  2008-06-11  7:42 ` Andreas Ericsson
  2008-06-11  7:43 ` Pierre Habouzit
  0 siblings, 2 replies; 11+ messages in thread
From: Linus Torvalds @ 2008-06-11  1:47 UTC (permalink / raw)
  To: Junio C Hamano, Git Mailing List; +Cc: Denis Bueno


This consolidates the common operations for closing the new temporary file 
that we have written, before we move it into place with the final name. 

There's some common code there (make it read-only and check for errors on 
close), but more importantly, this also gives a single place to add an 
fsync_or_die() call if we want to add a safe mode.

This was triggered due to Denis Bueno apparently twice being able to 
corrupt his git repository on OS X due to an unlucky combination of kernel 
crashes and a not-very-robust filesystem.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
---

Junio, this is just the scaffolding, without the actual fsync_or_die call. 

I think it's a worthy place-holder regardless of whether we really want to 
do the fsync (whether conditionally with a config option or not, and 
whether there are more clever options like aio_fsync()).

 sha1_file.c |   17 +++++++++++------
 1 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/sha1_file.c b/sha1_file.c
index adcf37c..f311c79 100644
--- a/sha1_file.c
+++ b/sha1_file.c
@@ -2105,6 +2105,15 @@ int hash_sha1_file(const void *buf, unsigned long len, const char *type,
 	return 0;
 }
 
+/* Finalize a file on disk, and close it. */
+static void close_sha1_file(int fd)
+{
+	/* For safe-mode, we could fsync_or_die(fd, "sha1 file") here */
+	fchmod(fd, 0444);
+	if (close(fd) != 0)
+		die("unable to write sha1 file");
+}
+
 static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 			      void *buf, unsigned long len, time_t mtime)
 {
@@ -2170,9 +2179,7 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen,
 
 	if (write_buffer(fd, compressed, size) < 0)
 		die("unable to write sha1 file");
-	fchmod(fd, 0444);
-	if (close(fd))
-		die("unable to write sha1 file");
+	close_sha1_file(fd);
 	free(compressed);
 
 	if (mtime) {
@@ -2350,9 +2357,7 @@ int write_sha1_from_fd(const unsigned char *sha1, int fd, char *buffer,
 	} while (1);
 	inflateEnd(&stream);
 
-	fchmod(local, 0444);
-	if (close(local) != 0)
-		die("unable to write sha1 file");
+	close_sha1_file(local);
 	SHA1_Final(real_sha1, &c);
 	if (ret != Z_STREAM_END) {
 		unlink(tmpfile);

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: Consolidate SHA1 object file close
  2008-06-11  1:47 Consolidate SHA1 object file close Linus Torvalds
@ 2008-06-11  7:42 ` Andreas Ericsson
  2008-06-11  7:43 ` Pierre Habouzit
  1 sibling, 0 replies; 11+ messages in thread
From: Andreas Ericsson @ 2008-06-11  7:42 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno

Linus Torvalds wrote:
> This consolidates the common operations for closing the new temporary file 
> that we have written, before we move it into place with the final name. 
> 
> There's some common code there (make it read-only and check for errors on 
> close), but more importantly, this also gives a single place to add an 
> fsync_or_die() call if we want to add a safe mode.
> 
> This was triggered due to Denis Bueno apparently twice being able to 
> corrupt his git repository on OS X due to an unlucky combination of kernel 
> crashes and a not-very-robust filesystem.
> 
> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
> ---
> 
> Junio, this is just the scaffolding, without the actual fsync_or_die call. 
> 
> I think it's a worthy place-holder regardless of whether we really want to 
> do the fsync (whether conditionally with a config option or not, and 
> whether there are more clever options like aio_fsync()).
> 
>  sha1_file.c |   17 +++++++++++------
>  1 files changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/sha1_file.c b/sha1_file.c
> index adcf37c..f311c79 100644
> --- a/sha1_file.c
> +++ b/sha1_file.c
> @@ -2105,6 +2105,15 @@ int hash_sha1_file(const void *buf, unsigned long len, const char *type,
>  	return 0;
>  }
>  
> +/* Finalize a file on disk, and close it. */
> +static void close_sha1_file(int fd)


Why close_sha1_file() when it operates on any old file?
I'd name it crash_safe_close() or perhaps close_and_fsync() or
something instead, as it's got nothing to do with sha1's and
everything to do with plain old files.

Other than that, I'm all for it.

-- 
Andreas Ericsson                   andreas.ericsson@op5.se
OP5 AB                             www.op5.se
Tel: +46 8-230225                  Fax: +46 8-230231

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consolidate SHA1 object file close
  2008-06-11  1:47 Consolidate SHA1 object file close Linus Torvalds
  2008-06-11  7:42 ` Andreas Ericsson
@ 2008-06-11  7:43 ` Pierre Habouzit
  2008-06-11 15:17   ` Linus Torvalds
  1 sibling, 1 reply; 11+ messages in thread
From: Pierre Habouzit @ 2008-06-11  7:43 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno

[-- Attachment #1: Type: text/plain, Size: 1929 bytes --]

On Wed, Jun 11, 2008 at 01:47:18AM +0000, Linus Torvalds wrote:
> 
> This consolidates the common operations for closing the new temporary file 
> that we have written, before we move it into place with the final name. 
> 
> There's some common code there (make it read-only and check for errors on 
> close), but more importantly, this also gives a single place to add an 
> fsync_or_die() call if we want to add a safe mode.
> 
> This was triggered due to Denis Bueno apparently twice being able to 
> corrupt his git repository on OS X due to an unlucky combination of kernel 
> crashes and a not-very-robust filesystem.

  Could this be the source of a problem we often meet at work ? Let me
try to describe it.

  We work with our git repositories (storages I should say) on NFS
homes, with workdirs on a local directory (NFS homes are backuped daily,
hence everything commited get backuped, and developers have shorter
compilation times thanks to the local FS). I don't think the workdir use
is relevant because I use it almost the same without NFS and haven't any
issues, but I mention it just in case.

  Quite often, when people commit, they have corrupt repositories. The
symptom is a `cannot read <sha1>` error message (or many at times). The
usual way to "fix" it is to git fsck, and git reset (because after the
fsck the index is totally screwed and all local files are marked new),
and usually everything is fine then.

  This is not really a hard corruption, and it's really hard to
reproduce, I don't know why it happens, and I wonder if this patch could
help, or if it's unrelated. I can only bring speculations as it's really
hard to reproduce, and it quite depends on the load of the NFS server :/

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consolidate SHA1 object file close
  2008-06-11  7:43 ` Pierre Habouzit
@ 2008-06-11 15:17   ` Linus Torvalds
  2008-06-11 15:40     ` Pierre Habouzit
  0 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2008-06-11 15:17 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno



On Wed, 11 Jun 2008, Pierre Habouzit wrote:
> 
>   Could this be the source of a problem we often meet at work ? Let me
> try to describe it.

The fsync() *should* make no difference unless you actually crash. So my 
initial reaction is no, but non-coherent client-side write caching over 
NFS may actually make a difference.

>   We work with our git repositories (storages I should say) on NFS
> homes, with workdirs on a local directory (NFS homes are backuped daily,
> hence everything commited get backuped, and developers have shorter
> compilation times thanks to the local FS).

Ok, so your actual git object directory is on NFS?

>   Quite often, when people commit, they have corrupt repositories. The
> symptom is a `cannot read <sha1>` error message (or many at times). The
> usual way to "fix" it is to git fsck, and git reset (because after the
> fsck the index is totally screwed and all local files are marked new),
> and usually everything is fine then.

Hmm. Very interesting. That definitely sounds like a cache coherency 
issue (ie the "fsck" probably doesn't really _do_ anything, it just 
delays things and possibly causes memory pressure to throw some stuff out 
of the cache).

What clients, what server?

NFS clients (I assume v2, which is not coherent) _should_ be doing what is 
called open-close consistent, which means that while clients can cache 
data locally, they should aim to be consistent between two clients over a 
an open-close pair (ie if two clients have the same file open at the same 
time, there are no consistency guarantees, but if you close on one client 
and then open on another, the data should be consistent).

If open-close consistency doesn't work, then things like various parallel 
load distribution things (clusters with a NFS filesystem doing parallel 
makes, etc) don't tend to work all that well either (ie an object file is 
written on one client, and then used for linking on another).

And that is what git does: even without the fsync(), git will "close()" 
the file before it actually does the link + unlink to move it to the new 
position. So it all _should_ be perfectly consistent even in the absense 
of explicit syncs.

That said, if there is some problem with that whole thing, then yes, the 
fsync() may well hide it. So yes, adding the fsync() is certainly worth 
testing.

>   This is not really a hard corruption, and it's really hard to
> reproduce, I don't know why it happens, and I wonder if this patch could
> help, or if it's unrelated. I can only bring speculations as it's really
> hard to reproduce, and it quite depends on the load of the NFS server :/

Yes, that sounds very much like a cache coherency issue. The "corruption" 
goes away when the cache gets flushed and the clients see the real state 
again. But as mentioned, git should already do things in a way that this 
should all work, but hey, that's using certain assumptions that perhaps 
aren't true in your environment.

			Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consolidate SHA1 object file close
  2008-06-11 15:17   ` Linus Torvalds
@ 2008-06-11 15:40     ` Pierre Habouzit
  2008-06-11 17:25       ` Linus Torvalds
  0 siblings, 1 reply; 11+ messages in thread
From: Pierre Habouzit @ 2008-06-11 15:40 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno

[-- Attachment #1: Type: text/plain, Size: 3199 bytes --]

On Wed, Jun 11, 2008 at 03:17:04PM +0000, Linus Torvalds wrote:
> 
> 
> On Wed, 11 Jun 2008, Pierre Habouzit wrote:
> > 
> >   Could this be the source of a problem we often meet at work ? Let me
> > try to describe it.
> 
> The fsync() *should* make no difference unless you actually crash. So my 
> initial reaction is no, but non-coherent client-side write caching over 
> NFS may actually make a difference.

  That's what I thought as well but … one never knows ;)

> >   We work with our git repositories (storages I should say) on NFS
> > homes, with workdirs on a local directory (NFS homes are backuped daily,
> > hence everything commited get backuped, and developers have shorter
> > compilation times thanks to the local FS).
> 
> Ok, so your actual git object directory is on NFS?

  Yes.

> >   Quite often, when people commit, they have corrupt repositories. The
> > symptom is a `cannot read <sha1>` error message (or many at times). The
> > usual way to "fix" it is to git fsck, and git reset (because after the
> > fsck the index is totally screwed and all local files are marked new),
> > and usually everything is fine then.
> 
> Hmm. Very interesting. That definitely sounds like a cache coherency 
> issue (ie the "fsck" probably doesn't really _do_ anything, it just 
> delays things and possibly causes memory pressure to throw some stuff out 
> of the cache).
> 
> What clients, what server?

  Server uses NFSv3 kernel server from Debian's 2.6.18 etch (up to
date).  Clients are various Unbuntu/Debian's with at least 2.6.18
kernels, some .22 .24 and .25.  It's a really simple setup, no clusters
are involved. The server exports an ext3 over dm-crypt partition though,
but I would be surprised it matters.

> That said, if there is some problem with that whole thing, then yes, the 
> fsync() may well hide it. So yes, adding the fsync() is certainly worth 
> testing.

Okay, I'll try to make my colleagues use that to see if they still have
the issues. I work on a laptop and not NFS, so I'm not the one having
the issues, only the one having to fix them on other's machines ;P

> >   This is not really a hard corruption, and it's really hard to
> > reproduce, I don't know why it happens, and I wonder if this patch could
> > help, or if it's unrelated. I can only bring speculations as it's really
> > hard to reproduce, and it quite depends on the load of the NFS server :/
> 
> Yes, that sounds very much like a cache coherency issue. The "corruption" 
> goes away when the cache gets flushed and the clients see the real state 
> again. But as mentioned, git should already do things in a way that this 
> should all work, but hey, that's using certain assumptions that perhaps 
> aren't true in your environment.

  Well we have the issue for quite a long time actually, and given that
it's hard to reproduce, I'm never in a state to be able to give more
useful informations :/ We'll see if the fsync() helps or not…

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consolidate SHA1 object file close
  2008-06-11 15:40     ` Pierre Habouzit
@ 2008-06-11 17:25       ` Linus Torvalds
  2008-06-11 17:46         ` Linus Torvalds
       [not found]         ` <alpine.LFD.1.10.0806111030580.3101@woody.linux-foundation.org>
  0 siblings, 2 replies; 11+ messages in thread
From: Linus Torvalds @ 2008-06-11 17:25 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno



On Wed, 11 Jun 2008, Pierre Habouzit wrote:
> > 
> > Hmm. Very interesting. That definitely sounds like a cache coherency 
> > issue (ie the "fsck" probably doesn't really _do_ anything, it just 
> > delays things and possibly causes memory pressure to throw some stuff out 
> > of the cache).
> > 
> > What clients, what server?
> 
>   Server uses NFSv3 kernel server from Debian's 2.6.18 etch (up to
> date).

Ok, it's almost impossible to be a server-side issue then - I could 
imagine that if you had some fancy cluster server or something, but in 
that kind os straightforward situation the only thing that is going to 
matter is the client-side caching.

I stopped using NFS so long ago that the only case I ever worried about 
and knew anything about was the traditional v2 issues. But iirc, v3 does 
nothing much to change the caching rules (v4, on the other hand, does add 
delegations and explicit caching support).

> Clients are various Unbuntu/Debian's with at least 2.6.18 kernels, some 
> .22 .24 and .25.

I'll ask Trond if he has any comments on this from the NFS client side. We 
_did_ hit some other NFS client bug with git long ago, I forget what it 
was all about (pread/pwrite?).

What is quite odd, though, is that exactly because of how git works, I 
would normally expect each client to not even *try* to look up objects 
that are written by other clients until long long after they have been 
written.

IOW, access to new objects is not something a git client will do just 
because the object suddenly appears in a directory - after the file is 
written and closed, it will not just be moved to the right position, but 
there has to be *other* files modified (ie the refs) to tell other clients 
about the changes too!

And that matters because even though there can be things like local 
directory caches (and Linux does support negative caching - ie the caching 
of the fact that a file was *not* there), those caches should be empty 
simply because other clients that didn't create the file shouldn't even 
have tried to look up non-existent objects!

If it's a directory content caching issue, then adding an fsync() won't 
matter. In fact, the fsync() should matter only if the client who wrote 
the object didn't write it out to the server at close() time, and that 
really sounds very unlikely indeed. So my personal guess is that the 
fsync() won't make any difference at all.

Do you have people using special flags for your NFS mounts? And do you 
know if there is some pattern to the client kernel versions when the 
problem happens?

		Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consolidate SHA1 object file close
  2008-06-11 17:25       ` Linus Torvalds
@ 2008-06-11 17:46         ` Linus Torvalds
       [not found]         ` <alpine.LFD.1.10.0806111030580.3101@woody.linux-foundation.org>
  1 sibling, 0 replies; 11+ messages in thread
From: Linus Torvalds @ 2008-06-11 17:46 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno



On Wed, 11 Jun 2008, Linus Torvalds wrote:
> 
> Do you have people using special flags for your NFS mounts? And do you 
> know if there is some pattern to the client kernel versions when the 
> problem happens?

Oh, before I even go there - let's get the _really_ obvious case out of 
the way first.

If you are using a shared git object repository (why are you doing that, 
btw?), are people perhaps doing things like "git gc --auto" etc at the 
same time? Perhaps even unknowingly, thanks to autogc?

That's an absolute no-no. It works on a real POSIX filesystem, because 
even if you unlink a file that is in use by another process, the other 
process still has access to the data and the file won't be *really* 
removed until all users have gone away.

That's also true within a _single_ NFS client thanks to so-called 
"silly-renaming", but it is *not* true across multiple clients.

So if one client is doing some kind of gc and creates a new pack-file and 
then removes old loose objects, and another client has already looked up 
and opened that loose object (but not finished reading it), then when the 
file gets removed, you will literally lose the data on the other client 
and get a short read!

And nothing we can do can ever fix this very fundamental issue of NFS. NFS 
simply isn't an even remotely POSIX filesystem, even though it's set up to 
mostly _look_ like one when accessed from a single client.

In general, I would discourage people ever sharing object directories 
among multiple users except in a server kind of environment (eg 
kernel.org). 

			Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consolidate SHA1 object file close
       [not found]         ` <alpine.LFD.1.10.0806111030580.3101@woody.linux-foundation.org>
@ 2008-06-11 22:25           ` Pierre Habouzit
  2008-06-11 23:03             ` Pierre Habouzit
  2008-06-12 15:33             ` Linus Torvalds
  0 siblings, 2 replies; 11+ messages in thread
From: Pierre Habouzit @ 2008-06-11 22:25 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno

[-- Attachment #1: Type: text/plain, Size: 3858 bytes --]

On Wed, Jun 11, 2008 at 05:34:53PM +0000, Linus Torvalds wrote:
> 
> 
> > > >   Quite often, when people commit, they have corrupt repositories. The
> > > > symptom is a `cannot read <sha1>` error message (or many at times).
> 
> Btw, do you have exact error messages?

  Well, like said I don't have had the problem myself, and it's hard to
reproduce, people have it like once a week average, for people that
perform around 50 to 100 commits a week.

> That plain "cannot read <sha1>" sounds unlikely. It exists in archive-tar, 
> archive-zip and builtin-tag (and in two of the tests), but not in any 
> commit paths that I can tell.

  Well that was what I recalled, I'll pass the word that I want a copy
of the message so that people don't trash their history when it happens.

On Wed, Jun 11, 2008 at 05:25:28PM +0000, Linus Torvalds wrote:
> Do you have people using special flags for your NFS mounts? And do you 
> know if there is some pattern to the client kernel versions when the 
> problem happens?

  not that I'm aware of. One machine with issues has this:
192.168.2.2:/home on /home type nfs (rw,noatime,bg,intr,hard,tcp,rsize=65536,wsize=65536,addr=192.168.2.2)

  IOW nothing fancy that I can see.

On Wed, Jun 11, 2008 at 05:46:00PM +0000, Linus Torvalds wrote:
> On Wed, 11 Jun 2008, Linus Torvalds wrote:
> > 
> > Do you have people using special flags for your NFS mounts? And do you 
> > know if there is some pattern to the client kernel versions when the 
> > problem happens?
> 
> Oh, before I even go there - let's get the _really_ obvious case out of 
> the way first.
> 
> If you are using a shared git object repository (why are you doing that, 
> btw?), are people perhaps doing things like "git gc --auto" etc at the 
> same time? Perhaps even unknowingly, thanks to autogc?

  No, we're not using a shared git object repository, each developper
has a git checkout in his /home (on NFS) but works for real in a workdir
that lives on his local hard drive (to get faster compilation times,
because NFS really sucks at speed for compilation). Though, people
working on plain NFS have had the same problems.

  I also had the same reaction as you do, and I've seen such problems
occur, and the operation that we were doing was just "git commit -as" or
something very similar. No nothing in progress in any other terminal at
the same time.

> So if one client is doing some kind of gc and creates a new pack-file and 
> then removes old loose objects, and another client has already looked up 
> and opened that loose object (but not finished reading it), then when the 
> file gets removed, you will literally lose the data on the other client 
> and get a short read!

  I don't think that is what happens.

> And nothing we can do can ever fix this very fundamental issue of NFS. NFS 
> simply isn't an even remotely POSIX filesystem, even though it's set up to 
> mostly _look_ like one when accessed from a single client.
> 
> In general, I would discourage people ever sharing object directories 
> among multiple users except in a server kind of environment (eg 
> kernel.org). 

  It's not shared repositories, really, it's just for conveniency so
that the object store is on NFS (and backuped like all that is on the
NFS server) but the checkouts on local hard drives, so that all
operations to the checkout are local. And when people commit, it goes
into the NFS and all is perfectly fine. Those repositories are *ONLY*
used by one physical user exclusively.


  I'll try to do some commits on NFS repeatedly tonight and trigger the
issue, I'll keep you posted.


-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consolidate SHA1 object file close
  2008-06-11 22:25           ` Pierre Habouzit
@ 2008-06-11 23:03             ` Pierre Habouzit
  2008-06-12 15:33             ` Linus Torvalds
  1 sibling, 0 replies; 11+ messages in thread
From: Pierre Habouzit @ 2008-06-11 23:03 UTC (permalink / raw)
  To: Linus Torvalds, Junio C Hamano, Git Mailing List, Denis Bueno


[-- Attachment #1.1: Type: text/plain, Size: 2258 bytes --]

On Wed, Jun 11, 2008 at 10:25:34PM +0000, Pierre Habouzit wrote:
> On Wed, Jun 11, 2008 at 05:34:53PM +0000, Linus Torvalds wrote:
> > 
> > 
> > > > >   Quite often, when people commit, they have corrupt repositories. The
> > > > > symptom is a `cannot read <sha1>` error message (or many at times).
> > 
> > Btw, do you have exact error messages?
> 
>   Well, like said I don't have had the problem myself, and it's hard to
> reproduce, people have it like once a week average, for people that
> perform around 50 to 100 commits a week.

  I should say that the problem happens more often when people use the script I
inlined in the end (this scripts does what git pull --rebase nowadays
does, with a dirty workdir, roughly, I'm not its author, the goal isn't
to discuss its lack of beauty or whatever, just to point out some
possible things that leads to the NFS issue).

  Next time something breaks, I'll be able to give you a log.

-------------------------------------
#!/bin/bash

OPTIONS_SPEC=
SUBDIRECTORY_OK=t
. git-sh-setup
require_work_tree

case $# in
    0)
        branch=origin/$(git symbolic-ref HEAD|sed -e 's!refs/heads/!!')
        ;;
    1)
        branch="$1"
        ;;
    *)
        echo 1>&2 "$(basename $0) [<branch>]"
        exit 1
        ;;
esac

if git rev-parse --verify HEAD > /dev/null &&
    git update-index --refresh &&
    git diff-files --quiet &&
    git diff-index --cached --quiet HEAD --
then
    NEED_TEMPO=
else
    NEED_TEMPO=t
fi

git fetch
test -z "$NEED_TEMPO" || git commit -a -s -m'tempo'
git rebase "$branch"

if test -n "$NEED_TEMPO"; then
    if test  -d "$(dirname "$(git rev-parse --git-dir)")/.dotest"; then
        echo ""
echo "run 'git reset HEAD~1' when rebase is finished"
    else
        git reset HEAD~1
    fi
fi
-------------------------------------

  Also note that now people use an enhanced version of that script (the
one in attachment). It still breaks, but less often. I'm not sure it's
valuable information at all, but one never knows.
-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #1.2: git-up --]
[-- Type: text/plain, Size: 2367 bytes --]

#!/bin/sh

OPTIONS_SPEC="\
$(basename $0) [options] [<remote> [<branch>]]
--
k,gitk      visualize unmerged differences
r,rebase    perform a rebase
m,merge     perform a merge
"
SUBDIRECTORY_OK=t
. git-sh-setup
require_work_tree

lbranch=$(git symbolic-ref HEAD | sed -e s~refs/heads/~~)
remote=$(git config --get "branch.$lbranch.remote" || echo origin)
branch=$(git config --get "branch.$lbranch.merge" || echo "refs/heads/$lbranch")

case "$(git config --bool --get madcoder.up-gitk)" in
    true) gitk=gitk;;
    *)    gitk=:
esac
case "$(git config --bool --get "branch.$lbranch.rebase")" in
    true) action=rebase;;
    *)    action=;;
esac

while test $# != 0; do
    case "$1" in
        -k|--gitk)
            shift; gitk=gitk;;
        --no-gitk)
            shift; gitk=:;;
        -r|--rebase)
            shift; action=rebase;;
        --no-rebase)
            shift; rebase=${rebase#rebase};;
        -m|--merge)
            shift; action=merge;;
        --no-merge)
            shift; rebase=${rebase#merge};;
        --)
            shift; break;;
        *)
            usage;;
    esac
done

case $# in
    0) ;;
    1) remote="$1";;
    2) remote="$1"; branch="$2";;
    *) usage;;
esac

git fetch "${remote}"
if test `git rev-list .."${remote}/${branch#refs/heads/}" -- | wc -l` = 0; then
    echo "Current branch $lbranch is up to date."
    exit 0
fi

$gitk .."${remote}/${branch#refs/heads/}" --
if test -z "$action"; then
    echo -n "(r)ebase/(m)erge/(q)uit ? "
    read ans
    case "$ans" in
        r*) action=rebase;;
        m*) action=merge;;
        *)  exit 0;;
    esac
fi

unclean=
git rev-parse --verify HEAD > /dev/null && \
    git update-index --refresh && \
    git diff-files --quiet && \
    git diff-index --cached --quiet HEAD -- || unclean=t

case "$action" in
    rebase)
        test -z "$unclean" || git stash save "git-up stash"
        git rebase "${remote}/${branch#refs/heads/}"
        ;;
    merge)
        test -z "$unclean" || git stash save "git-up stash"
        git merge "${remote}/${branch#refs/heads/}"
        ;;
    *)
        echo 1>&2 "no action specified"
        exit 1
        ;;
esac

if test -n "$unclean"; then
    if test  -d "$(git rev-parse --git-dir)/../.dotest"; then
        echo ""
        echo "run 'git stash apply' when rebase is finished"
    else
        git stash apply
    fi
fi

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consolidate SHA1 object file close
  2008-06-11 22:25           ` Pierre Habouzit
  2008-06-11 23:03             ` Pierre Habouzit
@ 2008-06-12 15:33             ` Linus Torvalds
  2008-06-12 16:00               ` Pierre Habouzit
  1 sibling, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2008-06-12 15:33 UTC (permalink / raw)
  To: Pierre Habouzit; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno



On Thu, 12 Jun 2008, Pierre Habouzit wrote:
> 
>   No, we're not using a shared git object repository, each developper
> has a git checkout in his /home (on NFS) but works for real in a workdir
> that lives on his local hard drive (to get faster compilation times,
> because NFS really sucks at speed for compilation). Though, people
> working on plain NFS have had the same problems.

Ahhh..

In that case it's not going to be a client caching issue - at least not in 
the sense that two different clients are out-of-sync with each other wrt 
caches. It sounds as it you only ever have one client that reads and 
writes to the same git repository at a time.

So scratch all the previous theory.

Quite frankly, in that case, it sounds more like simply some NFS problem. 
And we _have_ had NFS problems before. See the threads

 - bug: git-repack -a -d produces broken pack on NFS

   Turned out to apparently be ethernet packet corruption that was not 
   detected by the hardware and was due to a badly seated ethernet card!

 - git 1.5.3.5 error over NFS

   Some unexplained corruption due to problms with pread() on NFS not 
   returning data that was previously written.

for example.

Basically, NFS has many serious failure cases that can go undetected, and 
it _could_ be that you actually have flaky NFS but never noticed it before 
because most tools don't care as deeply as git does (ie if a bit is 
flipped in some random data, a lot of tools will never notice). There are 
supposed to be checksums etc on the network packets that NFS uses, but:

 - the ethernet checksum (which is a fairly strong CRC) is sadly often not 
   even checked by some switches and/or cards, and especially if it's a 
   store-and-forward switch that doesn't check the CRC properly, it can 
   end up re-sending a corrupt packet with a recomputed ethernet CRC that 
   now matches the _corrupt_ data. Oops.

 - Perhaps worse, the ethernet checksum is purely a physical layer one, 
   not an end-to-end checksum, which not only explains how a switch can 
   re-generate a broken one, but also means that even if the ethernet card 
   checks it properly, it doesn't actually account for any corruption that 
   happens _afterwards_. So if there is corruption going from the card to 
   memory (which was apparently the problem in the first git thread 
   above), the CRC got checked earlier and the new corruption isn't found.

 - there _is_ an TCP/IP-level packet check, with a checksum of the IP 
   header, and a separate checksum of UDP and TCP data. HOWEVER. All these 
   checksums are very very weak, and to make things worse, the UDP 
   checksum can be entirely disabled, and quite often "better" ethernet 
   cards will do checksumming for you in hardware, which again means that 
   it's not an end-to-end checksum, and you have the exact same failure 
   case as with the ethernet CRC.

IOW, there are safety nets in place, but they tend to be fairly easily 
broken under certain circumstances.

Add to the above the possibility of just a kernel NFS bug (or a NFSd one), 
and it would really be very interesting to hear:

 - do the errors seem to happen more at certain clients than others?

   If it's a client-side problem, it really should happen more for certain 
   kernel versions or certain hardware.

 - have you had any other anecdotal evidence of problems with non-git 
   usage? Unexplained SIGSEGV's if you have binaries over NFS, for 
   example? Strange syntax errors when compiling over NFS?

I'm not discounting a git bug, but quite frankly, it really is worth 
checking that your network/NFS setup is solid.

			Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: Consolidate SHA1 object file close
  2008-06-12 15:33             ` Linus Torvalds
@ 2008-06-12 16:00               ` Pierre Habouzit
  0 siblings, 0 replies; 11+ messages in thread
From: Pierre Habouzit @ 2008-06-12 16:00 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno

[-- Attachment #1: Type: text/plain, Size: 3238 bytes --]

On Thu, Jun 12, 2008 at 03:33:53PM +0000, Linus Torvalds wrote:
> IOW, there are safety nets in place, but they tend to be fairly easily 
> broken under certain circumstances.

  Right, though this is a LAN with really few switches between the
clients and the server, and the issue happens with any client with that
setup.

> Add to the above the possibility of just a kernel NFS bug (or a NFSd one), 
> and it would really be very interesting to hear:
> 
>  - do the errors seem to happen more at certain clients than others?

  Not really, it happens more often with the script I inlined in one of
my mails, than with the one I attached to it. And people working on
_pure_ NFS have the issue a bit less than the ones using their workdir
on a separate local device. That's all I can tell for now.

  One of the developper told me that this pattern of use triggers the
problem more for him: he has a 'master' branch checkouted in his NFS
home, and a 'local' branch on his local hard drive workdir[0]. The issue
happens more when he's working on the two workdirs at the same time
(for a value of "at the same time" that is like in the same minute, not
at the same nanosecond of course, he never commits in both workdirs at
the same time).  When he only works in his NFS 'master' or the local
'local' branch, it happens really less often.

>    If it's a client-side problem, it really should happen more for certain 
>    kernel versions or certain hardware.

  It doesn't afaict. Clients are heterogenous in kernel versions (.18,
.22, .24, .25 for whate I've seen), and in hardware (all machines are
Dell computers, but from really different years, hence different mobos
and NICs. Some even have non Dell gigabit NICs in them).


>  - have you had any other anecdotal evidence of problems with non-git 
>    usage? Unexplained SIGSEGV's if you have binaries over NFS, for 
>    example? Strange syntax errors when compiling over NFS?

  Not really no. Our NFS server is remarkably stable with anything else
(it's also remarkably slow compared to local drives but that's not
really relevant ;p).

> I'm not discounting a git bug, but quite frankly, it really is worth 
> checking that your network/NFS setup is solid.

  Well to date I'd say it's quite solid. We are a software company and
even have tested our software (that does heavy use of mmap, pread,
pwrite, and other things that NFS is not often dealing with very well)
on that very server without a glitch that could have been attributed to
NFS (I mean we've had tons of bugs, but it was always in our software in
the end ;p). Though we really rarely pread what we just pwrite-d or
things like that, so maybe we never triggered a possible kernel bug
either :)


  [0] the reason of that setup is that when we work on topic branches,
      we sometimes spot big bugs that are small to fix, and we use the
      "NFS" master branch to push those bugfixes as soon as we find
      them, whereas local is pushed only when the feature is ready.

-- 
·O·  Pierre Habouzit
··O                                                madcoder@debian.org
OOO                                                http://www.madism.org

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2008-06-12 16:01 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-11  1:47 Consolidate SHA1 object file close Linus Torvalds
2008-06-11  7:42 ` Andreas Ericsson
2008-06-11  7:43 ` Pierre Habouzit
2008-06-11 15:17   ` Linus Torvalds
2008-06-11 15:40     ` Pierre Habouzit
2008-06-11 17:25       ` Linus Torvalds
2008-06-11 17:46         ` Linus Torvalds
     [not found]         ` <alpine.LFD.1.10.0806111030580.3101@woody.linux-foundation.org>
2008-06-11 22:25           ` Pierre Habouzit
2008-06-11 23:03             ` Pierre Habouzit
2008-06-12 15:33             ` Linus Torvalds
2008-06-12 16:00               ` Pierre Habouzit

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).