* Consolidate SHA1 object file close @ 2008-06-11 1:47 Linus Torvalds 2008-06-11 7:42 ` Andreas Ericsson 2008-06-11 7:43 ` Pierre Habouzit 0 siblings, 2 replies; 11+ messages in thread From: Linus Torvalds @ 2008-06-11 1:47 UTC (permalink / raw) To: Junio C Hamano, Git Mailing List; +Cc: Denis Bueno This consolidates the common operations for closing the new temporary file that we have written, before we move it into place with the final name. There's some common code there (make it read-only and check for errors on close), but more importantly, this also gives a single place to add an fsync_or_die() call if we want to add a safe mode. This was triggered due to Denis Bueno apparently twice being able to corrupt his git repository on OS X due to an unlucky combination of kernel crashes and a not-very-robust filesystem. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> --- Junio, this is just the scaffolding, without the actual fsync_or_die call. I think it's a worthy place-holder regardless of whether we really want to do the fsync (whether conditionally with a config option or not, and whether there are more clever options like aio_fsync()). sha1_file.c | 17 +++++++++++------ 1 files changed, 11 insertions(+), 6 deletions(-) diff --git a/sha1_file.c b/sha1_file.c index adcf37c..f311c79 100644 --- a/sha1_file.c +++ b/sha1_file.c @@ -2105,6 +2105,15 @@ int hash_sha1_file(const void *buf, unsigned long len, const char *type, return 0; } +/* Finalize a file on disk, and close it. */ +static void close_sha1_file(int fd) +{ + /* For safe-mode, we could fsync_or_die(fd, "sha1 file") here */ + fchmod(fd, 0444); + if (close(fd) != 0) + die("unable to write sha1 file"); +} + static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen, void *buf, unsigned long len, time_t mtime) { @@ -2170,9 +2179,7 @@ static int write_loose_object(const unsigned char *sha1, char *hdr, int hdrlen, if (write_buffer(fd, compressed, size) < 0) die("unable to write sha1 file"); - fchmod(fd, 0444); - if (close(fd)) - die("unable to write sha1 file"); + close_sha1_file(fd); free(compressed); if (mtime) { @@ -2350,9 +2357,7 @@ int write_sha1_from_fd(const unsigned char *sha1, int fd, char *buffer, } while (1); inflateEnd(&stream); - fchmod(local, 0444); - if (close(local) != 0) - die("unable to write sha1 file"); + close_sha1_file(local); SHA1_Final(real_sha1, &c); if (ret != Z_STREAM_END) { unlink(tmpfile); ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: Consolidate SHA1 object file close 2008-06-11 1:47 Consolidate SHA1 object file close Linus Torvalds @ 2008-06-11 7:42 ` Andreas Ericsson 2008-06-11 7:43 ` Pierre Habouzit 1 sibling, 0 replies; 11+ messages in thread From: Andreas Ericsson @ 2008-06-11 7:42 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno Linus Torvalds wrote: > This consolidates the common operations for closing the new temporary file > that we have written, before we move it into place with the final name. > > There's some common code there (make it read-only and check for errors on > close), but more importantly, this also gives a single place to add an > fsync_or_die() call if we want to add a safe mode. > > This was triggered due to Denis Bueno apparently twice being able to > corrupt his git repository on OS X due to an unlucky combination of kernel > crashes and a not-very-robust filesystem. > > Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> > --- > > Junio, this is just the scaffolding, without the actual fsync_or_die call. > > I think it's a worthy place-holder regardless of whether we really want to > do the fsync (whether conditionally with a config option or not, and > whether there are more clever options like aio_fsync()). > > sha1_file.c | 17 +++++++++++------ > 1 files changed, 11 insertions(+), 6 deletions(-) > > diff --git a/sha1_file.c b/sha1_file.c > index adcf37c..f311c79 100644 > --- a/sha1_file.c > +++ b/sha1_file.c > @@ -2105,6 +2105,15 @@ int hash_sha1_file(const void *buf, unsigned long len, const char *type, > return 0; > } > > +/* Finalize a file on disk, and close it. */ > +static void close_sha1_file(int fd) Why close_sha1_file() when it operates on any old file? I'd name it crash_safe_close() or perhaps close_and_fsync() or something instead, as it's got nothing to do with sha1's and everything to do with plain old files. Other than that, I'm all for it. -- Andreas Ericsson andreas.ericsson@op5.se OP5 AB www.op5.se Tel: +46 8-230225 Fax: +46 8-230231 ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consolidate SHA1 object file close 2008-06-11 1:47 Consolidate SHA1 object file close Linus Torvalds 2008-06-11 7:42 ` Andreas Ericsson @ 2008-06-11 7:43 ` Pierre Habouzit 2008-06-11 15:17 ` Linus Torvalds 1 sibling, 1 reply; 11+ messages in thread From: Pierre Habouzit @ 2008-06-11 7:43 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno [-- Attachment #1: Type: text/plain, Size: 1929 bytes --] On Wed, Jun 11, 2008 at 01:47:18AM +0000, Linus Torvalds wrote: > > This consolidates the common operations for closing the new temporary file > that we have written, before we move it into place with the final name. > > There's some common code there (make it read-only and check for errors on > close), but more importantly, this also gives a single place to add an > fsync_or_die() call if we want to add a safe mode. > > This was triggered due to Denis Bueno apparently twice being able to > corrupt his git repository on OS X due to an unlucky combination of kernel > crashes and a not-very-robust filesystem. Could this be the source of a problem we often meet at work ? Let me try to describe it. We work with our git repositories (storages I should say) on NFS homes, with workdirs on a local directory (NFS homes are backuped daily, hence everything commited get backuped, and developers have shorter compilation times thanks to the local FS). I don't think the workdir use is relevant because I use it almost the same without NFS and haven't any issues, but I mention it just in case. Quite often, when people commit, they have corrupt repositories. The symptom is a `cannot read <sha1>` error message (or many at times). The usual way to "fix" it is to git fsck, and git reset (because after the fsck the index is totally screwed and all local files are marked new), and usually everything is fine then. This is not really a hard corruption, and it's really hard to reproduce, I don't know why it happens, and I wonder if this patch could help, or if it's unrelated. I can only bring speculations as it's really hard to reproduce, and it quite depends on the load of the NFS server :/ -- ·O· Pierre Habouzit ··O madcoder@debian.org OOO http://www.madism.org [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consolidate SHA1 object file close 2008-06-11 7:43 ` Pierre Habouzit @ 2008-06-11 15:17 ` Linus Torvalds 2008-06-11 15:40 ` Pierre Habouzit 0 siblings, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2008-06-11 15:17 UTC (permalink / raw) To: Pierre Habouzit; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno On Wed, 11 Jun 2008, Pierre Habouzit wrote: > > Could this be the source of a problem we often meet at work ? Let me > try to describe it. The fsync() *should* make no difference unless you actually crash. So my initial reaction is no, but non-coherent client-side write caching over NFS may actually make a difference. > We work with our git repositories (storages I should say) on NFS > homes, with workdirs on a local directory (NFS homes are backuped daily, > hence everything commited get backuped, and developers have shorter > compilation times thanks to the local FS). Ok, so your actual git object directory is on NFS? > Quite often, when people commit, they have corrupt repositories. The > symptom is a `cannot read <sha1>` error message (or many at times). The > usual way to "fix" it is to git fsck, and git reset (because after the > fsck the index is totally screwed and all local files are marked new), > and usually everything is fine then. Hmm. Very interesting. That definitely sounds like a cache coherency issue (ie the "fsck" probably doesn't really _do_ anything, it just delays things and possibly causes memory pressure to throw some stuff out of the cache). What clients, what server? NFS clients (I assume v2, which is not coherent) _should_ be doing what is called open-close consistent, which means that while clients can cache data locally, they should aim to be consistent between two clients over a an open-close pair (ie if two clients have the same file open at the same time, there are no consistency guarantees, but if you close on one client and then open on another, the data should be consistent). If open-close consistency doesn't work, then things like various parallel load distribution things (clusters with a NFS filesystem doing parallel makes, etc) don't tend to work all that well either (ie an object file is written on one client, and then used for linking on another). And that is what git does: even without the fsync(), git will "close()" the file before it actually does the link + unlink to move it to the new position. So it all _should_ be perfectly consistent even in the absense of explicit syncs. That said, if there is some problem with that whole thing, then yes, the fsync() may well hide it. So yes, adding the fsync() is certainly worth testing. > This is not really a hard corruption, and it's really hard to > reproduce, I don't know why it happens, and I wonder if this patch could > help, or if it's unrelated. I can only bring speculations as it's really > hard to reproduce, and it quite depends on the load of the NFS server :/ Yes, that sounds very much like a cache coherency issue. The "corruption" goes away when the cache gets flushed and the clients see the real state again. But as mentioned, git should already do things in a way that this should all work, but hey, that's using certain assumptions that perhaps aren't true in your environment. Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consolidate SHA1 object file close 2008-06-11 15:17 ` Linus Torvalds @ 2008-06-11 15:40 ` Pierre Habouzit 2008-06-11 17:25 ` Linus Torvalds 0 siblings, 1 reply; 11+ messages in thread From: Pierre Habouzit @ 2008-06-11 15:40 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno [-- Attachment #1: Type: text/plain, Size: 3199 bytes --] On Wed, Jun 11, 2008 at 03:17:04PM +0000, Linus Torvalds wrote: > > > On Wed, 11 Jun 2008, Pierre Habouzit wrote: > > > > Could this be the source of a problem we often meet at work ? Let me > > try to describe it. > > The fsync() *should* make no difference unless you actually crash. So my > initial reaction is no, but non-coherent client-side write caching over > NFS may actually make a difference. That's what I thought as well but … one never knows ;) > > We work with our git repositories (storages I should say) on NFS > > homes, with workdirs on a local directory (NFS homes are backuped daily, > > hence everything commited get backuped, and developers have shorter > > compilation times thanks to the local FS). > > Ok, so your actual git object directory is on NFS? Yes. > > Quite often, when people commit, they have corrupt repositories. The > > symptom is a `cannot read <sha1>` error message (or many at times). The > > usual way to "fix" it is to git fsck, and git reset (because after the > > fsck the index is totally screwed and all local files are marked new), > > and usually everything is fine then. > > Hmm. Very interesting. That definitely sounds like a cache coherency > issue (ie the "fsck" probably doesn't really _do_ anything, it just > delays things and possibly causes memory pressure to throw some stuff out > of the cache). > > What clients, what server? Server uses NFSv3 kernel server from Debian's 2.6.18 etch (up to date). Clients are various Unbuntu/Debian's with at least 2.6.18 kernels, some .22 .24 and .25. It's a really simple setup, no clusters are involved. The server exports an ext3 over dm-crypt partition though, but I would be surprised it matters. > That said, if there is some problem with that whole thing, then yes, the > fsync() may well hide it. So yes, adding the fsync() is certainly worth > testing. Okay, I'll try to make my colleagues use that to see if they still have the issues. I work on a laptop and not NFS, so I'm not the one having the issues, only the one having to fix them on other's machines ;P > > This is not really a hard corruption, and it's really hard to > > reproduce, I don't know why it happens, and I wonder if this patch could > > help, or if it's unrelated. I can only bring speculations as it's really > > hard to reproduce, and it quite depends on the load of the NFS server :/ > > Yes, that sounds very much like a cache coherency issue. The "corruption" > goes away when the cache gets flushed and the clients see the real state > again. But as mentioned, git should already do things in a way that this > should all work, but hey, that's using certain assumptions that perhaps > aren't true in your environment. Well we have the issue for quite a long time actually, and given that it's hard to reproduce, I'm never in a state to be able to give more useful informations :/ We'll see if the fsync() helps or not… -- ·O· Pierre Habouzit ··O madcoder@debian.org OOO http://www.madism.org [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consolidate SHA1 object file close 2008-06-11 15:40 ` Pierre Habouzit @ 2008-06-11 17:25 ` Linus Torvalds 2008-06-11 17:46 ` Linus Torvalds [not found] ` <alpine.LFD.1.10.0806111030580.3101@woody.linux-foundation.org> 0 siblings, 2 replies; 11+ messages in thread From: Linus Torvalds @ 2008-06-11 17:25 UTC (permalink / raw) To: Pierre Habouzit; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno On Wed, 11 Jun 2008, Pierre Habouzit wrote: > > > > Hmm. Very interesting. That definitely sounds like a cache coherency > > issue (ie the "fsck" probably doesn't really _do_ anything, it just > > delays things and possibly causes memory pressure to throw some stuff out > > of the cache). > > > > What clients, what server? > > Server uses NFSv3 kernel server from Debian's 2.6.18 etch (up to > date). Ok, it's almost impossible to be a server-side issue then - I could imagine that if you had some fancy cluster server or something, but in that kind os straightforward situation the only thing that is going to matter is the client-side caching. I stopped using NFS so long ago that the only case I ever worried about and knew anything about was the traditional v2 issues. But iirc, v3 does nothing much to change the caching rules (v4, on the other hand, does add delegations and explicit caching support). > Clients are various Unbuntu/Debian's with at least 2.6.18 kernels, some > .22 .24 and .25. I'll ask Trond if he has any comments on this from the NFS client side. We _did_ hit some other NFS client bug with git long ago, I forget what it was all about (pread/pwrite?). What is quite odd, though, is that exactly because of how git works, I would normally expect each client to not even *try* to look up objects that are written by other clients until long long after they have been written. IOW, access to new objects is not something a git client will do just because the object suddenly appears in a directory - after the file is written and closed, it will not just be moved to the right position, but there has to be *other* files modified (ie the refs) to tell other clients about the changes too! And that matters because even though there can be things like local directory caches (and Linux does support negative caching - ie the caching of the fact that a file was *not* there), those caches should be empty simply because other clients that didn't create the file shouldn't even have tried to look up non-existent objects! If it's a directory content caching issue, then adding an fsync() won't matter. In fact, the fsync() should matter only if the client who wrote the object didn't write it out to the server at close() time, and that really sounds very unlikely indeed. So my personal guess is that the fsync() won't make any difference at all. Do you have people using special flags for your NFS mounts? And do you know if there is some pattern to the client kernel versions when the problem happens? Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consolidate SHA1 object file close 2008-06-11 17:25 ` Linus Torvalds @ 2008-06-11 17:46 ` Linus Torvalds [not found] ` <alpine.LFD.1.10.0806111030580.3101@woody.linux-foundation.org> 1 sibling, 0 replies; 11+ messages in thread From: Linus Torvalds @ 2008-06-11 17:46 UTC (permalink / raw) To: Pierre Habouzit; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno On Wed, 11 Jun 2008, Linus Torvalds wrote: > > Do you have people using special flags for your NFS mounts? And do you > know if there is some pattern to the client kernel versions when the > problem happens? Oh, before I even go there - let's get the _really_ obvious case out of the way first. If you are using a shared git object repository (why are you doing that, btw?), are people perhaps doing things like "git gc --auto" etc at the same time? Perhaps even unknowingly, thanks to autogc? That's an absolute no-no. It works on a real POSIX filesystem, because even if you unlink a file that is in use by another process, the other process still has access to the data and the file won't be *really* removed until all users have gone away. That's also true within a _single_ NFS client thanks to so-called "silly-renaming", but it is *not* true across multiple clients. So if one client is doing some kind of gc and creates a new pack-file and then removes old loose objects, and another client has already looked up and opened that loose object (but not finished reading it), then when the file gets removed, you will literally lose the data on the other client and get a short read! And nothing we can do can ever fix this very fundamental issue of NFS. NFS simply isn't an even remotely POSIX filesystem, even though it's set up to mostly _look_ like one when accessed from a single client. In general, I would discourage people ever sharing object directories among multiple users except in a server kind of environment (eg kernel.org). Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
[parent not found: <alpine.LFD.1.10.0806111030580.3101@woody.linux-foundation.org>]
* Re: Consolidate SHA1 object file close [not found] ` <alpine.LFD.1.10.0806111030580.3101@woody.linux-foundation.org> @ 2008-06-11 22:25 ` Pierre Habouzit 2008-06-11 23:03 ` Pierre Habouzit 2008-06-12 15:33 ` Linus Torvalds 0 siblings, 2 replies; 11+ messages in thread From: Pierre Habouzit @ 2008-06-11 22:25 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno [-- Attachment #1: Type: text/plain, Size: 3858 bytes --] On Wed, Jun 11, 2008 at 05:34:53PM +0000, Linus Torvalds wrote: > > > > > > Quite often, when people commit, they have corrupt repositories. The > > > > symptom is a `cannot read <sha1>` error message (or many at times). > > Btw, do you have exact error messages? Well, like said I don't have had the problem myself, and it's hard to reproduce, people have it like once a week average, for people that perform around 50 to 100 commits a week. > That plain "cannot read <sha1>" sounds unlikely. It exists in archive-tar, > archive-zip and builtin-tag (and in two of the tests), but not in any > commit paths that I can tell. Well that was what I recalled, I'll pass the word that I want a copy of the message so that people don't trash their history when it happens. On Wed, Jun 11, 2008 at 05:25:28PM +0000, Linus Torvalds wrote: > Do you have people using special flags for your NFS mounts? And do you > know if there is some pattern to the client kernel versions when the > problem happens? not that I'm aware of. One machine with issues has this: 192.168.2.2:/home on /home type nfs (rw,noatime,bg,intr,hard,tcp,rsize=65536,wsize=65536,addr=192.168.2.2) IOW nothing fancy that I can see. On Wed, Jun 11, 2008 at 05:46:00PM +0000, Linus Torvalds wrote: > On Wed, 11 Jun 2008, Linus Torvalds wrote: > > > > Do you have people using special flags for your NFS mounts? And do you > > know if there is some pattern to the client kernel versions when the > > problem happens? > > Oh, before I even go there - let's get the _really_ obvious case out of > the way first. > > If you are using a shared git object repository (why are you doing that, > btw?), are people perhaps doing things like "git gc --auto" etc at the > same time? Perhaps even unknowingly, thanks to autogc? No, we're not using a shared git object repository, each developper has a git checkout in his /home (on NFS) but works for real in a workdir that lives on his local hard drive (to get faster compilation times, because NFS really sucks at speed for compilation). Though, people working on plain NFS have had the same problems. I also had the same reaction as you do, and I've seen such problems occur, and the operation that we were doing was just "git commit -as" or something very similar. No nothing in progress in any other terminal at the same time. > So if one client is doing some kind of gc and creates a new pack-file and > then removes old loose objects, and another client has already looked up > and opened that loose object (but not finished reading it), then when the > file gets removed, you will literally lose the data on the other client > and get a short read! I don't think that is what happens. > And nothing we can do can ever fix this very fundamental issue of NFS. NFS > simply isn't an even remotely POSIX filesystem, even though it's set up to > mostly _look_ like one when accessed from a single client. > > In general, I would discourage people ever sharing object directories > among multiple users except in a server kind of environment (eg > kernel.org). It's not shared repositories, really, it's just for conveniency so that the object store is on NFS (and backuped like all that is on the NFS server) but the checkouts on local hard drives, so that all operations to the checkout are local. And when people commit, it goes into the NFS and all is perfectly fine. Those repositories are *ONLY* used by one physical user exclusively. I'll try to do some commits on NFS repeatedly tonight and trigger the issue, I'll keep you posted. -- ·O· Pierre Habouzit ··O madcoder@debian.org OOO http://www.madism.org [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consolidate SHA1 object file close 2008-06-11 22:25 ` Pierre Habouzit @ 2008-06-11 23:03 ` Pierre Habouzit 2008-06-12 15:33 ` Linus Torvalds 1 sibling, 0 replies; 11+ messages in thread From: Pierre Habouzit @ 2008-06-11 23:03 UTC (permalink / raw) To: Linus Torvalds, Junio C Hamano, Git Mailing List, Denis Bueno [-- Attachment #1.1: Type: text/plain, Size: 2258 bytes --] On Wed, Jun 11, 2008 at 10:25:34PM +0000, Pierre Habouzit wrote: > On Wed, Jun 11, 2008 at 05:34:53PM +0000, Linus Torvalds wrote: > > > > > > > > > Quite often, when people commit, they have corrupt repositories. The > > > > > symptom is a `cannot read <sha1>` error message (or many at times). > > > > Btw, do you have exact error messages? > > Well, like said I don't have had the problem myself, and it's hard to > reproduce, people have it like once a week average, for people that > perform around 50 to 100 commits a week. I should say that the problem happens more often when people use the script I inlined in the end (this scripts does what git pull --rebase nowadays does, with a dirty workdir, roughly, I'm not its author, the goal isn't to discuss its lack of beauty or whatever, just to point out some possible things that leads to the NFS issue). Next time something breaks, I'll be able to give you a log. ------------------------------------- #!/bin/bash OPTIONS_SPEC= SUBDIRECTORY_OK=t . git-sh-setup require_work_tree case $# in 0) branch=origin/$(git symbolic-ref HEAD|sed -e 's!refs/heads/!!') ;; 1) branch="$1" ;; *) echo 1>&2 "$(basename $0) [<branch>]" exit 1 ;; esac if git rev-parse --verify HEAD > /dev/null && git update-index --refresh && git diff-files --quiet && git diff-index --cached --quiet HEAD -- then NEED_TEMPO= else NEED_TEMPO=t fi git fetch test -z "$NEED_TEMPO" || git commit -a -s -m'tempo' git rebase "$branch" if test -n "$NEED_TEMPO"; then if test -d "$(dirname "$(git rev-parse --git-dir)")/.dotest"; then echo "" echo "run 'git reset HEAD~1' when rebase is finished" else git reset HEAD~1 fi fi ------------------------------------- Also note that now people use an enhanced version of that script (the one in attachment). It still breaks, but less often. I'm not sure it's valuable information at all, but one never knows. -- ·O· Pierre Habouzit ··O madcoder@debian.org OOO http://www.madism.org [-- Attachment #1.2: git-up --] [-- Type: text/plain, Size: 2367 bytes --] #!/bin/sh OPTIONS_SPEC="\ $(basename $0) [options] [<remote> [<branch>]] -- k,gitk visualize unmerged differences r,rebase perform a rebase m,merge perform a merge " SUBDIRECTORY_OK=t . git-sh-setup require_work_tree lbranch=$(git symbolic-ref HEAD | sed -e s~refs/heads/~~) remote=$(git config --get "branch.$lbranch.remote" || echo origin) branch=$(git config --get "branch.$lbranch.merge" || echo "refs/heads/$lbranch") case "$(git config --bool --get madcoder.up-gitk)" in true) gitk=gitk;; *) gitk=: esac case "$(git config --bool --get "branch.$lbranch.rebase")" in true) action=rebase;; *) action=;; esac while test $# != 0; do case "$1" in -k|--gitk) shift; gitk=gitk;; --no-gitk) shift; gitk=:;; -r|--rebase) shift; action=rebase;; --no-rebase) shift; rebase=${rebase#rebase};; -m|--merge) shift; action=merge;; --no-merge) shift; rebase=${rebase#merge};; --) shift; break;; *) usage;; esac done case $# in 0) ;; 1) remote="$1";; 2) remote="$1"; branch="$2";; *) usage;; esac git fetch "${remote}" if test `git rev-list .."${remote}/${branch#refs/heads/}" -- | wc -l` = 0; then echo "Current branch $lbranch is up to date." exit 0 fi $gitk .."${remote}/${branch#refs/heads/}" -- if test -z "$action"; then echo -n "(r)ebase/(m)erge/(q)uit ? " read ans case "$ans" in r*) action=rebase;; m*) action=merge;; *) exit 0;; esac fi unclean= git rev-parse --verify HEAD > /dev/null && \ git update-index --refresh && \ git diff-files --quiet && \ git diff-index --cached --quiet HEAD -- || unclean=t case "$action" in rebase) test -z "$unclean" || git stash save "git-up stash" git rebase "${remote}/${branch#refs/heads/}" ;; merge) test -z "$unclean" || git stash save "git-up stash" git merge "${remote}/${branch#refs/heads/}" ;; *) echo 1>&2 "no action specified" exit 1 ;; esac if test -n "$unclean"; then if test -d "$(git rev-parse --git-dir)/../.dotest"; then echo "" echo "run 'git stash apply' when rebase is finished" else git stash apply fi fi [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consolidate SHA1 object file close 2008-06-11 22:25 ` Pierre Habouzit 2008-06-11 23:03 ` Pierre Habouzit @ 2008-06-12 15:33 ` Linus Torvalds 2008-06-12 16:00 ` Pierre Habouzit 1 sibling, 1 reply; 11+ messages in thread From: Linus Torvalds @ 2008-06-12 15:33 UTC (permalink / raw) To: Pierre Habouzit; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno On Thu, 12 Jun 2008, Pierre Habouzit wrote: > > No, we're not using a shared git object repository, each developper > has a git checkout in his /home (on NFS) but works for real in a workdir > that lives on his local hard drive (to get faster compilation times, > because NFS really sucks at speed for compilation). Though, people > working on plain NFS have had the same problems. Ahhh.. In that case it's not going to be a client caching issue - at least not in the sense that two different clients are out-of-sync with each other wrt caches. It sounds as it you only ever have one client that reads and writes to the same git repository at a time. So scratch all the previous theory. Quite frankly, in that case, it sounds more like simply some NFS problem. And we _have_ had NFS problems before. See the threads - bug: git-repack -a -d produces broken pack on NFS Turned out to apparently be ethernet packet corruption that was not detected by the hardware and was due to a badly seated ethernet card! - git 1.5.3.5 error over NFS Some unexplained corruption due to problms with pread() on NFS not returning data that was previously written. for example. Basically, NFS has many serious failure cases that can go undetected, and it _could_ be that you actually have flaky NFS but never noticed it before because most tools don't care as deeply as git does (ie if a bit is flipped in some random data, a lot of tools will never notice). There are supposed to be checksums etc on the network packets that NFS uses, but: - the ethernet checksum (which is a fairly strong CRC) is sadly often not even checked by some switches and/or cards, and especially if it's a store-and-forward switch that doesn't check the CRC properly, it can end up re-sending a corrupt packet with a recomputed ethernet CRC that now matches the _corrupt_ data. Oops. - Perhaps worse, the ethernet checksum is purely a physical layer one, not an end-to-end checksum, which not only explains how a switch can re-generate a broken one, but also means that even if the ethernet card checks it properly, it doesn't actually account for any corruption that happens _afterwards_. So if there is corruption going from the card to memory (which was apparently the problem in the first git thread above), the CRC got checked earlier and the new corruption isn't found. - there _is_ an TCP/IP-level packet check, with a checksum of the IP header, and a separate checksum of UDP and TCP data. HOWEVER. All these checksums are very very weak, and to make things worse, the UDP checksum can be entirely disabled, and quite often "better" ethernet cards will do checksumming for you in hardware, which again means that it's not an end-to-end checksum, and you have the exact same failure case as with the ethernet CRC. IOW, there are safety nets in place, but they tend to be fairly easily broken under certain circumstances. Add to the above the possibility of just a kernel NFS bug (or a NFSd one), and it would really be very interesting to hear: - do the errors seem to happen more at certain clients than others? If it's a client-side problem, it really should happen more for certain kernel versions or certain hardware. - have you had any other anecdotal evidence of problems with non-git usage? Unexplained SIGSEGV's if you have binaries over NFS, for example? Strange syntax errors when compiling over NFS? I'm not discounting a git bug, but quite frankly, it really is worth checking that your network/NFS setup is solid. Linus ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: Consolidate SHA1 object file close 2008-06-12 15:33 ` Linus Torvalds @ 2008-06-12 16:00 ` Pierre Habouzit 0 siblings, 0 replies; 11+ messages in thread From: Pierre Habouzit @ 2008-06-12 16:00 UTC (permalink / raw) To: Linus Torvalds; +Cc: Junio C Hamano, Git Mailing List, Denis Bueno [-- Attachment #1: Type: text/plain, Size: 3238 bytes --] On Thu, Jun 12, 2008 at 03:33:53PM +0000, Linus Torvalds wrote: > IOW, there are safety nets in place, but they tend to be fairly easily > broken under certain circumstances. Right, though this is a LAN with really few switches between the clients and the server, and the issue happens with any client with that setup. > Add to the above the possibility of just a kernel NFS bug (or a NFSd one), > and it would really be very interesting to hear: > > - do the errors seem to happen more at certain clients than others? Not really, it happens more often with the script I inlined in one of my mails, than with the one I attached to it. And people working on _pure_ NFS have the issue a bit less than the ones using their workdir on a separate local device. That's all I can tell for now. One of the developper told me that this pattern of use triggers the problem more for him: he has a 'master' branch checkouted in his NFS home, and a 'local' branch on his local hard drive workdir[0]. The issue happens more when he's working on the two workdirs at the same time (for a value of "at the same time" that is like in the same minute, not at the same nanosecond of course, he never commits in both workdirs at the same time). When he only works in his NFS 'master' or the local 'local' branch, it happens really less often. > If it's a client-side problem, it really should happen more for certain > kernel versions or certain hardware. It doesn't afaict. Clients are heterogenous in kernel versions (.18, .22, .24, .25 for whate I've seen), and in hardware (all machines are Dell computers, but from really different years, hence different mobos and NICs. Some even have non Dell gigabit NICs in them). > - have you had any other anecdotal evidence of problems with non-git > usage? Unexplained SIGSEGV's if you have binaries over NFS, for > example? Strange syntax errors when compiling over NFS? Not really no. Our NFS server is remarkably stable with anything else (it's also remarkably slow compared to local drives but that's not really relevant ;p). > I'm not discounting a git bug, but quite frankly, it really is worth > checking that your network/NFS setup is solid. Well to date I'd say it's quite solid. We are a software company and even have tested our software (that does heavy use of mmap, pread, pwrite, and other things that NFS is not often dealing with very well) on that very server without a glitch that could have been attributed to NFS (I mean we've had tons of bugs, but it was always in our software in the end ;p). Though we really rarely pread what we just pwrite-d or things like that, so maybe we never triggered a possible kernel bug either :) [0] the reason of that setup is that when we work on topic branches, we sometimes spot big bugs that are small to fix, and we use the "NFS" master branch to push those bugfixes as soon as we find them, whereas local is pushed only when the feature is ready. -- ·O· Pierre Habouzit ··O madcoder@debian.org OOO http://www.madism.org [-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --] ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2008-06-12 16:01 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-11 1:47 Consolidate SHA1 object file close Linus Torvalds
2008-06-11 7:42 ` Andreas Ericsson
2008-06-11 7:43 ` Pierre Habouzit
2008-06-11 15:17 ` Linus Torvalds
2008-06-11 15:40 ` Pierre Habouzit
2008-06-11 17:25 ` Linus Torvalds
2008-06-11 17:46 ` Linus Torvalds
[not found] ` <alpine.LFD.1.10.0806111030580.3101@woody.linux-foundation.org>
2008-06-11 22:25 ` Pierre Habouzit
2008-06-11 23:03 ` Pierre Habouzit
2008-06-12 15:33 ` Linus Torvalds
2008-06-12 16:00 ` Pierre Habouzit
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).