* Collective wisdom about repos on NFS accessed by concurrent clients (== corruption!?)
@ 2013-03-28 10:22 Kenneth Ölwing
2013-04-05 12:35 ` Kenneth Ölwing
0 siblings, 1 reply; 7+ messages in thread
From: Kenneth Ölwing @ 2013-03-28 10:22 UTC (permalink / raw)
To: Git List
Hi,
I'm hoping to hear some wisdom on the subject so I can decide if I'm
chasing a pipe dream or if it should be expected to work and I just need
to work out the kinks.
Finding things like this makes it sound possible:
http://permalink.gmane.org/gmane.comp.version-control.git/122670
but then again, in threads like this:
http://kerneltrap.org/mailarchive/git/2010/11/14/44799
opinions are that it's just not reliable to trust.
My experience so far is that I eventually get repo corruption when I
stress it with concurrent read/write access from multiple hosts (beyond
any sort of likely levels, but still). Maybe I'm doing something wrong,
missing a configuration setting somewhere, put an unfair stress on the
system, there's a bona fide bug - or, given the inherent difficulty in
achieving perfect coherency between machines on what's visible on the
mount, it's just impossible (?) to truly get it working under all
situations.
My eventual usecase is to set up a bunch of (gitolite) hosts that all
are effectively identical and all seeing the same storage and clients
can then be directed to any of them to get served. For the purpose of
this query I assume it can be boiled down to be the same as n users
working on their own workstations and accessing a central repo on an NFS
share they all have mounted, using regular file paths. Correct?
I have a number of load-generating test scripts (that from humble
beginnings have grown to beasts), but basically, they fork a number of
code pieces that proceed to hammer the repo with concurrent reading and
writing. If necessary, the scripts can be started on different hosts,
too. It's all about the central repo so clients will retry in various
modes whenever they get an error, up to re-cloning and starting over.
All in all, they can yield quite a load.
On a local filesystem everything seems to be holding up fine which is
expected. In the scenario with concurrency on top of shared NFS storage
however, the scripts eventually fails with various problems (when the
timing finally finds a hole, I guess) - possible to clone but checkouts
fails, errors about refs pointing wrong and errors where the original
repo is actually corrupted and can't be cloned from. Depending on test
strategy, I've also had everything going fine (ops looks fine in my
logs), but afterwards the repo has lost a branch or two.
I've experimented with various NFS settings (e.g. turning off the
attribute cache), but haven't reached a conclusion. I do suspect that it
just is a fact of life with a remote filesystem to have coherency
problems with high concurrency like this but I'd be happily proven wrong
- I'm not an expert in either NFS or git.
So, any opionions either way would be valuable, e.g. 'it should work' or
'it can never work 100%' is fine, as well as any suggestions. Obviously,
the concurrency needed to make it probable to hit this seems so unlikely
that maybe I just shouldn't worry...
I'd be happy to share scripts and whatever if someone would like to try
it out themselves.
Thanks for your time,
ken1
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Collective wisdom about repos on NFS accessed by concurrent clients (== corruption!?)
2013-03-28 10:22 Collective wisdom about repos on NFS accessed by concurrent clients (== corruption!?) Kenneth Ölwing
@ 2013-04-05 12:35 ` Kenneth Ölwing
2013-04-05 13:42 ` Thomas Rast
0 siblings, 1 reply; 7+ messages in thread
From: Kenneth Ölwing @ 2013-04-05 12:35 UTC (permalink / raw)
To: Git List
Hi
Basically, I'm at a place where I'm considering giving up getting this
to work reliably. In general, my setup work really fine, except for the
itty-bitty detail that when I put pressure on things I tend to get into
various kinds of trouble with the central repo being corrupted.
Can anyone authoritatively state anything either way?
TIA,
ken1
On 2013-03-28 11:22, Kenneth Ölwing wrote:
> Hi,
>
> I'm hoping to hear some wisdom on the subject so I can decide if I'm
> chasing a pipe dream or if it should be expected to work and I just
> need to work out the kinks.
>
> Finding things like this makes it sound possible:
> http://permalink.gmane.org/gmane.comp.version-control.git/122670
> but then again, in threads like this:
> http://kerneltrap.org/mailarchive/git/2010/11/14/44799
> opinions are that it's just not reliable to trust.
>
> My experience so far is that I eventually get repo corruption when I
> stress it with concurrent read/write access from multiple hosts
> (beyond any sort of likely levels, but still). Maybe I'm doing
> something wrong, missing a configuration setting somewhere, put an
> unfair stress on the system, there's a bona fide bug - or, given the
> inherent difficulty in achieving perfect coherency between machines on
> what's visible on the mount, it's just impossible (?) to truly get it
> working under all situations.
>
> My eventual usecase is to set up a bunch of (gitolite) hosts that all
> are effectively identical and all seeing the same storage and clients
> can then be directed to any of them to get served. For the purpose of
> this query I assume it can be boiled down to be the same as n users
> working on their own workstations and accessing a central repo on an
> NFS share they all have mounted, using regular file paths. Correct?
>
> I have a number of load-generating test scripts (that from humble
> beginnings have grown to beasts), but basically, they fork a number of
> code pieces that proceed to hammer the repo with concurrent reading
> and writing. If necessary, the scripts can be started on different
> hosts, too. It's all about the central repo so clients will retry in
> various modes whenever they get an error, up to re-cloning and
> starting over. All in all, they can yield quite a load.
>
> On a local filesystem everything seems to be holding up fine which is
> expected. In the scenario with concurrency on top of shared NFS
> storage however, the scripts eventually fails with various problems
> (when the timing finally finds a hole, I guess) - possible to clone
> but checkouts fails, errors about refs pointing wrong and errors where
> the original repo is actually corrupted and can't be cloned from.
> Depending on test strategy, I've also had everything going fine (ops
> looks fine in my logs), but afterwards the repo has lost a branch or two.
>
> I've experimented with various NFS settings (e.g. turning off the
> attribute cache), but haven't reached a conclusion. I do suspect that
> it just is a fact of life with a remote filesystem to have coherency
> problems with high concurrency like this but I'd be happily proven
> wrong - I'm not an expert in either NFS or git.
>
> So, any opionions either way would be valuable, e.g. 'it should work'
> or 'it can never work 100%' is fine, as well as any suggestions.
> Obviously, the concurrency needed to make it probable to hit this
> seems so unlikely that maybe I just shouldn't worry...
>
> I'd be happy to share scripts and whatever if someone would like to
> try it out themselves.
>
> Thanks for your time,
>
> ken1
>
> --
> To unsubscribe from this list: send the line "unsubscribe git" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Collective wisdom about repos on NFS accessed by concurrent clients (== corruption!?)
2013-04-05 12:35 ` Kenneth Ölwing
@ 2013-04-05 13:42 ` Thomas Rast
2013-04-05 14:45 ` Kenneth Ölwing
0 siblings, 1 reply; 7+ messages in thread
From: Thomas Rast @ 2013-04-05 13:42 UTC (permalink / raw)
To: Kenneth Ölwing; +Cc: Git List
Kenneth Ölwing <kenneth@olwing.se> writes:
> Basically, I'm at a place where I'm considering giving up getting this
> to work reliably. In general, my setup work really fine, except for
> the itty-bitty detail that when I put pressure on things I tend to get
> into various kinds of trouble with the central repo being corrupted.
>
> Can anyone authoritatively state anything either way?
My non-authoritative impression was that it's supposed to work
concurrently. Obviously something breaks:
>> My experience so far is that I eventually get repo corruption when I
>> stress it with concurrent read/write access from multiple hosts
>> (beyond any sort of likely levels, but still). Maybe I'm doing
>> something wrong, missing a configuration setting somewhere, put an
>> unfair stress on the system, there's a bona fide bug - or, given the
>> inherent difficulty in achieving perfect coherency between machines
>> on what's visible on the mount, it's just impossible (?) to truly
>> get it working under all situations.
Can you run the same tests under strace or similar, and gather the
relevant outputs? Otherwise it's probably very hard to say what is
going wrong.
In particular we've had some reports on lustre that boiled down to
"impossible" returns from libc functions, not git issues. It's hard to
say without some evidence.
--
Thomas Rast
trast@{inf,student}.ethz.ch
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Collective wisdom about repos on NFS accessed by concurrent clients (== corruption!?)
2013-04-05 13:42 ` Thomas Rast
@ 2013-04-05 14:45 ` Kenneth Ölwing
2013-04-06 8:11 ` Thomas Rast
0 siblings, 1 reply; 7+ messages in thread
From: Kenneth Ölwing @ 2013-04-05 14:45 UTC (permalink / raw)
To: Git List; +Cc: Thomas Rast
On 2013-04-05 15:42, Thomas Rast wrote:
> Can you run the same tests under strace or similar, and gather the
> relevant outputs? Otherwise it's probably very hard to say what is
> going wrong. In particular we've had some reports on lustre that
> boiled down to "impossible" returns from libc functions, not git
> issues. It's hard to say without some evidence.
Thomas, thanks for your reply.
I'm assuming I should strace the git commands as they're issued? I'm
already collecting regular stdout/err output in a log as I go. Is there
any debugging things I can turn on to make the calls issue internal
tracing of some sort?
The main issue I see is that I suspect it will generate so much data
that it'll overflow my disk ;-). Consider that my hammer consists of a
Perl script that forks a number of tasks (e.g. 15) that each loops doing
clone/commit/push/pull, with retrying on a few levels as errors occur
(usually expected ones due to the concurrency, i.e. someone else pushed
so a pull is necessary first, but occasionally the central repo is
broken enough that it can't be cloned from, or at least not checked out
master from...sometimes with printed errors that still give me a zero
exit code...). That is then also run on several machines to the same
repo to hopefully cause a breakage by sheer pounding...it's going to
generate huge collections of strace output I expect...
I have some variations of this (e.g. all tasks are working on different
branches, improving concurrency in some respects, but effects there have
been that at the end I was missing a branch or so...). The likelihood of
problems seems to increase when I actually use ssh in my ultimate setup,
so a loadbalancer roundrobins each call to any of several hosts. In that
case I must admit I don't know how to get in on the action since I guess
I would need to strace the git-upload/receive-pack processes on the
server side...?
Lastly, I don't know how much this will impact timings etc, or load. To
get a broken result I have sometimes needed to run for many hours,
others fairly quickly.
Well...I will try, it'll probably be a blast :-)
BTW, this is mostly done on Centos 6.3 and 6.4, locally built git-1.8.2.
ken1
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Collective wisdom about repos on NFS accessed by concurrent clients (== corruption!?)
2013-04-05 14:45 ` Kenneth Ölwing
@ 2013-04-06 8:11 ` Thomas Rast
2013-04-06 11:49 ` Jason Pyeron
0 siblings, 1 reply; 7+ messages in thread
From: Thomas Rast @ 2013-04-06 8:11 UTC (permalink / raw)
To: Kenneth Ölwing; +Cc: Git List
Kenneth Ölwing <kenneth@olwing.se> writes:
> On 2013-04-05 15:42, Thomas Rast wrote:
>> Can you run the same tests under strace or similar, and gather the
>> relevant outputs? Otherwise it's probably very hard to say what is
>> going wrong. In particular we've had some reports on lustre that
>> boiled down to "impossible" returns from libc functions, not git
>> issues. It's hard to say without some evidence.
> Thomas, thanks for your reply.
>
> I'm assuming I should strace the git commands as they're issued? I'm
> already collecting regular stdout/err output in a log as I go. Is
> there any debugging things I can turn on to make the calls issue
> internal tracing of some sort?
I don't think there's any internal debugging that helps at this point.
Usually errors pointing to corruption are caused by a chain of syscalls
failing in some way, and the final error shows only the last one, so
strace() output is very interesting.
> The main issue I see is that I suspect it will generate so much data
> that it'll overflow my disk ;-).
Well, assuming you have some automated way of detecting when it fails,
you can just overwrite the same strace output file repeatedly; we're
only interested in the last one (or all the last ones if several gits
fail concurrently).
Fiddling with strace will unfortunately change the timings somewhat
(causing a bunch of extra context switches per syscall), but I hope that
you can still get it to reproduce.
--
Thomas Rast
trast@{inf,student}.ethz.ch
^ permalink raw reply [flat|nested] 7+ messages in thread
* RE: Collective wisdom about repos on NFS accessed by concurrent clients (== corruption!?)
2013-04-06 8:11 ` Thomas Rast
@ 2013-04-06 11:49 ` Jason Pyeron
2013-04-07 18:56 ` Kenneth Ölwing
0 siblings, 1 reply; 7+ messages in thread
From: Jason Pyeron @ 2013-04-06 11:49 UTC (permalink / raw)
To: 'Thomas Rast', 'Kenneth Ölwing'; +Cc: 'Git List'
> -----Original Message-----
> From: Thomas Rast
> Sent: Saturday, April 06, 2013 4:12
>
> Kenneth Ölwing <kenneth@olwing.se> writes:
>
> > On 2013-04-05 15:42, Thomas Rast wrote:
> >> Can you run the same tests under strace or similar, and gather the
> >> relevant outputs? Otherwise it's probably very hard to say what is
> >> going wrong. In particular we've had some reports on lustre that
> >> boiled down to "impossible" returns from libc functions, not git
> >> issues. It's hard to say without some evidence.
> > Thomas, thanks for your reply.
> >
> > I'm assuming I should strace the git commands as they're
> issued? I'm
> > already collecting regular stdout/err output in a log as I go. Is
> > there any debugging things I can turn on to make the calls issue
> > internal tracing of some sort?
>
> I don't think there's any internal debugging that helps at this point.
> Usually errors pointing to corruption are caused by a chain
> of syscalls failing in some way, and the final error shows
> only the last one, so
> strace() output is very interesting.
>
> > The main issue I see is that I suspect it will generate so
> much data
> > that it'll overflow my disk ;-).
>
> Well, assuming you have some automated way of detecting when
> it fails, you can just overwrite the same strace output file
> repeatedly; we're only interested in the last one (or all the
> last ones if several gits fail concurrently).
We use tmpwatch for this type of issue, especially with oracle traces. Set up a
directory and tell tmpwatch to delete files older than X. This will keep the
files at bay and when you detect a problem stop the clean up script.
>
> Fiddling with strace will unfortunately change the timings
> somewhat (causing a bunch of extra context switches per
> syscall), but I hope that you can still get it to reproduce.
--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
- -
- Jason Pyeron PD Inc. http://www.pdinc.us -
- Principal Consultant 10 West 24th Street #100 -
- +1 (443) 269-1555 x333 Baltimore, Maryland 21218 -
- -
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-
This message is copyright PD Inc, subject to license 20080407P00.
^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: Collective wisdom about repos on NFS accessed by concurrent clients (== corruption!?)
2013-04-06 11:49 ` Jason Pyeron
@ 2013-04-07 18:56 ` Kenneth Ölwing
0 siblings, 0 replies; 7+ messages in thread
From: Kenneth Ölwing @ 2013-04-07 18:56 UTC (permalink / raw)
To: Jason Pyeron; +Cc: 'Thomas Rast', 'Git List'
Thanks for suggestions,
>> I don't think there's any internal debugging that helps at this
>> point. Usually errors pointing to corruption are caused by a chain of
>> syscalls failing in some way, and the final error shows only the last
>> one, so strace() output is very interesting.
Right - a problem could be, in my understanding, that it will be quite
hard to figure out which of the traces are actually interesting. First,
just because of the intense concurrency, there will be a lot of false
errors on the way, and as far as I can tell many of those errors are
effectively indistinguishable from a real error; i.e. a clone can report
wording to the effect of 'possibly the remote repo is broken', when it's
just in transition by another process. So, a lot of retries eventually
will work. Except when the repo actually is broken, but the retries are
done until they're exhausted. I do keep all such logs anyway, and adding
strace to the output should be fine - it's just a lot to go through.
Which is the second thing - I noticed that I can get strace to put in
timestamps in it's output which will likely be necessary to try to find
where two or more processes interfere.
Oh, BTW - I'm also uncertain whether it is the actual regular ops (e.g.
push) or perhaps auto-gc's that sometimes kick in that causes problems.
While I can set gc.auto=0 to take those out of the equation, it's
obviously not a solution in the long run. Hm, maybe I should go the
other way, like doing gc --aggressive very often while doing pushes and
see if that more quickly
provokes an error. Even Linus in my first link suggests 'avoiding
concurrent gc --prune' (I know, not the same as aggressive), which is
understandable, but since, again as I understand it, git will
occasionally decide to do it on it's own, frankly I would expect this to
work. Not optimally from any viewpoint of course, but still, I simply
shouldn't be able to break a repo as long as I use regular git commands.
Or is that an unreasonable expectation? Given that I'm probably way
beyond any reasonable normal use, I guess it could be considered to
chase ghosts...but then again, if there's even a tiny hole, it would be
nice to close it.
Well, I'll just have to try to battle on with it. Is there a hint of
docs anywhere that would describe the locking behavior git uses to
battle concurrency, and/or some (preferably single) points in the source
code that I could look at?
>>> The main issue I see is that I suspect it will generate so
>> much data
>>> that it'll overflow my disk ;-).
>> Well, assuming you have some automated way of detecting when
>> it fails, you can just overwrite the same strace output file
>> repeatedly; we're only interested in the last one (or all the
>> last ones if several gits fail concurrently).
> We use tmpwatch for this type of issue, especially with oracle traces. Set up a
> directory and tell tmpwatch to delete files older than X. This will keep the
> files at bay and when you detect a problem stop the clean up script.
>
Thanks - as described above I do keep track of all tasks that eventually
run out of steam and die from exhaustion (logs and the final copy of
their clone), and can I get the strace in there things should be fine.
It'll still be a lot of data since I as described I haven't yet figured
out how to accurately detect the point where the real error actually
occurs. I'll look into if I can have some checkpoints in my tasks where
they all calm down so I can test the repo for correctness, so limiting
the data and more quickly discovering the fail state.
ken1
^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2013-04-07 18:57 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-03-28 10:22 Collective wisdom about repos on NFS accessed by concurrent clients (== corruption!?) Kenneth Ölwing
2013-04-05 12:35 ` Kenneth Ölwing
2013-04-05 13:42 ` Thomas Rast
2013-04-05 14:45 ` Kenneth Ölwing
2013-04-06 8:11 ` Thomas Rast
2013-04-06 11:49 ` Jason Pyeron
2013-04-07 18:56 ` Kenneth Ölwing
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).