Re: Collective wisdom about repos on NFS accessed by concurrent clients (== corruption!?)

git.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Kenneth Ölwing" <kenneth@olwing.se>
To: Jason Pyeron <jpyeron@pdinc.us>
Cc: 'Thomas Rast' <trast@inf.ethz.ch>, 'Git List' <git@vger.kernel.org>
Subject: Re: Collective wisdom about repos on NFS accessed by concurrent clients (== corruption!?)
Date: Sun, 07 Apr 2013 20:56:36 +0200	[thread overview]
Message-ID: <5161C164.7020502@olwing.se> (raw)
In-Reply-To: <CB4C1FB3EB914D079EE0534228DE372D@black>

Thanks for suggestions,

>> I don't think there's any internal debugging that helps at this 
>> point. Usually errors pointing to corruption are caused by a chain of 
>> syscalls failing in some way, and the final error shows only the last 
>> one, so strace() output is very interesting. 

Right - a problem could be, in my understanding, that it will be quite 
hard to figure out which of the traces are actually interesting. First, 
just because of the intense concurrency, there will be a lot of false 
errors on the way, and as far as I can tell many of those errors are 
effectively indistinguishable from a real error; i.e. a clone can report 
wording to the effect of 'possibly the remote repo is broken', when it's 
just in transition by another process. So, a lot of retries eventually 
will work. Except when the repo actually is broken, but the retries are 
done until they're exhausted. I do keep all such logs anyway, and adding 
strace to the output should be fine - it's just a lot to go through. 
Which is the second thing - I noticed that I can get strace to put in 
timestamps in it's output which will likely be necessary to try to find 
where two or more processes interfere.

Oh, BTW - I'm also uncertain whether it is the actual regular ops (e.g. 
push) or perhaps auto-gc's that sometimes kick in that causes problems. 
While I can set gc.auto=0 to take those out of the equation, it's 
obviously not a solution in the long run. Hm, maybe I should go the 
other way, like doing gc --aggressive very often while doing pushes and 
see if that more quickly
provokes an error. Even Linus in my first link suggests 'avoiding 
concurrent gc --prune' (I know, not the same as aggressive), which is 
understandable, but since, again as I understand it, git will 
occasionally decide to do it on it's own, frankly I would expect this to 
work. Not optimally from any viewpoint of course, but still, I simply 
shouldn't be able to break a repo as long as I use regular git commands. 
Or is that an unreasonable expectation? Given that I'm probably way 
beyond any reasonable normal use, I guess it could be considered to 
chase ghosts...but then again, if there's even a tiny hole, it would be 
nice to close it.

Well, I'll just have to try to battle on with it. Is there a hint of 
docs anywhere that would describe the locking behavior git uses to 
battle concurrency, and/or some (preferably single) points in the source 
code that I could look at?

>>> The main issue I see is that I suspect it will generate so
>> much data
>>> that it'll overflow my disk ;-).
>> Well, assuming you have some automated way of detecting when
>> it fails, you can just overwrite the same strace output file
>> repeatedly; we're only interested in the last one (or all the
>> last ones if several gits fail concurrently).
> We use tmpwatch for this type of issue, especially with oracle traces. Set up a
> directory and tell tmpwatch to delete files older than X. This will keep the
> files at bay and when you detect a problem stop  the clean up script.
>
Thanks - as described above I do keep track of all tasks that eventually 
run out of steam and die from exhaustion (logs and the final copy of 
their clone), and can I get the strace in there things should be fine. 
It'll still be a lot of data since I as described I haven't yet figured 
out how to accurately detect the point where the real error actually 
occurs. I'll look into if I can have some checkpoints in my tasks where 
they all calm down so I can test the repo for correctness, so limiting 
the data and more quickly discovering the fail state.

ken1

     prev parent reply	other threads:[~2013-04-07 18:57 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-03-28 10:22 Collective wisdom about repos on NFS accessed by concurrent clients (== corruption!?) Kenneth Ölwing
2013-04-05 12:35 ` Kenneth Ölwing
2013-04-05 13:42   ` Thomas Rast
2013-04-05 14:45     ` Kenneth Ölwing
2013-04-06  8:11       ` Thomas Rast
2013-04-06 11:49         ` Jason Pyeron
2013-04-07 18:56           ` Kenneth Ölwing [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5161C164.7020502@olwing.se \
    --to=kenneth@olwing.se \
    --cc=git@vger.kernel.org \
    --cc=jpyeron@pdinc.us \
    --cc=trast@inf.ethz.ch \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).