* clone hang prevention / timeout? @ 2016-04-11 21:49 Jason Vas Dias 2016-04-12 8:01 ` Eric Wong ` (2 more replies) 0 siblings, 3 replies; 6+ messages in thread From: Jason Vas Dias @ 2016-04-11 21:49 UTC (permalink / raw) To: git It appears GIT has no way of specifying a timeout for a clone operation - if the server decides not to complete a get request, the clone can hang forever - is this correct ? This appears to be what I am seeing, in a script that is attempting to do many successive clone operations, eg. of git://anongit.freedesktop.org/xorg/* , the script occasionally hangs in a clone - I can see with netstat + strace that the TCP connection is open and GIT is trying to read . Is there any option I can specify to get the clone to timeout, or do I manually have to strace the git process and send it a signal after a hang is detected? ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: clone hang prevention / timeout? 2016-04-11 21:49 clone hang prevention / timeout? Jason Vas Dias @ 2016-04-12 8:01 ` Eric Wong 2016-04-13 22:24 ` Jeff King 2016-04-13 22:29 ` Jeff King 2 siblings, 0 replies; 6+ messages in thread From: Eric Wong @ 2016-04-12 8:01 UTC (permalink / raw) To: Jason Vas Dias; +Cc: git Jason Vas Dias <jason.vas.dias@gmail.com> wrote: > It appears GIT has no way of specifying a timeout for a clone operation - > if the server decides not to complete a get request, the clone can > hang forever - > is this correct ? git uses SO_KEEPALIVE for all connections it makes, so whatever your kernel TCP keepalive knobs are set at. By default, it's very long (around 2 hours), but you can change them using the tcp_keepalive_* knobs in /proc/sys/net/ipv4/ under Linux. I suppose we can do shorter timeouts (at least under Linux) via setsockopt(.. TCP_KEEP*) knobs, or we can call poll() ourselves to timeout connections. However, git packing operations on the server can take a long time; so it might be bad to timeout manually unless we know the connection is really dead. > This appears to be what I am seeing, in a script that is attempting to do many > successive clone operations, eg. of > git://anongit.freedesktop.org/xorg/* , the script > occasionally hangs in a clone - I can see with netstat + strace that the TCP > connection is open and GIT is trying to read . > Is there any option I can specify to get the clone to timeout, or do I manually > have to strace the git process and send it a signal after a hang is detected? I added git:// support for SO_KEEPALIVE in commit e47a8583a202 ("enable SO_KEEPALIVE for connected TCP sockets") back in 2011 (v1.7.10), and http:// support later in 2013 (v1.8.5) with commit a15d069a1986 ("http: enable keepalive on TCP sockets") ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: clone hang prevention / timeout? 2016-04-11 21:49 clone hang prevention / timeout? Jason Vas Dias 2016-04-12 8:01 ` Eric Wong @ 2016-04-13 22:24 ` Jeff King 2016-04-13 22:29 ` Jeff King 2 siblings, 0 replies; 6+ messages in thread From: Jeff King @ 2016-04-13 22:24 UTC (permalink / raw) To: Jason Vas Dias; +Cc: git On Mon, Apr 11, 2016 at 10:49:19PM +0100, Jason Vas Dias wrote: > It appears GIT has no way of specifying a timeout for a clone operation - > if the server decides not to complete a get request, the clone can > hang forever - > is this correct ? Yes. Git's protocol has no timeouts, though each side is generally either writing or reading at any moment, and so an interrupted connection should cause either EPIPE or EOF, ending the process. The exceptions I have seen are: - protocol / implementation bugs that cause a true deadlock. At this we've fixed all known cases, but that doesn't mean there aren't bugs lurking. - the network drops out in such a way that the OS doesn't realize the connection is gone, and the reading side is left waiting for input forever I think the TCP keepalive stuff that Eric mentioned should address the latter, though I don't know how well it works in practice. We used to sometimes see processes hung for days on GitHub, but it's been a long time. I don't recall if it was pre-v1.8.5 (which introduced SO_KEEPALIVE), or if we made some other change (we have a load-balancing layer in front that has more aggressive timeouts). > This appears to be what I am seeing, in a script that is attempting to do many > successive clone operations, eg. of > git://anongit.freedesktop.org/xorg/* , the script > occasionally hangs in a clone - I can see with netstat + strace that the TCP > connection is open and GIT is trying to read . > Is there any option I can specify to get the clone to timeout, or do I manually > have to strace the git process and send it a signal after a hang is detected? There are periods where a git client may have to wait for a while in read() while the other side is quiet (e.g., when the other side is badly packed and needs to do a lot of up-front CPU work to prepare the packfile). Since v1.8.4.2, the server side of a clone should generate application-level keepalive packets, so that the client never sees silence for more than ~5 seconds. The freedesktop servers appear to be on v2.1.4, so a long read() as you're seeing probably is a real hang. Note that pushing has a similar problem (the client may wait a long time while the server chews on the uploaded packfile before reporting status). There are no keepalives in that direction, though I have a series there that I need to polish and submit. -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: clone hang prevention / timeout? 2016-04-11 21:49 clone hang prevention / timeout? Jason Vas Dias 2016-04-12 8:01 ` Eric Wong 2016-04-13 22:24 ` Jeff King @ 2016-04-13 22:29 ` Jeff King 2016-04-14 18:32 ` Jason Vas Dias 2 siblings, 1 reply; 6+ messages in thread From: Jeff King @ 2016-04-13 22:29 UTC (permalink / raw) To: Jason Vas Dias; +Cc: git On Mon, Apr 11, 2016 at 10:49:19PM +0100, Jason Vas Dias wrote: > Is there any option I can specify to get the clone to timeout, or do I manually > have to strace the git process and send it a signal after a hang is detected? Oh, one other thing you might consider, it something like "timeout" from GNU coreutils, which puts a hard cap on the length of time a process can run. It's totally unaware of the state of the process, though, so if you really do have a clone which takes an hour, it might very well kill it at 99% complete. It has no mechanism for "gee, this process looks like it hasn't done anything for 5 minutes". I don't know offhand of a general tool for that. -Peff ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: clone hang prevention / timeout? 2016-04-13 22:29 ` Jeff King @ 2016-04-14 18:32 ` Jason Vas Dias 2016-04-30 9:04 ` Eric Wong 0 siblings, 1 reply; 6+ messages in thread From: Jason Vas Dias @ 2016-04-14 18:32 UTC (permalink / raw) To: Jeff King, Eric Wong; +Cc: git Thanks very much Eric & Jeff for your reply . Personally, I would recommend setting the SO_RECVTIMEO for GIT server sockets to a fixed default (eg. 5mins) , settable by a '--receive-timeout' argument or configuration parameter . The problem I was trying to overcome was cloning all the repositories under https://anongit.freedesktop.org/xorg/* . About 4 git clones would succeed in succession, but then typically the 5th would hang in read() forever - I left one such hung 'git clone' for nearly an hour and it had not progressed or timed out . I tried inserting a delay of up to 30 seconds between clones, but this did not help. Maybe freedesktop.org's GIT server is too overloaded and they have to resort to disabling 1 out of 5 GIT successive clone operations from same connection or something. Here is my solution, in case anyone else needs it : <quote><pre> eips=() counts=() declare -i failed=0; { echo "$BASHPID" >/tmp/git.pid; GIT_TRACE=2 exec git clone ${proto}://${user}anongit.freedesktop.org/${repo}$name; }& while [ ! -f /tmp/git.pid ]; do sleep 1; done git_pid="$(cat /tmp/git.pid)"; while [ -d /proc/$git_pid ]; do IFS=$'\n'; declare -a kids=($(ps --ppid $git_pid -o 'pid=,eip=')); unset IFS; declare -i n_kids=${#kids[@]} kid_n; for ((kid_n=0; kid_n < n_kids; kid_n+=1)); do declare -a ke=(${kids[kid_n]}); kid=${ke[0]} eip=${ke[1]} if [ ! -v 'eips['$kid']' ]; then eips[$kid]="$eip"; elif [ "${eips[$kid]}" = "$eip" ]; then if [ x = x"${counts[$kid]}" ]; then counts[$kid]=1; else counts[$kid]=$((${counts[$kid]}+1)); if (( ${counts[$kid]} >= 30 )); then echo 'child process '$kid' of git main process '$git_pid' appears to be stuck - killing it.'; kill -TERM $kid; ((failed=1)); fi fi else eips[$kid]="$eip"; counts[$kid]=''; fi done ; sleep 1; done wait </quote></pre> This is part of a script that reads a list of the Xorg projects, sets $repo to top level subdirectory, and $name to the project name, and initiates the GIT clone . It deems any GIT _CHILD_ process (eg. git-index-pack) that have not changed their instruction pointer register (EIP) for 30 seconds to be "hung" . There is logic at the end to retry all the failed clones. It does work, but is far from pretty . It sure would be nice if GIT had a timeout mechanism ! Thanks & Regards, Jason On 13/04/2016, Jeff King <peff@peff.net> wrote: > On Mon, Apr 11, 2016 at 10:49:19PM +0100, Jason Vas Dias wrote: > >> Is there any option I can specify to get the clone to timeout, or do I >> manually >> have to strace the git process and send it a signal after a hang is >> detected? > > Oh, one other thing you might consider, it something like "timeout" from > GNU coreutils, which puts a hard cap on the length of time a process can > run. > > It's totally unaware of the state of the process, though, so if you > really do have a clone which takes an hour, it might very well kill it > at 99% complete. It has no mechanism for "gee, this process looks like > it hasn't done anything for 5 minutes". > > I don't know offhand of a general tool for that. > > -Peff > ^ permalink raw reply [flat|nested] 6+ messages in thread
* Re: clone hang prevention / timeout? 2016-04-14 18:32 ` Jason Vas Dias @ 2016-04-30 9:04 ` Eric Wong 0 siblings, 0 replies; 6+ messages in thread From: Eric Wong @ 2016-04-30 9:04 UTC (permalink / raw) To: Jason Vas Dias; +Cc: Jeff King, git Jason Vas Dias <jason.vas.dias@gmail.com> wrote: > Thanks very much Eric & Jeff for your reply . > > Personally, I would recommend setting the SO_RECVTIMEO for GIT server > sockets to a fixed default (eg. 5mins) , settable by a > '--receive-timeout' argument or configuration parameter . (apologies for the delay, I thought I replied earlier :x) SO_RCVTIMEO only triggers EAGAIN, and AFAIK the git read/write wrappers are used to transparently retry on EAGAIN... So it's not so simple as doing a single setsockopt. > The problem I was trying to overcome was cloning all the repositories under > https://anongit.freedesktop.org/xorg/* . > > About 4 git clones would succeed in succession, but then typically the 5th > would hang in read() forever - I left one such hung 'git clone' for nearly an > hour and it had not progressed or timed out . I tried inserting a delay of > up to 30 seconds between clones, but this did not help. Are you in contact with any of the admins of that server to help? Is the problematic repo any larger or in any way stranger than the others? > Maybe freedesktop.org's GIT server is too overloaded and they have > to resort to disabling 1 out of 5 GIT successive clone operations from > same connection or something. Anyways I've been thinking about overloaded git servers, lately. Pack generation on big repos is painful, and having lots of slow clients can tie up server memory. So maybe an HTTP server which can switch between dumb and smart operation depending on load could be useful for the resource-constrained. > Here is my solution, in case anyone else needs it : It'd be nice to get an strace to know where in the clone process it hangs to help the admin figure out how far things got. And please don't top-post, it's a waste of resources. ^ permalink raw reply [flat|nested] 6+ messages in thread
end of thread, other threads:[~2016-04-30 9:04 UTC | newest] Thread overview: 6+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2016-04-11 21:49 clone hang prevention / timeout? Jason Vas Dias 2016-04-12 8:01 ` Eric Wong 2016-04-13 22:24 ` Jeff King 2016-04-13 22:29 ` Jeff King 2016-04-14 18:32 ` Jason Vas Dias 2016-04-30 9:04 ` Eric Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).