All of lore.kernel.org
 help / color / mirror / Atom feed
* Timeouts and congestion windows
@ 2003-09-19  8:33 Olaf Kirch
  0 siblings, 0 replies; only message in thread
From: Olaf Kirch @ 2003-09-19  8:33 UTC (permalink / raw)
  To: nfs

Hi,

I recently had to debug a problem where NFS installs of a new box
would stall after copying a couple of RPMs, with lots of "NFS
server not respondig/server OK" messages in the syslog.

The kernel I was looking at was 2.4.21 plus the NFS patches from 2.4.22.
The NFS server is a reasonably sized machine, and serves the entire R&D
department as install server.

It turned out the problem was in the RTT estimating code. As long as
everything was fine, the READs were being served promptly, with RTT times
of 1 ms or less. So the RTT estimator for READ would return a timeout
value of .101 seconds (the client kernel used HZ=1000).

Then a spike in the server and/or network load would send the round-trip
times to much higher values (.5 seconds and more) and I would see
things like

	transmit request
	retransmit request
	retransmit request
	retransmit request
	retransmit request
	retransmit request
	receive response
	receive response
	receive response
	receive response
	receive response

In several cases, the response was even ignored because it was received
inbetween the xprt_timer timing out the request, and rpciod waking
up and retransmitting it (could this be fixed by making xprt_complete_rqst
clear task->tk_status? I didn't really look into this part of the problem).

Obviously, this increases the network load even more. But the really bad
thing is that because the request was retransmitted, the RTT estimate
wasn't updated, so it kept predicting these .101 sec timeouts, and it
didn't get out of that trap until the server load went down.

I fixed this rather crudely by always updating the RTT estimate with
nretrans * rtt. A better approach would be to use the time delta between
the first transmit and the receipt of a response.

Then I noticed a second problem, which was that the congestion window
would oscillate quite wildly. It would go up to say 2000 or 3000 within
10-20 seconds, and than collapse down to 256 again. The problem
here, I believe, is two-fold. One is that if you have a spike in network
load, and lose say 5 packets, you get 5 timeouts and the cwnd is shrunk
by a factor 2^5. The second is that the window is going up too quickly
(but that's more of a gut feeling).

Printing the cwnd would give sort of a saw-tooth curve, and, what's
worse, doing that on two clients would show these saw-tooths were almost
synchronized.

I addressed that by making sure that once you update the cwnd, you're
not allowed to touch it for a certain time (0.5 sec when increasing the
cwnd, 1 sec when decreasing it) This smoothed out the curve quite a bit.
It also reacts fairly well to spikes in the network load - rather than
dropping all the way down to 256, it now goes back to half the value,
and stays there.

A third modification I made was to not print "server not responding"
messages for async RPC tasks :)

Olaf
-- 
Olaf Kirch     |  Anyone who has had to work with X.509 has probably
okir@suse.de   |  experienced what can best be described as
---------------+  ISO water torture. -- Peter Gutmann


-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2003-09-19  8:33 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-09-19  8:33 Timeouts and congestion windows Olaf Kirch

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.