Here at LLNL we have a rather challenging network environment on our clusters. We basically have 1000's of gigE links attached to an oversubscribed federated network. Most of the time this network is idle but the expected workload is for regular spikes extremely heavy activity lasting a few minutes. All end-points in a highly coordinated manor, typically after exiting an MPI barrier, start pushing as much data as possible through the oversubscribed core. The result is a wave of TCP back-offs where all the TCP streams back-off in lock step. The network oscillates from highly congested for brief moments to largely idle. Given enough time TCP will settle down in to something mostly reasonable but even then it causes us a few problems: 1) It takes a long time for the network to settle in to a steady state and while it does network utilization is very poor. 2) Many of the sockets will rapidly back off to the maximum value. This can lead to application level timeouts being triggered because we also initially calibrated the timeouts with the notion that 2 minute back-offs would be the exception and not the norm. 3) Once we reach steady state there's no guarantee of fairness between TCP streams. For our workload this is particularly undesirable since the parallel job which kicked off all this activity must wait until the slowest transaction completes. This translates in to 1000's of nodes sitting idle. By knowing this is the expected workload on this dedicated network we can safely make the back-offs more aggressive to mitigate most of these issues. We also played around with using a random seed when selecting the back-off interval to avoid all the sockets backing-off in lock step. That worked reasonably well but was more invasive then simply adding a few more tunable. Because these are general utility clusters we run many different programs and so trying to fix this problem in the application is not possible since there are literally hundreds if not thousands of them. We're more than willing to consider other approaches to handling this particular workload better. We've even considered that TCP isn't at all the right protocol but this affects several protocols including NFS and the benefits of running NFS over TCP are too great. The original patch was prepared by Brian Behlendorf. He asked me to adapt it for current kernels keep it up to date and send upstream. This may also help people like Andrew Athan which reported a similar problem a couple of days ago on the linux-net mailing list: http://www.uwsg.iu.edu/hypermail/linux/net/0609.3/0005.html I suspect that it is more common a case than is widely recognized. Signed-off-by: Ben Woodard Signed-off-by: Brian Behlendorf