Here at LLNL we have a rather challenging network environment on our 
clusters. We basically have 1000's of gigE links attached to an 
oversubscribed federated network. Most of the time this network is idle 
but the expected workload is for regular spikes extremely heavy activity 
lasting a few minutes. All end-points in a highly coordinated manor, 
typically after exiting an MPI barrier, start pushing as much data as 
possible through the oversubscribed core.  The result is a wave of TCP 
back-offs where all the TCP streams back-off in lock step. The network 
oscillates from highly congested for brief moments to largely idle. 
Given enough time TCP will settle down in to something mostly reasonable 
but even then it causes us a few problems:

1) It takes a long time for the network to settle in to a steady state 
and while it does network utilization is very poor.

2) Many of the sockets will rapidly back off to the maximum value.  This 
can lead to application level timeouts being triggered because we also 
initially calibrated the timeouts with the notion that 2 minute 
back-offs would be the exception and not the norm.

3) Once we reach steady state there's no guarantee of fairness between 
TCP streams.  For our workload this is particularly undesirable since 
the parallel job which kicked off all this activity must wait until the 
slowest transaction completes.  This translates in to 1000's of nodes 
sitting idle.

By knowing this is the expected workload on this dedicated network we
can safely make the back-offs more aggressive to mitigate most of these
issues.

We also played around with using a random seed when selecting the 
back-off interval to avoid all the sockets backing-off in lock step. 
That worked reasonably well but was more invasive then simply adding a 
few more tunable.

Because these are general utility clusters we run many different 
programs and so trying to fix this problem in the application is not 
possible since there are literally hundreds if not thousands of them.

We're more than willing to consider other approaches to handling this
particular workload better.  We've even considered that TCP isn't at all 
the right protocol but this affects several protocols including NFS and 
the benefits of running NFS over TCP are too great.

The original patch was prepared by Brian Behlendorf. He asked me to 
adapt it for current kernels keep it up to date and send upstream.

This may also help people like Andrew Athan which reported a similar 
problem a couple of days ago on the linux-net mailing list: 
http://www.uwsg.iu.edu/hypermail/linux/net/0609.3/0005.html I suspect 
that it is more common a case than is widely recognized.

Signed-off-by: Ben Woodard <woodard@redhat.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>