netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] Failover-friendly TCP retransmission
@ 2007-06-04 12:55 noboru.obata.ar
  2007-06-04 14:55 ` Andi Kleen
  0 siblings, 1 reply; 5+ messages in thread
From: noboru.obata.ar @ 2007-06-04 12:55 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 3565 bytes --]

Hi all,

I would like to hear comments on how TCP retransmission can be
done better on failover-capable network devices, such as an
active-backup bonding device.


Premise
=======

Please note first that I want to address physical failures by
the failover-capable network devices, which are increasingly
becoming important as Xen-based VM systems are getting popular.
Reducing a single-point-of-failure (physical device) is vital on
such VM systems.

And the failover here is not going to address overloaded or
congested networks here, which should be solved separately.


Background
==========

When designing a TCP/IP based network system on failover-capable
network devices, people want to set timeouts hierarchically in
three layers, network device layer, TCP layer, and application
layer (bottom-up order), such that:

1. Network device layer detects a failure first and switch to a
   backup device (say, in 20sec).

2. TCP layer timeout & retransmission comes next, _hopefully_
   before the application layer timeout.

3. Application layer detects a network failure last (by, say,
   30sec timeout) and may trigger a system-level failover.

It should be noted that the timeouts for #1 and #2 are handled
independently and there is no relationship between them.

Also note that the actual timeout settings (20sec or 30sec in
this example) are often determined by systems requirement and so
setting them to certain "safe values" (if any) are usually not
possible.


Problem
=======

If TCP retransmission misses the time frame between event #1 and
#3 in Background above (between 20 and 30sec since network
failure), a failure causes the system-level failover where the
network-device-level failover should be enough.

The problem in this hierarchical timeout scheme is that TCP
layer does not guarantee the next retransmission to occur in
certain period of time.  In the above example, people expect TCP
to retransmit a packet between 20 and 30sec since network
failure, but it may not happen.

Starting from RTO=0.5sec for example, retransmission will occur
at time 0.5, 1.5, 3.5, 7.5, 15.5, and 31.5 as indicated by 'o'
in the following diagram, but miss the time frame between time
20 and 30.

       time: 0         10        20        30sec
             |         |         |         |
  App. layer |---------+---------+---------X  ==> system failover
   TCP layer oo-o---o--+----o----+---------+o <== expects retrans. b/w 20~30
Netdev layer |---------+---------X            ==> network failover


Solution
========

It seems reasonable for me to solve this problem by capping
TCP_RTO_MAX, i.e., making TCP_RTO_MAX a sysctl variable and set
it to a small number.

In this example, setting to (10 - RTT)[sec] at most should work
because retransmission will take place between time 20 and 30.

My rationale follows.

* This solution is simple and so less error-prone.

* The solution does not violate RFC 2988 in maximum RTO value,
  because RFC 2988's requirement in maximum RTO value, "at least
  60 seconds (in (2.5))," is OPTIONAL.

* The solution adds a system-wide setting, which is preferable
  to per-socket setting (by, say, setsockopt or something),
  because all application benefits from the solution.


Before posting patches, I would like to hear comments here.
Any comments or suggestions, to make TCP retransmission work
better on failover-capable network devices, are welcome.

Regards,

-- 
OBATA Noboru (noboru.obata.ar@hitachi.com)

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2007-06-07  4:27 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-04 12:55 [RFC] Failover-friendly TCP retransmission noboru.obata.ar
2007-06-04 14:55 ` Andi Kleen
2007-06-05 13:05   ` noboru.obata.ar
2007-06-05 19:23     ` Andi Kleen
2007-06-07  4:27       ` noboru.obata.ar

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).