* [RFC] Failover-friendly TCP retransmission
@ 2007-06-04 12:55 noboru.obata.ar
2007-06-04 14:55 ` Andi Kleen
0 siblings, 1 reply; 5+ messages in thread
From: noboru.obata.ar @ 2007-06-04 12:55 UTC (permalink / raw)
To: netdev
[-- Attachment #1: Type: text/plain, Size: 3565 bytes --]
Hi all,
I would like to hear comments on how TCP retransmission can be
done better on failover-capable network devices, such as an
active-backup bonding device.
Premise
=======
Please note first that I want to address physical failures by
the failover-capable network devices, which are increasingly
becoming important as Xen-based VM systems are getting popular.
Reducing a single-point-of-failure (physical device) is vital on
such VM systems.
And the failover here is not going to address overloaded or
congested networks here, which should be solved separately.
Background
==========
When designing a TCP/IP based network system on failover-capable
network devices, people want to set timeouts hierarchically in
three layers, network device layer, TCP layer, and application
layer (bottom-up order), such that:
1. Network device layer detects a failure first and switch to a
backup device (say, in 20sec).
2. TCP layer timeout & retransmission comes next, _hopefully_
before the application layer timeout.
3. Application layer detects a network failure last (by, say,
30sec timeout) and may trigger a system-level failover.
It should be noted that the timeouts for #1 and #2 are handled
independently and there is no relationship between them.
Also note that the actual timeout settings (20sec or 30sec in
this example) are often determined by systems requirement and so
setting them to certain "safe values" (if any) are usually not
possible.
Problem
=======
If TCP retransmission misses the time frame between event #1 and
#3 in Background above (between 20 and 30sec since network
failure), a failure causes the system-level failover where the
network-device-level failover should be enough.
The problem in this hierarchical timeout scheme is that TCP
layer does not guarantee the next retransmission to occur in
certain period of time. In the above example, people expect TCP
to retransmit a packet between 20 and 30sec since network
failure, but it may not happen.
Starting from RTO=0.5sec for example, retransmission will occur
at time 0.5, 1.5, 3.5, 7.5, 15.5, and 31.5 as indicated by 'o'
in the following diagram, but miss the time frame between time
20 and 30.
time: 0 10 20 30sec
| | | |
App. layer |---------+---------+---------X ==> system failover
TCP layer oo-o---o--+----o----+---------+o <== expects retrans. b/w 20~30
Netdev layer |---------+---------X ==> network failover
Solution
========
It seems reasonable for me to solve this problem by capping
TCP_RTO_MAX, i.e., making TCP_RTO_MAX a sysctl variable and set
it to a small number.
In this example, setting to (10 - RTT)[sec] at most should work
because retransmission will take place between time 20 and 30.
My rationale follows.
* This solution is simple and so less error-prone.
* The solution does not violate RFC 2988 in maximum RTO value,
because RFC 2988's requirement in maximum RTO value, "at least
60 seconds (in (2.5))," is OPTIONAL.
* The solution adds a system-wide setting, which is preferable
to per-socket setting (by, say, setsockopt or something),
because all application benefits from the solution.
Before posting patches, I would like to hear comments here.
Any comments or suggestions, to make TCP retransmission work
better on failover-capable network devices, are welcome.
Regards,
--
OBATA Noboru (noboru.obata.ar@hitachi.com)
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [RFC] Failover-friendly TCP retransmission
2007-06-04 12:55 [RFC] Failover-friendly TCP retransmission noboru.obata.ar
@ 2007-06-04 14:55 ` Andi Kleen
2007-06-05 13:05 ` noboru.obata.ar
0 siblings, 1 reply; 5+ messages in thread
From: Andi Kleen @ 2007-06-04 14:55 UTC (permalink / raw)
To: noboru.obata.ar; +Cc: netdev
noboru.obata.ar@hitachi.com writes:
> Please note first that I want to address physical failures by
> the failover-capable network devices, which are increasingly
> becoming important as Xen-based VM systems are getting popular.
> Reducing a single-point-of-failure (physical device) is vital on
> such VM systems.
Just you typically still have lots of other single points of failures in
a single system, some of them quite less reliable than your typical
NIC. But at least it gives impressive demos when pulling ethernet cables @)
> 1. Network device layer detects a failure first and switch to a
> backup device (say, in 20sec).
>
> 2. TCP layer timeout & retransmission comes next, _hopefully_
> before the application layer timeout.
>
> 3. Application layer detects a network failure last (by, say,
> 30sec timeout) and may trigger a system-level failover.
>
> It should be noted that the timeouts for #1 and #2 are handled
> independently and there is no relationship between them.
> If TCP retransmission misses the time frame between event #1 and
> #3 in Background above (between 20 and 30sec since network
> failure), a failure causes the system-level failover where the
> network-device-level failover should be enough.
You should probably make sure that the device ends up returning the
right NET_XMIT_* code for such drops to TCP, in particular
NET_XMIT_DROP. This might require slight driver interface
changes. Also right now it only affects the congestion window, I think,
it might be reasonable to let it affect the timer backoff too.
-Andi
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC] Failover-friendly TCP retransmission
2007-06-04 14:55 ` Andi Kleen
@ 2007-06-05 13:05 ` noboru.obata.ar
2007-06-05 19:23 ` Andi Kleen
0 siblings, 1 reply; 5+ messages in thread
From: noboru.obata.ar @ 2007-06-05 13:05 UTC (permalink / raw)
To: andi; +Cc: netdev
Hi Andi,
Andi Kleen <andi@firstfloor.org> writes:
> > Please note first that I want to address physical failures by
> > the failover-capable network devices, which are increasingly
> > becoming important as Xen-based VM systems are getting popular.
> > Reducing a single-point-of-failure (physical device) is vital on
> > such VM systems.
>
> Just you typically still have lots of other single points of failures in
> a single system, some of them quite less reliable than your typical
> NIC. But at least it gives impressive demos when pulling ethernet cables @)
Indeed :-)
> > If TCP retransmission misses the time frame between event #1 and
> > #3 in Background above (between 20 and 30sec since network
> > failure), a failure causes the system-level failover where the
> > network-device-level failover should be enough.
>
> You should probably make sure that the device ends up returning the
> right NET_XMIT_* code for such drops to TCP, in particular
> NET_XMIT_DROP. This might require slight driver interface
> changes. Also right now it only affects the congestion window, I think,
> it might be reasonable to let it affect the timer backoff too.
Well, I don't think it can be a help.
Your suggestion, to utilize NET_XMIT_* code returned from an
underlying layer, is done in tcp_transmit_skb.
But my problem is that tcp_transmit_skb is not called during a
certain period of time. So I'm suggesting to cap RTO value so
that tcp_transmit_skb gets called more frequently.
Does it make sense, Andi?
Regards,
--
OBATA Noboru (noboru.obata.ar@hitachi.com)
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC] Failover-friendly TCP retransmission
2007-06-05 13:05 ` noboru.obata.ar
@ 2007-06-05 19:23 ` Andi Kleen
2007-06-07 4:27 ` noboru.obata.ar
0 siblings, 1 reply; 5+ messages in thread
From: Andi Kleen @ 2007-06-05 19:23 UTC (permalink / raw)
To: noboru.obata.ar; +Cc: andi, netdev
> Your suggestion, to utilize NET_XMIT_* code returned from an
> underlying layer, is done in tcp_transmit_skb.
>
> But my problem is that tcp_transmit_skb is not called during a
> certain period of time. So I'm suggesting to cap RTO value so
> that tcp_transmit_skb gets called more frequently.
The transmit code controls the transmission timeout. Or at least
it could change it if it really wanted.
What I wanted to say is: if the loss still happens under control
of the sending end device and TCP knows this then it could change
the retransmit timer to fire earlier or even just wait for an
event from the device that tells it to retransmit early.
I admit I have not thought through all the implications of this,
but it would seem to me a better approach than capping RTO or
doing other intrusive TCP changes.
The problem with capping RTO is that when there is a loss
in the network for some other reasons (and there is no reason
bonding can't be used when talking to the internet) you
might be too aggressive or not aggressive enough anymore
to get the data through.
But if you only change behaviour when you detect a local
loss this cannot happen.
Just using a very short timeout of one jiffie on local loss might work
(the stack already does this sometimes). Upcalls would be more
complicated and might have some bad side effects (like not
interacting well with qdiscs or possibly being unfair if there
are a lot of sockets). But that might be solvable too.
In a virtualized environments it might be also needed
to pass NET_XMIT_* through the paravirtual driver interface.
-Andi
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [RFC] Failover-friendly TCP retransmission
2007-06-05 19:23 ` Andi Kleen
@ 2007-06-07 4:27 ` noboru.obata.ar
0 siblings, 0 replies; 5+ messages in thread
From: noboru.obata.ar @ 2007-06-07 4:27 UTC (permalink / raw)
To: andi; +Cc: netdev
Hi Andi,
I thank you for your comments.
Andi Kleen <andi@firstfloor.org> writes:
> > Your suggestion, to utilize NET_XMIT_* code returned from an
> > underlying layer, is done in tcp_transmit_skb.
> >
> > But my problem is that tcp_transmit_skb is not called during a
> > certain period of time. So I'm suggesting to cap RTO value so
> > that tcp_transmit_skb gets called more frequently.
>
> The transmit code controls the transmission timeout. Or at least
> it could change it if it really wanted.
>
> What I wanted to say is: if the loss still happens under control
> of the sending end device and TCP knows this then it could change
> the retransmit timer to fire earlier or even just wait for an
> event from the device that tells it to retransmit early.
I examined your suggestion to introduce some interface so that
TCP can know, or be notified, to retransmit early.
Then there are two options; pull from TCP or notify to TCP.
The first option, pulling from TCP, must be done at the
expiration of retransmission timer because there is no other
context to do this. But if RTO is already large, this can
easily miss events or status of underlying layer, such as
symptom of failure, failover, etc. So I give up pulling from
TCP.
The second option, notifying to TCP, seems a bit promising.
Upon such a notification, TCP may look into a timer structure to
find retransmission events, update their timers so that they
expire earlier, and possibly reset their RTO values. Perhaps
this should be done for all TCP packets because TCP doesn't know
which packet will be sent to the device of interest at that time.
But I don't quite see if this better solves my problem. Such
upcalls would be more complicated than capping RTO, and thus may
be error-prone and harder to maintain. Problems might be
solvable, but I'd prefer a simpler solution.
> The problem with capping RTO is that when there is a loss
> in the network for some other reasons (and there is no reason
> bonding can't be used when talking to the internet) you
> might be too aggressive or not aggressive enough anymore
> to get the data through.
I think capping RTO is robust and better than upcalls. The
effects of capping RTO, as follows, are small enough.
* It just makes retransmission more frequent. Since there is
already a fast retransmission in TCP, retransmitting earlier
itself does not break TCP. (I'm going to examine every
occurrence of TCP_RTO_MAX, though.)
* In the worst case, it does not increase the total number of
retransmission packets, which is bounded by tcp_retries2.
Thus the final retransmission timeout comes earlier with same
tcp_retries2. If this is a case, one will have to raise
tcp_retries2.
* The average case, in a certain period of time (say 60[s]), it
may slightly increase the number of retransmission packets.
Starting from RTO = 0.2[s], the numbers of retransmission
packets in first 60[s] are, 8 with TCP_RTO_MAX = 120[s], and
15 with TCP_RTO_MAX = 5[s].
I think that increasing several packets per minute per socket
are acceptable.
Therefore the side effects of capping RTO, even talking to the
Internet, seems to be small enough.
--
OBATA Noboru (noboru.obata.ar@hitachi.com)
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2007-06-07 4:27 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-04 12:55 [RFC] Failover-friendly TCP retransmission noboru.obata.ar
2007-06-04 14:55 ` Andi Kleen
2007-06-05 13:05 ` noboru.obata.ar
2007-06-05 19:23 ` Andi Kleen
2007-06-07 4:27 ` noboru.obata.ar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).