* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
[not found] ` <20050901.154300.118239765.davem@davemloft.net>
@ 2005-09-01 22:53 ` Ion Badulescu
2005-09-01 23:37 ` Jesper Juhl
` (2 more replies)
0 siblings, 3 replies; 30+ messages in thread
From: Ion Badulescu @ 2005-09-01 22:53 UTC (permalink / raw)
To: David S. Miller; +Cc: linux-kernel, linux-net, netdev
Hi David,
On Thu, 1 Sep 2005, David S. Miller wrote:
> Thanks for the empty posting. Please provide the content you
> intended to post, and furthermore please post it to the network
> developer mailing list, netdev@vger.kernel.org
First of all, thanks for the reply (even to an empty posting :).
The posting wasn't actually empty, it was probably too long (94K according
to my sent-mail folder) and majordomo truncated it to zero. It has some
tcpdump snippets, that's what made it so long... unfortunately, they're
all necessary to understand the nature of the bug. I wasn't sure about
netdev, that's why I posted it only to linux-kernel and linux-net.
I can provide the full tcpdump out-of-band to interested people, since I
don't think I can get it past majordomo.
Here is the text of the message without the tcpdump inserts:
---------------------------------------------------------------------------
Hello,
I've been tracking down this bug for some time, and I'm fairly convinced
at this point that it's a kernel bug.
Under certain conditions, the TCP stack starts shrinking the TCP window
down to some ridiculously low values (hundreds of bytes, as low as 181)
and never recovers. The certain conditions I mentioned are not well
understood at this point, but they include a long-lived connection with a
very one-sided, fluctuating traffic flowing through it.
So far I've been able to reproduce it on plain-vanilla 2.4.9, 2.4.11.9,
and 2.4.12.2, as well as on the RHEL3 kernels 2.4.21-20 and 2.4.21-31. The
hardware is dual Opteron 250, running both 32- and 64-bit SMP kernels
(seems to make no difference). I've also seen the bug occur on a single
Athlon XP running 2.6.11.9 UP.
The bug occurs with all sysctl settings at their default values. I've
tried enabling and disabling pretty much all the tcp-related sysctl's in
/proc/sys/net/ipv4, to no visible improvement.
Here are a few tcpdump snippets of a TCP connection exhibiting the bug
(the complete tcpdump is available upon request, but it's very large).
10.2.20.246 is the data receiver and is the box exhibiting the bug (I'm
not sure what 10.2.224.182 is running, I don't have access to it). The
data being sent through is real-time financial data; the session begins by
catching up (at line speed) to present time, then continues to receive
real-time data as it is being generated. For what it's worth, we've never
been seen the bug occur while the session is still catching up (and
receiving a few large packets at a time); it always seems to happen while
receiving real-time data (many small packets, variably interspaced).
[I apologize for the amount of tcpdump data, but it's the only way to show
the bug in action.]
[tcpdump output removed]
The connection is established and the receiver's TCP window quickly ramps
up to 8192.
[tcpdump output removed]
Shortly thereafter the TCP window increases further to 16534. It remains
around 16534 for the next 5 minutes or so.
[tcpdump output removed]
A few minutes later it has finally caught up to present time and it starts
receiving smaller packets containing real-time data. The TCP window is
still 16534 at this point.
[tcpdump output removed]
This is where things start going bad. The window starts shrinking from
15340 all the way down to 2355 over the course of 0.3 seconds. Notice the
many duplicate acks that serve no purpose (there are no lost packets and
the tcpdump is taken on the receiver so there is no packets/acks crossed
in flight).
[tcpdump output removed]
Five minutes later the TCP window is still at 2355, having never
recovered. The window is so small that the available bandwidth for this
connection is too small to keep up with the real-time data so it is
falling behind, hence large packets are again being used. The application
processing the data (Java-based) is mostly idle at this point, and netstat
shows its recv queue to be empty. There is no apparent reason why the
kernel shouldn't enlarge the window.
In fact, if I let it continue, it eventually shrinks the window even
further (by 18:19:29, the time I'm writing this email, it's gone all the
way down to 1373). As I mentioned earlier, I've seen it go as low as 181.
We are kind of stumped at this point, and it's proving to be a
show-stopping bug for our purposes, especially over WAN links that have
higher latency (for obvious reasons). Any kind of assistance would be
greatly appreciated.
Thanks,
-Ion
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-01 22:53 ` Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels Ion Badulescu
@ 2005-09-01 23:37 ` Jesper Juhl
2005-09-02 2:51 ` John Heffner
2005-09-02 18:36 ` Alexey Kuznetsov
2 siblings, 0 replies; 30+ messages in thread
From: Jesper Juhl @ 2005-09-01 23:37 UTC (permalink / raw)
To: Ion Badulescu; +Cc: David S. Miller, linux-kernel, linux-net, netdev
On 9/2/05, Ion Badulescu <lists@limebrokerage.com> wrote:
> Hi David,
>
> On Thu, 1 Sep 2005, David S. Miller wrote:
>
> > Thanks for the empty posting. Please provide the content you
> > intended to post, and furthermore please post it to the network
> > developer mailing list, netdev@vger.kernel.org
>
> First of all, thanks for the reply (even to an empty posting :).
>
> The posting wasn't actually empty, it was probably too long (94K according
Two solutions commonly applied to that problem :
- put the big file(s) online somewhere and include an URL in the email
- compress the file(s) and attach the compressed files to the email
--
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-01 22:53 ` Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels Ion Badulescu
2005-09-01 23:37 ` Jesper Juhl
@ 2005-09-02 2:51 ` John Heffner
2005-09-02 6:28 ` David S. Miller
` (2 more replies)
2005-09-02 18:36 ` Alexey Kuznetsov
2 siblings, 3 replies; 30+ messages in thread
From: John Heffner @ 2005-09-02 2:51 UTC (permalink / raw)
To: Ion Badulescu; +Cc: David S. Miller, linux-net, linux-kernel, netdev
On Sep 1, 2005, at 6:53 PM, Ion Badulescu wrote:
>
> A few minutes later it has finally caught up to present time and it
> starts receiving smaller packets containing real-time data. The TCP
> window is still 16534 at this point.
>
> [tcpdump output removed]
>
> This is where things start going bad. The window starts shrinking from
> 15340 all the way down to 2355 over the course of 0.3 seconds. Notice
> the many duplicate acks that serve no purpose (there are no lost
> packets and the tcpdump is taken on the receiver so there is no
> packets/acks crossed in flight).
I have an idea why this is going on. Packets are pre-allocated by the
driver to be a max packet size, so when you send small packets, it
wastes a lot of memory. Currently Linux uses the packets at the
beginning of a connection to make a guess at how best to advertise its
window so as not to overflow the socket's memory bounds. Since you
start out with big segments then go to small ones, this is defeating
that mechanism. It's actually documented in the comments in
tcp_input.c. :)
* The scheme does not work when sender sends good segments opening
* window and then starts to feed us spagetti. But it should work
* in common situations. Otherwise, we have to rely on queue collapsing.
If you overflow the socket's memory bound, it ends up calling
tcp_clamp_window(). (I'm not sure this is really the right thing to do
here before trying to collapse the queue.) If the receiving
application doesn't fall too far behind, it might help you to set a
much larger receiver buffer.
-John
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 2:51 ` John Heffner
@ 2005-09-02 6:28 ` David S. Miller
2005-09-02 14:05 ` lists
2005-09-02 13:02 ` Guillaume Autran
2005-09-02 13:48 ` Alexey Kuznetsov
2 siblings, 1 reply; 30+ messages in thread
From: David S. Miller @ 2005-09-02 6:28 UTC (permalink / raw)
To: jheffner; +Cc: lists, linux-net, linux-kernel, netdev
From: John Heffner <jheffner@psc.edu>
Date: Thu, 1 Sep 2005 22:51:48 -0400
> I have an idea why this is going on. Packets are pre-allocated by the
> driver to be a max packet size, so when you send small packets, it
> wastes a lot of memory. Currently Linux uses the packets at the
> beginning of a connection to make a guess at how best to advertise its
> window so as not to overflow the socket's memory bounds. Since you
> start out with big segments then go to small ones, this is defeating
> that mechanism. It's actually documented in the comments in
> tcp_input.c. :)
>
> * The scheme does not work when sender sends good segments opening
> * window and then starts to feed us spagetti. But it should work
> * in common situations. Otherwise, we have to rely on queue collapsing.
That's a strong possibility, good catch John.
Although, I'm still not ruling out some box in the middle
even though I consider it less likely than your theory.
So you're suggesting that tcp_prune_queue() should do the:
if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
tcp_clamp_window(sk, tp);
check after attempting to collapse the queue.
But, that window clamping should fix the problem, as we recalculate
the window to advertise.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 2:51 ` John Heffner
2005-09-02 6:28 ` David S. Miller
@ 2005-09-02 13:02 ` Guillaume Autran
2005-09-02 13:48 ` Ion Badulescu
2005-09-02 13:52 ` Alexey Kuznetsov
2005-09-02 13:48 ` Alexey Kuznetsov
2 siblings, 2 replies; 30+ messages in thread
From: Guillaume Autran @ 2005-09-02 13:02 UTC (permalink / raw)
To: John Heffner
Cc: Ion Badulescu, David S. Miller, linux-net, linux-kernel, netdev
I experienced the very same problem but with window size going all the
way down to just a few bytes (14 bytes). dump files available upon
requests :)
Ion, how were you able to reproduce the issue ? Can the same type of
traffice always reproduce the issue or is it more intermittent ?
Best regards,
Guillaume.
John Heffner wrote:
> On Sep 1, 2005, at 6:53 PM, Ion Badulescu wrote:
>
>>
>> A few minutes later it has finally caught up to present time and it
>> starts receiving smaller packets containing real-time data. The TCP
>> window is still 16534 at this point.
>>
>> [tcpdump output removed]
>>
>> This is where things start going bad. The window starts shrinking
>> from 15340 all the way down to 2355 over the course of 0.3 seconds.
>> Notice the many duplicate acks that serve no purpose (there are no
>> lost packets and the tcpdump is taken on the receiver so there is no
>> packets/acks crossed in flight).
>
>
> I have an idea why this is going on. Packets are pre-allocated by the
> driver to be a max packet size, so when you send small packets, it
> wastes a lot of memory. Currently Linux uses the packets at the
> beginning of a connection to make a guess at how best to advertise its
> window so as not to overflow the socket's memory bounds. Since you
> start out with big segments then go to small ones, this is defeating
> that mechanism. It's actually documented in the comments in
> tcp_input.c. :)
>
> * The scheme does not work when sender sends good segments opening
> * window and then starts to feed us spagetti. But it should work
> * in common situations. Otherwise, we have to rely on queue collapsing.
>
> If you overflow the socket's memory bound, it ends up calling
> tcp_clamp_window(). (I'm not sure this is really the right thing to
> do here before trying to collapse the queue.) If the receiving
> application doesn't fall too far behind, it might help you to set a
> much larger receiver buffer.
>
> -John
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-net" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
--
=======================================
Guillaume Autran
Senior Software Engineer
MRV Communications, Inc.
Tel: (978) 952-4932 office
E-mail: gautran@mrv.com
=======================================
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 2:51 ` John Heffner
2005-09-02 6:28 ` David S. Miller
2005-09-02 13:02 ` Guillaume Autran
@ 2005-09-02 13:48 ` Alexey Kuznetsov
2005-09-02 14:16 ` John Heffner
2 siblings, 1 reply; 30+ messages in thread
From: Alexey Kuznetsov @ 2005-09-02 13:48 UTC (permalink / raw)
To: John Heffner
Cc: Ion Badulescu, David S. Miller, linux-net, linux-kernel, netdev
Hello!
> If you overflow the socket's memory bound, it ends up calling
> tcp_clamp_window(). (I'm not sure this is really the right thing to do
> here before trying to collapse the queue.)
Collapsing is too expensive procedure, it is rather an emergency measure.
So, tcp collapses queue, when it is necessary, but it must reduce window
as well.
Alexey
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 13:02 ` Guillaume Autran
@ 2005-09-02 13:48 ` Ion Badulescu
2005-09-02 13:52 ` Alexey Kuznetsov
1 sibling, 0 replies; 30+ messages in thread
From: Ion Badulescu @ 2005-09-02 13:48 UTC (permalink / raw)
To: Guillaume Autran
Cc: John Heffner, Ion Badulescu, David S. Miller, linux-net,
linux-kernel, netdev
On Fri, 2 Sep 2005, Guillaume Autran wrote:
> I experienced the very same problem but with window size going all the way
> down to just a few bytes (14 bytes). dump files available upon requests :)
> Ion, how were you able to reproduce the issue ? Can the same type of traffice
> always reproduce the issue or is it more intermittent ?
I have no problem whatsoever reproducing it, at least with the kind of
traffic I described. I had 4 flows like that running yesterday, and all 4
had TCP window sizes smaller than 500 bytes on the receiver by mid-day.
-Ion
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 13:02 ` Guillaume Autran
2005-09-02 13:48 ` Ion Badulescu
@ 2005-09-02 13:52 ` Alexey Kuznetsov
2005-09-02 14:11 ` John Heffner
[not found] ` <43185E81.2070300@mrv.com>
1 sibling, 2 replies; 30+ messages in thread
From: Alexey Kuznetsov @ 2005-09-02 13:52 UTC (permalink / raw)
To: Guillaume Autran
Cc: John Heffner, Ion Badulescu, David S. Miller, linux-net,
linux-kernel, netdev
Hello!
> I experienced the very same problem but with window size going all the
> way down to just a few bytes (14 bytes). dump files available upon
> requests :)
I do request.
TCP is not allowed to reduce window to a value less than 2*MSS no matter
how hard network device or peer try to confuse it. :-)
Alexey
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 6:28 ` David S. Miller
@ 2005-09-02 14:05 ` lists
2005-09-02 14:10 ` John Heffner
0 siblings, 1 reply; 30+ messages in thread
From: lists @ 2005-09-02 14:05 UTC (permalink / raw)
To: David S. Miller; +Cc: jheffner, linux-net, linux-kernel, netdev
Hi David,
On Thu, 1 Sep 2005, David S. Miller wrote:
> From: John Heffner <jheffner@psc.edu>
> Date: Thu, 1 Sep 2005 22:51:48 -0400
>
>> I have an idea why this is going on. Packets are pre-allocated by the
>> driver to be a max packet size, so when you send small packets, it
>> wastes a lot of memory. Currently Linux uses the packets at the
>> beginning of a connection to make a guess at how best to advertise its
>> window so as not to overflow the socket's memory bounds. Since you
>> start out with big segments then go to small ones, this is defeating
>> that mechanism. It's actually documented in the comments in
>> tcp_input.c. :)
>>
>> * The scheme does not work when sender sends good segments opening
>> * window and then starts to feed us spagetti. But it should work
>> * in common situations. Otherwise, we have to rely on queue collapsing.
>
> That's a strong possibility, good catch John.
That's possible, but see below.
> Although, I'm still not ruling out some box in the middle
> even though I consider it less likely than your theory.
There is no funky box in the middle, that much I can guarantee you.
I said yesterday that I don't have access to the sender. While that's true
for the flow I had captured in those dumps, I saw the same phenomenon
occur between two boxes I control fully. The sender is running Windows
2000, and is separated from the receiver by a single Catalyst 6500
switch/router (they are on different VLAN's) which doesn't do anything
fancy (I control the switch as well).
This particular Win2k sender sends _only_ real-time data, it's not capable
of rewinding. So it's always sending small packets, from start to finish,
yet the problem still occurs.
Note that even real-time data can end up generating a stream of full-size
packets occassionally. It's just very unlikely they would occur at the
start of the flow, as market data is very thin in the pre-market open hours.
> So you're suggesting that tcp_prune_queue() should do the:
>
> if (atomic_read(&sk->sk_rmem_alloc) >= sk->sk_rcvbuf)
> tcp_clamp_window(sk, tp);
>
> check after attempting to collapse the queue.
>
> But, that window clamping should fix the problem, as we recalculate
> the window to advertise.
Patches for testing are very much welcome...
Thanks,
-Ion
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 14:05 ` lists
@ 2005-09-02 14:10 ` John Heffner
2005-09-02 14:33 ` lists
0 siblings, 1 reply; 30+ messages in thread
From: John Heffner @ 2005-09-02 14:10 UTC (permalink / raw)
To: lists; +Cc: David S. Miller, linux-net, linux-kernel, netdev
On Sep 2, 2005, at 10:05 AM, lists@limebrokerage.com wrote:
> This particular Win2k sender sends _only_ real-time data, it's not
> capable of rewinding. So it's always sending small packets, from start
> to finish, yet the problem still occurs.
>
> Note that even real-time data can end up generating a stream of
> full-size packets occassionally. It's just very unlikely they would
> occur at the start of the flow, as market data is very thin in the
> pre-market open hours.
The rcv_ssthresh growth can actually take place anywhere in the flow,
not just at the beginning.
>> But, that window clamping should fix the problem, as we recalculate
>> the window to advertise.
>
> Patches for testing are very much welcome...
Have you tried increasing the size of the receive buffer yet?
-John
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 13:52 ` Alexey Kuznetsov
@ 2005-09-02 14:11 ` John Heffner
[not found] ` <43185E81.2070300@mrv.com>
1 sibling, 0 replies; 30+ messages in thread
From: John Heffner @ 2005-09-02 14:11 UTC (permalink / raw)
To: Alexey Kuznetsov
Cc: Ion Badulescu, Guillaume Autran, linux-kernel, David S. Miller,
linux-net, netdev
On Sep 2, 2005, at 9:52 AM, Alexey Kuznetsov wrote:
> Hello!
>
>> I experienced the very same problem but with window size going all the
>> way down to just a few bytes (14 bytes). dump files available upon
>> requests :)
>
> I do request.
>
> TCP is not allowed to reduce window to a value less than 2*MSS no
> matter
> how hard network device or peer try to confuse it. :-)
You're right, that doesn't make sense...
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 13:48 ` Alexey Kuznetsov
@ 2005-09-02 14:16 ` John Heffner
2005-09-02 15:11 ` Alexey Kuznetsov
0 siblings, 1 reply; 30+ messages in thread
From: John Heffner @ 2005-09-02 14:16 UTC (permalink / raw)
To: Alexey Kuznetsov
Cc: Ion Badulescu, linux-kernel, David S. Miller, linux-net, netdev
On Sep 2, 2005, at 9:48 AM, Alexey Kuznetsov wrote:
> Hello!
>
>> If you overflow the socket's memory bound, it ends up calling
>> tcp_clamp_window(). (I'm not sure this is really the right thing to
>> do
>> here before trying to collapse the queue.)
>
> Collapsing is too expensive procedure, it is rather an emergency
> measure.
> So, tcp collapses queue, when it is necessary, but it must reduce
> window
> as well.
Right.
I wonder if clamping the window though is too harsh. Maybe just
setting the rcv_ssthresh down is better? Why the distinction between
in-order and out-of-order data? Because you expect in-order data to be
a persistent case?
-John
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 14:10 ` John Heffner
@ 2005-09-02 14:33 ` lists
2005-09-02 14:48 ` John Heffner
0 siblings, 1 reply; 30+ messages in thread
From: lists @ 2005-09-02 14:33 UTC (permalink / raw)
To: John Heffner; +Cc: David S. Miller, linux-net, linux-kernel, netdev
On Fri, 2 Sep 2005, John Heffner wrote:
> Have you tried increasing the size of the receive buffer yet?
Actually, I just did. I changed rmem_max and rmem_default to 4MB and
tcp_rmem to "64k 4MB 4MB". It did seem to help, but I'm wondering if
that's simply because it has a _lot_ of memory now to leak before it
starts eating up into the window size.
I also ran into some very strange packet loss problems that weren't
occurring yesterday; they only started occurring after I increased the
buffer size. Most strange. If they happen again, I'll make sure I capture
the flow to analyze it.
Anyway, that was just to see if I can do anything at all to mitigate the
problem. I'll try again with smaller buffers (4k 128k 256k) and see what
happens.
Thanks,
-Ion
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 14:33 ` lists
@ 2005-09-02 14:48 ` John Heffner
2005-09-02 15:43 ` Ion Badulescu
0 siblings, 1 reply; 30+ messages in thread
From: John Heffner @ 2005-09-02 14:48 UTC (permalink / raw)
To: lists; +Cc: David S. Miller, linux-net, linux-kernel, netdev
On Sep 2, 2005, at 10:33 AM, lists@limebrokerage.com wrote:
> On Fri, 2 Sep 2005, John Heffner wrote:
>
>> Have you tried increasing the size of the receive buffer yet?
>
> Actually, I just did. I changed rmem_max and rmem_default to 4MB and
> tcp_rmem to "64k 4MB 4MB". It did seem to help, but I'm wondering if
> that's simply because it has a _lot_ of memory now to leak before it
> starts eating up into the window size.
If it is window clamping, then you should be asymptotically approaching
a ratio between receive buffer and window that corresponds (with a
fudge factor) to the ratio between TCP segment data size and allocated
packet size. If you make the receive buffer large enough, then the
clamped window should still end up big enough. Also, since you have
"real time" data, a larger receive buffer should probably be adequate
to eliminate this problem, since it only occurs when the receiving
application falls behind for a while, and a bigger receive buffer
allows it to fall behind more without triggering the window clamping.
-John
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 14:16 ` John Heffner
@ 2005-09-02 15:11 ` Alexey Kuznetsov
0 siblings, 0 replies; 30+ messages in thread
From: Alexey Kuznetsov @ 2005-09-02 15:11 UTC (permalink / raw)
To: John Heffner
Cc: Alexey Kuznetsov, Ion Badulescu, linux-kernel, David S. Miller,
linux-net, netdev
Hello!
> I wonder if clamping the window though is too harsh. Maybe just
> setting the rcv_ssthresh down is better?
It is too harsh. This was invented before we learned how to collapse
received data, that time tiny segments were fatal and clamping was
the last weapon against misbehaving connections.
It can be removed.
Actually, right solution would be an attempt to calculate ratio
window/rcvbuf dynamically. It looked quite tricky, so it was not done.
Instead it is controlled with static sysctl sysctl_tcp_adv_win_scale.
It does not work sometimes f.e. when a device has larger link level overhead.
I think, this should be reconsidered.
> Why the distinction between
> in-order and out-of-order data? Because you expect in-order data to be
> a persistent case?
Overflow in in-order data is hard, we cannot drop data. Also, it means
that receiving application cannot hold to receive rate and we can shrink
window.
Out-of-order data are different: we can drop the segments if we are
in serious troubles and overflow there can be cured by expansion of window
to allow fast retransmit.
Alexey
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 14:48 ` John Heffner
@ 2005-09-02 15:43 ` Ion Badulescu
0 siblings, 0 replies; 30+ messages in thread
From: Ion Badulescu @ 2005-09-02 15:43 UTC (permalink / raw)
To: John Heffner; +Cc: David S. Miller, linux-net, linux-kernel, netdev
On Fri, 2 Sep 2005, John Heffner wrote:
> If it is window clamping, then you should be asymptotically approaching a
> ratio between receive buffer and window that corresponds (with a fudge
> factor) to the ratio between TCP segment data size and allocated packet size.
> If you make the receive buffer large enough, then the clamped window should
> still end up big enough.
For what it's worth, running with a 512k receive buffer still caused the
clamping to occur, though it took longer than with the normal buffer size.
The window went down from a maximum of 12291 (times 2^4 due to window
scaling) to 3190 currently. That's still enough for our purposes, but I'll
keep monitoring it to see if it shrinks any further. It could be a viable
work-around for the time being.
Is this a bug, though, or a feature? :)
> Also, since you have "real time" data, a larger
> receive buffer should probably be adequate to eliminate this problem, since
> it only occurs when the receiving application falls behind for a while, and a
> bigger receive buffer allows it to fall behind more without triggering the
> window clamping.
Correct. I noticed too while experimenting that the clamping never occurs
if the application is fast enough to keep the socket buffer empty. It's
when data is allowed to accumulate in the buffer that the window shrinks,
and then it never grows back, as if a portion of the buffer got lost
permanently.
-Ion
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
[not found] ` <20050902154424.GA15060@yakov.inr.ac.ru>
@ 2005-09-02 16:18 ` Guillaume Autran
[not found] ` <431877EE.6010101@mrv.com>
1 sibling, 0 replies; 30+ messages in thread
From: Guillaume Autran @ 2005-09-02 16:18 UTC (permalink / raw)
To: linux-net, netdev
Hi Alexey,
Do you think this will also fix Ion's issue with small window size never
going back up ?
Thanks
Guillaume.
Alexey Kuznetsov wrote:
>Hello!
>
>
>
>>Here are the dumps...
>>
>>
>
>I see. It has nothing to do with clamping. Indeed this is a very old bug.
>Frankly speaking I even was aware about this at some moment
>(so that if the sender was Linux it would not happen. It was fixed there. :-))
>
>With such small window sender is forced to send a tiny segment
>via SWS override timer and receiver is confused in believing this
>is sender's mss, so the result is broken SWS avoidance.
>
>I think you can cure it deleting the following lines:
>
> /* If PSH is not set, packet should be
> * full sized, provided peer TCP is not badly broken.
> * This observation (if it is correct 8)) allows
> * to handle super-low mtu links fairly.
> */
> (len >= TCP_MIN_MSS + sizeof(struct tcphdr) &&
> !(tcp_flag_word(skb->h.th)&TCP_REMNANT))) {
>
>
>in tcp_input.c:tcp_measure_rcv_mss() at receiver side.
>
>
>Alexey
>
>
>
--
=======================================
Guillaume Autran
Senior Software Engineer
MRV Communications, Inc.
Tel: (978) 952-4932 office
E-mail: gautran@mrv.com
=======================================
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
[not found] ` <431877EE.6010101@mrv.com>
@ 2005-09-02 17:32 ` Alexey Kuznetsov
2005-09-02 18:56 ` Guillaume Autran
0 siblings, 1 reply; 30+ messages in thread
From: Alexey Kuznetsov @ 2005-09-02 17:32 UTC (permalink / raw)
To: Guillaume Autran
Cc: Alexey Kuznetsov, John Heffner, Ion Badulescu, David S. Miller,
linux-net, netdev
Hello!
> Do you think this will also fix Ion's issue with small window size never
> going back up ?
I was wrong even about this one. That bad case, which I rememebered,
is not triggered here. And even if packet lengths and windows were modified
to trigger it, the effect would not be so pathological.
12:23:24.474506 IP 10.10.10.3.3560 > 10.10.10.2.3200: P 13323:14703(1380) ack 1 win 6144 <nop,nop,timestamp 3256371 268597947>
12:23:24.508950 IP 10.10.10.2.3200 > 10.10.10.3.3560: . ack 14703 win 14 <nop,nop,timestamp 268598804 3256371>
This value for window is OK, we adverised 1394, so we have to reply with 14.
But where is the ACK opening full window after receiver application
reads data from buffer? It is the puzzle. It looks like rcvbuf is still
full.
Now sender cannot send anything due to SWS avoidance.
12:23:29.362161 IP 10.10.10.3.3560 > 10.10.10.2.3200: . 14703:14717(14) ack 1 win 6144 <nop,nop,timestamp 3256380 268597947>
I interpret this as SWS avoidance override timer.
12:23:29.362791 IP 10.10.10.2.3200 > 10.10.10.3.3560: . ack 14717 win 14 <nop,nop,timestamp 268599289 3256380>
This is impossible. :-) Well, it is possible, if rcv_mss is 14. It is what
I thought, but it is impossible. :-)
Honestly, I still cannot invent any way how this could happen.
Can you say what setsockopt()s were made on receiver socket? It looks
like just fiddlined with SO_RCVBUF is not enough.
Alexey
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-01 22:53 ` Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels Ion Badulescu
2005-09-01 23:37 ` Jesper Juhl
2005-09-02 2:51 ` John Heffner
@ 2005-09-02 18:36 ` Alexey Kuznetsov
2005-09-02 20:57 ` Ion Badulescu
2005-09-28 16:31 ` Ion Badulescu
2 siblings, 2 replies; 30+ messages in thread
From: Alexey Kuznetsov @ 2005-09-02 18:36 UTC (permalink / raw)
To: Ion Badulescu; +Cc: David S. Miller, linux-kernel, linux-net, netdev
Hello!
> This is where things start going bad. The window starts shrinking from
> 15340 all the way down to 2355 over the course of 0.3 seconds. Notice the
> many duplicate acks that serve no purpose
These are not duplicate, TCP_NODELAY sender just starts flooding
tiny segments, and those are normal ACKs acking those segments, note
ACK field is not the same.
> Five minutes later the TCP window is still at 2355,
....
> We are kind of stumped at this point, and it's proving to be a
> show-stopping bug for our purposes, especially over WAN links that have
> higher latency (for obvious reasons). Any kind of assistance would be
> greatly appreciated.
I still do not know how the value of 184 is possible in your case,
I would expect 730 as an absolute possible minumum. I see 9420 (2355*4).
Anyway, ignoring this puzzle, the following patch for 2.4 should help.
--- net/ipv4/tcp_input.c.orig 2003-02-20 20:38:39.000000000 +0300
+++ net/ipv4/tcp_input.c 2005-09-02 22:28:00.845952888 +0400
@@ -343,8 +343,6 @@
app_win -= tp->ack.rcv_mss;
app_win = max(app_win, 2U*tp->advmss);
- if (!ofo_win)
- tp->window_clamp = min(tp->window_clamp, app_win);
tp->rcv_ssthresh = min(tp->window_clamp, 2U*tp->advmss);
}
}
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 17:32 ` Alexey Kuznetsov
@ 2005-09-02 18:56 ` Guillaume Autran
2005-09-02 21:08 ` Alexey Kuznetsov
0 siblings, 1 reply; 30+ messages in thread
From: Guillaume Autran @ 2005-09-02 18:56 UTC (permalink / raw)
To: Alexey Kuznetsov
Cc: John Heffner, Ion Badulescu, David S. Miller, linux-net, netdev
The server socket sockopt are all default, except for the
TCP_WINDOW_CLAMP which is set to 1400 (application specific).
Guillaume.
Alexey Kuznetsov wrote:
>Hello!
>
>
>
>>Do you think this will also fix Ion's issue with small window size never
>>going back up ?
>>
>>
>
>I was wrong even about this one. That bad case, which I rememebered,
>is not triggered here. And even if packet lengths and windows were modified
>to trigger it, the effect would not be so pathological.
>
>
>12:23:24.474506 IP 10.10.10.3.3560 > 10.10.10.2.3200: P 13323:14703(1380) ack 1 win 6144 <nop,nop,timestamp 3256371 268597947>
>12:23:24.508950 IP 10.10.10.2.3200 > 10.10.10.3.3560: . ack 14703 win 14 <nop,nop,timestamp 268598804 3256371>
>
>This value for window is OK, we adverised 1394, so we have to reply with 14.
>
>But where is the ACK opening full window after receiver application
>reads data from buffer? It is the puzzle. It looks like rcvbuf is still
>full.
>
>Now sender cannot send anything due to SWS avoidance.
>
>12:23:29.362161 IP 10.10.10.3.3560 > 10.10.10.2.3200: . 14703:14717(14) ack 1 win 6144 <nop,nop,timestamp 3256380 268597947>
>
>I interpret this as SWS avoidance override timer.
>
>12:23:29.362791 IP 10.10.10.2.3200 > 10.10.10.3.3560: . ack 14717 win 14 <nop,nop,timestamp 268599289 3256380>
>
>This is impossible. :-) Well, it is possible, if rcv_mss is 14. It is what
>I thought, but it is impossible. :-)
>
>Honestly, I still cannot invent any way how this could happen.
>
>Can you say what setsockopt()s were made on receiver socket? It looks
>like just fiddlined with SO_RCVBUF is not enough.
>
>Alexey
>
>
>
--
=======================================
Guillaume Autran
Senior Software Engineer
MRV Communications, Inc.
Tel: (978) 952-4932 office
E-mail: gautran@mrv.com
=======================================
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 18:36 ` Alexey Kuznetsov
@ 2005-09-02 20:57 ` Ion Badulescu
2005-09-02 21:18 ` Alexey Kuznetsov
2005-09-28 16:31 ` Ion Badulescu
1 sibling, 1 reply; 30+ messages in thread
From: Ion Badulescu @ 2005-09-02 20:57 UTC (permalink / raw)
To: Alexey Kuznetsov
Cc: Ion Badulescu, David S. Miller, linux-kernel, linux-net, netdev
Hi Alexey,
On Fri, 2 Sep 2005, Alexey Kuznetsov wrote:
>> This is where things start going bad. The window starts shrinking from
>> 15340 all the way down to 2355 over the course of 0.3 seconds. Notice the
>> many duplicate acks that serve no purpose
>
> These are not duplicate, TCP_NODELAY sender just starts flooding
> tiny segments, and those are normal ACKs acking those segments, note
> ACK field is not the same.
Well, take a look at the double acks for 84439343, 84440447 and 84441059,
they seem pretty much identical to me.
> I still do not know how the value of 184 is possible in your case,
> I would expect 730 as an absolute possible minumum. I see 9420 (2355*4).
The numbers I mentioned are straight from the tcpdump and are not scaled,
so they need to be multiplied by 4. But even 9420, combined with a RTT of
20ms, results in a total usable bandwidth of about 3.75 Mbps, not enough
for this real-time stream at peak times.
Besides, it often gets even worse than 2355, all it takes is a few
application slowdowns.
> Anyway, ignoring this puzzle, the following patch for 2.4 should help.
>
>
> --- net/ipv4/tcp_input.c.orig 2003-02-20 20:38:39.000000000 +0300
> +++ net/ipv4/tcp_input.c 2005-09-02 22:28:00.845952888 +0400
> @@ -343,8 +343,6 @@
> app_win -= tp->ack.rcv_mss;
> app_win = max(app_win, 2U*tp->advmss);
>
> - if (!ofo_win)
> - tp->window_clamp = min(tp->window_clamp, app_win);
> tp->rcv_ssthresh = min(tp->window_clamp, 2U*tp->advmss);
> }
> }
That makes perfect sense...
I'll test it out on Tuesday, when I can connect again to the real-time
streams that we use.
Thanks a lot!
-Ion
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 18:56 ` Guillaume Autran
@ 2005-09-02 21:08 ` Alexey Kuznetsov
0 siblings, 0 replies; 30+ messages in thread
From: Alexey Kuznetsov @ 2005-09-02 21:08 UTC (permalink / raw)
To: Guillaume Autran
Cc: Alexey Kuznetsov, John Heffner, Ion Badulescu, David S. Miller,
linux-net, netdev
Hello!
> The server socket sockopt are all default, except for the
> TCP_WINDOW_CLAMP which is set to 1400 (application specific).
It is definitely not all. If you do not fiddle with SO_RCVBUF also,
you will always have receiver advertising window of 1400.
11:15:00.922119 IP 10.10.10.3.1150 > 10.10.10.2.3200: S 2246605788:2246605788(0) win 6144 <mss 1460,nop,wscale 0,nop,nop,timestamp 3248699 0>
11:15:00.922791 IP 10.10.10.2.3200 > 10.10.10.3.1150: S 3863556410:3863556410(0) ack 2246605789 win 1400 <mss 1460,nop,nop,timestamp 268188460 3248699,nop,wscale 0>
11:15:00.923118 IP 10.10.10.3.1150 > 10.10.10.2.3200: . ack 1 win 6144 <nop,nop,timestamp 3248699 268188460>
11:15:00.923486 IP 10.10.10.3.1150 > 10.10.10.2.3200: P 1:7(6) ack 1 win 6144 <nop,nop,timestamp 3248699 268188460>
11:15:00.924143 IP 10.10.10.2.3200 > 10.10.10.3.1150: . ack 7 win 1394 <nop,nop,timestamp 268188460 3248699>
cannot happen. SO_RCVBUF is not default.
Alexey
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 20:57 ` Ion Badulescu
@ 2005-09-02 21:18 ` Alexey Kuznetsov
2005-09-02 23:09 ` Ion Badulescu
0 siblings, 1 reply; 30+ messages in thread
From: Alexey Kuznetsov @ 2005-09-02 21:18 UTC (permalink / raw)
To: Ion Badulescu
Cc: Alexey Kuznetsov, Ion Badulescu, David S. Miller, linux-kernel,
linux-net, netdev
Hello!
> Well, take a look at the double acks for 84439343, 84440447 and 84441059,
> they seem pretty much identical to me.
It is just a little tcpdump glitch.
19:34:54.532271 < 10.2.20.246.33060 > 65.171.224.182.8700: . 44:44(0) ack 84439343 win 24544 <nop,nop,timestamp 226080638 99717832> (DF) (ttl 64, id 60946)
19:34:54.532432 < 10.2.20.246.33060 > 65.171.224.182.8700: . 44:44(0) ack 84439343 win 24544 <nop,nop,timestamp 226080638 99717832> (DF) (ttl 64, id 60946)
It is one ACK (look at IP ID), shown twice. This happens sometimes
with our packet socket.
> >I still do not know how the value of 184 is possible in your case,
> >I would expect 730 as an absolute possible minumum. I see 9420 (2355*4).
>
> The numbers I mentioned are straight from the tcpdump and are not scaled,
I understood. I expect when 184*4, when you said 184. But minimum is
still 730 (unscaled 1460*2). If you really saw values lower than 730
(unscaled 1460*2), there is another more severe problem and the suggested
patch will not solve it.
Alexey
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 21:18 ` Alexey Kuznetsov
@ 2005-09-02 23:09 ` Ion Badulescu
0 siblings, 0 replies; 30+ messages in thread
From: Ion Badulescu @ 2005-09-02 23:09 UTC (permalink / raw)
To: Alexey Kuznetsov
Cc: Ion Badulescu, David S. Miller, linux-kernel, linux-net, netdev
Hi Alexey,
On Sat, 3 Sep 2005, Alexey Kuznetsov wrote:
>> Well, take a look at the double acks for 84439343, 84440447 and 84441059,
>> they seem pretty much identical to me.
>
> It is just a little tcpdump glitch.
>
> 19:34:54.532271 < 10.2.20.246.33060 > 65.171.224.182.8700: . 44:44(0) ack 84439343 win 24544 <nop,nop,timestamp 226080638 99717832> (DF) (ttl 64, id 60946)
> 19:34:54.532432 < 10.2.20.246.33060 > 65.171.224.182.8700: . 44:44(0) ack 84439343 win 24544 <nop,nop,timestamp 226080638 99717832> (DF) (ttl 64, id 60946)
>
> It is one ACK (look at IP ID), shown twice. This happens sometimes
> with our packet socket.
Ahh... ack. :) That explains it.
> I understood. I expect when 184*4, when you said 184. But minimum is
> still 730 (unscaled 1460*2). If you really saw values lower than 730
> (unscaled 1460*2), there is another more severe problem and the suggested
> patch will not solve it.
I really did see very small values. This one is plucked from one of
today's streams, after a full day's worth of data had passed through it:
19:03:19.659454 10.1.12.11.8001 > 10.2.10.212.56690: P 3:6(3) ack 1 win 65529 <nop,nop,timestamp 27146219 3617561665> (DF)
19:03:19.659462 10.2.10.212.56690 > 10.1.12.11.8001: . ack 6 win 181 <nop,nop,timestamp 3617562713 27146219> (DF)
19:03:20.690719 10.1.12.11.8001 > 10.2.10.212.56690: P 6:9(3) ack 1 win 65529 <nop,nop,timestamp 27146230 3617562713> (DF)
19:03:20.690727 10.2.10.212.56690 > 10.1.12.11.8001: . ack 9 win 181 <nop,nop,timestamp 3617563744 27146230> (DF)
10.1.12.11 is the Win2k box, 10.2.10.212 is the Linux box. The socket
buffer sizes are the defaults, so the scaling is most likely 2^2. The
packets being exchanged at this point are just heartbeats.
On Tuesday I can try to capture a full session from the very begining, if
you think it would help.
Thanks,
-Ion
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-02 18:36 ` Alexey Kuznetsov
2005-09-02 20:57 ` Ion Badulescu
@ 2005-09-28 16:31 ` Ion Badulescu
2005-09-29 15:17 ` Alexey Kuznetsov
1 sibling, 1 reply; 30+ messages in thread
From: Ion Badulescu @ 2005-09-28 16:31 UTC (permalink / raw)
To: Alexey Kuznetsov; +Cc: David S. Miller, linux-kernel, linux-net, netdev
Hi Alexey,
On Fri, 2 Sep 2005, Alexey Kuznetsov wrote:
> Anyway, ignoring this puzzle, the following patch for 2.4 should help.
>
>
> --- net/ipv4/tcp_input.c.orig 2003-02-20 20:38:39.000000000 +0300
> +++ net/ipv4/tcp_input.c 2005-09-02 22:28:00.845952888 +0400
> @@ -343,8 +343,6 @@
> app_win -= tp->ack.rcv_mss;
> app_win = max(app_win, 2U*tp->advmss);
>
> - if (!ofo_win)
> - tp->window_clamp = min(tp->window_clamp, app_win);
> tp->rcv_ssthresh = min(tp->window_clamp, 2U*tp->advmss);
> }
> }
I'm very happy to report that the above patch, applied to 2.6.12.6, seems
to have cured the TCP window problem we were experiencing. I've been
testing it extensively over the last 4 weeks, and I have yet to see any
repeats of the previous behavior.
The TCP window now usually settles to around 4k (unscaled, 16k scaled),
which is smallish but good enough for our purposes.
Thanks,
-Ion
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-28 16:31 ` Ion Badulescu
@ 2005-09-29 15:17 ` Alexey Kuznetsov
2005-09-29 15:34 ` Guillaume Autran
` (2 more replies)
0 siblings, 3 replies; 30+ messages in thread
From: Alexey Kuznetsov @ 2005-09-29 15:17 UTC (permalink / raw)
To: Ion Badulescu; +Cc: David S. Miller, linux-kernel, linux-net, netdev, gautran
Hello!
> >Anyway, ignoring this puzzle, the following patch for 2.4 should help.
> >
> >
> >--- net/ipv4/tcp_input.c.orig 2003-02-20 20:38:39.000000000 +0300
> >+++ net/ipv4/tcp_input.c 2005-09-02 22:28:00.845952888 +0400
> >@@ -343,8 +343,6 @@
> > app_win -= tp->ack.rcv_mss;
> > app_win = max(app_win, 2U*tp->advmss);
> >
> >- if (!ofo_win)
> >- tp->window_clamp = min(tp->window_clamp, app_win);
> > tp->rcv_ssthresh = min(tp->window_clamp, 2U*tp->advmss);
> > }
> >}
>
> I'm very happy to report that the above patch, applied to 2.6.12.6, seems
> to have cured the TCP window problem we were experiencing.
Good. I think the patch is to be applied to all mainstream kernels.
The only obstacle is the second report by Guillaume Autran <gautran@mrv.com>,
which has some allied characteristics, but after analysis it is something
impossible, read, cryptic and severe bug. :-( I did not get a responce
to the last query, so the investigation stalled.
Alexey
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-29 15:17 ` Alexey Kuznetsov
@ 2005-09-29 15:34 ` Guillaume Autran
2005-09-29 16:04 ` John Heffner
2005-09-30 0:29 ` David S. Miller
2 siblings, 0 replies; 30+ messages in thread
From: Guillaume Autran @ 2005-09-29 15:34 UTC (permalink / raw)
To: Alexey Kuznetsov
Cc: Ion Badulescu, David S. Miller, linux-kernel, linux-net, netdev
Hi!
Sorry Alexey for keeping it quiet but I got pulled away to some other duties for
the past 3 weeks.
Anyway, the similar problem I was reporting has not been seen since that last
incident a month ago. We did change, on our application side, some of the
parameters (aka SO_RCVBUF) that did not need to be set in the first place (bug
on our side).
This plus your patch seem to have resolve the issues we were having. So, it's
all good !
Thanks again..
Guillaume.
Alexey Kuznetsov wrote:
> Hello!
>
>
>>>Anyway, ignoring this puzzle, the following patch for 2.4 should help.
>>>
>>>
>>>--- net/ipv4/tcp_input.c.orig 2003-02-20 20:38:39.000000000 +0300
>>>+++ net/ipv4/tcp_input.c 2005-09-02 22:28:00.845952888 +0400
>>>@@ -343,8 +343,6 @@
>>> app_win -= tp->ack.rcv_mss;
>>> app_win = max(app_win, 2U*tp->advmss);
>>>
>>>- if (!ofo_win)
>>>- tp->window_clamp = min(tp->window_clamp, app_win);
>>> tp->rcv_ssthresh = min(tp->window_clamp, 2U*tp->advmss);
>>> }
>>>}
>>
>>I'm very happy to report that the above patch, applied to 2.6.12.6, seems
>>to have cured the TCP window problem we were experiencing.
>
>
> Good. I think the patch is to be applied to all mainstream kernels.
>
> The only obstacle is the second report by Guillaume Autran <gautran@mrv.com>,
> which has some allied characteristics, but after analysis it is something
> impossible, read, cryptic and severe bug. :-( I did not get a responce
> to the last query, so the investigation stalled.
>
> Alexey
>
--
=======================================
Guillaume Autran
Senior Software Engineer
MRV Communications, Inc.
Tel: (978) 952-4932 office
=======================================
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-29 15:17 ` Alexey Kuznetsov
2005-09-29 15:34 ` Guillaume Autran
@ 2005-09-29 16:04 ` John Heffner
2005-09-29 18:16 ` David S. Miller
2005-09-30 0:29 ` David S. Miller
2 siblings, 1 reply; 30+ messages in thread
From: John Heffner @ 2005-09-29 16:04 UTC (permalink / raw)
To: Alexey Kuznetsov
Cc: Ion Badulescu, David S. Miller, linux-kernel, linux-net, netdev,
gautran
[-- Attachment #1: Type: text/plain, Size: 2317 bytes --]
On Thursday 29 September 2005 11:17 am, Alexey Kuznetsov wrote:
> Hello!
>
> > >Anyway, ignoring this puzzle, the following patch for 2.4 should help.
> > >
> > >
> > >--- net/ipv4/tcp_input.c.orig 2003-02-20 20:38:39.000000000 +0300
> > >+++ net/ipv4/tcp_input.c 2005-09-02 22:28:00.845952888 +0400
> > >@@ -343,8 +343,6 @@
> > > app_win -= tp->ack.rcv_mss;
> > > app_win = max(app_win, 2U*tp->advmss);
> > >
> > >- if (!ofo_win)
> > >- tp->window_clamp = min(tp->window_clamp, app_win);
> > > tp->rcv_ssthresh = min(tp->window_clamp, 2U*tp->advmss);
> > > }
> > >}
> >
> > I'm very happy to report that the above patch, applied to 2.6.12.6, seems
> > to have cured the TCP window problem we were experiencing.
>
> Good. I think the patch is to be applied to all mainstream kernels.
Has anyone looked at the patch I sent out on Sept 9? It goes a few steps
further, addressing some additional problems. Original message below.
Thanks,
-John
-----
This is a patch for discussion addressing some receive buffer growing issues.
This is partially related to the thread "Possible BUG in IPv4 TCP window
handling..." last week.
Specifically it addresses the problem of an interaction between rcvbuf
moderation (receiver autotuning) and rcv_ssthresh. The problem occurs when
sending small packets to a receiver with a larger MTU. (A very common case I
have is a host with a 1500 byte MTU sending to a host with a 9k MTU.) In
such a case, the rcv_ssthresh code is targeting a window size corresponding
to filling up the current rcvbuf, not taking into account that the new rcvbuf
moderation may increase the rcvbuf size.
One hunk makes rcv_ssthresh use tcp_rmem[2] as the size target rather than
rcvbuf. The other changes the behavior when it overflows its memory bounds
with in-order data so that it tries to grow rcvbuf (the same as with
out-of-order data).
These changes should help my problem of mixed MTUs, and should also help the
case from last week's thread I think. (In both cases though you still need
tcp_rmem[2] to be set much larger than the TCP window.) One question is if
this is too aggressive at trying to increase rcvbuf if it's under memory
stress.
-John
Signed-off-by: John Heffner <jheffner@psc.edu>
[-- Attachment #2: rcv_ssthresh.diff --]
[-- Type: text/x-diff, Size: 2005 bytes --]
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -233,7 +233,7 @@ static int __tcp_grow_window(const struc
{
/* Optimize this! */
int truesize = tcp_win_from_space(skb->truesize)/2;
- int window = tcp_full_space(sk)/2;
+ int window = tcp_win_from_space(sysctl_tcp_rmem[2])/2;
while (tp->rcv_ssthresh <= window) {
if (truesize <= skb->len)
@@ -326,39 +326,18 @@ static void tcp_init_buffer_space(struct
static void tcp_clamp_window(struct sock *sk, struct tcp_sock *tp)
{
struct inet_connection_sock *icsk = inet_csk(sk);
- struct sk_buff *skb;
- unsigned int app_win = tp->rcv_nxt - tp->copied_seq;
- int ofo_win = 0;
icsk->icsk_ack.quick = 0;
- skb_queue_walk(&tp->out_of_order_queue, skb) {
- ofo_win += skb->len;
+ if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
+ !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
+ !tcp_memory_pressure &&
+ atomic_read(&tcp_memory_allocated) < sysctl_tcp_mem[0]) {
+ sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
+ sysctl_tcp_rmem[2]);
}
-
- /* If overcommit is due to out of order segments,
- * do not clamp window. Try to expand rcvbuf instead.
- */
- if (ofo_win) {
- if (sk->sk_rcvbuf < sysctl_tcp_rmem[2] &&
- !(sk->sk_userlocks & SOCK_RCVBUF_LOCK) &&
- !tcp_memory_pressure &&
- atomic_read(&tcp_memory_allocated) < sysctl_tcp_mem[0])
- sk->sk_rcvbuf = min(atomic_read(&sk->sk_rmem_alloc),
- sysctl_tcp_rmem[2]);
- }
- if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf) {
- app_win += ofo_win;
- if (atomic_read(&sk->sk_rmem_alloc) >= 2 * sk->sk_rcvbuf)
- app_win >>= 1;
- if (app_win > icsk->icsk_ack.rcv_mss)
- app_win -= icsk->icsk_ack.rcv_mss;
- app_win = max(app_win, 2U*tp->advmss);
-
- if (!ofo_win)
- tp->window_clamp = min(tp->window_clamp, app_win);
+ if (atomic_read(&sk->sk_rmem_alloc) > sk->sk_rcvbuf)
tp->rcv_ssthresh = min(tp->window_clamp, 2U*tp->advmss);
- }
}
/* Receiver "autotuning" code.
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-29 16:04 ` John Heffner
@ 2005-09-29 18:16 ` David S. Miller
0 siblings, 0 replies; 30+ messages in thread
From: David S. Miller @ 2005-09-29 18:16 UTC (permalink / raw)
To: jheffner; +Cc: kuznet, lists, linux-kernel, linux-net, netdev, gautran
From: John Heffner <jheffner@psc.edu>
Date: Thu, 29 Sep 2005 12:04:28 -0400
> Has anyone looked at the patch I sent out on Sept 9? It goes a few steps
> further, addressing some additional problems. Original message below.
It's in my inbox pending review, so it's not forgotten :-)
My gut instinct right now is that we should put Alexey's
2-liner in for 2.6.14 et al. then be thinking about your
scheme for 2.6.15
^ permalink raw reply [flat|nested] 30+ messages in thread
* Re: Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels
2005-09-29 15:17 ` Alexey Kuznetsov
2005-09-29 15:34 ` Guillaume Autran
2005-09-29 16:04 ` John Heffner
@ 2005-09-30 0:29 ` David S. Miller
2 siblings, 0 replies; 30+ messages in thread
From: David S. Miller @ 2005-09-30 0:29 UTC (permalink / raw)
To: kuznet; +Cc: lists, linux-kernel, linux-net, netdev, gautran
From: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Date: Thu, 29 Sep 2005 19:17:29 +0400
> Good. I think the patch is to be applied to all mainstream kernels.
Done, thanks everyone.
^ permalink raw reply [flat|nested] 30+ messages in thread
end of thread, other threads:[~2005-09-30 0:29 UTC | newest]
Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <Pine.LNX.4.61.0509011713240.6083@guppy.limebrokerage.com>
[not found] ` <20050901.154300.118239765.davem@davemloft.net>
2005-09-01 22:53 ` Possible BUG in IPv4 TCP window handling, all recent 2.4.x/2.6.x kernels Ion Badulescu
2005-09-01 23:37 ` Jesper Juhl
2005-09-02 2:51 ` John Heffner
2005-09-02 6:28 ` David S. Miller
2005-09-02 14:05 ` lists
2005-09-02 14:10 ` John Heffner
2005-09-02 14:33 ` lists
2005-09-02 14:48 ` John Heffner
2005-09-02 15:43 ` Ion Badulescu
2005-09-02 13:02 ` Guillaume Autran
2005-09-02 13:48 ` Ion Badulescu
2005-09-02 13:52 ` Alexey Kuznetsov
2005-09-02 14:11 ` John Heffner
[not found] ` <43185E81.2070300@mrv.com>
[not found] ` <20050902154424.GA15060@yakov.inr.ac.ru>
2005-09-02 16:18 ` Guillaume Autran
[not found] ` <431877EE.6010101@mrv.com>
2005-09-02 17:32 ` Alexey Kuznetsov
2005-09-02 18:56 ` Guillaume Autran
2005-09-02 21:08 ` Alexey Kuznetsov
2005-09-02 13:48 ` Alexey Kuznetsov
2005-09-02 14:16 ` John Heffner
2005-09-02 15:11 ` Alexey Kuznetsov
2005-09-02 18:36 ` Alexey Kuznetsov
2005-09-02 20:57 ` Ion Badulescu
2005-09-02 21:18 ` Alexey Kuznetsov
2005-09-02 23:09 ` Ion Badulescu
2005-09-28 16:31 ` Ion Badulescu
2005-09-29 15:17 ` Alexey Kuznetsov
2005-09-29 15:34 ` Guillaume Autran
2005-09-29 16:04 ` John Heffner
2005-09-29 18:16 ` David S. Miller
2005-09-30 0:29 ` David S. Miller
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).