Re: Problem with tcp (2.6.31) as first, http://bugzilla.kernel.org/show

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: Problem with tcp (2.6.31) as first, http://bugzilla.kernel.org/show_bug.cgi?id=14580
       [not found]                               ` <alpine.DEB.2.00.0911271246030.2459@melkinpaasi.cs.helsinki.fi>
@ 2009-11-27 11:43                                 ` Krzysztof Olędzki
  2009-11-27 11:48                                 ` Eric Dumazet
  1 sibling, 0 replies; 4+ messages in thread
From: Krzysztof Olędzki @ 2009-11-27 11:43 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Eric Dumazet, David Miller, Herbert Xu, netdev

On 2009-11-27 12:04, Ilpo Järvinen wrote:
> On Fri, 27 Nov 2009, Krzysztof Oledzki wrote:
> 
>>
>> On Thu, 26 Nov 2009, Eric Dumazet wrote:
>>
>>> Ilpo Järvinen a écrit :
>>>> On Thu, 26 Nov 2009, Eric Dumazet wrote:
>>>>
>>>>> Eric Dumazet a écrit :
>>>>>> Krzysztof Olędzki a écrit :
>>>>>>> On 2009-11-26 21:47, Eric Dumazet wrote:
>>>>>>>
>>>>>>>> About wscale being not sent, I suppose last kernel is fixed
>>>>>>> Thanks, I'll check it. But it is quite strange, that while reusing
>>>>>>> the
>>>>>>> old connection, Linux uses wscale==0.
>>>>>>>
>>>>>> It only 'reuse' sequence of previous connection to compute its ISN.
>>>>>>
>>>>>> Its a new socket, a new connection, with possibly different RCVBUF
>>>>>> settings -> different window.
>>>>>>
>>>>>> In my tests on net-next-2.6, I always have wscale set, but I am using
>>>>>> a program of my own,
>>>>>> not full NFS setup.
>>>>>>
>>>>>>
>>>>> Well, it seems NFS reuses its socket, so maybe we miss some cleaning
>>>>> as spotted in this old patch :
>>>> ...Nice, so we have this reuse of socket after all. ...It seems that our
>>>> other bugs might have just been solved (wq purge can then cause stale
>>>> hints if this reusing is indeed true).
>>>>
>>> Indeed, and we can do this in user space too :)
>>>
>>>
>>>        sockaddr.sin_family = AF_INET;
>>>        sockaddr.sin_port = htons(PORT);
>>>        sockaddr.sin_addr.s_addr = inet_addr("192.168.20.112");
>>>        res = connect(fd, (struct sockaddr *)&sockaddr, sizeof(sockaddr));
>>> ...
>>>
>>> /*
>>> * following code calls tcp_disconnect()
>>> */
>>>        memset(&sockaddr, 0, sizeof(sockaddr));
>>>        sockaddr.sin_family = AF_UNSPEC;
>>>        connect(fd, (struct sockaddr *)&sockaddr, sizeof(sockaddr));
>>>
>>> /* reuse socket and reconnect on same target */
>>>        sockaddr.sin_family = AF_INET;
>>>        sockaddr.sin_port = htons(PORT);
>>>        sockaddr.sin_addr.s_addr = inet_addr("192.168.20.112");
>>>        res = connect(fd, (struct sockaddr *)&sockaddr, sizeof(sockaddr));
>>>
>>>
>>> I reproduced the problem (of too small window, wscale = 0 instead of 6)
>>>
>>>
>>>
>>> 23:14:53.608106 IP client.3434 > 192.168.20.112.333: S
>>> 392872616:392872616(0) win 5840 <mss 1460,nop,nop,timestamp 82516578
>>> 0,nop,wscale 6>
>>> 23:14:53.608199 IP 192.168.20.112.333 > client.3434: S
>>> 2753948468:2753948468(0) ack 392872617 win 5792 <mss 1460,nop,nop,timestamp
>>> 1660900486 82516578,nop,wscale 6>
>>> 23:14:53.608218 IP client.3434 > 192.168.20.112.333: . ack 1 win 92
>>> <nop,nop,timestamp 82516578 1660900486>
>>> 23:14:53.608232 IP client.3434 > 192.168.20.112.333: P 1:7(6) ack 1 win 92
>>> <nop,nop,timestamp 82516578 1660900486>
>>> 23:14:53.608320 IP 192.168.20.112.333 > client.3434: . ack 7 win 91
>>> <nop,nop,timestamp 1660900486 82516578>
>>> 23:14:53.608328 IP 192.168.20.112.333 > client.3434: P 1:7(6) ack 7 win 91
>>> <nop,nop,timestamp 1660900486 82516578>
>>> 23:14:53.608331 IP 192.168.20.112.333 > client.3434: F 7:7(0) ack 7 win 91
>>> <nop,nop,timestamp 1660900486 82516578>
>>> 23:14:53.608341 IP client.3434 > 192.168.20.112.333: . ack 7 win 92
>>> <nop,nop,timestamp 82516578 1660900486>
>>> 23:14:53.647202 IP client.3434 > 192.168.20.112.333: . ack 8 win 92
>>> <nop,nop,timestamp 82516618 1660900486>
>>> 23:14:56.614341 IP client.3434 > 192.168.20.112.333: F 7:7(0) ack 8 win 92
>>> <nop,nop,timestamp 82519584 1660900486>
>>> 23:14:56.614439 IP 192.168.20.112.333 > client.3434: . ack 8 win 91
>>> <nop,nop,timestamp 1660903493 82519584>
>>> 23:14:56.614461 IP client.3434 > 192.168.20.112.333: R
>>> 392872624:392872624(0) win 0
>>>
>>> <<HERE : win = 5840 wscale = 0>>
>>> 23:14:56.616260 IP client.3434 > 192.168.20.112.333: S
>>> 392878450:392878450(0) win 5840 <mss 1460,nop,nop,timestamp 82519586
>>> 0,nop,wscale 0>
>>>
>>> 23:14:56.616352 IP 192.168.20.112.333 > client.3434: S
>>> 2800950724:2800950724(0) ack 392878451 win 5792 <mss 1460,nop,nop,timestamp
>>> 1660903494 82519586,nop,wscale 6>
>>>
>>>
>>>
>>> Following patch solves this problem, but maybe we need a flag
>>> (a la sk->sk_userlocks |= SOCK_WINCLAMP_LOCK;)
>>> in case user set window_clamp.
>>> Or just document the clearing after a tcp disconnect ?
>>>
>>> [PATCH] tcp: tcp_disconnect() should clear window_clamp
>>>
>>> Or reuse of socket possibly selects a small window, wscale = 0 for next
>>> connection.
>>>
>>> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
>>> index 524f976..7d4648f 100644
>>> --- a/net/ipv4/tcp.c
>>> +++ b/net/ipv4/tcp.c
>>> @@ -2059,6 +2059,7 @@ int tcp_disconnect(struct sock *sk, int flags)
>>> 	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
>>> 	tp->snd_cwnd_cnt = 0;
>>> 	tp->bytes_acked = 0;
>>> +	tp->window_clamp = 0;
>>> 	tcp_set_ca_state(sk, TCP_CA_Open);
>>> 	tcp_clear_retrans(tp);
>>> 	inet_csk_delack_init(sk);
>>>
>> Thanks!
>>
>> 10:07:31.429627 IP 192.168.152.205.44678 > 192.168.152.20.2049: Flags [S], seq
>> 2012792102, win 5840, options [mss 1460,sackOK,TS val 4294877898 ecr
>> 0,nop,wscale 7], length 0
>> 10:07:31.429736 IP 192.168.152.20.2049 > 192.168.152.205.44678: Flags [S.],
>> seq 1548680033, ack 2012792103, win 5792, options [mss 1460,sackOK,TS val
>> 68439846 ecr 4294877898,nop,wscale 7], length 0
>> (switching servers)
>> 10:08:05.186989 IP 192.168.152.20.2049 > 192.168.152.205.44678: Flags [R], seq
>> 1548680550, win 0, length 0
>> 10:08:11.187117 IP 192.168.152.205.44678 > 192.168.152.20.2049: Flags [S], seq
>> 2012804321, win 5840, options [mss 1460,sackOK,TS val 4294917656 ecr
>> 0,nop,wscale 7], length 0
>> 10:08:11.187276 IP 192.168.152.20.2049 > 192.168.152.205.44678: Flags [S.],
>> seq 2176044714, ack 2012804322, win 5792, options [mss 1460,sackOK,TS val
>> 68482560 ecr 4294917656,nop,wscale 7], length 0
>>
>> This indeed fixes the problem with missing/zero wscale, however the original
>> problem (tcp loop flood) still remains. I wonder why the client is not able to
>> handle it, especially that seq numbers received from both servers are
>> distanced by much, much more than the current window size: 627364681 is much
>> larger than 5840 << 7 (747520).
> 
> What would you expect to happen? If out-of-window stuff arrives we send 
> dupacks. If we would send resets, that would introduce blind rst attacks.
> In theory we might be able to quench the loop by using pingpong thing but 
> that needs very careful thought in order to not introduce other problems,
> and even then your connections will not be re-usable until either end 
> times out so the gain is rather limited. We simply cannot rst the 
> connection, that's not an option.

Right, the idea of sending RST is indeed stupid. But at risk of being 
silly, why do we need to send anything in response to out-of-window 
packets? Especially as we are doing it without ratelimiting, even for a 
packet that contains no data, only pure ack. The current situation can 
be easily abused for a hard to trace DoS - just send a lot of spoofed 
packed to a port from established connection and such server will 
response at the same rate, flooding the client.

> I find this problem simply stem from the introduced loss of end-to-end 
> connectivity. Would you really "lose" that server so that its TCP state is 
> not maintained, you'd get resets etc (crash, scheduled reboot or 
> whatever).

In brain-split situation, when your network is temporary segmented, two 
redundant servers may take the same VIP simultaneously. When such 
networks restores full functionality, one of the servers loses the IP. 
It is how I have found this problem.

> Only real solution would be a kill switch for TCP connection 
> when you break e-2-e connectivity (ie., switch servers so that the same IP 
> is reacquired by somebody else). In theory you can "simulate" the kill 
> switch by setting tcp_retries sysctls to small values to make the 
> connections to timeout much faster, but still that might not be enough for 
> you (and has other implications you might not like). 

Now I wonder - maybe we can simply kill ESTABLISHED connections 
containing a addresses being removed?

>> But there is one more thing that still bugs me, please look at one of my
>> previous dump:
>>
>> 17:39:48.339503 IP (tos 0x0, ttl 64, id 31305, offset 0, flags [DF], proto TCP
>> (6), length 56)
>>      192.168.152.205.55329 > 192.168.152.20.2049: Flags [S], cksum 0x7d35
>> (correct), seq 3093379972, win 5840, options [mss 1460,sackOK,TS val 16845 ecr
>> 0], length 0
>>
>> 17:39:48.339588 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP
>> (6), length 56)
>>      192.168.152.20.2049 > 192.168.152.205.55329: Flags [S.], cksum 0x7930
>> (correct), seq 4250661905, ack 3093379973, win 5792, options [mss
>> 1460,sackOK,TS val 9179690 ecr 16845], length 0
>>
>> OK, now we know that the client is buggy and sends small windows, but why the
>> response from the server also contains so small window?
> 
> Perhaps I don't fully understand what you find here to be a problem... 
> Anyway, initially we start with small window and enlarge it while we keep 
> going (receiver window auto-tuning).

Yes, Eric already explained it to me. I was just debating why we had 
started with such small window here. Normally, with wscale enabled the 
window is much higher, so without wscale it should be higher too, of 
course with respect to the 16bit limit. But as windows can grow later, 
it is not a problem.

Best regards,

				Krzysztof Olędzki

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Problem with tcp (2.6.31) as first, http://bugzilla.kernel.org/show_bug.cgi?id=14580
       [not found]                               ` <alpine.DEB.2.00.0911271246030.2459@melkinpaasi.cs.helsinki.fi>
  2009-11-27 11:43                                 ` Problem with tcp (2.6.31) as first, http://bugzilla.kernel.org/show_bug.cgi?id=14580 Krzysztof Olędzki
@ 2009-11-27 11:48                                 ` Eric Dumazet
  1 sibling, 0 replies; 4+ messages in thread
From: Eric Dumazet @ 2009-11-27 11:48 UTC (permalink / raw)
  To: Ilpo Järvinen
  Cc: Krzysztof Oledzki, David Miller, Herbert Xu, Linux Netdev List

Ilpo Järvinen a écrit :

> What would you expect to happen? If out-of-window stuff arrives we send 
> dupacks. If we would send resets, that would introduce blind rst attacks.
> In theory we might be able to quench the loop by using pingpong thing but 
> that needs very careful thought in order to not introduce other problems,
> and even then your connections will not be re-usable until either end 
> times out so the gain is rather limited. We simply cannot rst the 
> connection, that's not an option.
> 
> I find this problem simply stem from the introduced loss of end-to-end 
> connectivity. Would you really "lose" that server so that its TCP state is 
> not maintained, you'd get resets etc (crash, scheduled reboot or 
> whatever). Only real solution would be a kill switch for TCP connection 
> when you break e-2-e connectivity (ie., switch servers so that the same IP 
> is reacquired by somebody else). In theory you can "simulate" the kill 
> switch by setting tcp_retries sysctls to small values to make the 
> connections to timeout much faster, but still that might not be enough for 
> you (and has other implications you might not like). 
>

RST is not an option, sure, but ACK storms are unlikely good things too.


Could'nt we do something smart in presence of tcp timestamps ?

11:23:27.669910 IP 192.168.20.110.3434 > 192.168.200.200.333: . ack 2457299512 win 92 <nop,nop,timestamp 42408589 1506086404>
11:23:27.669991 IP 192.168.200.200.333 > 192.168.20.110.3434: . ack 11687 win 91 <nop,nop,timestamp 1704614538 42406583>
11:23:27.670000 IP 192.168.20.110.3434 > 192.168.200.200.333: . ack 2457299512 win 92 <nop,nop,timestamp 42408589 1506086404>
11:23:27.670093 IP 192.168.200.200.333 > 192.168.20.110.3434: . ack 11687 win 91 <nop,nop,timestamp 1704614538 42406583>
11:23:27.670099 IP 192.168.20.110.3434 > 192.168.200.200.333: . ack 2457299512 win 92 <nop,nop,timestamp 42408589 1506086404>
11:23:27.670175 IP 192.168.200.200.333 > 192.168.20.110.3434: . ack 11687 win 91 <nop,nop,timestamp 1704614538 42406583>
11:23:27.670183 IP 192.168.20.110.3434 > 192.168.200.200.333: . ack 2457299512 win 92 <nop,nop,timestamp 42408589 1506086404>
11:23:27.670268 IP 192.168.200.200.333 > 192.168.20.110.3434: . ack 11687 win 91 <nop,nop,timestamp 1704614538 42406583>
11:23:27.670276 IP 192.168.20.110.3434 > 192.168.200.200.333: . ack 2457299512 win 92 <nop,nop,timestamp 42408589 1506086404>
11:23:27.670359 IP 192.168.200.200.333 > 192.168.20.110.3434: . ack 11687 win 91 <nop,nop,timestamp 1704614538 42406583>
11:23:27.670368 IP 192.168.20.110.3434 > 192.168.200.200.333: . ack 2457299512 win 92 <nop,nop,timestamp 42408589 1506086404>


Or we could 

Count number N of strange/bad acks we received from peer.

- At first one, send our ACK immediately

- For following, delay our ACK answer by N*100 ms, to reduce the flood.
(or if we have data in flight, only rely on retransmit timer and not sending acks)

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: Problem with tcp (2.6.31) as first
       [not found]                             ` <20091129.233447.60856273.davem@davemloft.net>
@ 2009-11-30  9:12                               ` Eric Dumazet
  2009-11-30 20:56                                 ` David Miller
  0 siblings, 1 reply; 4+ messages in thread
From: Eric Dumazet @ 2009-11-30  9:12 UTC (permalink / raw)
  To: David Miller; +Cc: ilpo.jarvinen, ole, herbert, Linux Netdev List

David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 26 Nov 2009 23:42:46 +0100
> 
>> Following patch solves this problem, but maybe we need a flag
>> (a la sk->sk_userlocks |= SOCK_WINCLAMP_LOCK;)
>> in case user set window_clamp.
>> Or just document the clearing after a tcp disconnect ?
>>
>> [PATCH] tcp: tcp_disconnect() should clear window_clamp
>>
>> Or reuse of socket possibly selects a small window, wscale = 0 for next connection.
> 
> Eric, can you post this with proper signoff to netdev?
> 
> Thanks.

Sure, here it is.

Thanks

[PATCH] tcp: tcp_disconnect() should clear window_clamp

NFS can reuse its TCP socket after calling tcp_disconnect().
We noticed window scaling was not negotiated in SYN packet of next connection request.

Fix is to clear tp->window_clamp in tcp_disconnect().

Reported-by: Krzysztof Oledzki <ole@ans.pl>
Tested-by: Krzysztof Oledzki <ole@ans.pl>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---

diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index f1813bc..d7a884c 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2059,6 +2059,7 @@ int tcp_disconnect(struct sock *sk, int flags)
 	tp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
 	tp->snd_cwnd_cnt = 0;
 	tp->bytes_acked = 0;
+	tp->window_clamp = 0;
 	tcp_set_ca_state(sk, TCP_CA_Open);
 	tcp_clear_retrans(tp);
 	inet_csk_delack_init(sk)

^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: Problem with tcp (2.6.31) as first
  2009-11-30  9:12                               ` Problem with tcp (2.6.31) as first Eric Dumazet
@ 2009-11-30 20:56                                 ` David Miller
  0 siblings, 0 replies; 4+ messages in thread
From: David Miller @ 2009-11-30 20:56 UTC (permalink / raw)
  To: eric.dumazet; +Cc: ilpo.jarvinen, ole, herbert, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Mon, 30 Nov 2009 10:12:19 +0100

> [PATCH] tcp: tcp_disconnect() should clear window_clamp
> 
> NFS can reuse its TCP socket after calling tcp_disconnect().
> We noticed window scaling was not negotiated in SYN packet of next connection request.
> 
> Fix is to clear tp->window_clamp in tcp_disconnect().
> 
> Reported-by: Krzysztof Oledzki <ole@ans.pl>
> Tested-by: Krzysztof Oledzki <ole@ans.pl>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>

Applied, thanks Eric.

You voiced some concerns about window clamp route and socket settings
and whatnot.  But I think the route side should be OK.  When a new
connect() occurs, TCP will reinitialize the window clamp setting based
upon the route metric and other values.

The TCP_WINDOW_CLAMP socket option case is tedious however.  If we
want to preserve such settings across disconnect operations we'll need
to store two pieces of state I believe.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2009-11-30 20:56 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <alpine.LNX.1.10.0911251626500.13662@bizon.gios.gov.pl>
     [not found] ` <4B0D83E8.3000009@gmail.com>
     [not found]   ` <4B0D86BD.4010902@ans.pl>
     [not found]     ` <4B0D8EC6.9050204@gmail.com>
     [not found]       ` <4B0D8FB2.1060606@ans.pl>
     [not found]         ` <4B0E1D1C.30000@gmail.com>
     [not found]           ` <4B0E6D50.1020906@ans.pl>
     [not found]             ` <4B0EAC4F.3080202@gmail.com>
     [not found]               ` <alpine.LNX.1.10.0911261749100.16993@bizon.gios.gov.pl>
     [not found]                 ` <4B0EE958.7090607@gmail.com>
     [not found]                   ` <4B0EECE5.5050406@ans.pl>
     [not found]                     ` <4B0EEF76.1070803@gmail.com>
     [not found]                       ` <4B0EF273.4030003@gmail.com>
     [not found]                         ` <alpine.DEB.2.00.0911262327130.24189@melkinpaasi.cs.helsinki.fi>
     [not found]                           ` <4B0F0466.8080006@gmail.com>
     [not found]                             ` <alpine.LNX.1.10.0911271013330.24080@bizon.gios.gov.pl>
     [not found]                               ` <alpine.DEB.2.00.0911271246030.2459@melkinpaasi.cs.helsinki.fi>
2009-11-27 11:43                                 ` Problem with tcp (2.6.31) as first, http://bugzilla.kernel.org/show_bug.cgi?id=14580 Krzysztof Olędzki
2009-11-27 11:48                                 ` Eric Dumazet
     [not found]                             ` <20091129.233447.60856273.davem@davemloft.net>
2009-11-30  9:12                               ` Problem with tcp (2.6.31) as first Eric Dumazet
2009-11-30 20:56                                 ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).