NAT stops forwarding ACKs after PMTU discovery

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* NAT stops forwarding ACKs after PMTU discovery
@ 2013-08-18  5:55 Corey Hickey
  2013-08-18 15:24 ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Corey Hickey @ 2013-08-18  5:55 UTC (permalink / raw)
  To: Linux Netdev List

Hi,

If there is a better user-oriented list I should ask this question on,
please tell me; I've asked a couple similar questions on netdev before
and gotten some great help.

I'm having a problem wherein some NATted connections stop forwarding
packets. I've troubleshot the problem far enough to tell that it happens
when path MTU discovery happens--the ACK to the retransmitted packet
never gets forwarded back to my local client (and neither do the
retransmitted ACKs).

This is my setup:
----------------------------------------------------------------------
       MTU 9100                  MTU 1355  MTU 1500
client      --> linux router          -->  vpn  --> work host
198.18.0.3      198.18.0.1    (eth0)                10.15.24.13
                192.168.61.54 (tun0)
----------------------------------------------------------------------

I'm running openconnect on a Linux router to connect to a cisco VPN at
work. The Linux router does SNAT (MASQUERADE) over the virtual tun0
interface created by openconnect.

This is the best test I have so far:

$ sudo ip route flush cache ; ssh workhost.example.com exit
Read from socket failed: Connection reset by peer

Most of the time, this hangs for a few minutes before giving up;
sometimes it just works fine. tcpdumps indicate that when it works, all
packets transmitted are below the VPN's MTU; when it fails, MTU
discovery has happened:

1. client sends large packet, for example 1832 bytes
2. router sends ICMP fragmentation needed
3. client retransmits with a smaller packet, for example 1303 bytes
4. router forwards packet over VPN
5. work host ACKs packet
6. router receives ACK but does not forward it
7. both endpoints retransmit, but ACKs never get forwarded

The NAT table maintains an entry throughout (and continues to SNAT
retransmits from the client), but the ACKs from the server never get
forwarded.

fire:~# netstat-nat -n -d 10.15.24.13
Proto NATed Address           Destination Address            State
tcp   198.18.0.3:51076        10.15.24.13:22                 ESTABLISHED

I put a LOG target on the INPUT chain, though, and apparently these
packets are bypassing the NAT; the kernel does not think they should be
forwarded (else they would be on the FORWARD chain).

[23335.509084] IN=tun0 OUT= MAC= SRC=10.15.24.13 DST=192.168.61.54
LEN=64 TOS=0x00 PREC=0x00 TTL=62 ID=28270 DF PROTO=TCP SPT=22 DPT=51076
WINDOW=134 RES=0x00 ACK FIN URGP=0

The kernel on the router is 3.10-2-amd64 (from Debian testing). I had
the same problem with 3.2.0-4-amd64 (from Debian stable), before I tried
to see if I could upgrade the problem away.

Any help would be much appreciated. I can send full tcpdumps if needed.

Thanks,
Corey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-18  5:55 NAT stops forwarding ACKs after PMTU discovery Corey Hickey
@ 2013-08-18 15:24 ` Eric Dumazet
  2013-08-18 16:59   ` Corey Hickey
  2013-08-18 21:23   ` Jozsef Kadlecsik
  0 siblings, 2 replies; 32+ messages in thread
From: Eric Dumazet @ 2013-08-18 15:24 UTC (permalink / raw)
  To: Corey Hickey; +Cc: Linux Netdev List, netfilter-devel

On Sat, 2013-08-17 at 22:55 -0700, Corey Hickey wrote:
> Hi,
> 
> If there is a better user-oriented list I should ask this question on,
> please tell me; I've asked a couple similar questions on netdev before
> and gotten some great help.
> 
> 
> I'm having a problem wherein some NATted connections stop forwarding
> packets. I've troubleshot the problem far enough to tell that it happens
> when path MTU discovery happens--the ACK to the retransmitted packet
> never gets forwarded back to my local client (and neither do the
> retransmitted ACKs).
> 
> This is my setup:
> ----------------------------------------------------------------------
>        MTU 9100                  MTU 1355  MTU 1500
> client      --> linux router          -->  vpn  --> work host
> 198.18.0.3      198.18.0.1    (eth0)                10.15.24.13
>                 192.168.61.54 (tun0)
> ----------------------------------------------------------------------
> 
> I'm running openconnect on a Linux router to connect to a cisco VPN at
> work. The Linux router does SNAT (MASQUERADE) over the virtual tun0
> interface created by openconnect.
> 
> This is the best test I have so far:
> 
> $ sudo ip route flush cache ; ssh workhost.example.com exit
> Read from socket failed: Connection reset by peer
> 
> Most of the time, this hangs for a few minutes before giving up;
> sometimes it just works fine. tcpdumps indicate that when it works, all
> packets transmitted are below the VPN's MTU; when it fails, MTU
> discovery has happened:
> 
> 1. client sends large packet, for example 1832 bytes
> 2. router sends ICMP fragmentation needed
> 3. client retransmits with a smaller packet, for example 1303 bytes
> 4. router forwards packet over VPN
> 5. work host ACKs packet
> 6. router receives ACK but does not forward it
> 7. both endpoints retransmit, but ACKs never get forwarded
> 
> The NAT table maintains an entry throughout (and continues to SNAT
> retransmits from the client), but the ACKs from the server never get
> forwarded.
> 
> fire:~# netstat-nat -n -d 10.15.24.13
> Proto NATed Address           Destination Address            State
> tcp   198.18.0.3:51076        10.15.24.13:22                 ESTABLISHED
> 
> 
> I put a LOG target on the INPUT chain, though, and apparently these
> packets are bypassing the NAT; the kernel does not think they should be
> forwarded (else they would be on the FORWARD chain).
> 
> [23335.509084] IN=tun0 OUT= MAC= SRC=10.15.24.13 DST=192.168.61.54
> LEN=64 TOS=0x00 PREC=0x00 TTL=62 ID=28270 DF PROTO=TCP SPT=22 DPT=51076
> WINDOW=134 RES=0x00 ACK FIN URGP=0
> 
> 
> The kernel on the router is 3.10-2-amd64 (from Debian testing). I had
> the same problem with 3.2.0-4-amd64 (from Debian stable), before I tried
> to see if I could upgrade the problem away.
> 
> Any help would be much appreciated. I can send full tcpdumps if needed.

This looks like the bug we had to fix recently : 

http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=356d7d88e088687b6578ca64601b0a2c9d145296

Could you try latest net tree ?




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-18 15:24 ` Eric Dumazet
@ 2013-08-18 16:59   ` Corey Hickey
  2013-08-18 21:23   ` Jozsef Kadlecsik
  1 sibling, 0 replies; 32+ messages in thread
From: Corey Hickey @ 2013-08-18 16:59 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Netdev List, netfilter-devel

On 2013-08-18 08:24, Eric Dumazet wrote:
> This looks like the bug we had to fix recently : 
> 
> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=356d7d88e088687b6578ca64601b0a2c9d145296
> 
> Could you try latest net tree ?

Thanks, I will do so, hopefully this evening (busy today).

-Corey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-18 15:24 ` Eric Dumazet
  2013-08-18 16:59   ` Corey Hickey
@ 2013-08-18 21:23   ` Jozsef Kadlecsik
  2013-08-19  0:00     ` Eric Dumazet
  1 sibling, 1 reply; 32+ messages in thread
From: Jozsef Kadlecsik @ 2013-08-18 21:23 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Corey Hickey, Linux Netdev List, netfilter-devel

On Sun, 18 Aug 2013, Eric Dumazet wrote:

> On Sat, 2013-08-17 at 22:55 -0700, Corey Hickey wrote:
> > Hi,
> > 
> > If there is a better user-oriented list I should ask this question on,
> > please tell me; I've asked a couple similar questions on netdev before
> > and gotten some great help.
> > 
> > 
> > I'm having a problem wherein some NATted connections stop forwarding
> > packets. I've troubleshot the problem far enough to tell that it happens
> > when path MTU discovery happens--the ACK to the retransmitted packet
> > never gets forwarded back to my local client (and neither do the
> > retransmitted ACKs).
> > 
> > This is my setup:
> > ----------------------------------------------------------------------
> >        MTU 9100                  MTU 1355  MTU 1500
> > client      --> linux router          -->  vpn  --> work host
> > 198.18.0.3      198.18.0.1    (eth0)                10.15.24.13
> >                 192.168.61.54 (tun0)
> > ----------------------------------------------------------------------
> > 
> > I'm running openconnect on a Linux router to connect to a cisco VPN at
> > work. The Linux router does SNAT (MASQUERADE) over the virtual tun0
> > interface created by openconnect.
> > 
> > This is the best test I have so far:
> > 
> > $ sudo ip route flush cache ; ssh workhost.example.com exit
> > Read from socket failed: Connection reset by peer
> > 
> > Most of the time, this hangs for a few minutes before giving up;
> > sometimes it just works fine. tcpdumps indicate that when it works, all
> > packets transmitted are below the VPN's MTU; when it fails, MTU
> > discovery has happened:
> > 
> > 1. client sends large packet, for example 1832 bytes
> > 2. router sends ICMP fragmentation needed
> > 3. client retransmits with a smaller packet, for example 1303 bytes
> > 4. router forwards packet over VPN
> > 5. work host ACKs packet
> > 6. router receives ACK but does not forward it
> > 7. both endpoints retransmit, but ACKs never get forwarded
> > 
> > The NAT table maintains an entry throughout (and continues to SNAT
> > retransmits from the client), but the ACKs from the server never get
> > forwarded.
> > 
> > fire:~# netstat-nat -n -d 10.15.24.13
> > Proto NATed Address           Destination Address            State
> > tcp   198.18.0.3:51076        10.15.24.13:22                 ESTABLISHED
> > 
> > 
> > I put a LOG target on the INPUT chain, though, and apparently these
> > packets are bypassing the NAT; the kernel does not think they should be
> > forwarded (else they would be on the FORWARD chain).
> > 
> > [23335.509084] IN=tun0 OUT= MAC= SRC=10.15.24.13 DST=192.168.61.54
> > LEN=64 TOS=0x00 PREC=0x00 TTL=62 ID=28270 DF PROTO=TCP SPT=22 DPT=51076
> > WINDOW=134 RES=0x00 ACK FIN URGP=0
> > 
> > 
> > The kernel on the router is 3.10-2-amd64 (from Debian testing). I had
> > the same problem with 3.2.0-4-amd64 (from Debian stable), before I tried
> > to see if I could upgrade the problem away.
> > 
> > Any help would be much appreciated. I can send full tcpdumps if needed.
> 
> This looks like the bug we had to fix recently : 
> 
> http://git.kernel.org/cgit/linux/kernel/git/davem/net.git/commit/?id=356d7d88e088687b6578ca64601b0a2c9d145296
> 
> Could you try latest net tree ?

It's a PMTU related problem, so I'd suggest to check first the next ones:

Aren't the ICMP error messages (i.e. fragmentation-needed) filtered out?
If not, does adding an explicit TCPMSS rule to the mangle table help?
Isn't there SACK options involved? Maybe there's a "smart" sequence 
number "anonymizer" device in the path which forgets about the SEQ numbers 
in the SACK options and thus such packets are marked as INVALID.

So tcpdumps captured at both sides could really help to tell what goes wrong.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-18 21:23   ` Jozsef Kadlecsik
@ 2013-08-19  0:00     ` Eric Dumazet
  2013-08-19  0:03       ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2013-08-19  0:00 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Corey Hickey, Linux Netdev List, netfilter-devel

On Sun, 2013-08-18 at 23:23 +0200, Jozsef Kadlecsik wrote:

> It's a PMTU related problem, so I'd suggest to check first the next ones:
> 
> Aren't the ICMP error messages (i.e. fragmentation-needed) filtered out?
> If not, does adding an explicit TCPMSS rule to the mangle table help?

Thats totally irrelevant. tcp conntrack should not care of ICMP messages
anyway.

> Isn't there SACK options involved? Maybe there's a "smart" sequence 
> number "anonymizer" device in the path which forgets about the SEQ numbers 
> in the SACK options and thus such packets are marked as INVALID.
> 

As I understood the report, they were no SACK at all, but plain ACK
messages, and conntrack believes these ACK are outside the window
because it remembers the @end sequence of the probe, and the @maxwin of
receiver.

Lets say the probe is 2000 bytes, seq 1:2001

conntrack records end = 2001

Of course this packet is lost.

The retransmit sends a 1000 bytes packet, seq 1:1001

conntrack pass the packet.

Receiver acks it using ack 1001

conntrack drops this ack because of some confusion about the prior 2001
end, and the window (might be as small as 1000)

Code like this seems very suspect to me :

before(sack, receiver->td_end + 1)

It basically says that conntrack doesn't deal with reorders on the
network.

> So tcpdumps captured at both sides could really help to tell what goes wrong.

We probably can reproduce this bug using packetdrill.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19  0:00     ` Eric Dumazet
@ 2013-08-19  0:03       ` Eric Dumazet
  2013-08-19  8:43         ` Corey Hickey
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2013-08-19  0:03 UTC (permalink / raw)
  To: Jozsef Kadlecsik; +Cc: Corey Hickey, Linux Netdev List, netfilter-devel

On Sun, 2013-08-18 at 17:00 -0700, Eric Dumazet wrote:

> Code like this seems very suspect to me :
> 
> before(sack, receiver->td_end + 1)
> 

My suggestion would be to try :

diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index 2f80107..1862902 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -656,12 +656,12 @@ static bool tcp_in_window(const struct nf_conn *ct,
 	pr_debug("tcp_in_window: I=%i II=%i III=%i IV=%i\n",
 		 before(seq, sender->td_maxend + 1),
 		 (in_recv_win ? 1 : 0),
-		 before(sack, receiver->td_end + 1),
+		 before(sack, receiver->td_end + MAXACKWINDOW(sender) + 1),
 		 after(sack, receiver->td_end - MAXACKWINDOW(sender) - 1));
 
 	if (before(seq, sender->td_maxend + 1) &&
 	    in_recv_win &&
-	    before(sack, receiver->td_end + 1) &&
+	    before(sack, receiver->td_end + MAXACKWINDOW(sender) + 1) &&
 	    after(sack, receiver->td_end - MAXACKWINDOW(sender) - 1)) {
 		/*
 		 * Take into account window scaling (RFC 1323).



^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19  0:03       ` Eric Dumazet
@ 2013-08-19  8:43         ` Corey Hickey
  2013-08-19 12:33           ` Christoph Paasch
  0 siblings, 1 reply; 32+ messages in thread
From: Corey Hickey @ 2013-08-19  8:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Jozsef Kadlecsik, Linux Netdev List, netfilter-devel

On 2013-08-18 17:03, Eric Dumazet wrote:
> On Sun, 2013-08-18 at 17:00 -0700, Eric Dumazet wrote:
> 
>> Code like this seems very suspect to me :
>>
>> before(sack, receiver->td_end + 1)
>>
> 
> My suggestion would be to try :
> 
> diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c

[...]

Thanks for all your suggestions--I really wasn't expecting so much on a
weekend. Here's all the data I have for tonight.

I tried the linux-next kernel, then linux-next with your patch applied.
Neither of them fix the problem, unfortunately. I have taken tcpdumps
for a working SSH and a failing SSH.

http://fatooh.org/files/tmp/linux-next-patch1.tar.bz2

[localhost]
sudo tcpdump -ni br0 -s 0 -w /tmp/local.pcap 'host 10.15.24.13 or icmp'

[router eth0]
tcpdump -ni eth0 -s 0 -w /tmp/eth0.pcap \
    'host 10.15.24.13 or (icmp and host not 69.78.33.132)'
* the exclusion here is just to remove some unrelated clutter

[router tun0]
tcpdump -ni tun0 -s 0 -w /tmp/tun0.pcap -s 0 'host 10.15.24.13 or icmp'

[remote]
tcpdump -ni eth0 -s 0 -w remote.pcap 'host 192.168.61.56'

Some notes:

1. I tested the new kernels only on the Linux router, assuming that is
where it was intended.

2. I take back what I wrote earlier about every connection that involves
PMTU discovery failed; I may have been observing this wrong. For now,
the situation is that some connections stop forwarding packets from the
remote host immediately after the retransmit, while other work fine.

3. From local.pcap, you can see that my localhost doesn't actually
transmit a large packet, yet the router's eth0 sees a large packet come
in. I think this is due to TSO, but I'm not completely sure.

4. For some reason, I cannot reproduce this when SSHing to a host at
work that is running Debian sid with 3.10-1-amd64, but I can reproduce
it when SSHing to hosts running Centos 6.4 with
2.6.32-358.6.1.el6.x86_64 (which surely has a ton of patches applied,
for whatever that's worth).

5. I have only a vague understanding of SACK; I will be reading up on
this soon. I will also look into packetdrill for reproducing the
problem, if the SSH results aren't good enough.

6. If I reduce the MTU on localhost to match the path MTU, the problem
does go away.

Thanks again for all the help,
Corey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19  8:43         ` Corey Hickey
@ 2013-08-19 12:33           ` Christoph Paasch
  2013-08-19 13:24             ` Eric Dumazet
  2013-08-19 18:22             ` Corey Hickey
  0 siblings, 2 replies; 32+ messages in thread
From: Christoph Paasch @ 2013-08-19 12:33 UTC (permalink / raw)
  To: Corey Hickey
  Cc: Eric Dumazet, Jozsef Kadlecsik, Linux Netdev List,
	netfilter-devel

Hello,

I would say, the problem is due to a sequence-number rewriting
middlebox, who does not correctly handle the SACK-blocks.

In remote.pcap, you see:
Packet#10: A Dup-ACK: ACK 1005503816, SACK: 1005505184-1005505648
The SACK actually covers Packet#9

In tun0.pcap, you have:
Packet#9: The one that is SACK'ed: SEQ: 3514869772
Packet#11: The Dup-ACK: ACK: 3514868404, SACK: 3570452498-3570452962

You can see that the SACK-block is not really aligned with the ACK-numbers.

Netfilter is probably dropping the Dup-ACK, because the SACK-block is
acknowledging unseen data.


There are middleboxes out there that randomize the sequence numbers, due to
an old bug in Windows, where the initial sequence number was not
sufficiently randomized. There are some of these middleboxes, who simply do
not support SACK, thus modify only the sequence numbers of the TCP-header
(https://supportforums.cisco.com/docs/DOC-12668#TCP_Sequence_Number_Randomization_and_SACK).

This results in a TCP retransmission timeout on the sender, because of
the invalid SACK-blocks, the duplicate ACKs are not accounted. This
completly kills the performance, as you can see in our presentation given at
IETF87: http://tools.ietf.org/agenda/87/slides/slides-87-tcpm-11.pdf

We have a patch that accounts DUP-ACKs with invalid SACK-blocks effectively
as duplicate acknowledgments. I can send the patches, if the
netdev-community is interested in accepting these upstream.


Cheers,
Christoph



On 19/08/13 - 01:43:18, Corey Hickey wrote:
> On 2013-08-18 17:03, Eric Dumazet wrote:
> > On Sun, 2013-08-18 at 17:00 -0700, Eric Dumazet wrote:
> > 
> >> Code like this seems very suspect to me :
> >>
> >> before(sack, receiver->td_end + 1)
> >>
> > 
> > My suggestion would be to try :
> > 
> > diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> 
> [...]
> 
> Thanks for all your suggestions--I really wasn't expecting so much on a
> weekend. Here's all the data I have for tonight.
> 
> 
> I tried the linux-next kernel, then linux-next with your patch applied.
> Neither of them fix the problem, unfortunately. I have taken tcpdumps
> for a working SSH and a failing SSH.
> 
> http://fatooh.org/files/tmp/linux-next-patch1.tar.bz2
> 
> [localhost]
> sudo tcpdump -ni br0 -s 0 -w /tmp/local.pcap 'host 10.15.24.13 or icmp'
> 
> [router eth0]
> tcpdump -ni eth0 -s 0 -w /tmp/eth0.pcap \
>     'host 10.15.24.13 or (icmp and host not 69.78.33.132)'
> * the exclusion here is just to remove some unrelated clutter
> 
> [router tun0]
> tcpdump -ni tun0 -s 0 -w /tmp/tun0.pcap -s 0 'host 10.15.24.13 or icmp'
> 
> [remote]
> tcpdump -ni eth0 -s 0 -w remote.pcap 'host 192.168.61.56'
> 
> 
> Some notes:
> 
> 1. I tested the new kernels only on the Linux router, assuming that is
> where it was intended.
> 
> 2. I take back what I wrote earlier about every connection that involves
> PMTU discovery failed; I may have been observing this wrong. For now,
> the situation is that some connections stop forwarding packets from the
> remote host immediately after the retransmit, while other work fine.
> 
> 3. From local.pcap, you can see that my localhost doesn't actually
> transmit a large packet, yet the router's eth0 sees a large packet come
> in. I think this is due to TSO, but I'm not completely sure.
> 
> 4. For some reason, I cannot reproduce this when SSHing to a host at
> work that is running Debian sid with 3.10-1-amd64, but I can reproduce
> it when SSHing to hosts running Centos 6.4 with
> 2.6.32-358.6.1.el6.x86_64 (which surely has a ton of patches applied,
> for whatever that's worth).
> 
> 5. I have only a vague understanding of SACK; I will be reading up on
> this soon. I will also look into packetdrill for reproducing the
> problem, if the SSH results aren't good enough.
> 
> 6. If I reduce the MTU on localhost to match the path MTU, the problem
> does go away.
> 
> Thanks again for all the help,
> Corey
> --
> To unsubscribe from this list: send the line "unsubscribe netfilter-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 12:33           ` Christoph Paasch
@ 2013-08-19 13:24             ` Eric Dumazet
  2013-08-19 13:49               ` Christoph Paasch
  2013-08-19 18:22             ` Corey Hickey
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2013-08-19 13:24 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: Corey Hickey, Jozsef Kadlecsik, Linux Netdev List,
	netfilter-devel

On Mon, 2013-08-19 at 14:33 +0200, Christoph Paasch wrote:
> Hello,
> 
> I would say, the problem is due to a sequence-number rewriting
> middlebox, who does not correctly handle the SACK-blocks.
> 
> In remote.pcap, you see:
> Packet#10: A Dup-ACK: ACK 1005503816, SACK: 1005505184-1005505648
> The SACK actually covers Packet#9
> 
> In tun0.pcap, you have:
> Packet#9: The one that is SACK'ed: SEQ: 3514869772
> Packet#11: The Dup-ACK: ACK: 3514868404, SACK: 3570452498-3570452962
> 
> You can see that the SACK-block is not really aligned with the ACK-numbers.
> 
> Netfilter is probably dropping the Dup-ACK, because the SACK-block is
> acknowledging unseen data.
> 
> 
> There are middleboxes out there that randomize the sequence numbers, due to
> an old bug in Windows, where the initial sequence number was not
> sufficiently randomized. There are some of these middleboxes, who simply do
> not support SACK, thus modify only the sequence numbers of the TCP-header
> (https://supportforums.cisco.com/docs/DOC-12668#TCP_Sequence_Number_Randomization_and_SACK).
> 
> This results in a TCP retransmission timeout on the sender, because of
> the invalid SACK-blocks, the duplicate ACKs are not accounted. This
> completly kills the performance, as you can see in our presentation given at
> IETF87: http://tools.ietf.org/agenda/87/slides/slides-87-tcpm-11.pdf
> 
> We have a patch that accounts DUP-ACKs with invalid SACK-blocks effectively
> as duplicate acknowledgments. I can send the patches, if the
> netdev-community is interested in accepting these upstream.
> 

You mean a netfilter patch, a tcp patch, or both ?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 13:24             ` Eric Dumazet
@ 2013-08-19 13:49               ` Christoph Paasch
  2013-08-19 13:58                 ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: Christoph Paasch @ 2013-08-19 13:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Corey Hickey, Jozsef Kadlecsik, Linux Netdev List,
	netfilter-devel

On 19/08/13 - 06:24:17, Eric Dumazet wrote:
> On Mon, 2013-08-19 at 14:33 +0200, Christoph Paasch wrote:
> > Hello,
> > 
> > I would say, the problem is due to a sequence-number rewriting
> > middlebox, who does not correctly handle the SACK-blocks.
> > 
> > In remote.pcap, you see:
> > Packet#10: A Dup-ACK: ACK 1005503816, SACK: 1005505184-1005505648
> > The SACK actually covers Packet#9
> > 
> > In tun0.pcap, you have:
> > Packet#9: The one that is SACK'ed: SEQ: 3514869772
> > Packet#11: The Dup-ACK: ACK: 3514868404, SACK: 3570452498-3570452962
> > 
> > You can see that the SACK-block is not really aligned with the ACK-numbers.
> > 
> > Netfilter is probably dropping the Dup-ACK, because the SACK-block is
> > acknowledging unseen data.
> > 
> > 
> > There are middleboxes out there that randomize the sequence numbers, due to
> > an old bug in Windows, where the initial sequence number was not
> > sufficiently randomized. There are some of these middleboxes, who simply do
> > not support SACK, thus modify only the sequence numbers of the TCP-header
> > (https://supportforums.cisco.com/docs/DOC-12668#TCP_Sequence_Number_Randomization_and_SACK).
> > 
> > This results in a TCP retransmission timeout on the sender, because of
> > the invalid SACK-blocks, the duplicate ACKs are not accounted. This
> > completly kills the performance, as you can see in our presentation given at
> > IETF87: http://tools.ietf.org/agenda/87/slides/slides-87-tcpm-11.pdf
> > 
> > We have a patch that accounts DUP-ACKs with invalid SACK-blocks effectively
> > as duplicate acknowledgments. I can send the patches, if the
> > netdev-community is interested in accepting these upstream.
> > 
> 
> You mean a netfilter patch, a tcp patch, or both ?

It's a TCP-patch, that interprets duplicate-acks with invalid SACK-blocks as
duplicate acks in tcp_sock->sacked_out.


Christoph


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 13:49               ` Christoph Paasch
@ 2013-08-19 13:58                 ` Eric Dumazet
  2013-08-19 14:35                   ` Phil Oester
                                     ` (2 more replies)
  0 siblings, 3 replies; 32+ messages in thread
From: Eric Dumazet @ 2013-08-19 13:58 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: Corey Hickey, Jozsef Kadlecsik, Linux Netdev List,
	netfilter-devel

On Mon, 2013-08-19 at 15:49 +0200, Christoph Paasch wrote:

> 
> It's a TCP-patch, that interprets duplicate-acks with invalid SACK-blocks as
> duplicate acks in tcp_sock->sacked_out.

Yeah, but here, this is conntrack who is blocking the thing.

TCP receiver has no chance to 'fix' it.

See conntrack is one of those buggy middle box as well.

So if you want to properly handle this mess, you'll also have to fix
conntrack.




^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 13:58                 ` Eric Dumazet
@ 2013-08-19 14:35                   ` Phil Oester
  2013-08-19 15:32                     ` Eric Dumazet
  2013-08-19 15:33                     ` Christoph Paasch
  2013-08-19 14:43                   ` NAT stops forwarding ACKs after PMTU discovery Christoph Paasch
  2013-08-19 20:13                   ` Jozsef Kadlecsik
  2 siblings, 2 replies; 32+ messages in thread
From: Phil Oester @ 2013-08-19 14:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Paasch, Corey Hickey, Jozsef Kadlecsik,
	Linux Netdev List, netfilter-devel

On Mon, Aug 19, 2013 at 06:58:05AM -0700, Eric Dumazet wrote:
> Yeah, but here, this is conntrack who is blocking the thing.
> 
> TCP receiver has no chance to 'fix' it.
> 
> See conntrack is one of those buggy middle box as well.
> 
> So if you want to properly handle this mess, you'll also have to fix
> conntrack.

Better to fix the box which is randomizing the acks and ignoring
the SACKs.  Usually these are older Cisco PIX/ASA devices which just
need a code upgrade.  

Phil

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 14:35                   ` Phil Oester
@ 2013-08-19 15:32                     ` Eric Dumazet
  2013-08-19 15:33                     ` Christoph Paasch
  1 sibling, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2013-08-19 15:32 UTC (permalink / raw)
  To: Phil Oester
  Cc: Christoph Paasch, Corey Hickey, Jozsef Kadlecsik,
	Linux Netdev List, netfilter-devel

On Mon, 2013-08-19 at 07:35 -0700, Phil Oester wrote:
> On Mon, Aug 19, 2013 at 06:58:05AM -0700, Eric Dumazet wrote:
> > Yeah, but here, this is conntrack who is blocking the thing.
> > 
> > TCP receiver has no chance to 'fix' it.
> > 
> > See conntrack is one of those buggy middle box as well.
> > 
> > So if you want to properly handle this mess, you'll also have to fix
> > conntrack.
> 
> Better to fix the box which is randomizing the acks and ignoring
> the SACKs.  Usually these are older Cisco PIX/ASA devices which just
> need a code upgrade.  


Having a patch to relax things could be evaluated, if not adding new
problems ( http://en.wikipedia.org/wiki/Robustness_principle ) 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 14:35                   ` Phil Oester
  2013-08-19 15:32                     ` Eric Dumazet
@ 2013-08-19 15:33                     ` Christoph Paasch
  2013-08-19 16:00                       ` Eric Dumazet
  1 sibling, 1 reply; 32+ messages in thread
From: Christoph Paasch @ 2013-08-19 15:33 UTC (permalink / raw)
  To: Phil Oester
  Cc: Eric Dumazet, Corey Hickey, Jozsef Kadlecsik, Linux Netdev List,
	netfilter-devel

On 19/08/13 - 07:35:09, Phil Oester wrote:
> On Mon, Aug 19, 2013 at 06:58:05AM -0700, Eric Dumazet wrote:
> > Yeah, but here, this is conntrack who is blocking the thing.
> > 
> > TCP receiver has no chance to 'fix' it.
> > 
> > See conntrack is one of those buggy middle box as well.
> > 
> > So if you want to properly handle this mess, you'll also have to fix
> > conntrack.
> 
> Better to fix the box which is randomizing the acks and ignoring
> the SACKs.  Usually these are older Cisco PIX/ASA devices which just
> need a code upgrade.  

I agree, the problem are these horrid middleboxes...

Unfortunately, they will hardly go away in the near futur. Rather the
opposite is the case.

If you have a public server running, I would be interested in the count of
invalid SACK-blocks received (netstat -s | grep TCPSACKDiscard). This is an
indication for such kind of middlebox between your server and the client,
implying that these connections cannot benefit from TCP-FastRetransmission
and each packet-loss will require an RTO to recover.

Cheers,
Christoph

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 15:33                     ` Christoph Paasch
@ 2013-08-19 16:00                       ` Eric Dumazet
  2013-08-19 17:15                         ` Christoph Paasch
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2013-08-19 16:00 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: Phil Oester, Corey Hickey, Jozsef Kadlecsik, Linux Netdev List,
	netfilter-devel

On Mon, 2013-08-19 at 17:33 +0200, Christoph Paasch wrote:

> 
> Unfortunately, they will hardly go away in the near futur. Rather the
> opposite is the case.
> 
> 
> If you have a public server running, I would be interested in the count of
> invalid SACK-blocks received (netstat -s | grep TCPSACKDiscard). This is an
> indication for such kind of middlebox between your server and the client,
> implying that these connections cannot benefit from TCP-FastRetransmission
> and each packet-loss will require an RTO to recover.
> 

If the (random) sequence offset is small rather than completely out of
window, it's going to be hard to detect all problems.

Show us your patch ;)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 16:00                       ` Eric Dumazet
@ 2013-08-19 17:15                         ` Christoph Paasch
  2013-08-19 18:00                           ` Phil Oester
  2013-08-19 18:10                           ` Eric Dumazet
  0 siblings, 2 replies; 32+ messages in thread
From: Christoph Paasch @ 2013-08-19 17:15 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Phil Oester, Corey Hickey, Jozsef Kadlecsik, Linux Netdev List,
	netfilter-devel

On 19/08/13 - 09:00:31, Eric Dumazet wrote:
> On Mon, 2013-08-19 at 17:33 +0200, Christoph Paasch wrote:
> > Unfortunately, they will hardly go away in the near futur. Rather the
> > opposite is the case.
> > 
> > 
> > If you have a public server running, I would be interested in the count of
> > invalid SACK-blocks received (netstat -s | grep TCPSACKDiscard). This is an
> > indication for such kind of middlebox between your server and the client,
> > implying that these connections cannot benefit from TCP-FastRetransmission
> > and each packet-loss will require an RTO to recover.
> > 
> 
> If the (random) sequence offset is small rather than completely out of
> window, it's going to be hard to detect all problems.

Yes, it is not possible to make it 100% perfect. But considering the size of
the seq-space, the probability is rather low that the SACK-block falls
in-window.

> Show us your patch ;)

Will send it soon. Have to rebase on net-next,... :)
(it's several months ago that I did this)


Christoph

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 17:15                         ` Christoph Paasch
@ 2013-08-19 18:00                           ` Phil Oester
  2013-08-19 18:10                           ` Eric Dumazet
  1 sibling, 0 replies; 32+ messages in thread
From: Phil Oester @ 2013-08-19 18:00 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: Eric Dumazet, Corey Hickey, Jozsef Kadlecsik, Linux Netdev List,
	netfilter-devel

On Mon, Aug 19, 2013 at 07:15:19PM +0200, Christoph Paasch wrote:
> On 19/08/13 - 09:00:31, Eric Dumazet wrote:
> > On Mon, 2013-08-19 at 17:33 +0200, Christoph Paasch wrote:
> > > Unfortunately, they will hardly go away in the near futur. Rather the
> > > opposite is the case.
> > > 
> > > 
> > > If you have a public server running, I would be interested in the count of
> > > invalid SACK-blocks received (netstat -s | grep TCPSACKDiscard). This is an
> > > indication for such kind of middlebox between your server and the client,
> > > implying that these connections cannot benefit from TCP-FastRetransmission
> > > and each packet-loss will require an RTO to recover.
> > > 
> > 
> > If the (random) sequence offset is small rather than completely out of
> > window, it's going to be hard to detect all problems.

It is not small unfortunately.  The bug is that with ISN randomization enabled
the PIX leaves the SACK sequence numbers untouched.  For reference, the cisco
bug is CSCse14419.

> Yes, it is not possible to make it 100% perfect. But considering the size of
> the seq-space, the probability is rather low that the SACK-block falls
> in-window.

s/rather low/nil/

> > Show us your patch ;)

Or just disable SACK on the box behind the problematic PIX.

Phil

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 17:15                         ` Christoph Paasch
  2013-08-19 18:00                           ` Phil Oester
@ 2013-08-19 18:10                           ` Eric Dumazet
  2013-08-19 19:29                             ` [RFC 0/2] Account for duplicate ACKs with invalid SACK-blocks Christoph Paasch
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2013-08-19 18:10 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: Phil Oester, Corey Hickey, Jozsef Kadlecsik, Linux Netdev List,
	netfilter-devel

On Mon, 2013-08-19 at 19:15 +0200, Christoph Paasch wrote:

> 
> Will send it soon. Have to rebase on net-next,... :)
> (it's several months ago that I did this)

No hurry.

This is a very unlikely problem anyway.

Trace taken on a random Google server :

TcpExtTCPSACKDiscard 44666
TcpExtTCPSACKReneging 173593
TcpExtTCPSACKReorder 1331306
TcpExtTCPSackFailures 791292
TcpExtTCPSackMerged 261702691
TcpExtTCPSackRecovery 57267494
TcpExtTCPSackRecoveryFail 2008511
TcpExtTCPSackShiftFallback 841017416
TcpExtTCPSackShifted 619337144
...
TcpOutSegs 42215528084
TcpRetransSegs 441868477

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC 0/2] Account for duplicate ACKs with invalid SACK-blocks
  2013-08-19 18:10                           ` Eric Dumazet
@ 2013-08-19 19:29                             ` Christoph Paasch
  2013-08-19 19:29                               ` [RFC 1/2] Use acked_out for reno-style ack acounting instead of sacked_out Christoph Paasch
                                                 ` (3 more replies)
  0 siblings, 4 replies; 32+ messages in thread
From: Christoph Paasch @ 2013-08-19 19:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Phil Oester, Corey Hickey, Benjamin Hesmans

There exist sequence-number rewriting middleboxes, who do not modify the 
sequence-number in the SACK-blocks.
Duplicate acknowledgments with these (invalid) SACK-blocks will not be
accounted in sacked_out, and thus no fast-retransmit will trigger.
So, the only way to recover from a packet-loss is through an RTO,
effectively killing the performance of TCP in this case.

Performance-results can be seen here:
http://tools.ietf.org/agenda/87/slides/slides-87-tcpm-11.pdf

Another solution might be to simply disable SACK, as soon as an invalid
SACK-block has been received (as suggested by Phil Oester). This however
might be too aggressive.

Christoph Paasch (2):
  Use acked_out for reno-style ack acounting instead of sacked_out
  Account acked_out in sack, if the sack is invalid

 include/linux/tcp.h      |   1 +
 include/net/tcp.h        |  17 +++++---
 net/ipv4/tcp_input.c     | 103 ++++++++++++++++++++++++++++++++---------------
 net/ipv4/tcp_minisocks.c |   1 +
 net/ipv4/tcp_output.c    |   6 +--
 net/ipv4/tcp_timer.c     |   2 +-
 6 files changed, 89 insertions(+), 41 deletions(-)

-- 
1.8.1.2

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC 1/2] Use acked_out for reno-style ack acounting instead of sacked_out
  2013-08-19 19:29                             ` [RFC 0/2] Account for duplicate ACKs with invalid SACK-blocks Christoph Paasch
@ 2013-08-19 19:29                               ` Christoph Paasch
  2013-08-19 19:29                               ` [RFC 2/2] Account acked_out in sack, if the sack is invalid Christoph Paasch
                                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 32+ messages in thread
From: Christoph Paasch @ 2013-08-19 19:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Phil Oester, Corey Hickey, Benjamin Hesmans

This patch adds acked_out to tcp_sock, so that all acounting of acked
packets in case of non-sack connections is done within this field.

tcp_acked_out() sums up sacked_out and acked_out.

Tested-by: Benjamin Hesmans <benjamin.hesmans@uclouvain.be>
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
---
 include/linux/tcp.h      |  1 +
 include/net/tcp.h        | 17 ++++++++++++-----
 net/ipv4/tcp_input.c     | 44 +++++++++++++++++++++++++++-----------------
 net/ipv4/tcp_minisocks.c |  1 +
 net/ipv4/tcp_output.c    |  2 +-
 net/ipv4/tcp_timer.c     |  2 +-
 6 files changed, 43 insertions(+), 24 deletions(-)

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index d686334..6622595 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -241,6 +241,7 @@ struct tcp_sock {
 	u32	pushed_seq;	/* Last pushed seq, required to talk to windows */
 	u32	lost_out;	/* Lost packets			*/
 	u32	sacked_out;	/* SACK'd packets			*/
+	u32	acked_out;	/* DupACK'd packets			*/
 	u32	fackets_out;	/* FACK'd packets			*/
 	u32	tso_deferred;
 
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 09cb5c1..8210057 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -871,9 +871,14 @@ static inline void tcp_disable_early_retrans(struct tcp_sock *tp)
 	tp->do_early_retrans = 0;
 }
 
+static inline unsigned int tcp_acked_out(const struct tcp_sock *tp)
+{
+	return tp->sacked_out + tp->acked_out;
+}
+
 static inline unsigned int tcp_left_out(const struct tcp_sock *tp)
 {
-	return tp->sacked_out + tp->lost_out;
+	return tcp_acked_out(tp) + tp->lost_out;
 }
 
 /* This determines how many packets are "in the network" to the best
@@ -1446,12 +1451,12 @@ static inline void tcp_push_pending_frames(struct sock *sk)
 }
 
 /* Start sequence of the skb just after the highest skb with SACKed
- * bit, valid only if sacked_out > 0 or when the caller has ensured
+ * bit, valid only if sacked_out or acked_out > 0 or when the caller has ensured
  * validity by itself.
  */
 static inline u32 tcp_highest_sack_seq(struct tcp_sock *tp)
 {
-	if (!tp->sacked_out)
+	if (!tp->sacked_out && !tp->acked_out)
 		return tp->snd_una;
 
 	if (tp->highest_sack == NULL)
@@ -1481,8 +1486,10 @@ static inline void tcp_highest_sack_combine(struct sock *sk,
 					    struct sk_buff *old,
 					    struct sk_buff *new)
 {
-	if (tcp_sk(sk)->sacked_out && (old == tcp_sk(sk)->highest_sack))
-		tcp_sk(sk)->highest_sack = new;
+	struct tcp_sock *tp = tcp_sk(sk);
+
+	if ((tp->sacked_out || tp->acked_out) && (old == tp->highest_sack))
+		tp->highest_sack = new;
 }
 
 /* Determines whether this is a thin stream (which may suffer from
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e965cc7..af764d0 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -1748,8 +1748,8 @@ out:
 	return state.flag;
 }
 
-/* Limits sacked_out so that sum with lost_out isn't ever larger than
- * packets_out. Returns false if sacked_out adjustement wasn't necessary.
+/* Limits acked_out so that sum with lost_out isn't ever larger than
+ * packets_out. Returns false if acked_out adjustement wasn't necessary.
  */
 static bool tcp_limit_reno_sacked(struct tcp_sock *tp)
 {
@@ -1758,8 +1758,8 @@ static bool tcp_limit_reno_sacked(struct tcp_sock *tp)
 	holes = max(tp->lost_out, 1U);
 	holes = min(holes, tp->packets_out);
 
-	if ((tp->sacked_out + holes) > tp->packets_out) {
-		tp->sacked_out = tp->packets_out - holes;
+	if ((tp->acked_out + holes) > tp->packets_out) {
+		tp->acked_out = tp->packets_out - holes;
 		return true;
 	}
 	return false;
@@ -1781,7 +1781,7 @@ static void tcp_check_reno_reordering(struct sock *sk, const int addend)
 static void tcp_add_reno_sack(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	tp->sacked_out++;
+	tp->acked_out++;
 	tcp_check_reno_reordering(sk, 0);
 	tcp_verify_left_out(tp);
 }
@@ -1794,10 +1794,10 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
 
 	if (acked > 0) {
 		/* One ACK acked hole. The rest eat duplicate ACKs. */
-		if (acked - 1 >= tp->sacked_out)
-			tp->sacked_out = 0;
+		if (acked - 1 >= tp->acked_out)
+			tp->acked_out = 0;
 		else
-			tp->sacked_out -= acked - 1;
+			tp->acked_out -= acked - 1;
 	}
 	tcp_check_reno_reordering(sk, acked);
 	tcp_verify_left_out(tp);
@@ -1805,7 +1805,7 @@ static void tcp_remove_reno_sacks(struct sock *sk, int acked)
 
 static inline void tcp_reset_reno_sack(struct tcp_sock *tp)
 {
-	tp->sacked_out = 0;
+	tp->acked_out = 0;
 }
 
 static void tcp_clear_retrans_partial(struct tcp_sock *tp)
@@ -1823,6 +1823,7 @@ void tcp_clear_retrans(struct tcp_sock *tp)
 
 	tp->fackets_out = 0;
 	tp->sacked_out = 0;
+	tp->acked_out = 0;
 }
 
 /* Enter Loss state. If "how" is not zero, forget all SACK information
@@ -1857,6 +1858,7 @@ void tcp_enter_loss(struct sock *sk, int how)
 	tp->undo_marker = tp->snd_una;
 	if (how) {
 		tp->sacked_out = 0;
+		tp->acked_out = 0;
 		tp->fackets_out = 0;
 	}
 	tcp_clear_all_retrans_hints(tp);
@@ -1921,7 +1923,7 @@ static bool tcp_check_sack_reneging(struct sock *sk, int flag)
 
 static inline int tcp_fackets_out(const struct tcp_sock *tp)
 {
-	return tcp_is_reno(tp) ? tp->sacked_out + 1 : tp->fackets_out;
+	return tcp_is_reno(tp) ? tp->acked_out + 1 : tp->fackets_out;
 }
 
 /* Heurestics to calculate number of duplicate ACKs. There's no dupACKs
@@ -1941,7 +1943,7 @@ static inline int tcp_fackets_out(const struct tcp_sock *tp)
  */
 static inline int tcp_dupack_heuristics(const struct tcp_sock *tp)
 {
-	return tcp_is_fack(tp) ? tp->fackets_out : tp->sacked_out + 1;
+	return tcp_is_fack(tp) ? tp->fackets_out : tcp_acked_out(tp) + 1;
 }
 
 static bool tcp_pause_early_retransmit(struct sock *sk, int flag)
@@ -2077,7 +2079,7 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
 	 */
 	packets_out = tp->packets_out;
 	if (packets_out <= tp->reordering &&
-	    tp->sacked_out >= max_t(__u32, packets_out/2, sysctl_tcp_reordering) &&
+	    tcp_acked_out(tp) >= max_t(__u32, packets_out/2, sysctl_tcp_reordering) &&
 	    !tcp_may_send_now(sk)) {
 		/* We have nothing to send. This connection is limited
 		 * either by receiver window or by application.
@@ -2100,8 +2102,8 @@ static bool tcp_time_to_recover(struct sock *sk, int flag)
 	 * Mitigation A.3 in the RFC and delay the retransmission for a short
 	 * interval if appropriate.
 	 */
-	if (tp->do_early_retrans && !tp->retrans_out && tp->sacked_out &&
-	    (tp->packets_out >= (tp->sacked_out + 1) && tp->packets_out < 4) &&
+	if (tp->do_early_retrans && !tp->retrans_out && (tp->sacked_out || tp->acked_out) &&
+	    (tp->packets_out >= (tcp_acked_out(tp) + 1) && tp->packets_out < 4) &&
 	    !tcp_may_send_now(sk))
 		return !tcp_pause_early_retransmit(sk, flag);
 
@@ -2701,9 +2703,11 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
 				    (tcp_fackets_out(tp) > tp->reordering));
 	int fast_rexmit = 0;
 
-	if (WARN_ON(!tp->packets_out && tp->sacked_out))
+	if (WARN_ON(!tp->packets_out && (tp->sacked_out || tp->acked_out))) {
 		tp->sacked_out = 0;
-	if (WARN_ON(!tp->sacked_out && tp->fackets_out))
+		tp->acked_out = 0;
+	}
+	if (WARN_ON(!tp->sacked_out && !tp->acked_out && tp->fackets_out))
 		tp->fackets_out = 0;
 
 	/* Now state machine starts.
@@ -3078,6 +3082,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 
 #if FASTRETRANS_DEBUG > 0
 	WARN_ON((int)tp->sacked_out < 0);
+	WARN_ON((int)tp->acked_out < 0);
 	WARN_ON((int)tp->lost_out < 0);
 	WARN_ON((int)tp->retrans_out < 0);
 	if (!tp->packets_out && tcp_is_sack(tp)) {
@@ -3092,6 +3097,11 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 				 tp->sacked_out, icsk->icsk_ca_state);
 			tp->sacked_out = 0;
 		}
+		if (tp->acked_out) {
+			pr_debug("Leak a=%u %d\n",
+				 tp->acked_out, icsk->icsk_ca_state);
+			tp->acked_out = 0;
+		}
 		if (tp->retrans_out) {
 			pr_debug("Leak r=%u %d\n",
 				 tp->retrans_out, icsk->icsk_ca_state);
@@ -3270,7 +3280,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	u32 prior_in_flight;
 	u32 prior_fackets;
 	int prior_packets = tp->packets_out;
-	const int prior_unsacked = tp->packets_out - tp->sacked_out;
+	const int prior_unsacked = tp->packets_out - tcp_acked_out(tp);
 	int acked = 0; /* Number of packets newly acked */
 	s32 sack_rtt = -1;
 
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 58a3e69..a2f5279 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -407,6 +407,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
 		newtp->packets_out = 0;
 		newtp->retrans_out = 0;
 		newtp->sacked_out = 0;
+		newtp->acked_out = 0;
 		newtp->fackets_out = 0;
 		newtp->snd_ssthresh = TCP_INFINITE_SSTHRESH;
 		tcp_enable_early_retrans(newtp);
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 884efff..17ebbff 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1033,7 +1033,7 @@ static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int de
 
 	/* Reno case is special. Sigh... */
 	if (tcp_is_reno(tp) && decr > 0)
-		tp->sacked_out -= min_t(u32, tp->sacked_out, decr);
+		tp->acked_out -= min_t(u32, tp->acked_out, decr);
 
 	tcp_adjust_fackets_out(sk, skb, decr);
 
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 4b85e6f..82acaea 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -405,7 +405,7 @@ void tcp_retransmit_timer(struct sock *sk)
 		} else if (icsk->icsk_ca_state == TCP_CA_Loss) {
 			mib_idx = LINUX_MIB_TCPLOSSFAILURES;
 		} else if ((icsk->icsk_ca_state == TCP_CA_Disorder) ||
-			   tp->sacked_out) {
+			   tp->sacked_out || tp->acked_out) {
 			if (tcp_is_sack(tp))
 				mib_idx = LINUX_MIB_TCPSACKFAILURES;
 			else
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 2/2] Account acked_out in sack, if the sack is invalid
  2013-08-19 19:29                             ` [RFC 0/2] Account for duplicate ACKs with invalid SACK-blocks Christoph Paasch
  2013-08-19 19:29                               ` [RFC 1/2] Use acked_out for reno-style ack acounting instead of sacked_out Christoph Paasch
@ 2013-08-19 19:29                               ` Christoph Paasch
  2013-08-19 21:22                               ` [RFC 0/2] Account for duplicate ACKs with invalid SACK-blocks Corey Hickey
  2013-08-22  3:32                               ` David Miller
  3 siblings, 0 replies; 32+ messages in thread
From: Christoph Paasch @ 2013-08-19 19:29 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Phil Oester, Corey Hickey, Benjamin Hesmans

In case of a sequence-number rewriting firewall (e.g., a Cisco Firewall:
https://supportforums.cisco.com/docs/DOC-12668#TCP_Sequence_Number_Randomization_and_SACK)
the SACK-blocks are not rewritten.
As we receive these ACK-packets, the sackblock is marked as invalid in
tcp_is_sackblock_valid(), and sacked_out will not be incremented.
Thus, fast-retransmit will not trigger and the stack will timeout after an
RTO. This completly kills the performance across this kind of firewalls.

This patch allows acked_out to increase, in case a sackblog is marked as
invalid. Further, it considers acked_out in the accounting when SACK has
been enabled on the connection.

Tested-by: Benjamin Hesmans <benjamin.hesmans@uclouvain.be>
Signed-off-by: Christoph Paasch <christoph.paasch@uclouvain.be>
---
 net/ipv4/tcp_input.c  | 59 ++++++++++++++++++++++++++++++++++++++-------------
 net/ipv4/tcp_output.c |  4 ++--
 2 files changed, 46 insertions(+), 17 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index af764d0..ea99017 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -75,6 +75,8 @@
 #include <asm/unaligned.h>
 #include <net/netdma.h>
 
+static void tcp_add_reno_sack(struct sock *sk);
+
 int sysctl_tcp_timestamps __read_mostly = 1;
 int sysctl_tcp_window_scaling __read_mostly = 1;
 int sysctl_tcp_sack __read_mostly = 1;
@@ -1542,7 +1544,7 @@ static int tcp_sack_cache_ok(const struct tcp_sock *tp, const struct tcp_sack_bl
 
 static int
 tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
-			u32 prior_snd_una, s32 *sack_rtt)
+			u32 prior_snd_una, s32 *sack_rtt, int flag)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 	const unsigned char *ptr = (skb_transport_header(ack_skb) +
@@ -1554,7 +1556,7 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
 	struct sk_buff *skb;
 	int num_sacks = min(TCP_NUM_SACKS, (ptr[1] - TCPOLEN_SACK_BASE) >> 3);
 	int used_sacks;
-	bool found_dup_sack = false;
+	bool found_dup_sack = false, first = true;
 	int i, j;
 	int first_sack_index;
 
@@ -1562,7 +1564,7 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
 	state.reord = tp->packets_out;
 	state.rtt = -1;
 
-	if (!tp->sacked_out) {
+	if (!tp->sacked_out && !tp->acked_out) {
 		if (WARN_ON(tp->fackets_out))
 			tp->fackets_out = 0;
 		tcp_highest_sack_reset(sk);
@@ -1612,6 +1614,29 @@ tcp_sacktag_write_queue(struct sock *sk, const struct sk_buff *ack_skb,
 			NET_INC_STATS_BH(sock_net(sk), mib_idx);
 			if (i == 0)
 				first_sack_index = -1;
+
+			/* 1. Only account the dup-ack once.
+			 *
+			 * 2. If we are coming from tcp_ack, with an old_ack, we
+			 * should not increase acked_out, as the ack is old :)
+			 *
+			 * 3. Only increase acked_out, if this is really a
+			 * duplicate ack. Check is similar to the one in
+			 * tcp_ack, when setting is_dupack.
+			 *
+			 * 4. We have to reproduce the first check of
+			 * tcp_is_sackblock_valid, because we don't want to
+			 * account dup-acks if the sack-blog is corrupted.
+			 */
+			if (first &&
+			    !before(TCP_SKB_CB(ack_skb)->ack_seq, prior_snd_una) &&
+			    !(flag & (FLAG_SND_UNA_ADVANCED | FLAG_NOT_DUP)) &&
+			    before(sp[used_sacks].start_seq, sp[used_sacks].end_seq)) {
+				first = false;
+				/* Disable FACK, as we can't trust SACKs anymore */
+				tcp_disable_fack(tp);
+				tcp_add_reno_sack(sk);
+			}
 			continue;
 		}
 
@@ -1740,6 +1765,7 @@ out:
 
 #if FASTRETRANS_DEBUG > 0
 	WARN_ON((int)tp->sacked_out < 0);
+	WARN_ON((int)tp->acked_out < 0);
 	WARN_ON((int)tp->lost_out < 0);
 	WARN_ON((int)tp->retrans_out < 0);
 	WARN_ON((int)tcp_packets_in_flight(tp) < 0);
@@ -1852,7 +1878,7 @@ void tcp_enter_loss(struct sock *sk, int how)
 
 	tcp_clear_retrans_partial(tp);
 
-	if (tcp_is_reno(tp))
+	if (tcp_is_reno(tp) || tp->acked_out)
 		tcp_reset_reno_sack(tp);
 
 	tp->undo_marker = tp->snd_una;
@@ -2189,7 +2215,7 @@ static void tcp_update_scoreboard(struct sock *sk, int fast_rexmit)
 			lost = 1;
 		tcp_mark_head_lost(sk, lost, 0);
 	} else {
-		int sacked_upto = tp->sacked_out - tp->reordering;
+		int sacked_upto = tcp_acked_out(tp) - tp->reordering;
 		if (sacked_upto >= 0)
 			tcp_mark_head_lost(sk, sacked_upto, 0);
 		else if (fast_rexmit)
@@ -2551,7 +2577,7 @@ void tcp_simple_retransmit(struct sock *sk)
 	if (prior_lost == tp->lost_out)
 		return;
 
-	if (tcp_is_reno(tp))
+	if (tcp_is_reno(tp) || tp->acked_out)
 		tcp_limit_reno_sacked(tp);
 
 	tcp_verify_left_out(tp);
@@ -2634,11 +2660,11 @@ static void tcp_process_loss(struct sock *sk, int flag, bool is_dupack)
 	}
 	if (flag & FLAG_DATA_ACKED)
 		icsk->icsk_retransmits = 0;
-	if (tcp_is_reno(tp)) {
+	if (tcp_is_reno(tp) || tp->acked_out) {
 		/* A Reno DUPACK means new data in F-RTO step 2.b above are
 		 * delivered. Lower inflight to clock out (re)tranmissions.
 		 */
-		if (after(tp->snd_nxt, tp->high_seq) && is_dupack)
+		if (tcp_is_reno(tp) && after(tp->snd_nxt, tp->high_seq) && is_dupack)
 			tcp_add_reno_sack(sk);
 		else if (flag & FLAG_SND_UNA_ADVANCED)
 			tcp_reset_reno_sack(tp);
@@ -2739,7 +2765,7 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
 			break;
 
 		case TCP_CA_Recovery:
-			if (tcp_is_reno(tp))
+			if (tcp_is_reno(tp) || tp->acked_out)
 				tcp_reset_reno_sack(tp);
 			if (tcp_try_undo_recovery(sk))
 				return;
@@ -2772,10 +2798,10 @@ static void tcp_fastretrans_alert(struct sock *sk, const int acked,
 			return;
 		/* Fall through to processing in Open state. */
 	default:
-		if (tcp_is_reno(tp)) {
+		if (tcp_is_reno(tp) || tp->acked_out) {
 			if (flag & FLAG_SND_UNA_ADVANCED)
 				tcp_reset_reno_sack(tp);
-			if (is_dupack)
+			if (is_dupack && tcp_is_reno(tp))
 				tcp_add_reno_sack(sk);
 		}
 
@@ -2952,7 +2978,7 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 	int flag = 0;
 	u32 pkts_acked = 0;
 	u32 reord = tp->packets_out;
-	u32 prior_sacked = tp->sacked_out;
+	u32 prior_sacked = tcp_acked_out(tp);
 	s32 seq_rtt = -1;
 	s32 ca_seq_rtt = -1;
 	ktime_t last_ackt = net_invalid_timestamp();
@@ -3050,12 +3076,15 @@ static int tcp_clean_rtx_queue(struct sock *sk, int prior_fackets,
 		} else {
 			int delta;
 
+			if (tp->acked_out)
+				tcp_remove_reno_sacks(sk, pkts_acked);
+
 			/* Non-retransmitted hole got filled? That's reordering */
 			if (reord < prior_fackets)
 				tcp_update_reordering(sk, tp->fackets_out - reord, 0);
 
 			delta = tcp_is_fack(tp) ? pkts_acked :
-						  prior_sacked - tp->sacked_out;
+						  prior_sacked - tcp_acked_out(tp);
 			tp->lost_cnt_hint -= min(tp->lost_cnt_hint, delta);
 		}
 
@@ -3340,7 +3369,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 
 		if (TCP_SKB_CB(skb)->sacked)
 			flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
-							&sack_rtt);
+							&sack_rtt, flag);
 
 		if (TCP_ECN_rcv_ecn_echo(tp, tcp_hdr(skb)))
 			flag |= FLAG_ECE;
@@ -3413,7 +3442,7 @@ old_ack:
 	 */
 	if (TCP_SKB_CB(skb)->sacked) {
 		flag |= tcp_sacktag_write_queue(sk, skb, prior_snd_una,
-						&sack_rtt);
+						&sack_rtt, flag);
 		tcp_fastretrans_alert(sk, acked, prior_unsacked,
 				      is_dupack, flag);
 	}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 17ebbff..2ecb943 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1008,7 +1008,7 @@ static void tcp_adjust_fackets_out(struct sock *sk, const struct sk_buff *skb,
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
-	if (!tp->sacked_out || tcp_is_reno(tp))
+	if ((!tp->sacked_out && !tp->acked_out) || tcp_is_reno(tp))
 		return;
 
 	if (after(tcp_highest_sack_seq(tp), TCP_SKB_CB(skb)->seq))
@@ -1032,7 +1032,7 @@ static void tcp_adjust_pcount(struct sock *sk, const struct sk_buff *skb, int de
 		tp->lost_out -= decr;
 
 	/* Reno case is special. Sigh... */
-	if (tcp_is_reno(tp) && decr > 0)
+	if ((tcp_is_reno(tp) || tp->acked_out) && decr > 0)
 		tp->acked_out -= min_t(u32, tp->acked_out, decr);
 
 	tcp_adjust_fackets_out(sk, skb, decr);
-- 
1.8.1.2

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC 0/2] Account for duplicate ACKs with invalid SACK-blocks
  2013-08-19 19:29                             ` [RFC 0/2] Account for duplicate ACKs with invalid SACK-blocks Christoph Paasch
  2013-08-19 19:29                               ` [RFC 1/2] Use acked_out for reno-style ack acounting instead of sacked_out Christoph Paasch
  2013-08-19 19:29                               ` [RFC 2/2] Account acked_out in sack, if the sack is invalid Christoph Paasch
@ 2013-08-19 21:22                               ` Corey Hickey
  2013-08-20  7:36                                 ` Christoph Paasch
  2013-08-22  3:32                               ` David Miller
  3 siblings, 1 reply; 32+ messages in thread
From: Corey Hickey @ 2013-08-19 21:22 UTC (permalink / raw)
  To: Christoph Paasch; +Cc: Eric Dumazet, netdev, Phil Oester, Benjamin Hesmans

On 2013-08-19 12:29, Christoph Paasch wrote:
> There exist sequence-number rewriting middleboxes, who do not modify the 
> sequence-number in the SACK-blocks.
> Duplicate acknowledgments with these (invalid) SACK-blocks will not be
> accounted in sacked_out, and thus no fast-retransmit will trigger.
> So, the only way to recover from a packet-loss is through an RTO,
> effectively killing the performance of TCP in this case.
> 
> Performance-results can be seen here:
> http://tools.ietf.org/agenda/87/slides/slides-87-tcpm-11.pdf
> 
> Another solution might be to simply disable SACK, as soon as an invalid
> SACK-block has been received (as suggested by Phil Oester). This however
> might be too aggressive.
> 
> Christoph Paasch (2):
>   Use acked_out for reno-style ack acounting instead of sacked_out
>   Account acked_out in sack, if the sack is invalid
> 
>  include/linux/tcp.h      |   1 +
>  include/net/tcp.h        |  17 +++++---
>  net/ipv4/tcp_input.c     | 103 ++++++++++++++++++++++++++++++++---------------
>  net/ipv4/tcp_minisocks.c |   1 +
>  net/ipv4/tcp_output.c    |   6 +--
>  net/ipv4/tcp_timer.c     |   2 +-
>  6 files changed, 89 insertions(+), 41 deletions(-)

I tested this patchset, and, unfortunately, the behavior seems to be the
same.

I'm still planning on working with the Cisco device, when I can get some
of the network admin's time, but that won't happen immediately.

Thanks,
Corey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/2] Account for duplicate ACKs with invalid SACK-blocks
  2013-08-19 21:22                               ` [RFC 0/2] Account for duplicate ACKs with invalid SACK-blocks Corey Hickey
@ 2013-08-20  7:36                                 ` Christoph Paasch
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Paasch @ 2013-08-20  7:36 UTC (permalink / raw)
  To: Corey Hickey; +Cc: Eric Dumazet, netdev, Phil Oester, Benjamin Hesmans

On 19/08/13 - 14:22:46, Corey Hickey wrote:
> On 2013-08-19 12:29, Christoph Paasch wrote:
> > There exist sequence-number rewriting middleboxes, who do not modify the 
> > sequence-number in the SACK-blocks.
> > Duplicate acknowledgments with these (invalid) SACK-blocks will not be
> > accounted in sacked_out, and thus no fast-retransmit will trigger.
> > So, the only way to recover from a packet-loss is through an RTO,
> > effectively killing the performance of TCP in this case.
> > 
> > Performance-results can be seen here:
> > http://tools.ietf.org/agenda/87/slides/slides-87-tcpm-11.pdf
> > 
> > Another solution might be to simply disable SACK, as soon as an invalid
> > SACK-block has been received (as suggested by Phil Oester). This however
> > might be too aggressive.
> > 
> > Christoph Paasch (2):
> >   Use acked_out for reno-style ack acounting instead of sacked_out
> >   Account acked_out in sack, if the sack is invalid
> > 
> >  include/linux/tcp.h      |   1 +
> >  include/net/tcp.h        |  17 +++++---
> >  net/ipv4/tcp_input.c     | 103 ++++++++++++++++++++++++++++++++---------------
> >  net/ipv4/tcp_minisocks.c |   1 +
> >  net/ipv4/tcp_output.c    |   6 +--
> >  net/ipv4/tcp_timer.c     |   2 +-
> >  6 files changed, 89 insertions(+), 41 deletions(-)
> 
> I tested this patchset, and, unfortunately, the behavior seems to be the
> same.

Yes, because the problem is still on conntrack's side too.

As soon as conntrack will let pass the invalid SACK-blocks, you will
experience bad performance because every packet-loss can only be recovered
by an RTO.
My patchset tries to fix this last issue.


Cheers,
Christoph

> 
> I'm still planning on working with the Cisco device, when I can get some
> of the network admin's time, but that won't happen immediately.
> 
> Thanks,
> Corey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/2] Account for duplicate ACKs with invalid SACK-blocks
  2013-08-19 19:29                             ` [RFC 0/2] Account for duplicate ACKs with invalid SACK-blocks Christoph Paasch
                                                 ` (2 preceding siblings ...)
  2013-08-19 21:22                               ` [RFC 0/2] Account for duplicate ACKs with invalid SACK-blocks Corey Hickey
@ 2013-08-22  3:32                               ` David Miller
  2013-08-22  4:15                                 ` Corey Hickey
  3 siblings, 1 reply; 32+ messages in thread
From: David Miller @ 2013-08-22  3:32 UTC (permalink / raw)
  To: christoph.paasch
  Cc: eric.dumazet, netdev, kernel, bugfood-ml, benjamin.hesmans

From: Christoph Paasch <christoph.paasch@uclouvain.be>
Date: Mon, 19 Aug 2013 21:29:26 +0200

> There exist sequence-number rewriting middleboxes, who do not modify the 
> sequence-number in the SACK-blocks.

These are bugs that the vendor's should fix, not something we should
cater to at all.

I'm not applying patches like these, and I've rejected similar
workarounds in the past 18 years, so this position is strong and
consistent.

Sorry.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/2] Account for duplicate ACKs with invalid SACK-blocks
  2013-08-22  3:32                               ` David Miller
@ 2013-08-22  4:15                                 ` Corey Hickey
  0 siblings, 0 replies; 32+ messages in thread
From: Corey Hickey @ 2013-08-22  4:15 UTC (permalink / raw)
  To: David Miller
  Cc: christoph.paasch, eric.dumazet, netdev, kernel, benjamin.hesmans

On 2013-08-21 20:32, David Miller wrote:
> From: Christoph Paasch <christoph.paasch@uclouvain.be>
> Date: Mon, 19 Aug 2013 21:29:26 +0200
> 
>> There exist sequence-number rewriting middleboxes, who do not modify the 
>> sequence-number in the SACK-blocks.
> 
> These are bugs that the vendor's should fix, not something we should
> cater to at all.
> 
> I'm not applying patches like these, and I've rejected similar
> workarounds in the past 18 years, so this position is strong and
> consistent.
> 
> Sorry.

That's not terrible for me; we'll most likely be turning off SEQ
randomization on the FWSM, when we have time to make sure it doesn't
break anything (which it shouldn't).

-Corey

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 13:58                 ` Eric Dumazet
  2013-08-19 14:35                   ` Phil Oester
@ 2013-08-19 14:43                   ` Christoph Paasch
  2013-08-19 20:13                   ` Jozsef Kadlecsik
  2 siblings, 0 replies; 32+ messages in thread
From: Christoph Paasch @ 2013-08-19 14:43 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Corey Hickey, Jozsef Kadlecsik, Linux Netdev List,
	netfilter-devel

On 19/08/13 - 06:58:05, Eric Dumazet wrote:
> On Mon, 2013-08-19 at 15:49 +0200, Christoph Paasch wrote:
> > It's a TCP-patch, that interprets duplicate-acks with invalid SACK-blocks as
> > duplicate acks in tcp_sock->sacked_out.
> 
> Yeah, but here, this is conntrack who is blocking the thing.

Yes, of course. I know that here conntrack is blocking the dup-ACKs.

Sorry for having added TCP-stack specific things in a conntrack-thread.
I just wanted to highlight that the TCP-stack has also problems with
sequence-number rewriting middleboxes.

> TCP receiver has no chance to 'fix' it.
> 
> See conntrack is one of those buggy middle box as well.
> 
> So if you want to properly handle this mess, you'll also have to fix
> conntrack.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 13:58                 ` Eric Dumazet
  2013-08-19 14:35                   ` Phil Oester
  2013-08-19 14:43                   ` NAT stops forwarding ACKs after PMTU discovery Christoph Paasch
@ 2013-08-19 20:13                   ` Jozsef Kadlecsik
  2013-08-19 20:43                     ` Christoph Paasch
  2013-08-19 21:08                     ` Eric Dumazet
  2 siblings, 2 replies; 32+ messages in thread
From: Jozsef Kadlecsik @ 2013-08-19 20:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Paasch, Corey Hickey, Linux Netdev List,
	netfilter-devel

On Mon, 19 Aug 2013, Eric Dumazet wrote:

> On Mon, 2013-08-19 at 15:49 +0200, Christoph Paasch wrote:
>
> > It's a TCP-patch, that interprets duplicate-acks with invalid SACK-blocks as
> > duplicate acks in tcp_sock->sacked_out.
> 
> Yeah, but here, this is conntrack who is blocking the thing.
> 
> TCP receiver has no chance to 'fix' it.
> 
> See conntrack is one of those buggy middle box as well.
> 
> So if you want to properly handle this mess, you'll also have to fix
> conntrack.

I beg you pardon: why conntrack should be relaxed, when it is expected
to do more strict TCP checkings (RFC5961, Section 5.).

Also, it's clearly a broken middle box. Don't shoot the messenger.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 20:13                   ` Jozsef Kadlecsik
@ 2013-08-19 20:43                     ` Christoph Paasch
  2013-08-19 21:08                     ` Eric Dumazet
  1 sibling, 0 replies; 32+ messages in thread
From: Christoph Paasch @ 2013-08-19 20:43 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Eric Dumazet, Corey Hickey, Linux Netdev List, netfilter-devel

On 19/08/13 - 22:13:59, Jozsef Kadlecsik wrote:
> On Mon, 19 Aug 2013, Eric Dumazet wrote:
> 
> > On Mon, 2013-08-19 at 15:49 +0200, Christoph Paasch wrote:
> >
> > > It's a TCP-patch, that interprets duplicate-acks with invalid SACK-blocks as
> > > duplicate acks in tcp_sock->sacked_out.
> > 
> > Yeah, but here, this is conntrack who is blocking the thing.
> > 
> > TCP receiver has no chance to 'fix' it.
> > 
> > See conntrack is one of those buggy middle box as well.
> > 
> > So if you want to properly handle this mess, you'll also have to fix
> > conntrack.
> 
> I beg you pardon: why conntrack should be relaxed, when it is expected
> to do more strict TCP checkings (RFC5961, Section 5.).

There is no mention of SACK in this RFC.
The duplicate ACKs with invalid SACK-blocks are valid with respect to
RFC5961, Section 5.

Actually, no RFC says that dup-ACKs with invalid SACK-blocks should
be discarded.


Cheers,
Christoph


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 20:13                   ` Jozsef Kadlecsik
  2013-08-19 20:43                     ` Christoph Paasch
@ 2013-08-19 21:08                     ` Eric Dumazet
  2013-08-19 22:07                       ` Jozsef Kadlecsik
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2013-08-19 21:08 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Christoph Paasch, Corey Hickey, Linux Netdev List,
	netfilter-devel

On Mon, 2013-08-19 at 22:13 +0200, Jozsef Kadlecsik wrote:
> On Mon, 19 Aug 2013, Eric Dumazet wrote:
> 
> > On Mon, 2013-08-19 at 15:49 +0200, Christoph Paasch wrote:
> >
> > > It's a TCP-patch, that interprets duplicate-acks with invalid SACK-blocks as
> > > duplicate acks in tcp_sock->sacked_out.
> > 
> > Yeah, but here, this is conntrack who is blocking the thing.
> > 
> > TCP receiver has no chance to 'fix' it.
> > 
> > See conntrack is one of those buggy middle box as well.
> > 
> > So if you want to properly handle this mess, you'll also have to fix
> > conntrack.
> 
> I beg you pardon: why conntrack should be relaxed, when it is expected
> to do more strict TCP checkings (RFC5961, Section 5.).
> 
> Also, it's clearly a broken middle box. Don't shoot the messenger.

Frames are dropped by conntrack, before TCP receiver can even have a
choice.

So Christoph patch would be of no use for Corey.

I do not think I shot anyone, only stated the truth.

We have workarounds in our stack to 'fix' bugs from others, there
is no shame in this.

Glad to see you are interested in RFC 5961 support, as conntrack is
known to break the ACK challenges in response to RST messages (section
3)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 21:08                     ` Eric Dumazet
@ 2013-08-19 22:07                       ` Jozsef Kadlecsik
  2013-08-20  4:18                         ` Corey Hickey
  0 siblings, 1 reply; 32+ messages in thread
From: Jozsef Kadlecsik @ 2013-08-19 22:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Christoph Paasch, Corey Hickey, Linux Netdev List,
	netfilter-devel

On Mon, 19 Aug 2013, Eric Dumazet wrote:

> On Mon, 2013-08-19 at 22:13 +0200, Jozsef Kadlecsik wrote:
> > On Mon, 19 Aug 2013, Eric Dumazet wrote:
> > 
> > > On Mon, 2013-08-19 at 15:49 +0200, Christoph Paasch wrote:
> > >
> > > > It's a TCP-patch, that interprets duplicate-acks with invalid 
> > > > SACK-blocks as duplicate acks in tcp_sock->sacked_out.
> > > 
> > > Yeah, but here, this is conntrack who is blocking the thing.
> > > 
> > > TCP receiver has no chance to 'fix' it.
> > > 
> > > See conntrack is one of those buggy middle box as well.
> > > 
> > > So if you want to properly handle this mess, you'll also have to fix
> > > conntrack.
> > 
> > I beg you pardon: why conntrack should be relaxed, when it is expected
> > to do more strict TCP checkings (RFC5961, Section 5.).
> > 
> > Also, it's clearly a broken middle box. Don't shoot the messenger.
> 
> Frames are dropped by conntrack, before TCP receiver can even have a
> choice.
> 
> So Christoph patch would be of no use for Corey.

Yes, exactly.
 
> I do not think I shot anyone, only stated the truth.

There's a middlebox in the path wich breaks SACK completely and conntrack 
drops (technically marks as INVALID) the packets with bogus SACK options.

It can be fixed by fixing the middlebox, or disabling SACK by the 
TCPOPTSTRIP target, or by relaxing conntrack. For the latter, the next 
untested patch may be sufficient:

diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index 7dcc376..8b5d783 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -649,6 +649,11 @@ static bool tcp_in_window(const struct nf_conn *ct,
 		 receiver->td_end, receiver->td_maxend, receiver->td_maxwin,
 		 receiver->td_scale);
 
+	/* Fall back to ACK when SACK is bogus */
+	if (!(before(sack, receiver->td_end + 1) &&
+	      after(sack, receiver->td_end - MAXACKWINDOW(sender) - 1)))
+		sack = ack;
+	      
 	pr_debug("tcp_in_window: I=%i II=%i III=%i IV=%i\n",
 		 before(seq, sender->td_maxend + 1),
 		 after(end, sender->td_end - receiver->td_maxwin - 1),

However it is good to cover the issue thus?
 
> We have workarounds in our stack to 'fix' bugs from others, there
> is no shame in this.
> 
> Glad to see you are interested in RFC 5961 support, as conntrack is
> known to break the ACK challenges in response to RST messages (section
> 3)

*Any* netfilter configuration where the non-allowed TCP packets are 
dropped and not rejected breaks the ACK challenges. That is why I consider 
only section 5 useful from RFC 5961.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 22:07                       ` Jozsef Kadlecsik
@ 2013-08-20  4:18                         ` Corey Hickey
  0 siblings, 0 replies; 32+ messages in thread
From: Corey Hickey @ 2013-08-20  4:18 UTC (permalink / raw)
  To: Jozsef Kadlecsik
  Cc: Eric Dumazet, Christoph Paasch, Linux Netdev List,
	netfilter-devel

On 2013-08-19 15:07, Jozsef Kadlecsik wrote:
> It can be fixed by fixing the middlebox, or disabling SACK by the 
> TCPOPTSTRIP target, or by relaxing conntrack. For the latter, the next 
> untested patch may be sufficient:
> 
> diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
> index 7dcc376..8b5d783 100644
> --- a/net/netfilter/nf_conntrack_proto_tcp.c
> +++ b/net/netfilter/nf_conntrack_proto_tcp.c
> @@ -649,6 +649,11 @@ static bool tcp_in_window(const struct nf_conn *ct,
>  		 receiver->td_end, receiver->td_maxend, receiver->td_maxwin,
>  		 receiver->td_scale);
>  
> +	/* Fall back to ACK when SACK is bogus */
> +	if (!(before(sack, receiver->td_end + 1) &&
> +	      after(sack, receiver->td_end - MAXACKWINDOW(sender) - 1)))
> +		sack = ack;
> +	      
>  	pr_debug("tcp_in_window: I=%i II=%i III=%i IV=%i\n",
>  		 before(seq, sender->td_maxend + 1),
>  		 after(end, sender->td_end - receiver->td_maxwin - 1),
> 
> However it is good to cover the issue thus?

This didn't quite apply to my kernel tree, but I let patch apply it with
fuzz=2 and got:

diff --git a/net/netfilter/nf_conntrack_proto_tcp.c b/net/netfilter/nf_conntrack_proto_tcp.c
index 2f80107..94b326b 100644
--- a/net/netfilter/nf_conntrack_proto_tcp.c
+++ b/net/netfilter/nf_conntrack_proto_tcp.c
@@ -653,6 +653,11 @@ static bool tcp_in_window(const struct nf_conn *ct,
        in_recv_win = !receiver->td_maxwin ||
                      after(end, sender->td_end - receiver->td_maxwin - 1);
 
+       /* Fall back to ACK when SACK is bogus */
+       if (!(before(sack, receiver->td_end + 1) &&
+             after(sack, receiver->td_end - MAXACKWINDOW(sender) - 1)))
+               sack = ack;
+
        pr_debug("tcp_in_window: I=%i II=%i III=%i IV=%i\n",
                 before(seq, sender->td_maxend + 1),
                 (in_recv_win ? 1 : 0),



I can confirm, that does indeed work! Thank you. I will continue watching
this thread for other things to test, if need be.

-Corey

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: NAT stops forwarding ACKs after PMTU discovery
  2013-08-19 12:33           ` Christoph Paasch
  2013-08-19 13:24             ` Eric Dumazet
@ 2013-08-19 18:22             ` Corey Hickey
  1 sibling, 0 replies; 32+ messages in thread
From: Corey Hickey @ 2013-08-19 18:22 UTC (permalink / raw)
  To: Christoph Paasch
  Cc: Eric Dumazet, Jozsef Kadlecsik, Linux Netdev List,
	netfilter-devel

On 2013-08-19 05:33, Christoph Paasch wrote:
> Hello,
> 
> I would say, the problem is due to a sequence-number rewriting
> middlebox, who does not correctly handle the SACK-blocks.
> 
> In remote.pcap, you see:
> Packet#10: A Dup-ACK: ACK 1005503816, SACK: 1005505184-1005505648
> The SACK actually covers Packet#9
> 
> In tun0.pcap, you have:
> Packet#9: The one that is SACK'ed: SEQ: 3514869772
> Packet#11: The Dup-ACK: ACK: 3514868404, SACK: 3570452498-3570452962
> 
> You can see that the SACK-block is not really aligned with the ACK-numbers.
> 
> Netfilter is probably dropping the Dup-ACK, because the SACK-block is
> acknowledging unseen data.
> 
> 
> There are middleboxes out there that randomize the sequence numbers, due to
> an old bug in Windows, where the initial sequence number was not
> sufficiently randomized. There are some of these middleboxes, who simply do
> not support SACK, thus modify only the sequence numbers of the TCP-header
> (https://supportforums.cisco.com/docs/DOC-12668#TCP_Sequence_Number_Randomization_and_SACK).
> 
> This results in a TCP retransmission timeout on the sender, because of
> the invalid SACK-blocks, the duplicate ACKs are not accounted. This
> completly kills the performance, as you can see in our presentation given at
> IETF87: http://tools.ietf.org/agenda/87/slides/slides-87-tcpm-11.pdf

This makes good sense. I normally look at these in wireshark with
relative sequence numbers turned on, and I see in bad/tun0.pcap that the
SACK values are very high, but are normal in bad/remote.pcap.

I see the same thing in good/tun0.pcap, though, so I don't fully
understand what is making it work sometimes and not others.

I will show this thread and the Cisco docs to the network engineer at
work, and maybe we can get the SEQ randomization turned off (or at least
test it).

> We have a patch that accounts DUP-ACKs with invalid SACK-blocks effectively
> as duplicate acknowledgments. I can send the patches, if the
> netdev-community is interested in accepting these upstream.

I'll keep my eye out and test them if they come up.


Thanks,
Corey

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2013-08-22  4:15 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-08-18  5:55 NAT stops forwarding ACKs after PMTU discovery Corey Hickey
2013-08-18 15:24 ` Eric Dumazet
2013-08-18 16:59   ` Corey Hickey
2013-08-18 21:23   ` Jozsef Kadlecsik
2013-08-19  0:00     ` Eric Dumazet
2013-08-19  0:03       ` Eric Dumazet
2013-08-19  8:43         ` Corey Hickey
2013-08-19 12:33           ` Christoph Paasch
2013-08-19 13:24             ` Eric Dumazet
2013-08-19 13:49               ` Christoph Paasch
2013-08-19 13:58                 ` Eric Dumazet
2013-08-19 14:35                   ` Phil Oester
2013-08-19 15:32                     ` Eric Dumazet
2013-08-19 15:33                     ` Christoph Paasch
2013-08-19 16:00                       ` Eric Dumazet
2013-08-19 17:15                         ` Christoph Paasch
2013-08-19 18:00                           ` Phil Oester
2013-08-19 18:10                           ` Eric Dumazet
2013-08-19 19:29                             ` [RFC 0/2] Account for duplicate ACKs with invalid SACK-blocks Christoph Paasch
2013-08-19 19:29                               ` [RFC 1/2] Use acked_out for reno-style ack acounting instead of sacked_out Christoph Paasch
2013-08-19 19:29                               ` [RFC 2/2] Account acked_out in sack, if the sack is invalid Christoph Paasch
2013-08-19 21:22                               ` [RFC 0/2] Account for duplicate ACKs with invalid SACK-blocks Corey Hickey
2013-08-20  7:36                                 ` Christoph Paasch
2013-08-22  3:32                               ` David Miller
2013-08-22  4:15                                 ` Corey Hickey
2013-08-19 14:43                   ` NAT stops forwarding ACKs after PMTU discovery Christoph Paasch
2013-08-19 20:13                   ` Jozsef Kadlecsik
2013-08-19 20:43                     ` Christoph Paasch
2013-08-19 21:08                     ` Eric Dumazet
2013-08-19 22:07                       ` Jozsef Kadlecsik
2013-08-20  4:18                         ` Corey Hickey
2013-08-19 18:22             ` Corey Hickey

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).