* [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+
@ 2014-11-29 11:44 Thomas Jarosch
2014-12-01 10:25 ` Herbert Xu
2014-12-01 13:17 ` [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+ Wolfgang Walter
0 siblings, 2 replies; 40+ messages in thread
From: Thomas Jarosch @ 2014-11-29 11:44 UTC (permalink / raw)
To: netdev; +Cc: Eric Dumazet
[-- Attachment #1: Type: text/plain, Size: 5432 bytes --]
Hello,
we're in the process of updating production level machines
from kernel 3.4.101 to kernel 3.14.25. On one mail server
we noticed that emails destined for an IPSec tunnel sometimes
get stuck in the mail queue with TCP timeouts.
To make a long story short: When the VPN connection is initially
set up or re-newed, the path MTU for the xfrm tunnel is undetermined.
As soon as a TCP client starts to send large packets,
it triggers path MTU detection. Some middlebox on the
way to the final server has a lower MTU and sends back
an "ICMP fragmentation needed" packet as normal.
With the old kernel, the packet size for the TCP connection inside
the xfrm tunnel gets adjusted and all is fine. With kernel v3.12+,
the connection stalls completely. Same thing with kernel v3.18-rc6.
We wrote a small tool to mimic postfix's TCP behavior (see attached file).
In the end it's a normal TCP client sending large packets.
The server side is just "socat - tcp4-listen:667".
If you run "socket_client" a second time, the path MTU
for the xfrm tunnel is already known and packets flow normal, too.
The "evil" commit in question is this one:
---------------------------------------------------------------------
commit 8f26fb1c1ed81c33f5d87c5936f4d9d1b4118918
Author: Eric Dumazet <edumazet@google.com>
Date: Tue Oct 15 12:24:54 2013 -0700
tcp: remove the sk_can_gso() check from tcp_set_skb_tso_segs()
sk_can_gso() should only be used as a hint in tcp_sendmsg() to build GSO
packets in the first place. (As a performance hint)
Once we have GSO packets in write queue, we can not decide they are no
longer GSO only because flow now uses a route which doesn't handle
TSO/GSO.
Core networking stack handles the case very well for us, all we need
is keeping track of packet counts in MSS terms, regardless of
segmentation done later (in GSO or hardware)
Right now, if tcp_fragment() splits a GSO packet in two parts,
@left and @right, and route changed through a non GSO device,
both @left and @right have pcount set to 1, which is wrong,
and leads to incorrect packet_count tracking.
This problem was added in commit d5ac99a648 ("[TCP]: skb pcount with MTU
discovery")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8fad1c1..d46f214 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -989,8 +989,7 @@ static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
/* Make sure we own this skb before messing gso_size/gso_segs */
WARN_ON_ONCE(skb_cloned(skb));
- if (skb->len <= mss_now || !sk_can_gso(sk) ||
- skb->ip_summed == CHECKSUM_NONE) {
+ if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
/* Avoid the costly divide in the normal
* non-TSO case.
*/
---------------------------------------------------------------------
When I revert it, even kernel v3.18-rc6 starts working.
But I doubt this is the root problem, may be just hiding another issue.
--- Sample output of socket_client using vanilla v3.12 kernel ---
[1417258063 SEND result: 4096, strerror: Success]
tcp max seg: res: 0, max_seg: 1370
[1417258063 SEND result: 4096, strerror: Success]
tcp max seg: res: 0, max_seg: 1370
[1417258063 SEND result: 4096, strerror: Success]
tcp max seg: res: 0, max_seg: 1370
[1417258063 SEND result: 4096, strerror: Success]
tcp max seg: res: 0, max_seg: 1370
[1417258063 SEND result: 4096, strerror: Success]
tcp max seg: res: 0, max_seg: 1338
[1417258063 SEND result: 4096, strerror: Success]
tcp max seg: res: 0, max_seg: 1338
*STUCK*
--------------------------------------------------------
The "machine" is running on KVM and using "virtio_net" as NIC driver.
I've played with the ethtool offload settings:
*** eth1 defaults ***
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp-segmentation-offload: on
udp-fragmentation-offload: on
generic-segmentation-offload: on
generic-receive-offload: on
large-receive-offload: off
*** eth1 working (no stalls) using vanilla kernel ***
Offload parameters for eth1:
rx-checksumming: on
tx-checksumming: off <-- the magic switch
scatter-gather: off
tcp-segmentation-offload: off
udp-fragmentation-offload: off
generic-segmentation-offload: off
generic-receive-offload: off
large-receive-offload: off
When I turn "tx-checksumming" back on, it fails again.
Though that is probably also just a side effect.
I can provide tcpdumps if needed but they are no real help
since you can just see the kernel stops sending TCP packets.
(and the outgoing TCP packets are encrypted in ESP packets)
Any vague idea what might be the root cause?
I also tried reverting commit 4d53eff48b5f03ce67f4f301d6acca1d2145cb7a
("xfrm: Don't queue retransmitted packets if the original is still on the host")
but that didn't change the situation. In fact it wasn't even triggered.
Please CC: comments. Thanks.
Best regards,
Thomas
[-- Attachment #2: socket_client.c --]
[-- Type: text/x-csrc, Size: 1327 bytes --]
#include <sys/types.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <netinet/in.h>
#include <netinet/tcp.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/un.h>
#include <unistd.h>
#include <errno.h>
#include <time.h>
/*
Remote server: socat - tcp4-listen:667
*/
int main()
{
int sockfd = socket(AF_INET, SOCK_STREAM, 0);
struct sockaddr_in servaddr;
bzero(&servaddr,sizeof(servaddr));
servaddr.sin_family = AF_INET;
servaddr.sin_addr.s_addr=inet_addr("192.168.12.254");
servaddr.sin_port=htons(667);
int result = connect(sockfd, (struct sockaddr *)&servaddr, sizeof(servaddr));
if(result != 0)
{
perror("failed to connect");
exit(1);
}
char sendbuf[4096];
memset(sendbuf, 0, sizeof(sendbuf));
strcpy(sendbuf, "NOOP\n");
int max_seg = 0, max_seg_len = sizeof(max_seg), get_res = 0;
for (int i = 0; i < 10; ++i)
{
errno = 0;
int send_res = send(sockfd, sendbuf, sizeof(sendbuf), 0);
printf("[%d SEND result: %d, strerror: %s]\n", time(NULL), send_res, strerror(errno));
get_res = getsockopt(sockfd, SOL_TCP, TCP_MAXSEG, &max_seg, &max_seg_len);
printf("tcp max seg: res: %d, max_seg: %d\n", get_res, max_seg);
}
printf("All sent.\n");
close(sockfd);
exit(0);
}
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+
2014-11-29 11:44 [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+ Thomas Jarosch
@ 2014-12-01 10:25 ` Herbert Xu
2014-12-01 11:20 ` Thomas Jarosch
2014-12-31 13:39 ` tcp: Do not apply TSO segment limit to non-TSO packets Herbert Xu
2014-12-01 13:17 ` [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+ Wolfgang Walter
1 sibling, 2 replies; 40+ messages in thread
From: Herbert Xu @ 2014-12-01 10:25 UTC (permalink / raw)
To: Thomas Jarosch; +Cc: netdev, edumazet, Steffen Klassert
Thomas Jarosch <thomas.jarosch@intra2net.com> wrote:
>
> When I revert it, even kernel v3.18-rc6 starts working.
> But I doubt this is the root problem, may be just hiding another issue.
Can you do a tcpdump with this patch reverted? I would like to
see the size of the packets that are sent out vs. the ICMP message
that came back.
My guess is that the IPsec GSO path is buggy since as you rightly
pointed out it couldn't have been heavily tested prior to this
patch.
Though I am surprised that it only breaks when you have a PMTU event
so it might be something else after all.
Thanks,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+
2014-12-01 10:25 ` Herbert Xu
@ 2014-12-01 11:20 ` Thomas Jarosch
2014-12-31 13:39 ` tcp: Do not apply TSO segment limit to non-TSO packets Herbert Xu
1 sibling, 0 replies; 40+ messages in thread
From: Thomas Jarosch @ 2014-12-01 11:20 UTC (permalink / raw)
To: Herbert Xu; +Cc: netdev, edumazet, Steffen Klassert
On Monday, 1. December 2014 18:25:22 Herbert Xu wrote:
> Thomas Jarosch <thomas.jarosch@intra2net.com> wrote:
> > When I revert it, even kernel v3.18-rc6 starts working.
> > But I doubt this is the root problem, may be just hiding another issue.
>
> Can you do a tcpdump with this patch reverted? I would like to
> see the size of the packets that are sent out vs. the ICMP message
> that came back.
>
> My guess is that the IPsec GSO path is buggy since as you rightly
> pointed out it couldn't have been heavily tested prior to this
> patch.
>
> Though I am surprised that it only breaks when you have a PMTU event
> so it might be something else after all.
I've sent you two pcap files off list. One with the reverted patch
and one I created on Saturday showing the stalling TCP connection.
(though the mail currently seems greylisted by the receiving mail server)
One thing I can say from looking at the tcpdump with the reverted patch
is that the ESP packet size drops from 1510 to 1478. The announced MTU
of the next hop in the ICMP message is 1464. It also contains seven
duplicated ACK packets later on.
Without the reverted patch, it sends four ESP packets:
1. 1498 bytes
2. 1498 bytes
3. 1498 bytes
4. 234 bytes
That triggers the ICMP "fragmentation needed" message. I can only spot one
ESP packet of size 1466 afterwards that's sent from time to time. Only one
duplicated ACK packet can be seen. The other packets are just not resent.
May be it's stuck in some in-kernel queue because it's too big to send
and stays there until it expires?
Today I also tried changing the NIC driver on the virtual machine
from virtio_net to e1000. Luckily still the same behavior,
so it's probably not related to the virtio_net driver.
Cheers,
Thomas
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+
2014-11-29 11:44 [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+ Thomas Jarosch
2014-12-01 10:25 ` Herbert Xu
@ 2014-12-01 13:17 ` Wolfgang Walter
2014-12-01 16:41 ` Wolfgang Walter
1 sibling, 1 reply; 40+ messages in thread
From: Wolfgang Walter @ 2014-12-01 13:17 UTC (permalink / raw)
To: Thomas Jarosch; +Cc: netdev, Eric Dumazet
Am Samstag, 29. November 2014, 12:44:07 schrieb Thomas Jarosch:
> Hello,
>
> we're in the process of updating production level machines
> from kernel 3.4.101 to kernel 3.14.25. On one mail server
> we noticed that emails destined for an IPSec tunnel sometimes
> get stuck in the mail queue with TCP timeouts.
>
> To make a long story short: When the VPN connection is initially
> set up or re-newed, the path MTU for the xfrm tunnel is undetermined.
>
> As soon as a TCP client starts to send large packets,
> it triggers path MTU detection. Some middlebox on the
> way to the final server has a lower MTU and sends back
> an "ICMP fragmentation needed" packet as normal.
>
> With the old kernel, the packet size for the TCP connection inside
> the xfrm tunnel gets adjusted and all is fine. With kernel v3.12+,
> the connection stalls completely. Same thing with kernel v3.18-rc6.
We see something similar with real nic (RTL8139). In our case only the first
tcp-connection which triggers PMTU stalls. Later tcp-connections then work
fine.
I will revert that patch and see if that fixes the problem.
>
> We wrote a small tool to mimic postfix's TCP behavior (see attached file).
> In the end it's a normal TCP client sending large packets.
> The server side is just "socat - tcp4-listen:667".
>
> If you run "socket_client" a second time, the path MTU
> for the xfrm tunnel is already known and packets flow normal, too.
>
>
> The "evil" commit in question is this one:
> ---------------------------------------------------------------------
> commit 8f26fb1c1ed81c33f5d87c5936f4d9d1b4118918
> Author: Eric Dumazet <edumazet@google.com>
> Date: Tue Oct 15 12:24:54 2013 -0700
>
> tcp: remove the sk_can_gso() check from tcp_set_skb_tso_segs()
>
> sk_can_gso() should only be used as a hint in tcp_sendmsg() to build GSO
> packets in the first place. (As a performance hint)
>
> Once we have GSO packets in write queue, we can not decide they are no
> longer GSO only because flow now uses a route which doesn't handle
> TSO/GSO.
>
> Core networking stack handles the case very well for us, all we need
> is keeping track of packet counts in MSS terms, regardless of
> segmentation done later (in GSO or hardware)
>
> Right now, if tcp_fragment() splits a GSO packet in two parts,
> @left and @right, and route changed through a non GSO device,
> both @left and @right have pcount set to 1, which is wrong,
> and leads to incorrect packet_count tracking.
>
> This problem was added in commit d5ac99a648 ("[TCP]: skb pcount with MTU
> discovery")
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Signed-off-by: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Yuchung Cheng <ycheng@google.com>
> Reported-by: Maciej Żenczykowski <maze@google.com>
> Signed-off-by: David S. Miller <davem@davemloft.net>
>
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 8fad1c1..d46f214 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -989,8 +989,7 @@ static void tcp_set_skb_tso_segs(const struct sock *sk,
> struct sk_buff *skb, /* Make sure we own this skb before messing
> gso_size/gso_segs */ WARN_ON_ONCE(skb_cloned(skb));
>
> - if (skb->len <= mss_now || !sk_can_gso(sk) ||
> - skb->ip_summed == CHECKSUM_NONE) {
> + if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
> /* Avoid the costly divide in the normal
> * non-TSO case.
> */
> ---------------------------------------------------------------------
>
> When I revert it, even kernel v3.18-rc6 starts working.
> But I doubt this is the root problem, may be just hiding another issue.
>
> --- Sample output of socket_client using vanilla v3.12 kernel ---
> [1417258063 SEND result: 4096, strerror: Success]
> tcp max seg: res: 0, max_seg: 1370
> [1417258063 SEND result: 4096, strerror: Success]
> tcp max seg: res: 0, max_seg: 1370
> [1417258063 SEND result: 4096, strerror: Success]
> tcp max seg: res: 0, max_seg: 1370
> [1417258063 SEND result: 4096, strerror: Success]
> tcp max seg: res: 0, max_seg: 1370
> [1417258063 SEND result: 4096, strerror: Success]
> tcp max seg: res: 0, max_seg: 1338
> [1417258063 SEND result: 4096, strerror: Success]
> tcp max seg: res: 0, max_seg: 1338
> *STUCK*
> --------------------------------------------------------
>
> The "machine" is running on KVM and using "virtio_net" as NIC driver.
> I've played with the ethtool offload settings:
>
> *** eth1 defaults ***
> Offload parameters for eth1:
> rx-checksumming: on
> tx-checksumming: on
> scatter-gather: on
> tcp-segmentation-offload: on
> udp-fragmentation-offload: on
> generic-segmentation-offload: on
> generic-receive-offload: on
> large-receive-offload: off
>
> *** eth1 working (no stalls) using vanilla kernel ***
> Offload parameters for eth1:
> rx-checksumming: on
> tx-checksumming: off <-- the magic switch
> scatter-gather: off
> tcp-segmentation-offload: off
> udp-fragmentation-offload: off
> generic-segmentation-offload: off
> generic-receive-offload: off
> large-receive-offload: off
>
> When I turn "tx-checksumming" back on, it fails again.
> Though that is probably also just a side effect.
>
> I can provide tcpdumps if needed but they are no real help
> since you can just see the kernel stops sending TCP packets.
> (and the outgoing TCP packets are encrypted in ESP packets)
>
>
> Any vague idea what might be the root cause?
>
> I also tried reverting commit 4d53eff48b5f03ce67f4f301d6acca1d2145cb7a
> ("xfrm: Don't queue retransmitted packets if the original is still on the
> host") but that didn't change the situation. In fact it wasn't even
> triggered.
>
> Please CC: comments. Thanks.
>
> Best regards,
> Thomas
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+
2014-12-01 13:17 ` [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+ Wolfgang Walter
@ 2014-12-01 16:41 ` Wolfgang Walter
2014-12-05 12:09 ` Wolfgang Walter
0 siblings, 1 reply; 40+ messages in thread
From: Wolfgang Walter @ 2014-12-01 16:41 UTC (permalink / raw)
To: Thomas Jarosch; +Cc: netdev, Eric Dumazet
Am Montag, 1. Dezember 2014, 14:17:28 schrieb Wolfgang Walter:
> Am Samstag, 29. November 2014, 12:44:07 schrieb Thomas Jarosch:
> > Hello,
> >
> > we're in the process of updating production level machines
> > from kernel 3.4.101 to kernel 3.14.25. On one mail server
> > we noticed that emails destined for an IPSec tunnel sometimes
> > get stuck in the mail queue with TCP timeouts.
> >
> > To make a long story short: When the VPN connection is initially
> > set up or re-newed, the path MTU for the xfrm tunnel is undetermined.
> >
> > As soon as a TCP client starts to send large packets,
> > it triggers path MTU detection. Some middlebox on the
> > way to the final server has a lower MTU and sends back
> > an "ICMP fragmentation needed" packet as normal.
> >
> > With the old kernel, the packet size for the TCP connection inside
> > the xfrm tunnel gets adjusted and all is fine. With kernel v3.12+,
> > the connection stalls completely. Same thing with kernel v3.18-rc6.
>
> We see something similar with real nic (RTL8139). In our case only the first
> tcp-connection which triggers PMTU stalls. Later tcp-connections then work
> fine.
>
> I will revert that patch and see if that fixes the problem.
Reverting the commit fixes the problem here, too.
>
> > We wrote a small tool to mimic postfix's TCP behavior (see attached file).
> > In the end it's a normal TCP client sending large packets.
> > The server side is just "socat - tcp4-listen:667".
> >
> > If you run "socket_client" a second time, the path MTU
> > for the xfrm tunnel is already known and packets flow normal, too.
> >
> >
> > The "evil" commit in question is this one:
> > ---------------------------------------------------------------------
> > commit 8f26fb1c1ed81c33f5d87c5936f4d9d1b4118918
> > Author: Eric Dumazet <edumazet@google.com>
> > Date: Tue Oct 15 12:24:54 2013 -0700
> >
> > tcp: remove the sk_can_gso() check from tcp_set_skb_tso_segs()
> >
> > sk_can_gso() should only be used as a hint in tcp_sendmsg() to build
> > GSO
> >
> > packets in the first place. (As a performance hint)
> >
> > Once we have GSO packets in write queue, we can not decide they are no
> > longer GSO only because flow now uses a route which doesn't handle
> > TSO/GSO.
> >
> > Core networking stack handles the case very well for us, all we need
> > is keeping track of packet counts in MSS terms, regardless of
> > segmentation done later (in GSO or hardware)
> >
> > Right now, if tcp_fragment() splits a GSO packet in two parts,
> > @left and @right, and route changed through a non GSO device,
> > both @left and @right have pcount set to 1, which is wrong,
> > and leads to incorrect packet_count tracking.
> >
> > This problem was added in commit d5ac99a648 ("[TCP]: skb pcount with
> > MTU
> >
> > discovery")
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Signed-off-by: Neal Cardwell <ncardwell@google.com>
> > Signed-off-by: Yuchung Cheng <ycheng@google.com>
> > Reported-by: Maciej Żenczykowski <maze@google.com>
> > Signed-off-by: David S. Miller <davem@davemloft.net>
> >
> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > index 8fad1c1..d46f214 100644
> > --- a/net/ipv4/tcp_output.c
> > +++ b/net/ipv4/tcp_output.c
> > @@ -989,8 +989,7 @@ static void tcp_set_skb_tso_segs(const struct sock
> > *sk,
> > struct sk_buff *skb, /* Make sure we own this skb before messing
> > gso_size/gso_segs */ WARN_ON_ONCE(skb_cloned(skb));
> >
> > - if (skb->len <= mss_now || !sk_can_gso(sk) ||
> > - skb->ip_summed == CHECKSUM_NONE) {
> > + if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
> >
> > /* Avoid the costly divide in the normal
> >
> > * non-TSO case.
> > */
> >
> > ---------------------------------------------------------------------
> >
> > When I revert it, even kernel v3.18-rc6 starts working.
> > But I doubt this is the root problem, may be just hiding another issue.
> >
> > --- Sample output of socket_client using vanilla v3.12 kernel ---
> > [1417258063 SEND result: 4096, strerror: Success]
> > tcp max seg: res: 0, max_seg: 1370
> > [1417258063 SEND result: 4096, strerror: Success]
> > tcp max seg: res: 0, max_seg: 1370
> > [1417258063 SEND result: 4096, strerror: Success]
> > tcp max seg: res: 0, max_seg: 1370
> > [1417258063 SEND result: 4096, strerror: Success]
> > tcp max seg: res: 0, max_seg: 1370
> > [1417258063 SEND result: 4096, strerror: Success]
> > tcp max seg: res: 0, max_seg: 1338
> > [1417258063 SEND result: 4096, strerror: Success]
> > tcp max seg: res: 0, max_seg: 1338
> > *STUCK*
> > --------------------------------------------------------
> >
> > The "machine" is running on KVM and using "virtio_net" as NIC driver.
> > I've played with the ethtool offload settings:
> >
> > *** eth1 defaults ***
> > Offload parameters for eth1:
> > rx-checksumming: on
> > tx-checksumming: on
> > scatter-gather: on
> > tcp-segmentation-offload: on
> > udp-fragmentation-offload: on
> > generic-segmentation-offload: on
> > generic-receive-offload: on
> > large-receive-offload: off
> >
> > *** eth1 working (no stalls) using vanilla kernel ***
> > Offload parameters for eth1:
> > rx-checksumming: on
> > tx-checksumming: off <-- the magic switch
> > scatter-gather: off
> > tcp-segmentation-offload: off
> > udp-fragmentation-offload: off
> > generic-segmentation-offload: off
> > generic-receive-offload: off
> > large-receive-offload: off
> >
> > When I turn "tx-checksumming" back on, it fails again.
> > Though that is probably also just a side effect.
> >
> > I can provide tcpdumps if needed but they are no real help
> > since you can just see the kernel stops sending TCP packets.
> > (and the outgoing TCP packets are encrypted in ESP packets)
> >
> >
> > Any vague idea what might be the root cause?
> >
> > I also tried reverting commit 4d53eff48b5f03ce67f4f301d6acca1d2145cb7a
> > ("xfrm: Don't queue retransmitted packets if the original is still on the
> > host") but that didn't change the situation. In fact it wasn't even
> > triggered.
> >
> > Please CC: comments. Thanks.
> >
> > Best regards,
> > Thomas
>
> Regards,
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+
2014-12-01 16:41 ` Wolfgang Walter
@ 2014-12-05 12:09 ` Wolfgang Walter
2014-12-05 13:26 ` Eric Dumazet
0 siblings, 1 reply; 40+ messages in thread
From: Wolfgang Walter @ 2014-12-05 12:09 UTC (permalink / raw)
To: netdev; +Cc: Thomas Jarosch, Eric Dumazet, Herbert Xu, Steffen Klassert
Hello,
as reverting this patch fixes this rather annoying problem: is it dangerous to
revert it as a workaround until the root cause is found?
Am Montag, 1. Dezember 2014, 17:41:23 schrieb Wolfgang Walter:
> Am Montag, 1. Dezember 2014, 14:17:28 schrieb Wolfgang Walter:
> > Am Samstag, 29. November 2014, 12:44:07 schrieb Thomas Jarosch:
> > > Hello,
> > >
> > > we're in the process of updating production level machines
> > > from kernel 3.4.101 to kernel 3.14.25. On one mail server
> > > we noticed that emails destined for an IPSec tunnel sometimes
> > > get stuck in the mail queue with TCP timeouts.
> > >
> > > To make a long story short: When the VPN connection is initially
> > > set up or re-newed, the path MTU for the xfrm tunnel is undetermined.
> > >
> > > As soon as a TCP client starts to send large packets,
> > > it triggers path MTU detection. Some middlebox on the
> > > way to the final server has a lower MTU and sends back
> > > an "ICMP fragmentation needed" packet as normal.
> > >
> > > With the old kernel, the packet size for the TCP connection inside
> > > the xfrm tunnel gets adjusted and all is fine. With kernel v3.12+,
> > > the connection stalls completely. Same thing with kernel v3.18-rc6.
> >
> > We see something similar with real nic (RTL8139). In our case only the
> > first tcp-connection which triggers PMTU stalls. Later tcp-connections
> > then work fine.
> >
> > I will revert that patch and see if that fixes the problem.
>
> Reverting the commit fixes the problem here, too.
>
> > > We wrote a small tool to mimic postfix's TCP behavior (see attached
> > > file).
> > > In the end it's a normal TCP client sending large packets.
> > > The server side is just "socat - tcp4-listen:667".
> > >
> > > If you run "socket_client" a second time, the path MTU
> > > for the xfrm tunnel is already known and packets flow normal, too.
> > >
> > >
> > > The "evil" commit in question is this one:
> > > ---------------------------------------------------------------------
> > > commit 8f26fb1c1ed81c33f5d87c5936f4d9d1b4118918
> > > Author: Eric Dumazet <edumazet@google.com>
> > > Date: Tue Oct 15 12:24:54 2013 -0700
> > >
> > > tcp: remove the sk_can_gso() check from tcp_set_skb_tso_segs()
> > >
> > > sk_can_gso() should only be used as a hint in tcp_sendmsg() to build
> > > GSO
> > >
> > > packets in the first place. (As a performance hint)
> > >
> > > Once we have GSO packets in write queue, we can not decide they are
> > > no
> > > longer GSO only because flow now uses a route which doesn't handle
> > > TSO/GSO.
> > >
> > > Core networking stack handles the case very well for us, all we need
> > > is keeping track of packet counts in MSS terms, regardless of
> > > segmentation done later (in GSO or hardware)
> > >
> > > Right now, if tcp_fragment() splits a GSO packet in two parts,
> > > @left and @right, and route changed through a non GSO device,
> > > both @left and @right have pcount set to 1, which is wrong,
> > > and leads to incorrect packet_count tracking.
> > >
> > > This problem was added in commit d5ac99a648 ("[TCP]: skb pcount with
> > > MTU
> > >
> > > discovery")
> > >
> > > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > > Signed-off-by: Neal Cardwell <ncardwell@google.com>
> > > Signed-off-by: Yuchung Cheng <ycheng@google.com>
> > > Reported-by: Maciej Żenczykowski <maze@google.com>
> > > Signed-off-by: David S. Miller <davem@davemloft.net>
> > >
> > > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> > > index 8fad1c1..d46f214 100644
> > > --- a/net/ipv4/tcp_output.c
> > > +++ b/net/ipv4/tcp_output.c
> > > @@ -989,8 +989,7 @@ static void tcp_set_skb_tso_segs(const struct sock
> > > *sk,
> > > struct sk_buff *skb, /* Make sure we own this skb before messing
> > > gso_size/gso_segs */ WARN_ON_ONCE(skb_cloned(skb));
> > >
> > > - if (skb->len <= mss_now || !sk_can_gso(sk) ||
> > > - skb->ip_summed == CHECKSUM_NONE) {
> > > + if (skb->len <= mss_now || skb->ip_summed == CHECKSUM_NONE) {
> > >
> > > /* Avoid the costly divide in the normal
> > >
> > > * non-TSO case.
> > > */
> > >
> > > ---------------------------------------------------------------------
> > >
> > > When I revert it, even kernel v3.18-rc6 starts working.
> > > But I doubt this is the root problem, may be just hiding another issue.
> > >
> > > --- Sample output of socket_client using vanilla v3.12 kernel ---
> > > [1417258063 SEND result: 4096, strerror: Success]
> > > tcp max seg: res: 0, max_seg: 1370
> > > [1417258063 SEND result: 4096, strerror: Success]
> > > tcp max seg: res: 0, max_seg: 1370
> > > [1417258063 SEND result: 4096, strerror: Success]
> > > tcp max seg: res: 0, max_seg: 1370
> > > [1417258063 SEND result: 4096, strerror: Success]
> > > tcp max seg: res: 0, max_seg: 1370
> > > [1417258063 SEND result: 4096, strerror: Success]
> > > tcp max seg: res: 0, max_seg: 1338
> > > [1417258063 SEND result: 4096, strerror: Success]
> > > tcp max seg: res: 0, max_seg: 1338
> > > *STUCK*
> > > --------------------------------------------------------
> > >
> > > The "machine" is running on KVM and using "virtio_net" as NIC driver.
> > > I've played with the ethtool offload settings:
> > >
> > > *** eth1 defaults ***
> > > Offload parameters for eth1:
> > > rx-checksumming: on
> > > tx-checksumming: on
> > > scatter-gather: on
> > > tcp-segmentation-offload: on
> > > udp-fragmentation-offload: on
> > > generic-segmentation-offload: on
> > > generic-receive-offload: on
> > > large-receive-offload: off
> > >
> > > *** eth1 working (no stalls) using vanilla kernel ***
> > > Offload parameters for eth1:
> > > rx-checksumming: on
> > > tx-checksumming: off <-- the magic switch
> > > scatter-gather: off
> > > tcp-segmentation-offload: off
> > > udp-fragmentation-offload: off
> > > generic-segmentation-offload: off
> > > generic-receive-offload: off
> > > large-receive-offload: off
> > >
> > > When I turn "tx-checksumming" back on, it fails again.
> > > Though that is probably also just a side effect.
> > >
> > > I can provide tcpdumps if needed but they are no real help
> > > since you can just see the kernel stops sending TCP packets.
> > > (and the outgoing TCP packets are encrypted in ESP packets)
> > >
> > >
> > > Any vague idea what might be the root cause?
> > >
> > > I also tried reverting commit 4d53eff48b5f03ce67f4f301d6acca1d2145cb7a
> > > ("xfrm: Don't queue retransmitted packets if the original is still on
> > > the
> > > host") but that didn't change the situation. In fact it wasn't even
> > > triggered.
> > >
> > > Please CC: comments. Thanks.
> > >
> > > Best regards,
> > > Thomas
> >
> > Regards,
>
> Regards,
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+
2014-12-05 12:09 ` Wolfgang Walter
@ 2014-12-05 13:26 ` Eric Dumazet
2014-12-08 22:20 ` Wolfgang Walter
0 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2014-12-05 13:26 UTC (permalink / raw)
To: Wolfgang Walter
Cc: netdev, Thomas Jarosch, Eric Dumazet, Herbert Xu,
Steffen Klassert
On Fri, 2014-12-05 at 13:09 +0100, Wolfgang Walter wrote:
> Hello,
>
> as reverting this patch fixes this rather annoying problem: is it dangerous to
> revert it as a workaround until the root cause is found?
>
Unfortunately no, this patch fixes a serious issue.
We need to find the root cause of your problem instead of trying to work
around it.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+
2014-12-05 13:26 ` Eric Dumazet
@ 2014-12-08 22:20 ` Wolfgang Walter
2014-12-09 8:54 ` Thomas Jarosch
0 siblings, 1 reply; 40+ messages in thread
From: Wolfgang Walter @ 2014-12-08 22:20 UTC (permalink / raw)
To: Eric Dumazet
Cc: netdev, Thomas Jarosch, Eric Dumazet, Herbert Xu,
Steffen Klassert
Am Freitag, 5. Dezember 2014, 05:26:25 schrieb Eric Dumazet:
> On Fri, 2014-12-05 at 13:09 +0100, Wolfgang Walter wrote:
> > Hello,
> >
> > as reverting this patch fixes this rather annoying problem: is it
> > dangerous to revert it as a workaround until the root cause is found?
>
> Unfortunately no, this patch fixes a serious issue.
>
> We need to find the root cause of your problem instead of trying to work
> around it.
>
I only wanted to use it as local workaround here.
I looked a bit at at code. I'm not familiar with the network code, though :-).
When exactly should the gso handling with esp tunnel mode happen?
esp_output_done() calls xfrm_output_resume() and no gso handling is done.
Would be to late anyway.
esp_output() does not handle gso.
xfrm4_mode_tunnel_output() does not do any gso handling as far as I can see.
xfrm_output() (from net/xfrm/xfrm_output) does but is it called with esp
tunnel mode?
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+
2014-12-08 22:20 ` Wolfgang Walter
@ 2014-12-09 8:54 ` Thomas Jarosch
2014-12-09 14:26 ` [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3 Eric Dumazet
0 siblings, 1 reply; 40+ messages in thread
From: Thomas Jarosch @ 2014-12-09 8:54 UTC (permalink / raw)
To: Wolfgang Walter
Cc: Eric Dumazet, netdev, Eric Dumazet, Herbert Xu, Steffen Klassert
On Monday, 8. December 2014 23:20:42 Wolfgang Walter wrote:
> Am Freitag, 5. Dezember 2014, 05:26:25 schrieb Eric Dumazet:
> > On Fri, 2014-12-05 at 13:09 +0100, Wolfgang Walter wrote:
> > > Hello,
> > >
> > > as reverting this patch fixes this rather annoying problem: is it
> > > dangerous to revert it as a workaround until the root cause is found?
> >
> > Unfortunately no, this patch fixes a serious issue.
> >
> > We need to find the root cause of your problem instead of trying to work
> > around it.
>
> I only wanted to use it as local workaround here.
>
>
> I looked a bit at at code. I'm not familiar with the network code, though
> :-).
If it helps, I'm running the reverted patch on five production boxes hitherto
without a hiccup. As far as I understood the original commit message,
some packet counters might me wrong without it.
@Eric: What could possibly go wrong(tm)? :)
Of course a real fix is preferred over this band-aid.
Cheers,
Thomas
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
2014-12-09 8:54 ` Thomas Jarosch
@ 2014-12-09 14:26 ` Eric Dumazet
2014-12-09 14:49 ` Thomas Jarosch
2014-12-09 20:36 ` Wolfgang Walter
0 siblings, 2 replies; 40+ messages in thread
From: Eric Dumazet @ 2014-12-09 14:26 UTC (permalink / raw)
To: Thomas Jarosch
Cc: Wolfgang Walter, netdev, Eric Dumazet, Herbert Xu,
Steffen Klassert
On Tue, 2014-12-09 at 09:54 +0100, Thomas Jarosch wrote:
> On Monday, 8. December 2014 23:20:42 Wolfgang Walter wrote:
> > Am Freitag, 5. Dezember 2014, 05:26:25 schrieb Eric Dumazet:
> > > On Fri, 2014-12-05 at 13:09 +0100, Wolfgang Walter wrote:
> > > > Hello,
> > > >
> > > > as reverting this patch fixes this rather annoying problem: is it
> > > > dangerous to revert it as a workaround until the root cause is found?
> > >
> > > Unfortunately no, this patch fixes a serious issue.
> > >
> > > We need to find the root cause of your problem instead of trying to work
> > > around it.
> >
> > I only wanted to use it as local workaround here.
> >
> >
> > I looked a bit at at code. I'm not familiar with the network code, though
> > :-).
>
> If it helps, I'm running the reverted patch on five production boxes hitherto
> without a hiccup. As far as I understood the original commit message,
> some packet counters might me wrong without it.
>
> @Eric: What could possibly go wrong(tm)? :)
Crashes in TCP stack, because of packet count mismatches.
The sk_can_gso() status is already tested in tcp_sendmsg() as a hint,
since path behavior can dynamically be changed on existing flow :
<start a TCP flow>
ethtool -K eth0 tso off gso off
In this case, core networking stack detects this and segments the
packets _after_ TCP or IP stack, before they reach eth0.
TCP stack does not have to know that something is changed right before
giving a GSO packet to core networking stack, this would be racy by
nature, as TCP does not know or control full path. Hopefully we do not
take RTNL for every packet we send in TCP !
It seems XFRM triggers in a slow path something which is not correctly
handled.
It is not correct to add a racy kludge in TCP fast path for this very
unlikely case.
I would disable TSO/GSO on xfrm, and problem should disappear.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
2014-12-09 14:26 ` [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3 Eric Dumazet
@ 2014-12-09 14:49 ` Thomas Jarosch
2014-12-09 20:36 ` Wolfgang Walter
1 sibling, 0 replies; 40+ messages in thread
From: Thomas Jarosch @ 2014-12-09 14:49 UTC (permalink / raw)
To: Eric Dumazet
Cc: Wolfgang Walter, netdev, Eric Dumazet, Herbert Xu,
Steffen Klassert
On Tuesday, 9. December 2014 06:26:49 Eric Dumazet wrote:
> > If it helps, I'm running the reverted patch on five production boxes
> > hitherto without a hiccup. As far as I understood the original commit
> > message, some packet counters might me wrong without it.
> >
> > @Eric: What could possibly go wrong(tm)? :)
>
> Crashes in TCP stack, because of packet count mismatches.
alright, that sounds like a pretty good argument.
> ...
> I would disable TSO/GSO on xfrm, and problem should disappear.
I guess you can't explicitly disable this with the "ip xfrm" command?
Or do you mean this should be disabled on the ethX device
serving the xfrm connection?
We are about to push out this code to ten more machines,
so the best time (for me) to do any changes that increases
stability would be now :o)
Cheers,
Thomas
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
2014-12-09 14:26 ` [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3 Eric Dumazet
2014-12-09 14:49 ` Thomas Jarosch
@ 2014-12-09 20:36 ` Wolfgang Walter
2014-12-09 21:40 ` Eric Dumazet
1 sibling, 1 reply; 40+ messages in thread
From: Wolfgang Walter @ 2014-12-09 20:36 UTC (permalink / raw)
To: Eric Dumazet
Cc: Thomas Jarosch, netdev, Eric Dumazet, Herbert Xu,
Steffen Klassert
Am Dienstag, 9. Dezember 2014, 06:26:49 schrieb Eric Dumazet:
> On Tue, 2014-12-09 at 09:54 +0100, Thomas Jarosch wrote:
> > On Monday, 8. December 2014 23:20:42 Wolfgang Walter wrote:
> > > Am Freitag, 5. Dezember 2014, 05:26:25 schrieb Eric Dumazet:
> > > > On Fri, 2014-12-05 at 13:09 +0100, Wolfgang Walter wrote:
> > > > > Hello,
> > > > >
> > > > > as reverting this patch fixes this rather annoying problem: is it
> > > > > dangerous to revert it as a workaround until the root cause is
> > > > > found?
> > > >
> > > > Unfortunately no, this patch fixes a serious issue.
> > > >
> > > > We need to find the root cause of your problem instead of trying to
> > > > work
> > > > around it.
> > >
> > > I only wanted to use it as local workaround here.
> > >
> > >
> > > I looked a bit at at code. I'm not familiar with the network code,
> > > though
> > >
> > > :-).
> >
> > If it helps, I'm running the reverted patch on five production boxes
> > hitherto without a hiccup. As far as I understood the original commit
> > message, some packet counters might me wrong without it.
> >
> > @Eric: What could possibly go wrong(tm)? :)
>
> Crashes in TCP stack, because of packet count mismatches.
>
> The sk_can_gso() status is already tested in tcp_sendmsg() as a hint,
> since path behavior can dynamically be changed on existing flow :
>
> <start a TCP flow>
> ethtool -K eth0 tso off gso off
>
> In this case, core networking stack detects this and segments the
> packets _after_ TCP or IP stack, before they reach eth0.
>
> TCP stack does not have to know that something is changed right before
> giving a GSO packet to core networking stack, this would be racy by
> nature, as TCP does not know or control full path. Hopefully we do not
> take RTNL for every packet we send in TCP !
>
> It seems XFRM triggers in a slow path something which is not correctly
> handled.
>
> It is not correct to add a racy kludge in TCP fast path for this very
> unlikely case.
>
> I would disable TSO/GSO on xfrm, and problem should disappear.
How would that be done? I found no way to disable it especially for xfrm. I
disabled gso for the interface which serves the ipsec traffic but this does
not help. tcp still uses gso for the esp tunnel.
I put a view printk's in net/xfrm/xfrm_output.c and net/ipv4/tcp_output.c. (I
try to understand where in the xfrm transformation gso is handeled).
What I can say yet is:
xfrm_output() is used with ipsec (esp) tunnel mode but at I never see gso
packets here. xfrm_output_gso() is never called.
Everytime tcp_set_skb_tso_segs() is called for a tcp connection over the esp-
tunnel and it is a gso case then the tcp connection hangs. Those packets
always have skb->len 1398 and mss_now is 1374. I see a call of xfrm_output()
afterwards but for a packet of skb->len 52 (maybe ACK from other direction?).
As long as the tcp-connection over the ipsec-tunnel works and if I send bulk
traffic xfrm_output() is called 3 times with packet skb->len 1426 and then one
time with 78 (maybe other direction?), don't know if that is of any interest.
With non-ipsec-traffic gso works fine: in this case the skb->len() varies a
lot and mss_now is always 1288.
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
2014-12-09 20:36 ` Wolfgang Walter
@ 2014-12-09 21:40 ` Eric Dumazet
2014-12-10 18:34 ` Wolfgang Walter
0 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2014-12-09 21:40 UTC (permalink / raw)
To: Wolfgang Walter
Cc: Thomas Jarosch, netdev, Eric Dumazet, Herbert Xu,
Steffen Klassert
On Tue, 2014-12-09 at 21:36 +0100, Wolfgang Walter wrote:
> How would that be done? I found no way to disable it especially for xfrm. I
> disabled gso for the interface which serves the ipsec traffic but this does
> not help. tcp still uses gso for the esp tunnel.
>
> I put a view printk's in net/xfrm/xfrm_output.c and net/ipv4/tcp_output.c. (I
> try to understand where in the xfrm transformation gso is handeled).
>
> What I can say yet is:
>
> xfrm_output() is used with ipsec (esp) tunnel mode but at I never see gso
> packets here. xfrm_output_gso() is never called.
>
> Everytime tcp_set_skb_tso_segs() is called for a tcp connection over the esp-
> tunnel and it is a gso case then the tcp connection hangs. Those packets
> always have skb->len 1398 and mss_now is 1374. I see a call of xfrm_output()
> afterwards but for a packet of skb->len 52 (maybe ACK from other direction?).
>
> As long as the tcp-connection over the ipsec-tunnel works and if I send bulk
> traffic xfrm_output() is called 3 times with packet skb->len 1426 and then one
> time with 78 (maybe other direction?), don't know if that is of any interest.
>
>
> With non-ipsec-traffic gso works fine: in this case the skb->len() varies a
> lot and mss_now is always 1288.
>
Presumably something happens so that sk_can_gso() returns false.
But apparently, this happens _after_ tcp_sendmsg(), ie a bit too late.
Note that after https://patchwork.ozlabs.org/patch/418506/ is applied,
fix would simply be :
diff --git a/net/core/sock.c b/net/core/sock.c
index 9a56b2000c3f374fb95aedada3327447816a9512..678ef8393680dc781445c5f121719ea8ea7bb7c1 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1585,6 +1585,8 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
sk->sk_gso_max_size = dst->dev->gso_max_size;
sk->sk_gso_max_segs = dst->dev->gso_max_segs;
}
+ } else {
+ sk->sk_gso_max_segs = 1;
}
}
EXPORT_SYMBOL_GPL(sk_setup_caps);
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
2014-12-09 21:40 ` Eric Dumazet
@ 2014-12-10 18:34 ` Wolfgang Walter
2014-12-10 19:10 ` Eric Dumazet
0 siblings, 1 reply; 40+ messages in thread
From: Wolfgang Walter @ 2014-12-10 18:34 UTC (permalink / raw)
To: Eric Dumazet
Cc: Thomas Jarosch, netdev, Eric Dumazet, Herbert Xu,
Steffen Klassert
Am Dienstag, 9. Dezember 2014, 13:40:19 schrieb Eric Dumazet:
> On Tue, 2014-12-09 at 21:36 +0100, Wolfgang Walter wrote:
> > How would that be done? I found no way to disable it especially for xfrm.
> > I
> > disabled gso for the interface which serves the ipsec traffic but this
> > does
> > not help. tcp still uses gso for the esp tunnel.
> >
> > I put a view printk's in net/xfrm/xfrm_output.c and net/ipv4/tcp_output.c.
> > (I try to understand where in the xfrm transformation gso is handeled).
> >
> > What I can say yet is:
> >
> > xfrm_output() is used with ipsec (esp) tunnel mode but at I never see gso
> > packets here. xfrm_output_gso() is never called.
> >
> > Everytime tcp_set_skb_tso_segs() is called for a tcp connection over the
> > esp- tunnel and it is a gso case then the tcp connection hangs. Those
> > packets always have skb->len 1398 and mss_now is 1374. I see a call of
> > xfrm_output() afterwards but for a packet of skb->len 52 (maybe ACK from
> > other direction?).
> >
> > As long as the tcp-connection over the ipsec-tunnel works and if I send
> > bulk traffic xfrm_output() is called 3 times with packet skb->len 1426
> > and then one time with 78 (maybe other direction?), don't know if that is
> > of any interest.
> >
> >
> > With non-ipsec-traffic gso works fine: in this case the skb->len() varies
> > a
> > lot and mss_now is always 1288.
>
> Presumably something happens so that sk_can_gso() returns false.
>
> But apparently, this happens _after_ tcp_sendmsg(), ie a bit too late.
>
> Note that after https://patchwork.ozlabs.org/patch/418506/ is applied,
> fix would simply be :
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index
> 9a56b2000c3f374fb95aedada3327447816a9512..678ef8393680dc781445c5f121719ea8e
> a7bb7c1 100644 --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1585,6 +1585,8 @@ void sk_setup_caps(struct sock *sk, struct dst_entry
> *dst) sk->sk_gso_max_size = dst->dev->gso_max_size;
> sk->sk_gso_max_segs = dst->dev->gso_max_segs;
> }
> + } else {
> + sk->sk_gso_max_segs = 1;
> }
> }
> EXPORT_SYMBOL_GPL(sk_setup_caps);
Ok. I tried that. I backported https://patchwork.ozlabs.org/patch/418506/ to
3.14.26 and then applied this patch. Still hangs. At that stage sk_can_gso(sk)
is true. The reason is that the interface of *dst in sk_setup_caps() is the
interface which serves the ipsec traffic.
There is a difference, though. If I disable gso for the interface which serves
the ipsec traffic then the hang disappears now. This seems to be because of
the patch above: sk_can_gso(sk) in sk_setup_caps then returns false and the ne
else branch is taken. So there may be a hidden bug in the non-gso case.
Back to xfrm case with hang:
I see: a call of sk_setup_caps() which takes the path
sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
...
So gso is on. When the hang happens sk_setup_caps is called from
inet_sk_rebuild_header(). Now the path
sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
is taken as dst->header_len is now non zero.
This is the reason why later calls of sk_can_gso() return false.
I'll try to change the above patch to
@@ -1585,6 +1585,8 @@ void sk_setup_caps(struct sock *sk, struct dst_entry
*dst) sk->sk_gso_max_size = dst->dev->gso_max_size;
sk->sk_gso_max_segs = dst->dev->gso_max_segs;
}
}
+ if (sk_can_gso(sk)) {
+ sk->sk_gso_max_segs = 1;
}
}
EXPORT_SYMBOL_GPL(sk_setup_caps);
so that the case that GSO is disabled because of dst->header_len != 0 sets
sk_gso_max_segs, too.
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
2014-12-10 18:34 ` Wolfgang Walter
@ 2014-12-10 19:10 ` Eric Dumazet
2014-12-11 0:36 ` Wolfgang Walter
2014-12-12 16:58 ` Wolfgang Walter
0 siblings, 2 replies; 40+ messages in thread
From: Eric Dumazet @ 2014-12-10 19:10 UTC (permalink / raw)
To: Wolfgang Walter
Cc: Thomas Jarosch, netdev, Eric Dumazet, Herbert Xu,
Steffen Klassert
On Wed, 2014-12-10 at 19:34 +0100, Wolfgang Walter wrote:
> So gso is on. When the hang happens sk_setup_caps is called from
> inet_sk_rebuild_header(). Now the path
>
> sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
>
> is taken as dst->header_len is now non zero.
>
> This is the reason why later calls of sk_can_gso() return false.
>
> I'll try to change the above patch to
>
> @@ -1585,6 +1585,8 @@ void sk_setup_caps(struct sock *sk, struct dst_entry
> *dst) sk->sk_gso_max_size = dst->dev->gso_max_size;
> sk->sk_gso_max_segs = dst->dev->gso_max_segs;
> }
> }
> + if (sk_can_gso(sk)) {
> + sk->sk_gso_max_segs = 1;
> }
> }
> EXPORT_SYMBOL_GPL(sk_setup_caps);
>
> so that the case that GSO is disabled because of dst->header_len != 0 sets
> sk_gso_max_segs, too.
Sounds good, or maybe simply :
diff --git a/net/core/sock.c b/net/core/sock.c
index 9a56b2000c3f374fb95aedada3327447816a9512..edca31319dfee57cabe5b376505324bea07f767a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1577,6 +1577,7 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
if (sk->sk_route_caps & NETIF_F_GSO)
sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;
sk->sk_route_caps &= ~sk->sk_route_nocaps;
+ sk->sk_gso_max_segs = 1;
if (sk_can_gso(sk)) {
if (dst->header_len) {
sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
2014-12-10 19:10 ` Eric Dumazet
@ 2014-12-11 0:36 ` Wolfgang Walter
2014-12-12 16:58 ` Wolfgang Walter
1 sibling, 0 replies; 40+ messages in thread
From: Wolfgang Walter @ 2014-12-11 0:36 UTC (permalink / raw)
To: Eric Dumazet
Cc: Thomas Jarosch, netdev, Eric Dumazet, Herbert Xu,
Steffen Klassert
Am Mittwoch, 10. Dezember 2014, 11:10:03 schrieb Eric Dumazet:
> On Wed, 2014-12-10 at 19:34 +0100, Wolfgang Walter wrote:
> > So gso is on. When the hang happens sk_setup_caps is called from
> > inet_sk_rebuild_header(). Now the path
> >
> > sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
> >
> > is taken as dst->header_len is now non zero.
> >
> > This is the reason why later calls of sk_can_gso() return false.
> >
> > I'll try to change the above patch to
> >
> > @@ -1585,6 +1585,8 @@ void sk_setup_caps(struct sock *sk, struct dst_entry
> > *dst) sk->sk_gso_max_size = dst->dev->gso_max_size;
> >
> > sk->sk_gso_max_segs = dst->dev->gso_max_segs;
> >
> > }
> >
> > }
> >
> > + if (sk_can_gso(sk)) {
> > + sk->sk_gso_max_segs = 1;
> >
> > }
> >
> > }
> > EXPORT_SYMBOL_GPL(sk_setup_caps);
> >
> > so that the case that GSO is disabled because of dst->header_len != 0 sets
> > sk_gso_max_segs, too.
>
> Sounds good, or maybe simply :
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index
> 9a56b2000c3f374fb95aedada3327447816a9512..edca31319dfee57cabe5b376505324bea
> 07f767a 100644 --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1577,6 +1577,7 @@ void sk_setup_caps(struct sock *sk, struct dst_entry
> *dst) if (sk->sk_route_caps & NETIF_F_GSO)
> sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;
> sk->sk_route_caps &= ~sk->sk_route_nocaps;
> + sk->sk_gso_max_segs = 1;
> if (sk_can_gso(sk)) {
> if (dst->header_len) {
> sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
Yes, true.
This fixes the issue. That means no hangs any more even if gso is enabled on
the device serving the ipsec traffic.
As said I could not apply https://patchwork.ozlabs.org/patch/418506/
unmodified to 3.14. This is because the following patch is missing in v.3.14:
commit e114a710aa5058c0ba4aa1dfb105132aefeb5e04
Author: Eric Dumazet <edumazet@google.com>
Date: Wed Apr 30 11:58:13 2014 -0700
tcp: fix cwnd limited checking to improve congestion control
(and the follow-up 249015515fe3fc9818d86cb5c83bbc92505ad7dc).
I'll next try to apply these patches on v3.14.26, see if the other patch then
applies cleanly, add that one-liner - and hopefully it will work.
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
2014-12-10 19:10 ` Eric Dumazet
2014-12-11 0:36 ` Wolfgang Walter
@ 2014-12-12 16:58 ` Wolfgang Walter
2014-12-12 17:27 ` Eric Dumazet
1 sibling, 1 reply; 40+ messages in thread
From: Wolfgang Walter @ 2014-12-12 16:58 UTC (permalink / raw)
To: Eric Dumazet
Cc: Thomas Jarosch, netdev, Eric Dumazet, Herbert Xu,
Steffen Klassert
[-- Attachment #1: Type: text/plain, Size: 3614 bytes --]
Hello Eric,
my plan to use e114a710aa5058c0ba4aa1dfb105132aefeb5e04 to make it easier to apply tcp-refine-TSO-autosizing was a bad idea as that needs further patches :-). Backporting
tcp-refine-TSO-autosizing
https://patchwork.ozlabs.org/patch/418506/
seems to be the easier solution.
I tested if 3.18 and 3.17.6 are affected and they are. I applied tcp-refine-TSO-autosizing and the one-liner below and then both work fine.
Attached is
1) my backport of tcp-refine-TSO-autosizing for 3.14.26
2) my backport of tcp-refine-TSO-autosizing for 3.17.6
For 3.18 tcp-refine-TSO-autosizing just applies fine.
I tested 3.16.26 + patches, 3.17.6 + patches and 3.18.2 + patches.
And here again the one-liner:
====================
Subject: tcp: ensure that sk_gso_max_segs = 1 if GSO is disabled
sk_setup_caps() may disable GSO even if the network device
supports it. This may happens after the creation time of the
socket (e.g. triggered by xfrm, PMTU or routing changes).
This fixes hangs of local tcp connections over ipsec tunnels where
pmtu is lower than the mtu of the interface.
---
net/core/sock.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/net/core/sock.c b/net/core/sock.c
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1559,6 +1559,7 @@ void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
if (sk->sk_route_caps & NETIF_F_GSO)
sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;
sk->sk_route_caps &= ~sk->sk_route_nocaps;
+ sk->sk_gso_max_segs = 1;
if (sk_can_gso(sk)) {
if (dst->header_len) {
sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
--
1.7.10.4
=======================
Do you think the backports are ok? Then Thomas my test them, too.
What should be done now?
First the one line should go into 3.19 as soon as tcp-refine-TSO-autosizing
arrived there?
If both are in Torvald's who should I ask for inclusion into stable?
Am Mittwoch, 10. Dezember 2014, 11:10:03 schrieb Eric Dumazet:
> On Wed, 2014-12-10 at 19:34 +0100, Wolfgang Walter wrote:
> > So gso is on. When the hang happens sk_setup_caps is called from
> > inet_sk_rebuild_header(). Now the path
> >
> > sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
> >
> > is taken as dst->header_len is now non zero.
> >
> > This is the reason why later calls of sk_can_gso() return false.
> >
> > I'll try to change the above patch to
> >
> > @@ -1585,6 +1585,8 @@ void sk_setup_caps(struct sock *sk, struct dst_entry
> > *dst) sk->sk_gso_max_size = dst->dev->gso_max_size;
> >
> > sk->sk_gso_max_segs = dst->dev->gso_max_segs;
> >
> > }
> >
> > }
> >
> > + if (sk_can_gso(sk)) {
> > + sk->sk_gso_max_segs = 1;
> >
> > }
> >
> > }
> > EXPORT_SYMBOL_GPL(sk_setup_caps);
> >
> > so that the case that GSO is disabled because of dst->header_len != 0 sets
> > sk_gso_max_segs, too.
>
> Sounds good, or maybe simply :
>
> diff --git a/net/core/sock.c b/net/core/sock.c
> index
> 9a56b2000c3f374fb95aedada3327447816a9512..edca31319dfee57cabe5b376505324bea
> 07f767a 100644 --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1577,6 +1577,7 @@ void sk_setup_caps(struct sock *sk, struct dst_entry
> *dst) if (sk->sk_route_caps & NETIF_F_GSO)
> sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;
> sk->sk_route_caps &= ~sk->sk_route_nocaps;
> + sk->sk_gso_max_segs = 1;
> if (sk_can_gso(sk)) {
> if (dst->header_len) {
> sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
Regards and thank you very much for your help,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
[-- Attachment #2: 0002-tcp-refine-TSO-autosizing.patch --]
[-- Type: text/x-patch, Size: 7394 bytes --]
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -134,7 +134,7 @@ struct tcp_sock {
/* inet_connection_sock has to be the first member of tcp_sock */
struct inet_connection_sock inet_conn;
u16 tcp_header_len; /* Bytes of tcp header to send */
- u16 xmit_size_goal_segs; /* Goal for segmenting output packets */
+ u16 gso_segs; /* Max number of segs per GSO packet */
/*
* Header prediction flags
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -825,47 +825,29 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
int large_allowed)
{
struct tcp_sock *tp = tcp_sk(sk);
- u32 xmit_size_goal, old_size_goal;
-
- xmit_size_goal = mss_now;
-
- if (large_allowed && sk_can_gso(sk)) {
- u32 gso_size, hlen;
-
- /* Maybe we should/could use sk->sk_prot->max_header here ? */
- hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
- inet_csk(sk)->icsk_ext_hdr_len +
- tp->tcp_header_len;
-
- /* Goal is to send at least one packet per ms,
- * not one big TSO packet every 100 ms.
- * This preserves ACK clocking and is consistent
- * with tcp_tso_should_defer() heuristic.
- */
- gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
- gso_size = max_t(u32, gso_size,
- sysctl_tcp_min_tso_segs * mss_now);
-
- xmit_size_goal = min_t(u32, gso_size,
- sk->sk_gso_max_size - 1 - hlen);
-
- xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
-
- /* We try hard to avoid divides here */
- old_size_goal = tp->xmit_size_goal_segs * mss_now;
-
- if (likely(old_size_goal <= xmit_size_goal &&
- old_size_goal + mss_now > xmit_size_goal)) {
- xmit_size_goal = old_size_goal;
- } else {
- tp->xmit_size_goal_segs =
- min_t(u16, xmit_size_goal / mss_now,
- sk->sk_gso_max_segs);
- xmit_size_goal = tp->xmit_size_goal_segs * mss_now;
- }
+ u32 new_size_goal, size_goal, hlen;
+
+ if (!large_allowed || !sk_can_gso(sk))
+ return mss_now;
+
+ /* Maybe we should/could use sk->sk_prot->max_header here ? */
+ hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
+ inet_csk(sk)->icsk_ext_hdr_len +
+ tp->tcp_header_len;
+
+ new_size_goal = sk->sk_gso_max_size - 1 - hlen;
+ new_size_goal = tcp_bound_to_half_wnd(tp, new_size_goal);
+
+ /* We try hard to avoid divides here */
+ size_goal = tp->gso_segs * mss_now;
+ if (unlikely(new_size_goal < size_goal ||
+ new_size_goal >= size_goal + mss_now)) {
+ tp->gso_segs = min_t(u16, new_size_goal / mss_now,
+ sk->sk_gso_max_segs);
+ size_goal = tp->gso_segs * mss_now;
}
- return max(xmit_size_goal, mss_now);
+ return max(size_goal, mss_now);
}
static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -290,7 +290,7 @@ bool tcp_is_cwnd_limited(const struct sock *sk, u32 in_flight)
left = tp->snd_cwnd - in_flight;
if (sk_can_gso(sk) &&
left * sysctl_tcp_tso_win_divisor < tp->snd_cwnd &&
- left < tp->xmit_size_goal_segs)
+ left < min_t(u32, tp->gso_segs, sk->sk_gso_max_segs))
return true;
return left <= tcp_max_tso_deferred_mss(tp);
}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1432,6 +1432,27 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
((nonagle & TCP_NAGLE_CORK) ||
(!nonagle && tp->packets_out && tcp_minshall_check(tp)));
}
+
+/* Return how many segs we'd like on a TSO packet,
+ * to send one TSO packet per ms
+ */
+static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now)
+{
+ u32 bytes, segs;
+
+ bytes = min(sk->sk_pacing_rate >> 10,
+ sk->sk_gso_max_size - 1 - MAX_TCP_HEADER);
+
+ /* Goal is to send at least one packet per ms,
+ * not one big TSO packet every 100 ms.
+ * This preserves ACK clocking and is consistent
+ * with tcp_tso_should_defer() heuristic.
+ */
+ segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs);
+
+ return min_t(u32, segs, sk->sk_gso_max_segs);
+}
+
/* Returns the portion of skb which can be sent right away */
static unsigned int tcp_mss_split_point(const struct sock *sk,
const struct sk_buff *skb,
@@ -1633,7 +1654,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
*
* This algorithm is from John Heffner.
*/
-static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
+static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb, u32 max_segs)
{
struct tcp_sock *tp = tcp_sk(sk);
const struct inet_connection_sock *icsk = inet_csk(sk);
@@ -1663,8 +1684,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb)
limit = min(send_win, cong_win);
/* If a full-sized TSO skb can be sent, do it. */
- if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
- tp->xmit_size_goal_segs * tp->mss_cache))
+ if (limit >= max_segs * tp->mss_cache)
goto send_now;
/* Middle in queue won't get any more data, full sendable already? */
@@ -1857,6 +1877,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
unsigned int tso_segs, sent_pkts;
int cwnd_quota;
int result;
+ u32 max_segs;
sent_pkts = 0;
@@ -1870,6 +1891,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
}
}
+ max_segs = tcp_tso_autosize(sk, mss_now);
while ((skb = tcp_send_head(sk))) {
unsigned int limit;
@@ -1900,10 +1922,22 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
nonagle : TCP_NAGLE_PUSH))))
break;
} else {
- if (!push_one && tcp_tso_should_defer(sk, skb))
+ if (!push_one && tcp_tso_should_defer(sk, skb, max_segs))
break;
}
+ limit = mss_now;
+ if (tso_segs > 1 && !tcp_urg_mode(tp))
+ limit = tcp_mss_split_point(sk, skb, mss_now,
+ min_t(unsigned int,
+ cwnd_quota,
+ max_segs),
+ nonagle);
+
+ if (skb->len > limit &&
+ unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
+ break;
+
/* TCP Small Queues :
* Control number of packets in qdisc/devices to two packets / or ~1 ms.
* This allows for :
@@ -1914,8 +1948,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
* of queued bytes to ensure line rate.
* One example is wifi aggregation (802.11 AMPDU)
*/
- limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
- sk->sk_pacing_rate >> 10);
+ limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+ limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
if (atomic_read(&sk->sk_wmem_alloc) > limit) {
set_bit(TSQ_THROTTLED, &tp->tsq_flags);
@@ -1930,18 +1964,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
break;
}
- limit = mss_now;
- if (tso_segs > 1 && !tcp_urg_mode(tp))
- limit = tcp_mss_split_point(sk, skb, mss_now,
- min_t(unsigned int,
- cwnd_quota,
- sk->sk_gso_max_segs),
- nonagle);
-
- if (skb->len > limit &&
- unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
- break;
-
TCP_SKB_CB(skb)->when = tcp_time_stamp;
if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
--
1.7.10.4
[-- Attachment #3: 0001-tcp-refine-TSO-autosizing.patch --]
[-- Type: text/x-patch, Size: 6898 bytes --]
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -131,7 +131,7 @@ struct tcp_sock {
/* inet_connection_sock has to be the first member of tcp_sock */
struct inet_connection_sock inet_conn;
u16 tcp_header_len; /* Bytes of tcp header to send */
- u16 xmit_size_goal_segs; /* Goal for segmenting output packets */
+ u16 gso_segs; /* Max number of segs per GSO packet */
/*
* Header prediction flags
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -836,47 +836,29 @@ static unsigned int tcp_xmit_size_goal(struct sock *sk, u32 mss_now,
int large_allowed)
{
struct tcp_sock *tp = tcp_sk(sk);
- u32 xmit_size_goal, old_size_goal;
-
- xmit_size_goal = mss_now;
-
- if (large_allowed && sk_can_gso(sk)) {
- u32 gso_size, hlen;
-
- /* Maybe we should/could use sk->sk_prot->max_header here ? */
- hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
- inet_csk(sk)->icsk_ext_hdr_len +
- tp->tcp_header_len;
-
- /* Goal is to send at least one packet per ms,
- * not one big TSO packet every 100 ms.
- * This preserves ACK clocking and is consistent
- * with tcp_tso_should_defer() heuristic.
- */
- gso_size = sk->sk_pacing_rate / (2 * MSEC_PER_SEC);
- gso_size = max_t(u32, gso_size,
- sysctl_tcp_min_tso_segs * mss_now);
-
- xmit_size_goal = min_t(u32, gso_size,
- sk->sk_gso_max_size - 1 - hlen);
-
- xmit_size_goal = tcp_bound_to_half_wnd(tp, xmit_size_goal);
-
- /* We try hard to avoid divides here */
- old_size_goal = tp->xmit_size_goal_segs * mss_now;
-
- if (likely(old_size_goal <= xmit_size_goal &&
- old_size_goal + mss_now > xmit_size_goal)) {
- xmit_size_goal = old_size_goal;
- } else {
- tp->xmit_size_goal_segs =
- min_t(u16, xmit_size_goal / mss_now,
- sk->sk_gso_max_segs);
- xmit_size_goal = tp->xmit_size_goal_segs * mss_now;
- }
+ u32 new_size_goal, size_goal, hlen;
+
+ if (!large_allowed || !sk_can_gso(sk))
+ return mss_now;
+
+ /* Maybe we should/could use sk->sk_prot->max_header here ? */
+ hlen = inet_csk(sk)->icsk_af_ops->net_header_len +
+ inet_csk(sk)->icsk_ext_hdr_len +
+ tp->tcp_header_len;
+
+ new_size_goal = sk->sk_gso_max_size - 1 - hlen;
+ new_size_goal = tcp_bound_to_half_wnd(tp, new_size_goal);
+
+ /* We try hard to avoid divides here */
+ size_goal = tp->gso_segs * mss_now;
+ if (unlikely(new_size_goal < size_goal ||
+ new_size_goal >= size_goal + mss_now)) {
+ tp->gso_segs = min_t(u16, new_size_goal / mss_now,
+ sk->sk_gso_max_segs);
+ size_goal = tp->gso_segs * mss_now;
}
- return max(xmit_size_goal, mss_now);
+ return max(size_goal, mss_now);
}
static int tcp_send_mss(struct sock *sk, int *size_goal, int flags)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1484,6 +1484,27 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
((nonagle & TCP_NAGLE_CORK) ||
(!nonagle && tp->packets_out && tcp_minshall_check(tp)));
}
+
+/* Return how many segs we'd like on a TSO packet,
+ * to send one TSO packet per ms
+ */
+static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now)
+{
+ u32 bytes, segs;
+
+ bytes = min(sk->sk_pacing_rate >> 10,
+ sk->sk_gso_max_size - 1 - MAX_TCP_HEADER);
+
+ /* Goal is to send at least one packet per ms,
+ * not one big TSO packet every 100 ms.
+ * This preserves ACK clocking and is consistent
+ * with tcp_tso_should_defer() heuristic.
+ */
+ segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs);
+
+ return min_t(u32, segs, sk->sk_gso_max_segs);
+}
+
/* Returns the portion of skb which can be sent right away */
static unsigned int tcp_mss_split_point(const struct sock *sk,
const struct sk_buff *skb,
@@ -1687,7 +1708,7 @@ static int tso_fragment(struct sock *sk, struct sk_buff *skb, unsigned int len,
* This algorithm is from John Heffner.
*/
static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
- bool *is_cwnd_limited)
+ bool *is_cwnd_limited, u32 max_segs)
{
struct tcp_sock *tp = tcp_sk(sk);
const struct inet_connection_sock *icsk = inet_csk(sk);
@@ -1717,8 +1738,7 @@ static bool tcp_tso_should_defer(struct sock *sk, struct sk_buff *skb,
limit = min(send_win, cong_win);
/* If a full-sized TSO skb can be sent, do it. */
- if (limit >= min_t(unsigned int, sk->sk_gso_max_size,
- tp->xmit_size_goal_segs * tp->mss_cache))
+ if (limit >= max_segs * tp->mss_cache)
goto send_now;
/* Middle in queue won't get any more data, full sendable already? */
@@ -1915,6 +1935,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
int cwnd_quota;
int result;
bool is_cwnd_limited = false;
+ u32 max_segs;
sent_pkts = 0;
@@ -1928,6 +1949,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
}
}
+ max_segs = tcp_tso_autosize(sk, mss_now);
while ((skb = tcp_send_head(sk))) {
unsigned int limit;
@@ -1960,10 +1982,23 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
break;
} else {
if (!push_one &&
- tcp_tso_should_defer(sk, skb, &is_cwnd_limited))
+ tcp_tso_should_defer(sk, skb, &is_cwnd_limited,
+ max_segs))
break;
}
+ limit = mss_now;
+ if (tso_segs > 1 && !tcp_urg_mode(tp))
+ limit = tcp_mss_split_point(sk, skb, mss_now,
+ min_t(unsigned int,
+ cwnd_quota,
+ max_segs),
+ nonagle);
+
+ if (skb->len > limit &&
+ unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
+ break;
+
/* TCP Small Queues :
* Control number of packets in qdisc/devices to two packets / or ~1 ms.
* This allows for :
@@ -1974,8 +2009,8 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
* of queued bytes to ensure line rate.
* One example is wifi aggregation (802.11 AMPDU)
*/
- limit = max_t(unsigned int, sysctl_tcp_limit_output_bytes,
- sk->sk_pacing_rate >> 10);
+ limit = max(2 * skb->truesize, sk->sk_pacing_rate >> 10);
+ limit = min_t(u32, limit, sysctl_tcp_limit_output_bytes);
if (atomic_read(&sk->sk_wmem_alloc) > limit) {
set_bit(TSQ_THROTTLED, &tp->tsq_flags);
@@ -1988,18 +2023,6 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
break;
}
- limit = mss_now;
- if (tso_segs > 1 && !tcp_urg_mode(tp))
- limit = tcp_mss_split_point(sk, skb, mss_now,
- min_t(unsigned int,
- cwnd_quota,
- sk->sk_gso_max_segs),
- nonagle);
-
- if (skb->len > limit &&
- unlikely(tso_fragment(sk, skb, limit, mss_now, gfp)))
- break;
-
TCP_SKB_CB(skb)->when = tcp_time_stamp;
if (unlikely(tcp_transmit_skb(sk, skb, 1, gfp)))
--
1.7.10.4
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
2014-12-12 16:58 ` Wolfgang Walter
@ 2014-12-12 17:27 ` Eric Dumazet
2014-12-12 20:31 ` Wolfgang Walter
0 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2014-12-12 17:27 UTC (permalink / raw)
To: Wolfgang Walter
Cc: Thomas Jarosch, netdev, Eric Dumazet, Herbert Xu,
Steffen Klassert
On Fri, 2014-12-12 at 17:58 +0100, Wolfgang Walter wrote:
> This fixes hangs of local tcp connections over ipsec tunnels where
> pmtu is lower than the mtu of the interface.
Again this is a work around, the one liner patch is not a clean patch.
Something is wrong, we should fix GSO path instead.
Why ? Because we can queue a GSO packet in a qdisc, then later call
sk_setup_caps() (too late)
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
2014-12-12 17:27 ` Eric Dumazet
@ 2014-12-12 20:31 ` Wolfgang Walter
2014-12-12 21:30 ` Thomas Jarosch
0 siblings, 1 reply; 40+ messages in thread
From: Wolfgang Walter @ 2014-12-12 20:31 UTC (permalink / raw)
To: Eric Dumazet
Cc: Thomas Jarosch, netdev, Eric Dumazet, Herbert Xu,
Steffen Klassert
Am Freitag, 12. Dezember 2014, 09:27:05 schrieb Eric Dumazet:
> On Fri, 2014-12-12 at 17:58 +0100, Wolfgang Walter wrote:
> > This fixes hangs of local tcp connections over ipsec tunnels where
> > pmtu is lower than the mtu of the interface.
>
> Again this is a work around, the one liner patch is not a clean patch.
>
> Something is wrong, we should fix GSO path instead.
>
> Why ? Because we can queue a GSO packet in a qdisc, then later call
> sk_setup_caps() (too late)
Hmm. Do you think that sk_setup_caps should not disable GSO later at all? Or
only that it should not influence already queued packets?
In the case of an ipsec tunnel GSO gets disabled later (for tcp connections)
because dst->header_len() is zero at first and non-zero later. tcp connections
initiated not to long afterwards start with GSO disabled from the beginning.
dst->header_len() being non zero for ipsec tunnels seems to be the case with
or without an adjustment via PMTU. It seems that routing only does not know it
from the beginning on (for incomimg connections) :-).
Why does a non zero dst->header_len() disable GSO? Is it just an optimization
or does GSO not work in that case? If dst->header_len() being non zero signals
that ipsec tunnel mode can't handle GSO at all then the problem would not lay
in the gso path.
There is one thing which indicates that there is also a problen in the GSO
path, though. As my tests demonstrated disabling GSO on the physical interface
serving the ipsec tunnel does not prevent the hangs. It does, though, if sk-
>sk_gso_max_segs = 1 in sk_setup_caps() if sk_can_gso(sk) returns false. This
was the first variation of the patch where sk->sk_gso_max_segs was not set if
dst->header_len() is non zero.
Concering qdisc: an ipsec tunnel doesn't have a "tunnel-device" so it does not
have qdisc? The ipsec tunnel transforms packets and the results are (in case
of ipsec esp tunnel) esp packets which cannot be resgmented. Only these
transformed packets are finally queued to a qdisc of a network device then.
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
2014-12-12 20:31 ` Wolfgang Walter
@ 2014-12-12 21:30 ` Thomas Jarosch
2014-12-12 22:31 ` Eric Dumazet
0 siblings, 1 reply; 40+ messages in thread
From: Thomas Jarosch @ 2014-12-12 21:30 UTC (permalink / raw)
To: Wolfgang Walter, Eric Dumazet
Cc: netdev, Eric Dumazet, Herbert Xu, Steffen Klassert
Wolfgang,
On 12/12/2014 09:31 PM, Wolfgang Walter wrote:
> There is one thing which indicates that there is also a problen in the GSO
> path, though. As my tests demonstrated disabling GSO on the physical interface
> serving the ipsec tunnel does not prevent the hangs. It does, though, if sk-
>> sk_gso_max_segs = 1 in sk_setup_caps() if sk_can_gso(sk) returns false. This
> was the first variation of the patch where sk->sk_gso_max_segs was not set if
> dst->header_len() is non zero.
if it's any help: Disabling TX checksumming prevents the hang
even on an unpatched 3.14.x kernel. You could check with your debug statements
in place the path of the packets with and without TX checksumming.
HTH,
Thomas
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
2014-12-12 21:30 ` Thomas Jarosch
@ 2014-12-12 22:31 ` Eric Dumazet
2014-12-12 23:47 ` Wolfgang Walter
0 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2014-12-12 22:31 UTC (permalink / raw)
To: Thomas Jarosch
Cc: Wolfgang Walter, netdev, Eric Dumazet, Herbert Xu,
Steffen Klassert
On Fri, 2014-12-12 at 22:30 +0100, Thomas Jarosch wrote:
> if it's any help: Disabling TX checksumming prevents the hang
> even on an unpatched 3.14.x kernel. You could check with your debug statements
> in place the path of the packets with and without TX checksumming.
Disabling TX checksum automatically disables GSO.
Disabling TSO/GSO is the real 'cure' for the time being, you can keep TX
checksums.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
2014-12-12 22:31 ` Eric Dumazet
@ 2014-12-12 23:47 ` Wolfgang Walter
2014-12-13 0:15 ` Eric Dumazet
0 siblings, 1 reply; 40+ messages in thread
From: Wolfgang Walter @ 2014-12-12 23:47 UTC (permalink / raw)
To: Eric Dumazet
Cc: Thomas Jarosch, netdev, Eric Dumazet, Herbert Xu,
Steffen Klassert
Am Freitag, 12. Dezember 2014, 14:31:49 schrieben Sie:
> On Fri, 2014-12-12 at 22:30 +0100, Thomas Jarosch wrote:
> > if it's any help: Disabling TX checksumming prevents the hang
> > even on an unpatched 3.14.x kernel. You could check with your debug
> > statements in place the path of the packets with and without TX
> > checksumming.
I can't disable it as the driver will not allow it:
# ethtool -K eth0 tx off
Cannot change tx-checksumming
Could not change any device features
> Disabling TX checksum automatically disables GSO.
>
> Disabling TSO/GSO is the real 'cure' for the time being, you can keep TX
> checksums.
This does not help here. With GSO disabled (for network device serving the
ipsec traffic, here e.g. eth0) the hang still occurs. tcp_set_skb_tso_segs()
gets called and the else-branch for skb->len <= mss_now is taken.
I tested this serveral times.
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
2014-12-12 23:47 ` Wolfgang Walter
@ 2014-12-13 0:15 ` Eric Dumazet
2014-12-13 0:43 ` Wolfgang Walter
2014-12-15 18:04 ` Wolfgang Walter
0 siblings, 2 replies; 40+ messages in thread
From: Eric Dumazet @ 2014-12-13 0:15 UTC (permalink / raw)
To: Wolfgang Walter
Cc: Thomas Jarosch, netdev, Eric Dumazet, Herbert Xu,
Steffen Klassert
On Sat, 2014-12-13 at 00:47 +0100, Wolfgang Walter wrote:
> I can't disable it as the driver will not allow it:
> # ethtool -K eth0 tx off
> Cannot change tx-checksumming
> Could not change any device features
Sounds a bug in itself :(
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
2014-12-13 0:15 ` Eric Dumazet
@ 2014-12-13 0:43 ` Wolfgang Walter
2014-12-15 18:04 ` Wolfgang Walter
1 sibling, 0 replies; 40+ messages in thread
From: Wolfgang Walter @ 2014-12-13 0:43 UTC (permalink / raw)
To: Eric Dumazet
Cc: Thomas Jarosch, netdev, Eric Dumazet, Herbert Xu,
Steffen Klassert
Am Freitag, 12. Dezember 2014, 16:15:40 schrieb Eric Dumazet:
> On Sat, 2014-12-13 at 00:47 +0100, Wolfgang Walter wrote:
> > I can't disable it as the driver will not allow it:
> > # ethtool -K eth0 tx off
> > Cannot change tx-checksumming
> > Could not change any device features
>
> Sounds a bug in itself :(
>
Yes, that is what I meant in my email that there seems also to be bug in the
GSO path.
I don't understand how tcp_sendmsg() marks gso/non-gso skbs.
tcp_send_mss() calls tcp_xmit_size_goal() which calls tcp_xmit_size_goal() and
there is a check for GSO ( sk_can_gso(sk) ). If sk_can_gso() returns false
then tcp_xmit_size_goal() returns mss_now. gso_segs seems to be set to zero.
In in tcp_output.c gso_segs == 0 seems to allow GSO.
What if mss later shrinks? Couldn't it be that then tcp_set_skb_tso_segs()
gets called and then there the gso branch is taken?
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.
2014-12-13 0:15 ` Eric Dumazet
2014-12-13 0:43 ` Wolfgang Walter
@ 2014-12-15 18:04 ` Wolfgang Walter
1 sibling, 0 replies; 40+ messages in thread
From: Wolfgang Walter @ 2014-12-15 18:04 UTC (permalink / raw)
To: Eric Dumazet
Cc: Thomas Jarosch, netdev, Eric Dumazet, Herbert Xu,
Steffen Klassert
Hello Eric!
Am Freitag, 12. Dezember 2014, 16:15:40 schrieb Eric Dumazet:
> On Sat, 2014-12-13 at 00:47 +0100, Wolfgang Walter wrote:
> > I can't disable it as the driver will not allow it:
> > # ethtool -K eth0 tx off
> > Cannot change tx-checksumming
> > Could not change any device features
>
> Sounds a bug in itself :(
I think the "gso disabled for interface" case is broken because:
tcp_sendmsg() sets tcp_gso_segs to zero
tcp_sendmsg() calls tcp_push_one() or __tcp_push_pending_frames()
tcp_push_one() and __tcp_push_pending_frames() call tcp_write_xmit()
tcp_write_xmit() call tcp_init_tso_segs()
tcp_init_tso_segs() calls tcp_set_skb_tso_segs() because !tso_segs is true
tcp_init_tso_segs() creates a gso-packet if packet size is larger than
mss_now.
I think tcp_init_tso_segs() assumed that tcp_init_tso_segs() checks if the
socket supports gso.
I didn't check the other callers of tcp_init_tso_segs().
Regards,
--
Wolfgang Walter
Studentenwerk München
Anstalt des öffentlichen Rechts
^ permalink raw reply [flat|nested] 40+ messages in thread
* tcp: Do not apply TSO segment limit to non-TSO packets
2014-12-01 10:25 ` Herbert Xu
2014-12-01 11:20 ` Thomas Jarosch
@ 2014-12-31 13:39 ` Herbert Xu
2014-12-31 13:42 ` Herbert Xu
` (3 more replies)
1 sibling, 4 replies; 40+ messages in thread
From: Herbert Xu @ 2014-12-31 13:39 UTC (permalink / raw)
To: Thomas Jarosch
Cc: netdev, edumazet, Steffen Klassert, Ben Hutchings,
David S. Miller
On Mon, Dec 01, 2014 at 06:25:22PM +0800, Herbert Xu wrote:
> Thomas Jarosch <thomas.jarosch@intra2net.com> wrote:
> >
> > When I revert it, even kernel v3.18-rc6 starts working.
> > But I doubt this is the root problem, may be just hiding another issue.
>
> Can you do a tcpdump with this patch reverted? I would like to
> see the size of the packets that are sent out vs. the ICMP message
> that came back.
Thanks for providing the data. Here is a patch that should fix
the problem.
-- >8 --
Thomas Jarosch reported IPsec TCP stalls when a PMTU event occurs.
In fact the problem was completely unrelated to IPsec. The bug is
also reproducible if you just disable TSO/GSO.
The problem is that when the MSS goes down, existing queued packet
on the TX queue that have not been transmitted yet all look like
TSO packets and get treated as such.
This then triggers a bug where tcp_mss_split_point tells us to
generate a zero-sized packet on the TX queue. Once that happens
we're screwed because the zero-sized packet can never be removed
by ACKs.
Fixes: 1485348d242 ("tcp: Apply device TSO segment limit earlier")
Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 7f18262..65caf8b 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2019,7 +2019,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
break;
- if (tso_segs == 1) {
+ if (tso_segs == 1 || !max_segs) {
if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
(tcp_skb_is_last(sk, skb) ?
nonagle : TCP_NAGLE_PUSH))))
@@ -2032,7 +2032,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
}
limit = mss_now;
- if (tso_segs > 1 && !tcp_urg_mode(tp))
+ if (tso_segs > 1 && max_segs && !tcp_urg_mode(tp))
limit = tcp_mss_split_point(sk, skb, mss_now,
min_t(unsigned int,
cwnd_quota,
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: tcp: Do not apply TSO segment limit to non-TSO packets
2014-12-31 13:39 ` tcp: Do not apply TSO segment limit to non-TSO packets Herbert Xu
@ 2014-12-31 13:42 ` Herbert Xu
2015-01-02 18:24 ` Eric Dumazet
2015-01-02 21:13 ` David Miller
` (2 subsequent siblings)
3 siblings, 1 reply; 40+ messages in thread
From: Herbert Xu @ 2014-12-31 13:42 UTC (permalink / raw)
To: Thomas Jarosch
Cc: netdev, edumazet, Steffen Klassert, Ben Hutchings,
David S. Miller
On Thu, Jan 01, 2015 at 12:39:23AM +1100, Herbert Xu wrote:
>
> Thomas Jarosch reported IPsec TCP stalls when a PMTU event occurs.
>
> In fact the problem was completely unrelated to IPsec. The bug is
> also reproducible if you just disable TSO/GSO.
This raises two interesting questions.
Firstly not many people test non-TSO code paths anymore so bugs
are likely to persist for a long time there. Perhaps it's time
to remove the non-TSO code path altogether? The GSO code path
should provide enough speed-up in terms of boosting the effective
MTU to offset the cost of copying.
Secondly why are we dealing with hardware TSO segment limits
by limiting the size of the TSO packet in the TCP stack? Surely
in this case GSO is free since there won't be any copying?
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: tcp: Do not apply TSO segment limit to non-TSO packets
2014-12-31 13:42 ` Herbert Xu
@ 2015-01-02 18:24 ` Eric Dumazet
2015-01-02 20:36 ` David Miller
0 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2015-01-02 18:24 UTC (permalink / raw)
To: Herbert Xu
Cc: Thomas Jarosch, netdev, edumazet, Steffen Klassert, Ben Hutchings,
David S. Miller
On Thu, 2015-01-01 at 00:42 +1100, Herbert Xu wrote:
> On Thu, Jan 01, 2015 at 12:39:23AM +1100, Herbert Xu wrote:
> >
> > Thomas Jarosch reported IPsec TCP stalls when a PMTU event occurs.
> >
> > In fact the problem was completely unrelated to IPsec. The bug is
> > also reproducible if you just disable TSO/GSO.
>
> This raises two interesting questions.
>
> Firstly not many people test non-TSO code paths anymore so bugs
> are likely to persist for a long time there. Perhaps it's time
> to remove the non-TSO code path altogether? The GSO code path
> should provide enough speed-up in terms of boosting the effective
> MTU to offset the cost of copying.
> Secondly why are we dealing with hardware TSO segment limits
> by limiting the size of the TSO packet in the TCP stack? Surely
> in this case GSO is free since there won't be any copying?
It might depends on the device capabilities.
Non TSO/GSO path is known to be better for devices unable to perform TX
checksumming, as we compute the checksum at the time we copy data from
user to kernel (csum_and_copy_from_user() from tcp_sendmsg())).
With BQL+TSQ, having to compute the TX hash means bringing data into cpu
caches a second time right before ndo_start_xmit()
But maybe this gain is very relative in a full blown configuration, with
netfilter / complex qdisc being used.
Thanks Herbert !
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: tcp: Do not apply TSO segment limit to non-TSO packets
2015-01-02 18:24 ` Eric Dumazet
@ 2015-01-02 20:36 ` David Miller
2015-01-02 22:01 ` Herbert Xu
0 siblings, 1 reply; 40+ messages in thread
From: David Miller @ 2015-01-02 20:36 UTC (permalink / raw)
To: eric.dumazet
Cc: herbert, thomas.jarosch, netdev, edumazet, steffen.klassert,
bhutchings
From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 02 Jan 2015 10:24:00 -0800
> On Thu, 2015-01-01 at 00:42 +1100, Herbert Xu wrote:
>> Firstly not many people test non-TSO code paths anymore so bugs
>> are likely to persist for a long time there. Perhaps it's time
>> to remove the non-TSO code path altogether? The GSO code path
>> should provide enough speed-up in terms of boosting the effective
>> MTU to offset the cost of copying.
>
>> Secondly why are we dealing with hardware TSO segment limits
>> by limiting the size of the TSO packet in the TCP stack? Surely
>> in this case GSO is free since there won't be any copying?
>
> It might depends on the device capabilities.
>
> Non TSO/GSO path is known to be better for devices unable to perform TX
> checksumming, as we compute the checksum at the time we copy data from
> user to kernel (csum_and_copy_from_user() from tcp_sendmsg())).
Non-SG capable devices suffer in this scenerio as well.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: tcp: Do not apply TSO segment limit to non-TSO packets
2014-12-31 13:39 ` tcp: Do not apply TSO segment limit to non-TSO packets Herbert Xu
2014-12-31 13:42 ` Herbert Xu
@ 2015-01-02 21:13 ` David Miller
2015-01-16 10:45 ` Thomas Jarosch
2015-01-19 13:39 ` Thomas Jarosch
3 siblings, 0 replies; 40+ messages in thread
From: David Miller @ 2015-01-02 21:13 UTC (permalink / raw)
To: herbert; +Cc: thomas.jarosch, netdev, edumazet, steffen.klassert, bhutchings
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Thu, 1 Jan 2015 00:39:23 +1100
> Thomas Jarosch reported IPsec TCP stalls when a PMTU event occurs.
>
> In fact the problem was completely unrelated to IPsec. The bug is
> also reproducible if you just disable TSO/GSO.
>
> The problem is that when the MSS goes down, existing queued packet
> on the TX queue that have not been transmitted yet all look like
> TSO packets and get treated as such.
>
> This then triggers a bug where tcp_mss_split_point tells us to
> generate a zero-sized packet on the TX queue. Once that happens
> we're screwed because the zero-sized packet can never be removed
> by ACKs.
>
> Fixes: 1485348d242 ("tcp: Apply device TSO segment limit earlier")
> Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Applied and queued up for -stable, thanks Herbert.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: tcp: Do not apply TSO segment limit to non-TSO packets
2015-01-02 20:36 ` David Miller
@ 2015-01-02 22:01 ` Herbert Xu
2015-01-02 22:06 ` David Miller
0 siblings, 1 reply; 40+ messages in thread
From: Herbert Xu @ 2015-01-02 22:01 UTC (permalink / raw)
To: David Miller
Cc: eric.dumazet, thomas.jarosch, netdev, edumazet, steffen.klassert,
bhutchings
On Fri, Jan 02, 2015 at 03:36:55PM -0500, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 02 Jan 2015 10:24:00 -0800
>
> > Non TSO/GSO path is known to be better for devices unable to perform TX
> > checksumming, as we compute the checksum at the time we copy data from
> > user to kernel (csum_and_copy_from_user() from tcp_sendmsg())).
>
> Non-SG capable devices suffer in this scenerio as well.
Yes I was referring to using GSO on non-SG/non-checksumming devices.
Anything that supports checksum/SG should obviously be using GSO.
IIRC when I first tested this GSO is basically on par for the non-SG
case as the overhead of the extra copying was offset by the benefit
of a larger MTU through the stack.
So has anyone actually observed worse performance with GSO on these
devices (you could even test on a GSO-capable device simply by
disabling checksumming)?
Also the fact that this bug took two years to surface means that
very few people are actually using non-GSO in the real world as
this bug is easily triggered by a PMTU event.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: tcp: Do not apply TSO segment limit to non-TSO packets
2015-01-02 22:01 ` Herbert Xu
@ 2015-01-02 22:06 ` David Miller
2015-01-02 22:09 ` Herbert Xu
0 siblings, 1 reply; 40+ messages in thread
From: David Miller @ 2015-01-02 22:06 UTC (permalink / raw)
To: herbert
Cc: eric.dumazet, thomas.jarosch, netdev, edumazet, steffen.klassert,
bhutchings
From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sat, 3 Jan 2015 09:01:07 +1100
> So has anyone actually observed worse performance with GSO on these
> devices (you could even test on a GSO-capable device simply by
> disabling checksumming)?
Good question.
> Also the fact that this bug took two years to surface means that
> very few people are actually using non-GSO in the real world as
> this bug is easily triggered by a PMTU event.
I think the rarity of PMTU events on non-VPN'd connections plays
a part in how long it took as well.
But I totally accept your point, for sure.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: tcp: Do not apply TSO segment limit to non-TSO packets
2015-01-02 22:06 ` David Miller
@ 2015-01-02 22:09 ` Herbert Xu
0 siblings, 0 replies; 40+ messages in thread
From: Herbert Xu @ 2015-01-02 22:09 UTC (permalink / raw)
To: David Miller
Cc: eric.dumazet, thomas.jarosch, netdev, edumazet, steffen.klassert,
bhutchings
On Fri, Jan 02, 2015 at 05:06:29PM -0500, David Miller wrote:
>
> I think the rarity of PMTU events on non-VPN'd connections plays
> a part in how long it took as well.
Maybe in your neck of the woods but certainly in China I observe
loads of PMTU events without involving any VPNs at all :)
In fact even the fibre connections here use PPPOE so PMTU is
sort of unavoidable.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: tcp: Do not apply TSO segment limit to non-TSO packets
2014-12-31 13:39 ` tcp: Do not apply TSO segment limit to non-TSO packets Herbert Xu
2014-12-31 13:42 ` Herbert Xu
2015-01-02 21:13 ` David Miller
@ 2015-01-16 10:45 ` Thomas Jarosch
2015-01-16 10:50 ` Herbert Xu
2015-01-19 13:39 ` Thomas Jarosch
3 siblings, 1 reply; 40+ messages in thread
From: Thomas Jarosch @ 2015-01-16 10:45 UTC (permalink / raw)
To: Herbert Xu
Cc: netdev, edumazet, Steffen Klassert, Ben Hutchings,
David S. Miller
On Thursday, 1. January 2015 00:39:23 Herbert Xu wrote:
> On Mon, Dec 01, 2014 at 06:25:22PM +0800, Herbert Xu wrote:
> > Thomas Jarosch <thomas.jarosch@intra2net.com> wrote:
> > > When I revert it, even kernel v3.18-rc6 starts working.
> > > But I doubt this is the root problem, may be just hiding another
> > > issue.
> >
> > Can you do a tcpdump with this patch reverted? I would like to
> > see the size of the packets that are sent out vs. the ICMP message
> > that came back.
>
> Thanks for providing the data. Here is a patch that should fix
> the problem.
Thanks for the fix, Herbert! I've verified the patch is working fine
and the tcpdump looks good, too. In fact the PMTU discovery
only takes 0.001s, you can barely notice it ;)
For backporting to -stable: Kernel 3.14 lacks tcp_tso_autosize().
So I've borrowed that from 3.19-rc4+ and also added the max_segs variable.
The final and tested code looks like this:
-- >8 --
Thomas Jarosch reported IPsec TCP stalls when a PMTU event occurs.
In fact the problem was completely unrelated to IPsec. The bug is
also reproducible if you just disable TSO/GSO.
The problem is that when the MSS goes down, existing queued packet
on the TX queue that have not been transmitted yet all look like
TSO packets and get treated as such.
This then triggers a bug where tcp_mss_split_point tells us to
generate a zero-sized packet on the TX queue. Once that happens
we're screwed because the zero-sized packet can never be removed
by ACKs.
Fixes: 1485348d242 ("tcp: Apply device TSO segment limit earlier")
Reported-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Thomas Jarosch <thomas.jarosch@intra2net.com>
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 17a11e6..a109032 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1432,6 +1432,27 @@ static bool tcp_nagle_check(bool partial, const struct tcp_sock *tp,
((nonagle & TCP_NAGLE_CORK) ||
(!nonagle && tp->packets_out && tcp_minshall_check(tp)));
}
+
+/* Return how many segs we'd like on a TSO packet,
+ * to send one TSO packet per ms
+ */
+static u32 tcp_tso_autosize(const struct sock *sk, unsigned int mss_now)
+{
+ u32 bytes, segs;
+
+ bytes = min(sk->sk_pacing_rate >> 10,
+ sk->sk_gso_max_size - 1 - MAX_TCP_HEADER);
+
+ /* Goal is to send at least one packet per ms,
+ * not one big TSO packet every 100 ms.
+ * This preserves ACK clocking and is consistent
+ * with tcp_tso_should_defer() heuristic.
+ */
+ segs = max_t(u32, bytes / mss_now, sysctl_tcp_min_tso_segs);
+
+ return min_t(u32, segs, sk->sk_gso_max_segs);
+}
+
/* Returns the portion of skb which can be sent right away */
static unsigned int tcp_mss_split_point(const struct sock *sk,
const struct sk_buff *skb,
@@ -1857,6 +1878,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
unsigned int tso_segs, sent_pkts;
int cwnd_quota;
int result;
+ u32 max_segs;
sent_pkts = 0;
@@ -1870,6 +1892,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
}
}
+ max_segs = tcp_tso_autosize(sk, mss_now);
while ((skb = tcp_send_head(sk))) {
unsigned int limit;
@@ -1891,7 +1914,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
if (unlikely(!tcp_snd_wnd_test(tp, skb, mss_now)))
break;
- if (tso_segs == 1) {
+ if (tso_segs == 1 || !max_segs) {
if (unlikely(!tcp_nagle_test(tp, skb, mss_now,
(tcp_skb_is_last(sk, skb) ?
nonagle : TCP_NAGLE_PUSH))))
@@ -1928,7 +1951,7 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
}
limit = mss_now;
- if (tso_segs > 1 && !tcp_urg_mode(tp))
+ if (tso_segs > 1 && max_segs && !tcp_urg_mode(tp))
limit = tcp_mss_split_point(sk, skb, mss_now,
min_t(unsigned int,
cwnd_quota,
--
1.9.3
^ permalink raw reply related [flat|nested] 40+ messages in thread
* Re: tcp: Do not apply TSO segment limit to non-TSO packets
2015-01-16 10:45 ` Thomas Jarosch
@ 2015-01-16 10:50 ` Herbert Xu
2015-01-16 11:03 ` Thomas Jarosch
0 siblings, 1 reply; 40+ messages in thread
From: Herbert Xu @ 2015-01-16 10:50 UTC (permalink / raw)
To: Thomas Jarosch
Cc: netdev, edumazet, Steffen Klassert, Ben Hutchings,
David S. Miller
On Fri, Jan 16, 2015 at 11:45:44AM +0100, Thomas Jarosch wrote:
>
> For backporting to -stable: Kernel 3.14 lacks tcp_tso_autosize().
> So I've borrowed that from 3.19-rc4+ and also added the max_segs variable.
> The final and tested code looks like this:
You don't need tcp_tso_autosize. Instead of testing max_segs just
test sk->sk_gso_max_segs.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: tcp: Do not apply TSO segment limit to non-TSO packets
2015-01-16 10:50 ` Herbert Xu
@ 2015-01-16 11:03 ` Thomas Jarosch
0 siblings, 0 replies; 40+ messages in thread
From: Thomas Jarosch @ 2015-01-16 11:03 UTC (permalink / raw)
To: Herbert Xu
Cc: netdev, edumazet, Steffen Klassert, Ben Hutchings,
David S. Miller
On Friday, 16. January 2015 21:50:13 Herbert Xu wrote:
> On Fri, Jan 16, 2015 at 11:45:44AM +0100, Thomas Jarosch wrote:
> > For backporting to -stable: Kernel 3.14 lacks tcp_tso_autosize().
> > So I've borrowed that from 3.19-rc4+ and also added the max_segs
> > variable.
> > The final and tested code looks like this:
> You don't need tcp_tso_autosize. Instead of testing max_segs just
> test sk->sk_gso_max_segs.
splendid, even better. tcpdump looks good, too. Thanks!
Thomas
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: tcp: Do not apply TSO segment limit to non-TSO packets
2014-12-31 13:39 ` tcp: Do not apply TSO segment limit to non-TSO packets Herbert Xu
` (2 preceding siblings ...)
2015-01-16 10:45 ` Thomas Jarosch
@ 2015-01-19 13:39 ` Thomas Jarosch
[not found] ` <CANn89i+U-PFbuUrp08s3Ec8BmjPFq1zj8Aj2=vPVO4-iiLkTuw@mail.gmail.com>
3 siblings, 1 reply; 40+ messages in thread
From: Thomas Jarosch @ 2015-01-19 13:39 UTC (permalink / raw)
To: Herbert Xu
Cc: netdev, edumazet, Steffen Klassert, Ben Hutchings,
David S. Miller
Hi Herbert,
On Thursday, 1. January 2015 00:39:23 Herbert Xu wrote:
> The problem is that when the MSS goes down, existing queued packet
> on the TX queue that have not been transmitted yet all look like
> TSO packets and get treated as such.
>
> This then triggers a bug where tcp_mss_split_point tells us to
> generate a zero-sized packet on the TX queue. Once that happens
> we're screwed because the zero-sized packet can never be removed
> by ACKs.
picking up this one again: Is there any valid use case to have
zero-sized packets in the TX queue? If not, may be a WARN_ON() could
be added to the processing of the TX queue. That would help to
prevent future issues like this.
Cheers,
Thomas
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: tcp: Do not apply TSO segment limit to non-TSO packets
[not found] ` <CANn89i+U-PFbuUrp08s3Ec8BmjPFq1zj8Aj2=vPVO4-iiLkTuw@mail.gmail.com>
@ 2015-01-19 22:36 ` Herbert Xu
2015-01-19 22:38 ` Eric Dumazet
0 siblings, 1 reply; 40+ messages in thread
From: Herbert Xu @ 2015-01-19 22:36 UTC (permalink / raw)
To: Eric Dumazet
Cc: Thomas Jarosch, netdev, Steffen Klassert, Ben Hutchings,
David S. Miller
On Mon, Jan 19, 2015 at 09:19:48AM -0800, Eric Dumazet wrote:
> Zero sized packets are valid, just take a look at tcpdump ;)
But zero-sized packets in the send queue are not valid and will
not ever be cleaned, which is what triggered the original bug.
Although I'm not sure whether a WARN_ON will actually help since
it won't tell you why there is a zero-sized packet on the queue,
just the fact that there is. You'll know you've got a zero-sized
packet on the queue anyway by looking at the tcpdump.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: tcp: Do not apply TSO segment limit to non-TSO packets
2015-01-19 22:36 ` Herbert Xu
@ 2015-01-19 22:38 ` Eric Dumazet
2015-01-19 22:40 ` Herbert Xu
0 siblings, 1 reply; 40+ messages in thread
From: Eric Dumazet @ 2015-01-19 22:38 UTC (permalink / raw)
To: Herbert Xu
Cc: Thomas Jarosch, netdev, Steffen Klassert, Ben Hutchings,
David S. Miller
On Mon, Jan 19, 2015 at 2:36 PM, Herbert Xu <herbert@gondor.apana.org.au> wrote:
> On Mon, Jan 19, 2015 at 09:19:48AM -0800, Eric Dumazet wrote:
>> Zero sized packets are valid, just take a look at tcpdump ;)
>
> But zero-sized packets in the send queue are not valid and will
> not ever be cleaned, which is what triggered the original bug.
SYN and FIN packets are in the send queue.
They are valid and might have a zero size.
^ permalink raw reply [flat|nested] 40+ messages in thread
* Re: tcp: Do not apply TSO segment limit to non-TSO packets
2015-01-19 22:38 ` Eric Dumazet
@ 2015-01-19 22:40 ` Herbert Xu
0 siblings, 0 replies; 40+ messages in thread
From: Herbert Xu @ 2015-01-19 22:40 UTC (permalink / raw)
To: Eric Dumazet
Cc: Thomas Jarosch, netdev, Steffen Klassert, Ben Hutchings,
David S. Miller
On Mon, Jan 19, 2015 at 02:38:47PM -0800, Eric Dumazet wrote:
>
> SYN and FIN packets are in the send queue.
>
> They are valid and might have a zero size.
Right, obviously I meant packets in between the SYN and the FIN.
Technically SYN/FIN are not really zero-sided since they carry
one bit of information and yes they do get cleaned while the ones
in the middle will not.
Cheers,
--
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
^ permalink raw reply [flat|nested] 40+ messages in thread
end of thread, other threads:[~2015-01-19 22:40 UTC | newest]
Thread overview: 40+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-29 11:44 [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+ Thomas Jarosch
2014-12-01 10:25 ` Herbert Xu
2014-12-01 11:20 ` Thomas Jarosch
2014-12-31 13:39 ` tcp: Do not apply TSO segment limit to non-TSO packets Herbert Xu
2014-12-31 13:42 ` Herbert Xu
2015-01-02 18:24 ` Eric Dumazet
2015-01-02 20:36 ` David Miller
2015-01-02 22:01 ` Herbert Xu
2015-01-02 22:06 ` David Miller
2015-01-02 22:09 ` Herbert Xu
2015-01-02 21:13 ` David Miller
2015-01-16 10:45 ` Thomas Jarosch
2015-01-16 10:50 ` Herbert Xu
2015-01-16 11:03 ` Thomas Jarosch
2015-01-19 13:39 ` Thomas Jarosch
[not found] ` <CANn89i+U-PFbuUrp08s3Ec8BmjPFq1zj8Aj2=vPVO4-iiLkTuw@mail.gmail.com>
2015-01-19 22:36 ` Herbert Xu
2015-01-19 22:38 ` Eric Dumazet
2015-01-19 22:40 ` Herbert Xu
2014-12-01 13:17 ` [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3.12+ Wolfgang Walter
2014-12-01 16:41 ` Wolfgang Walter
2014-12-05 12:09 ` Wolfgang Walter
2014-12-05 13:26 ` Eric Dumazet
2014-12-08 22:20 ` Wolfgang Walter
2014-12-09 8:54 ` Thomas Jarosch
2014-12-09 14:26 ` [bisected] xfrm: TCP connection initiating PMTU discovery stalls on v3 Eric Dumazet
2014-12-09 14:49 ` Thomas Jarosch
2014-12-09 20:36 ` Wolfgang Walter
2014-12-09 21:40 ` Eric Dumazet
2014-12-10 18:34 ` Wolfgang Walter
2014-12-10 19:10 ` Eric Dumazet
2014-12-11 0:36 ` Wolfgang Walter
2014-12-12 16:58 ` Wolfgang Walter
2014-12-12 17:27 ` Eric Dumazet
2014-12-12 20:31 ` Wolfgang Walter
2014-12-12 21:30 ` Thomas Jarosch
2014-12-12 22:31 ` Eric Dumazet
2014-12-12 23:47 ` Wolfgang Walter
2014-12-13 0:15 ` Eric Dumazet
2014-12-13 0:43 ` Wolfgang Walter
2014-12-15 18:04 ` Wolfgang Walter
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).