[RFC] ipv6: dst_allfrag() not taken into account by TCP

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] ipv6: dst_allfrag() not taken into account by TCP
@ 2012-01-17 16:28 Eric Dumazet
  2012-01-17 17:34 ` David Miller
  2012-01-17 20:03 ` Tore Anderson
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2012-01-17 16:28 UTC (permalink / raw)
  To: netdev; +Cc: tore

Bugzilla reference :

https://bugzilla.kernel.org/show_bug.cgi?id=42572

> An IPv4 client behind a link with a MTU of 1259 downloading a file from an IPv6
> server
> 
> When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment
> size does not take into account the size of the IPv6 Fragmentation
> header that needs to be included in outbound packets, causing every
> transmitted TCP segment to be fragmented across two IPv6 packets, the
> latter of which will only contain 8 bytes of actual payload.
> 
> RTAX_FEATURE_ALLFRAG is typically set on a route in response to
> receving a ICMPv6 Packet Too Big message indicating a Path MTU of less
> than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6
> PTBs with MTU < 1280 are still valid, in particular when an IPv6
> packet is sent to an IPv4 destination through a stateless translator.
> Any ICMPv4 Need To Fragment packets originated from the IPv4 part of
> the path will be translated to ICMPv6 PTB which may then indicate an
> MTU of less than 1280.
> 
> RFC 2460 section 5 specifies what an IPv6 stack should do when this
> happens:
> 
> > In response to an IPv6 packet that is sent to an IPv4 destination
> > (i.e., a packet that undergoes translation from IPv6 to IPv4), the
> > originating IPv6 node may receive an ICMP Packet Too Big message
> > reporting a Next-Hop MTU less than 1280.  In that case, the IPv6 node
> > is not required to reduce the size of subsequent packets to less than
> > 1280, but must include a Fragment header in those packets so that the
> > IPv6-to-IPv4 translating router can obtain a suitable Identification
> > value to use in resulting IPv4 fragments.  Note that this means the
> > payload may have to be reduced to 1232 octets (1280 minus 40 for the
> > IPv6 header and 8 for the Fragment header), and smaller still if
> > additional extension headers are used.
> 
> The Linux kernel refuses to reduce the effective MTU to anything below
> 1280 bytes, instead it sets it to exactly 1280 bytes, and
> RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears
> to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header),
> instead of 1232 (additionally taking into account the 8 bytes required
> by the IPv6 Fragmentation extension header).
> 
> This in turn results in rather inefficient transmission, as every 
> transmitted TCP segment now is split in two fragments containing
> 1232+8 bytes of payload.
> 
> I am attaching a tcpdump that shows this happening. In this case,
> 2a02:c0::46:0:57ee:3d82 is an IPv6-only server running Linux 3.2.0,
> while 2a02:c0::46:0:57ee:2a2a really is 87.238.42.42, a NAT device with
> an IPv4 node behind it. The link between the NAT device and the IPv4
> node has a MTU of 1259. Somewhere between the NAT device and the server
> there's a stateless IPv4/IPv6 translator. When the server sends its
> first full-sized (1500 bytes) packets, the NAT device responds with
> a ICMPv4 Need To Fragment (MTU=1259) which are then received by the
> server in its translated for (ICMPv6 PTB, MTU 1279). After that a
> large number of these mini-fragments containing only 8 bytes of 
> payload are transmitted. They should have been avoided.
> 
> Tore


It seems that dst_allfrag() will force us to use ip6_fragment() and
reduce effective MSS to :

MTU - sizeof(ipv6hdr) - sizeof(frag_hdr) - sizeof(tcphdr)

(not counting TCP options)

But tcp_mtu_to_mss() doesnt take into account dst_allfrag() and computed
TCP MSS might be 8 bytes too big ? (ie sizeof(struct frag_hdr))

For MTU = 1280, we endup with MSS=1240 instead of 1232

/* Calculate MSS. Not accounting for SACKs here.  */
int tcp_mtu_to_mss(const struct sock *sk, int pmtu)
{
        const struct tcp_sock *tp = tcp_sk(sk);
        const struct inet_connection_sock *icsk = inet_csk(sk);
        int mss_now;

        /* Calculate base mss without TCP options:
           It is MMS_S - sizeof(tcphdr) of rfc1122
         */
        mss_now = pmtu - icsk->icsk_af_ops->net_header_len - sizeof(struct tcphdr);

        /* Clamp it (mss_clamp does not include tcp options) */
        if (mss_now > tp->rx_opt.mss_clamp)
                mss_now = tp->rx_opt.mss_clamp;

        /* Now subtract optional transport overhead */
        mss_now -= icsk->icsk_ext_hdr_len;

        /* Then reserve room for full set of TCP options and 8 bytes of data */
        if (mss_now < 48)
                mss_now = 48;

        /* Now subtract TCP options size, not including SACKs */
        mss_now -= tp->tcp_header_len - sizeof(struct tcphdr);

        return mss_now;
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] ipv6: dst_allfrag() not taken into account by TCP
  2012-01-17 16:28 [RFC] ipv6: dst_allfrag() not taken into account by TCP Eric Dumazet
@ 2012-01-17 17:34 ` David Miller
  2012-01-17 18:15   ` Eric Dumazet
  2012-01-17 20:03 ` Tore Anderson
  1 sibling, 1 reply; 20+ messages in thread
From: David Miller @ 2012-01-17 17:34 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, tore

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 17 Jan 2012 17:28:19 +0100

> It seems that dst_allfrag() will force us to use ip6_fragment() and
> reduce effective MSS to :

Don't use TCP with forced fragmentation.

Use path MTU discovery or an artificially reduced route MTU on paths
where ICMP blocking is causing problems.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] ipv6: dst_allfrag() not taken into account by TCP
  2012-01-17 17:34 ` David Miller
@ 2012-01-17 18:15   ` Eric Dumazet
  2012-01-17 18:25     ` David Miller
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2012-01-17 18:15 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, tore

Le mardi 17 janvier 2012 à 12:34 -0500, David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Tue, 17 Jan 2012 17:28:19 +0100
> 
> > It seems that dst_allfrag() will force us to use ip6_fragment() and
> > reduce effective MSS to :
> 
> Don't use TCP with forced fragmentation.
> 
> Use path MTU discovery or an artificially reduced route MTU on paths
> where ICMP blocking is causing problems.

We receive a correct ICMP message and we process it correctly.

It should work without fragments, if we hadnt a bug, when we hit the
minimum IPv6 MTU (1280) ?

Problem is we build too large TCP frames (by 8 bytes) and they must be
fragmented in ip6_finish_output(), wasting cpu and network bandwidth.

We do the following in tcp_mtu_to_mss()

mss_now = pmtu - icsk->icsk_af_ops->net_header_len - sizeof(struct tcphdr);

(net_header_len being a constant value : sizeof(struct ipv6hdr) for IPv6)

For IPv6, effective network header len would be either :

1)	sizeof(struct ipv6hdr) 
2)	sizeof(struct ipv6hdr) + sizeof(frag_hdr)

We could change net_header_len to be a method, and for ipv6 do :

static unsigned int ipv6_net_header_len(struct sock *sk)
{
	const struct dst_entry *dst = __sk_dst_get(sk);

	return sizeof(struct ipv6hdr) +
		(dst && dst_allfrag(dst)) ? sizeof(struct frag_hdr) : 0; 
}

/* for ipv4 : */
static unsigned int ipv4_net_header_len(struct sock *sk)
{
	return sizeof(struct iphdr);
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] ipv6: dst_allfrag() not taken into account by TCP
  2012-01-17 18:15   ` Eric Dumazet
@ 2012-01-17 18:25     ` David Miller
  0 siblings, 0 replies; 20+ messages in thread
From: David Miller @ 2012-01-17 18:25 UTC (permalink / raw)
  To: eric.dumazet; +Cc: netdev, tore

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Tue, 17 Jan 2012 19:15:11 +0100

> We could change net_header_len to be a method, and for ipv6 do :

Or add a icsk_frag_hdr_len (or something in proto ops) and do the
dst_allfrag() test inline.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] ipv6: dst_allfrag() not taken into account by TCP
  2012-01-17 16:28 [RFC] ipv6: dst_allfrag() not taken into account by TCP Eric Dumazet
  2012-01-17 17:34 ` David Miller
@ 2012-01-17 20:03 ` Tore Anderson
  2012-01-17 20:25   ` Eric Dumazet
  2012-01-17 23:43   ` Bugzilla 42595 Eric Dumazet
  1 sibling, 2 replies; 20+ messages in thread
From: Tore Anderson @ 2012-01-17 20:03 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

* Eric Dumazet

> Bugzilla reference :
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=42572

Hi, and thanks for taking an interest in this issue!

I've got some general comments regarding running IPv6-only Linux servers
behind stateless IPv4/IPv6 translators. (They are not strictly related
to the above bug, but not completely off-topic either I hope.)

1) The Linux kernel doesn't allow reducing the effective IPv6 link MTU
(as recorded in the routing cache) to anything less than 1280. This
means that it can end up in a situation where the effective IPv6 link
MTU is greater than the actual IPv6 Path MTU. In the PCAP in the
bugzilla, they are 1280 and 1279, respectively. However, the kernel
doesn't appear to record the actual Path MTU anywhere, instead setting
the allfrag feature.

While this is perfectly legal behaviour according to the RFC, from an
operational point of view it would have been nice if there were some way
(e.g. a sysctl) to tell the kernel to also actually allow an ICMPv6 PTB
to reduce the effective IPv6 link MTU to values less than 1280 (at least
down to the minimum IPv4 MTU + 20 bytes). That would have avoided the
need for the allfrag feature to come into play completely.

The RFC allows for this behaviour, too.

2) Since the kernel doesn't keep track of the actual Path MTU (if it's
lower than 1280), when the allfrag feature gets set on a route, *every*
packet gets a fragmentation header. (Which is to be expected, really,
given it's name.) However, this means that even tiny packets such as a
TCP SYN/ACK gets the fragmentation header added. This is clearly not
particularly useful.

If the kernel had kept track of the effective Path MTU, and only
included the IPv6 Fragmentation header on packets that were larger than
it *only*, this wouldn't have been a problem. (Alternatively, if it
allowed the effective link MTU to drop below 1280 that would also have
avoided this problem.)

3) There seems to be a bug related to generating the TCP checksum of
SYN/ACK packets to destinations with the allfrag features set. I just
submitted a bug report about this:

https://bugzilla.kernel.org/show_bug.cgi?id=42595

This makes the allfrag feature pretty much useless for me, as I can only
successfully establish a single TCP session from a client behind a <1280
MTU link for the entire lifetime of the routing cache entry.

Best regards,
-- 
Tore Anderson

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] ipv6: dst_allfrag() not taken into account by TCP
  2012-01-17 20:03 ` Tore Anderson
@ 2012-01-17 20:25   ` Eric Dumazet
  2012-01-18 12:42     ` Tore Anderson
  2012-01-17 23:43   ` Bugzilla 42595 Eric Dumazet
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2012-01-17 20:25 UTC (permalink / raw)
  To: Tore Anderson; +Cc: netdev

Le mardi 17 janvier 2012 à 21:03 +0100, Tore Anderson a écrit :
> * Eric Dumazet
> 
> > Bugzilla reference :
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=42572
> 
> Hi, and thanks for taking an interest in this issue!
> 
> I've got some general comments regarding running IPv6-only Linux servers
> behind stateless IPv4/IPv6 translators. (They are not strictly related
> to the above bug, but not completely off-topic either I hope.)
> 
> 1) The Linux kernel doesn't allow reducing the effective IPv6 link MTU
> (as recorded in the routing cache) to anything less than 1280. This
> means that it can end up in a situation where the effective IPv6 link
> MTU is greater than the actual IPv6 Path MTU. In the PCAP in the
> bugzilla, they are 1280 and 1279, respectively. However, the kernel
> doesn't appear to record the actual Path MTU anywhere, instead setting
> the allfrag feature.
> 
> While this is perfectly legal behaviour according to the RFC, from an
> operational point of view it would have been nice if there were some way
> (e.g. a sysctl) to tell the kernel to also actually allow an ICMPv6 PTB
> to reduce the effective IPv6 link MTU to values less than 1280 (at least
> down to the minimum IPv4 MTU + 20 bytes). That would have avoided the
> need for the allfrag feature to come into play completely.
> 
> The RFC allows for this behaviour, too.
> 
> 2) Since the kernel doesn't keep track of the actual Path MTU (if it's
> lower than 1280), when the allfrag feature gets set on a route, *every*
> packet gets a fragmentation header. (Which is to be expected, really,
> given it's name.) However, this means that even tiny packets such as a
> TCP SYN/ACK gets the fragmentation header added. This is clearly not
> particularly useful.
> 
> If the kernel had kept track of the effective Path MTU, and only
> included the IPv6 Fragmentation header on packets that were larger than
> it *only*, this wouldn't have been a problem. (Alternatively, if it
> allowed the effective link MTU to drop below 1280 that would also have
> avoided this problem.)
> 
> 3) There seems to be a bug related to generating the TCP checksum of
> SYN/ACK packets to destinations with the allfrag features set. I just
> submitted a bug report about this:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=42595
> 
> This makes the allfrag feature pretty much useless for me, as I can only
> successfully establish a single TCP session from a client behind a <1280
> MTU link for the entire lifetime of the routing cache entry.
> 
> Best regards,

Thanks Tore, we'll take a look at this second problem.

Could you please test following patch ?

It should apply on current 'net' or 'linux' tree

 include/net/inet_connection_sock.h |    1 +
 include/net/tcp.h                  |    4 ++--
 net/ipv4/tcp_output.c              |   19 +++++++++++++++++--
 net/ipv6/tcp_ipv6.c                |    1 +
 net/sctp/ipv6.c                    |    1 +
 5 files changed, 22 insertions(+), 4 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index dbf9aab..8297691 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -45,6 +45,7 @@ struct inet_connection_sock_af_ops {
 				      struct dst_entry *dst);
 	struct inet_peer *(*get_peer)(struct sock *sk, bool *release_it);
 	u16	    net_header_len;
+	u16	    net_frag_header_len;
 	u16	    sockaddr_len;
 	int	    (*setsockopt)(struct sock *sk, int level, int optname, 
 				  char __user *optval, unsigned int optlen);
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 0118ea9..86e5ef4 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -531,8 +531,8 @@ extern int tcp_read_sock(struct sock *sk, read_descriptor_t *desc,
 
 extern void tcp_initialize_rcv_mss(struct sock *sk);
 
-extern int tcp_mtu_to_mss(const struct sock *sk, int pmtu);
-extern int tcp_mss_to_mtu(const struct sock *sk, int mss);
+extern int tcp_mtu_to_mss(struct sock *sk, int pmtu);
+extern int tcp_mss_to_mtu(struct sock *sk, int mss);
 extern void tcp_mtup_init(struct sock *sk);
 extern void tcp_valid_rtt_meas(struct sock *sk, u32 seq_rtt);
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 8c8de27..5d27b68 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1151,7 +1151,7 @@ int tcp_trim_head(struct sock *sk, struct sk_buff *skb, u32 len)
 }
 
 /* Calculate MSS. Not accounting for SACKs here.  */
-int tcp_mtu_to_mss(const struct sock *sk, int pmtu)
+int tcp_mtu_to_mss(struct sock *sk, int pmtu)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 	const struct inet_connection_sock *icsk = inet_csk(sk);
@@ -1162,6 +1162,14 @@ int tcp_mtu_to_mss(const struct sock *sk, int pmtu)
 	 */
 	mss_now = pmtu - icsk->icsk_af_ops->net_header_len - sizeof(struct tcphdr);
 
+	/* IPv6 adds a frag_hdr in case RTAX_FEATURE_ALLFRAG is set */
+	if (icsk->icsk_af_ops->net_frag_header_len) {
+		const struct dst_entry *dst = __sk_dst_get(sk);
+
+		if (dst && dst_allfrag(dst))
+			mss_now -= icsk->icsk_af_ops->net_frag_header_len;
+	}
+
 	/* Clamp it (mss_clamp does not include tcp options) */
 	if (mss_now > tp->rx_opt.mss_clamp)
 		mss_now = tp->rx_opt.mss_clamp;
@@ -1180,7 +1188,7 @@ int tcp_mtu_to_mss(const struct sock *sk, int pmtu)
 }
 
 /* Inverse of above */
-int tcp_mss_to_mtu(const struct sock *sk, int mss)
+int tcp_mss_to_mtu(struct sock *sk, int mss)
 {
 	const struct tcp_sock *tp = tcp_sk(sk);
 	const struct inet_connection_sock *icsk = inet_csk(sk);
@@ -1191,6 +1199,13 @@ int tcp_mss_to_mtu(const struct sock *sk, int mss)
 	      icsk->icsk_ext_hdr_len +
 	      icsk->icsk_af_ops->net_header_len;
 
+	/* IPv6 adds a frag_hdr in case RTAX_FEATURE_ALLFRAG is set */
+	if (icsk->icsk_af_ops->net_frag_header_len) {
+		const struct dst_entry *dst = __sk_dst_get(sk);
+
+		if (dst && dst_allfrag(dst))
+			mtu += icsk->icsk_af_ops->net_frag_header_len;
+	}
 	return mtu;
 }
 
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 906c7ca..d90c007 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1883,6 +1883,7 @@ static const struct inet_connection_sock_af_ops ipv6_specific = {
 	.syn_recv_sock	   = tcp_v6_syn_recv_sock,
 	.get_peer	   = tcp_v6_get_peer,
 	.net_header_len	   = sizeof(struct ipv6hdr),
+	.net_frag_header_len = sizeof(struct frag_hdr),
 	.setsockopt	   = ipv6_setsockopt,
 	.getsockopt	   = ipv6_getsockopt,
 	.addr2sockaddr	   = inet6_csk_addr2sockaddr,
diff --git a/net/sctp/ipv6.c b/net/sctp/ipv6.c
index 91f4791..13174e5 100644
--- a/net/sctp/ipv6.c
+++ b/net/sctp/ipv6.c
@@ -1000,6 +1000,7 @@ static struct sctp_af sctp_af_inet6 = {
 	.seq_dump_addr	   = sctp_v6_seq_dump_addr,
 	.ecn_capable	   = sctp_v6_ecn_capable,
 	.net_header_len	   = sizeof(struct ipv6hdr),
+	.net_frag_header_len = sizeof(struct frag_hdr),
 	.sockaddr_len	   = sizeof(struct sockaddr_in6),
 #ifdef CONFIG_COMPAT
 	.compat_setsockopt = compat_ipv6_setsockopt,

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Bugzilla 42595
  2012-01-17 20:03 ` Tore Anderson
  2012-01-17 20:25   ` Eric Dumazet
@ 2012-01-17 23:43   ` Eric Dumazet
  2012-01-18 10:58     ` Eric Dumazet
  1 sibling, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2012-01-17 23:43 UTC (permalink / raw)
  To: Tore Anderson; +Cc: netdev

Le mardi 17 janvier 2012 à 21:03 +0100, Tore Anderson a écrit :

> 
> 3) There seems to be a bug related to generating the TCP checksum of
> SYN/ACK packets to destinations with the allfrag features set. I just
> submitted a bug report about this:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=42595
> 
> This makes the allfrag feature pretty much useless for me, as I can only
> successfully establish a single TCP session from a client behind a <1280
> MTU link for the entire lifetime of the routing cache entry.
> 

It seems we dont handle skb ip_summed CHECKSUM_PARTIAL 
in ip6_fragment() slow_path

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Bugzilla 42595
  2012-01-17 23:43   ` Bugzilla 42595 Eric Dumazet
@ 2012-01-18 10:58     ` Eric Dumazet
  2012-01-18 13:44       ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2012-01-18 10:58 UTC (permalink / raw)
  To: Tore Anderson; +Cc: netdev

Le mercredi 18 janvier 2012 à 00:43 +0100, Eric Dumazet a écrit :
> Le mardi 17 janvier 2012 à 21:03 +0100, Tore Anderson a écrit :
> 
> > 
> > 3) There seems to be a bug related to generating the TCP checksum of
> > SYN/ACK packets to destinations with the allfrag features set. I just
> > submitted a bug report about this:
> > 
> > https://bugzilla.kernel.org/show_bug.cgi?id=42595
> > 
> > This makes the allfrag feature pretty much useless for me, as I can only
> > successfully establish a single TCP session from a client behind a <1280
> > MTU link for the entire lifetime of the routing cache entry.
> > 
> 
> It seems we dont handle skb ip_summed CHECKSUM_PARTIAL 
> in ip6_fragment() slow_path


and many drivers dont handle the presence of a frag header.

if (ipv6_hdr(skb)->nexthdr == IPPROTO_TCP)

I suspect we'll have to compute the checksum in software...

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] ipv6: dst_allfrag() not taken into account by TCP
  2012-01-17 20:25   ` Eric Dumazet
@ 2012-01-18 12:42     ` Tore Anderson
  2012-01-18 14:06       ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Tore Anderson @ 2012-01-18 12:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

* Eric Dumazet

> Could you please test following patch ?

Hi,

It seems to work fine, now all the outgoing packets that includes a
Fragmentation header all are "atomic" or "non-fragmented" fragments,
i.e., that they both have Offset=0 and More Fragments=0.

I'm uploaded a tcpdump showing the traffic with the patch here:

https://bugzilla.kernel.org/attachment.cgi?id=72106

Best regards,
-- 
Tore Anderson

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Bugzilla 42595
  2012-01-18 10:58     ` Eric Dumazet
@ 2012-01-18 13:44       ` Eric Dumazet
  2012-01-18 14:20         ` Tore Anderson
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2012-01-18 13:44 UTC (permalink / raw)
  To: Tore Anderson; +Cc: netdev

Le mercredi 18 janvier 2012 à 11:58 +0100, Eric Dumazet a écrit :
> Le mercredi 18 janvier 2012 à 00:43 +0100, Eric Dumazet a écrit :
> > Le mardi 17 janvier 2012 à 21:03 +0100, Tore Anderson a écrit :
> > 
> > > 
> > > 3) There seems to be a bug related to generating the TCP checksum of
> > > SYN/ACK packets to destinations with the allfrag features set. I just
> > > submitted a bug report about this:
> > > 
> > > https://bugzilla.kernel.org/show_bug.cgi?id=42595
> > > 
> > > This makes the allfrag feature pretty much useless for me, as I can only
> > > successfully establish a single TCP session from a client behind a <1280
> > > MTU link for the entire lifetime of the routing cache entry.
> > > 
> > 
> > It seems we dont handle skb ip_summed CHECKSUM_PARTIAL 
> > in ip6_fragment() slow_path
> 
> 
> and many drivers dont handle the presence of a frag header.
> 
> if (ipv6_hdr(skb)->nexthdr == IPPROTO_TCP)
> 
> I suspect we'll have to compute the checksum in software...
> 

Following patch seems to help for me, please test it ;)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index d97e071..92ea301 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -791,6 +791,10 @@ slow_path_clean:
 	}
 
 slow_path:
+	if ((skb->ip_summed == CHECKSUM_PARTIAL) &&
+	    skb_checksum_help(skb))
+		goto fail;
+
 	left = skb->len - hlen;		/* Space per frame */
 	ptr = hlen;			/* Where to start from */
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC] ipv6: dst_allfrag() not taken into account by TCP
  2012-01-18 12:42     ` Tore Anderson
@ 2012-01-18 14:06       ` Eric Dumazet
  2012-01-18 14:43         ` Tore Anderson
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2012-01-18 14:06 UTC (permalink / raw)
  To: Tore Anderson; +Cc: netdev

Le mercredi 18 janvier 2012 à 13:42 +0100, Tore Anderson a écrit :
> * Eric Dumazet
> 
> > Could you please test following patch ?
> 
> Hi,
> 
> It seems to work fine, now all the outgoing packets that includes a
> Fragmentation header all are "atomic" or "non-fragmented" fragments,
> i.e., that they both have Offset=0 and More Fragments=0.
> 
> I'm uploaded a tcpdump showing the traffic with the patch here:
> 
> https://bugzilla.kernel.org/attachment.cgi?id=72106
> 
> Best regards,


Seems good, thanks !

Now I wonder if some RFC can allow us to send small frames without a
frag header, since this frag header forces us to perform tx checksum in
software. 

I guess 576 should be safe ?

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index d97e071..14cb938 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -153,7 +153,7 @@ static int ip6_finish_output2(struct sk_buff *skb)
 static int ip6_finish_output(struct sk_buff *skb)
 {
 	if ((skb->len > ip6_skb_dst_mtu(skb) && !skb_is_gso(skb)) ||
-	    dst_allfrag(skb_dst(skb)))
+	    (skb->len > 576 && dst_allfrag(skb_dst(skb))))
 		return ip6_fragment(skb, ip6_finish_output2);
 	else
 		return ip6_finish_output2(skb);

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: Bugzilla 42595
  2012-01-18 13:44       ` Eric Dumazet
@ 2012-01-18 14:20         ` Tore Anderson
  2012-01-18 14:42           ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Tore Anderson @ 2012-01-18 14:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev


* Eric Dumazet

> Following patch seems to help for me, please test it ;)

Looks very good! I uploaded a PCAP to the bug report:

https://bugzilla.kernel.org/show_bug.cgi?id=42595#c7

-- 
Tore Anderson

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Bugzilla 42595
  2012-01-18 14:20         ` Tore Anderson
@ 2012-01-18 14:42           ` Eric Dumazet
  2012-01-18 15:42             ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2012-01-18 14:42 UTC (permalink / raw)
  To: Tore Anderson; +Cc: netdev

Le mercredi 18 janvier 2012 à 15:20 +0100, Tore Anderson a écrit :
> * Eric Dumazet
> 
> > Following patch seems to help for me, please test it ;)
> 
> Looks very good! I uploaded a PCAP to the bug report:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=42595#c7
> 

Thanks

It seems that if you try to have the default ethtool settings (tso, gso,
tx checksum ... on), transfert is very slow... I am investigating, but I
suspect that if dst_allfrag() is true, we must disable gso as well.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] ipv6: dst_allfrag() not taken into account by TCP
  2012-01-18 14:06       ` Eric Dumazet
@ 2012-01-18 14:43         ` Tore Anderson
  2012-01-18 14:59           ` Eric Dumazet
  0 siblings, 1 reply; 20+ messages in thread
From: Tore Anderson @ 2012-01-18 14:43 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

* Eric Dumazet

> Now I wonder if some RFC can allow us to send small frames without a
> frag header, since this frag header forces us to perform tx checksum in
> software.

Yes. In my opinion, the most correct thing to do would be to keep track
of the actual IPv6 Path MTU (the value indicated in the ICMPv6 Packet
Too Big message received) as a separate property that is added to the
routing cache, and to include the Fragment header only if you transmit
a packet that is larger than the PMTU.

So, let's say you get an ICMPv6 PTB with MTU=800. Then the resulting
host route added to the routing cache should contain both "mtu 1280"
as well as "pmtu 800". You can then safely omit the Fragmentation
header from any packet smaller than 800 bytes. You might of course then
get another ICMPv6 PTB with an even smaller MTU originated from an IPv4
router even further out in the path, which should then result in
lowering the cached pmtu value further, of course.

> > I guess 576 should be safe ?

I have difficulty determining whether or not the minimum permitted IPv4
MTU is 576 or 68 bytes, actually. Assuming it's the latter, by hard-
coding the former, you may end up in a situation where you'll be stuck
in a PMTUD loop - you'll be re-transmitting 576 bytes large packets
over and over again, while getting back ICMPv6 PTBs with an even lower
MTU value for each of the packets you transmit.

That said, you can add 20 bytes to whatever value you hard-code, as the
IPv6 header is 20 bytes larger than the IPv4 header. In other words,
596- and 88-byte large IPv6 packets will be translated into 576- or
68-byte large IPv4 packets.

Another way of solving it and avoiding the Fragmentation headers
altogether would be to allow the "route MTU" to drop below 1280 as a
result of Path MTU Discovery. I opened an enchancement bug report
requesting this:

https://bugzilla.kernel.org/show_bug.cgi?id=42599

Best regards,
-- 
Tore Anderson

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] ipv6: dst_allfrag() not taken into account by TCP
  2012-01-18 14:43         ` Tore Anderson
@ 2012-01-18 14:59           ` Eric Dumazet
  2012-01-18 15:14             ` Tore Anderson
  2012-01-18 17:01             ` David Miller
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2012-01-18 14:59 UTC (permalink / raw)
  To: Tore Anderson; +Cc: netdev

Le mercredi 18 janvier 2012 à 15:43 +0100, Tore Anderson a écrit :

> Another way of solving it and avoiding the Fragmentation headers
> altogether would be to allow the "route MTU" to drop below 1280 as a
> result of Path MTU Discovery. I opened an enchancement bug report
> requesting this:
> 
> https://bugzilla.kernel.org/show_bug.cgi?id=42599

Please dont open bugzilla entries for this kind of stuff, since there is
no bug in this case.

netdev is more appropriate to discuss future improvements.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] ipv6: dst_allfrag() not taken into account by TCP
  2012-01-18 14:59           ` Eric Dumazet
@ 2012-01-18 15:14             ` Tore Anderson
  2012-01-18 15:40               ` Eric Dumazet
  2012-01-18 17:01             ` David Miller
  1 sibling, 1 reply; 20+ messages in thread
From: Tore Anderson @ 2012-01-18 15:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev


* Eric Dumazet

> Le mercredi 18 janvier 2012 à 15:43 +0100, Tore Anderson a écrit :
> 
>> Another way of solving it and avoiding the Fragmentation headers
>> altogether would be to allow the "route MTU" to drop below 1280 as a
>> result of Path MTU Discovery. I opened an enchancement bug report
>> requesting this:
>> 
>> https://bugzilla.kernel.org/show_bug.cgi?id=42599
> 
> Please dont open bugzilla entries for this kind of stuff, since there is
> no bug in this case.

Okay...but then what is the «enhanchement» severity's intended use?

> netdev is more appropriate to discuss future improvements.

No problem. Bottom line, I'd like a "min_pmtu" IPv6 sysctl or something
similar that would allow me to configure it so that the kernel would 
accept an arbitrary low PMTU being received in an ICMPv6 PTB. That 
would prevent having to deal with the Fragmentation headers completely.

Best regards,
-- 
Tore Anderson

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [RFC] ipv6: dst_allfrag() not taken into account by TCP
  2012-01-18 15:14             ` Tore Anderson
@ 2012-01-18 15:40               ` Eric Dumazet
  0 siblings, 0 replies; 20+ messages in thread
From: Eric Dumazet @ 2012-01-18 15:40 UTC (permalink / raw)
  To: Tore Anderson; +Cc: netdev

Le mercredi 18 janvier 2012 à 16:14 +0100, Tore Anderson a écrit :

> No problem. Bottom line, I'd like a "min_pmtu" IPv6 sysctl or something
> similar that would allow me to configure it so that the kernel would 
> accept an arbitrary low PMTU being received in an ICMPv6 PTB. That 
> would prevent having to deal with the Fragmentation headers completely.

I agree, but first lets fix the bugs ;)

Once the allfrag feature works again, we can try to avoid it.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Bugzilla 42595
  2012-01-18 14:42           ` Eric Dumazet
@ 2012-01-18 15:42             ` Eric Dumazet
  2012-01-18 19:26               ` Tore Anderson
  0 siblings, 1 reply; 20+ messages in thread
From: Eric Dumazet @ 2012-01-18 15:42 UTC (permalink / raw)
  To: Tore Anderson; +Cc: netdev

Le mercredi 18 janvier 2012 à 15:42 +0100, Eric Dumazet a écrit :

> It seems that if you try to have the default ethtool settings (tso, gso,
> tx checksum ... on), transfert is very slow... I am investigating, but I
> suspect that if dst_allfrag() is true, we must disable gso as well.
> 
> 

Following patch is a good compromise, since it disables gso on the
socket on the first frame we consider too big in ip6_fragment().

So the added check on dst_allfrag() is only done once per socket,
instead of adding in sk_can_gso(sk) in fast path.

I can now let gso on on the device, and still have good transfert speed
on sockets hitting the ALLFRAG feature.

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index d97e071..e058747 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -647,8 +647,12 @@ int ip6_fragment(struct sk_buff *skb, int
(*output)(struct sk_buff *))
 	 * or if the skb it not generated by a local socket.
 	 */
 	if (!skb->local_df && skb->len > mtu) {
+
+		if (skb->sk && dst_allfrag(skb_dst(skb)))
+			sk_nocaps_add(skb->sk, NETIF_F_GSO_MASK);
+
 		skb->dev = skb_dst(skb)->dev;
 		icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
 		IP6_INC_STATS(net, ip6_dst_idev(skb_dst(skb)),
 			      IPSTATS_MIB_FRAGFAILS);
 		kfree_skb(skb);

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [RFC] ipv6: dst_allfrag() not taken into account by TCP
  2012-01-18 14:59           ` Eric Dumazet
  2012-01-18 15:14             ` Tore Anderson
@ 2012-01-18 17:01             ` David Miller
  1 sibling, 0 replies; 20+ messages in thread
From: David Miller @ 2012-01-18 17:01 UTC (permalink / raw)
  To: eric.dumazet; +Cc: tore, netdev

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Wed, 18 Jan 2012 15:59:06 +0100

> Le mercredi 18 janvier 2012 à 15:43 +0100, Tore Anderson a écrit :
> 
>> Another way of solving it and avoiding the Fragmentation headers
>> altogether would be to allow the "route MTU" to drop below 1280 as a
>> result of Path MTU Discovery. I opened an enchancement bug report
>> requesting this:
>> 
>> https://bugzilla.kernel.org/show_bug.cgi?id=42599
> 
> Please dont open bugzilla entries for this kind of stuff, since there is
> no bug in this case.

Plus, we really don't normally follow the bugzilla reports at all, and
as Eric states just report them here to the netdev list if you want
something looked into.

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: Bugzilla 42595
  2012-01-18 15:42             ` Eric Dumazet
@ 2012-01-18 19:26               ` Tore Anderson
  0 siblings, 0 replies; 20+ messages in thread
From: Tore Anderson @ 2012-01-18 19:26 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

* Eric Dumazet

> Following patch is a good compromise, since it disables gso on the
> socket on the first frame we consider too big in ip6_fragment().
> 
> So the added check on dst_allfrag() is only done once per socket,
> instead of adding in sk_can_gso(sk) in fast path.
> 
> I can now let gso on on the device, and still have good transfert speed
> on sockets hitting the ALLFRAG feature.

I can both confirm that I too saw abysmal performance when using the
default offload settings (with your two earlier patches applied only),
and also that applying this patch in addition fixes the problem, with no
apparent ill effects. Thanks!

Best regards,
-- 
Tore Anderson

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-01-18 19:26 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-17 16:28 [RFC] ipv6: dst_allfrag() not taken into account by TCP Eric Dumazet
2012-01-17 17:34 ` David Miller
2012-01-17 18:15   ` Eric Dumazet
2012-01-17 18:25     ` David Miller
2012-01-17 20:03 ` Tore Anderson
2012-01-17 20:25   ` Eric Dumazet
2012-01-18 12:42     ` Tore Anderson
2012-01-18 14:06       ` Eric Dumazet
2012-01-18 14:43         ` Tore Anderson
2012-01-18 14:59           ` Eric Dumazet
2012-01-18 15:14             ` Tore Anderson
2012-01-18 15:40               ` Eric Dumazet
2012-01-18 17:01             ` David Miller
2012-01-17 23:43   ` Bugzilla 42595 Eric Dumazet
2012-01-18 10:58     ` Eric Dumazet
2012-01-18 13:44       ` Eric Dumazet
2012-01-18 14:20         ` Tore Anderson
2012-01-18 14:42           ` Eric Dumazet
2012-01-18 15:42             ` Eric Dumazet
2012-01-18 19:26               ` Tore Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).