netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] ipv6: dst_allfrag() not taken into account by TCP
@ 2012-01-17 16:28 Eric Dumazet
  2012-01-17 17:34 ` David Miller
  2012-01-17 20:03 ` Tore Anderson
  0 siblings, 2 replies; 20+ messages in thread
From: Eric Dumazet @ 2012-01-17 16:28 UTC (permalink / raw)
  To: netdev; +Cc: tore

Bugzilla reference :

https://bugzilla.kernel.org/show_bug.cgi?id=42572

> An IPv4 client behind a link with a MTU of 1259 downloading a file from an IPv6
> server
> 
> When RTAX_FEATURE_ALLFRAG is set on a route, the effective TCP segment
> size does not take into account the size of the IPv6 Fragmentation
> header that needs to be included in outbound packets, causing every
> transmitted TCP segment to be fragmented across two IPv6 packets, the
> latter of which will only contain 8 bytes of actual payload.
> 
> RTAX_FEATURE_ALLFRAG is typically set on a route in response to
> receving a ICMPv6 Packet Too Big message indicating a Path MTU of less
> than 1280 bytes. 1280 bytes is the minimum IPv6 MTU, however ICMPv6
> PTBs with MTU < 1280 are still valid, in particular when an IPv6
> packet is sent to an IPv4 destination through a stateless translator.
> Any ICMPv4 Need To Fragment packets originated from the IPv4 part of
> the path will be translated to ICMPv6 PTB which may then indicate an
> MTU of less than 1280.
> 
> RFC 2460 section 5 specifies what an IPv6 stack should do when this
> happens:
> 
> > In response to an IPv6 packet that is sent to an IPv4 destination
> > (i.e., a packet that undergoes translation from IPv6 to IPv4), the
> > originating IPv6 node may receive an ICMP Packet Too Big message
> > reporting a Next-Hop MTU less than 1280.  In that case, the IPv6 node
> > is not required to reduce the size of subsequent packets to less than
> > 1280, but must include a Fragment header in those packets so that the
> > IPv6-to-IPv4 translating router can obtain a suitable Identification
> > value to use in resulting IPv4 fragments.  Note that this means the
> > payload may have to be reduced to 1232 octets (1280 minus 40 for the
> > IPv6 header and 8 for the Fragment header), and smaller still if
> > additional extension headers are used.
> 
> The Linux kernel refuses to reduce the effective MTU to anything below
> 1280 bytes, instead it sets it to exactly 1280 bytes, and
> RTAX_FEATURE_ALLFRAG is also set. However, the TCP segment size appears
> to be set to 1240 bytes (1280 Path MTU - 40 bytes of IPv6 header),
> instead of 1232 (additionally taking into account the 8 bytes required
> by the IPv6 Fragmentation extension header).
> 
> This in turn results in rather inefficient transmission, as every 
> transmitted TCP segment now is split in two fragments containing
> 1232+8 bytes of payload.
> 
> I am attaching a tcpdump that shows this happening. In this case,
> 2a02:c0::46:0:57ee:3d82 is an IPv6-only server running Linux 3.2.0,
> while 2a02:c0::46:0:57ee:2a2a really is 87.238.42.42, a NAT device with
> an IPv4 node behind it. The link between the NAT device and the IPv4
> node has a MTU of 1259. Somewhere between the NAT device and the server
> there's a stateless IPv4/IPv6 translator. When the server sends its
> first full-sized (1500 bytes) packets, the NAT device responds with
> a ICMPv4 Need To Fragment (MTU=1259) which are then received by the
> server in its translated for (ICMPv6 PTB, MTU 1279). After that a
> large number of these mini-fragments containing only 8 bytes of 
> payload are transmitted. They should have been avoided.
> 
> Tore


It seems that dst_allfrag() will force us to use ip6_fragment() and
reduce effective MSS to :

MTU - sizeof(ipv6hdr) - sizeof(frag_hdr) - sizeof(tcphdr)

(not counting TCP options)

But tcp_mtu_to_mss() doesnt take into account dst_allfrag() and computed
TCP MSS might be 8 bytes too big ? (ie sizeof(struct frag_hdr))

For MTU = 1280, we endup with MSS=1240 instead of 1232

/* Calculate MSS. Not accounting for SACKs here.  */
int tcp_mtu_to_mss(const struct sock *sk, int pmtu)
{
        const struct tcp_sock *tp = tcp_sk(sk);
        const struct inet_connection_sock *icsk = inet_csk(sk);
        int mss_now;

        /* Calculate base mss without TCP options:
           It is MMS_S - sizeof(tcphdr) of rfc1122
         */
        mss_now = pmtu - icsk->icsk_af_ops->net_header_len - sizeof(struct tcphdr);

        /* Clamp it (mss_clamp does not include tcp options) */
        if (mss_now > tp->rx_opt.mss_clamp)
                mss_now = tp->rx_opt.mss_clamp;

        /* Now subtract optional transport overhead */
        mss_now -= icsk->icsk_ext_hdr_len;

        /* Then reserve room for full set of TCP options and 8 bytes of data */
        if (mss_now < 48)
                mss_now = 48;

        /* Now subtract TCP options size, not including SACKs */
        mss_now -= tp->tcp_header_len - sizeof(struct tcphdr);

        return mss_now;
}

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-01-18 19:26 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-01-17 16:28 [RFC] ipv6: dst_allfrag() not taken into account by TCP Eric Dumazet
2012-01-17 17:34 ` David Miller
2012-01-17 18:15   ` Eric Dumazet
2012-01-17 18:25     ` David Miller
2012-01-17 20:03 ` Tore Anderson
2012-01-17 20:25   ` Eric Dumazet
2012-01-18 12:42     ` Tore Anderson
2012-01-18 14:06       ` Eric Dumazet
2012-01-18 14:43         ` Tore Anderson
2012-01-18 14:59           ` Eric Dumazet
2012-01-18 15:14             ` Tore Anderson
2012-01-18 15:40               ` Eric Dumazet
2012-01-18 17:01             ` David Miller
2012-01-17 23:43   ` Bugzilla 42595 Eric Dumazet
2012-01-18 10:58     ` Eric Dumazet
2012-01-18 13:44       ` Eric Dumazet
2012-01-18 14:20         ` Tore Anderson
2012-01-18 14:42           ` Eric Dumazet
2012-01-18 15:42             ` Eric Dumazet
2012-01-18 19:26               ` Tore Anderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).