Re: [Qemu-devel] TCP performance problems - GSO/TSO, MSS, 8139cp related

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* Re: [Qemu-devel] TCP performance problems - GSO/TSO, MSS, 8139cp related
       [not found] <20161111210500.GE1041@n2100.armlinux.org.uk>
@ 2016-11-11 21:23 ` David Woodhouse
  2016-11-11 22:33   ` Russell King - ARM Linux
  2016-11-11 22:44   ` Russell King - ARM Linux
  0 siblings, 2 replies; 6+ messages in thread
From: David Woodhouse @ 2016-11-11 21:23 UTC (permalink / raw)
  To: Russell King - ARM Linux, netdev; +Cc: qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1494 bytes --]

On Fri, 2016-11-11 at 21:05 +0000, Russell King - ARM Linux wrote:
> 
> 18:59:38.782818 IP (tos 0x0, ttl 52, id 35619, offset 0, flags [DF], proto TCP (6), length 60)
>     84.xx.xxx.196.61236 > 195.92.253.2.http: Flags [S], cksum 0x88db (correct), seq 158975430, win 29200, options [mss 1452,sackOK,TS val 1377914597 ecr 0,nop,wscale 7], length 0

... (MSS 1452)

> 18:59:38.816371 IP (tos 0x0, ttl 64, id 25879, offset 0, flags [DF], proto TCP (6), length 1500)
>     195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 1:1449, ack 154, win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 1448: HTTP, length: 1448
> 18:59:38.816393 IP (tos 0x0, ttl 64, id 25880, offset 0, flags [DF], proto TCP (6), length 1484)
>     195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 1449:2881, ack 154, win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 1432: HTTP

Can you instrument cp_start_xmit() in 8139cp.c and get it to print the
value of 'mss' when this happens?

All we do is take that value from skb_shinfo(skb)->gso_size, shift it a
bit, and shove it in the descriptor ring. There's not much scope for a
driver-specific bug.

It's also *fairly* unlikely that the kernel in the guest has developed
a bug and isn't setting gso_size sanely. I'm more inclined to suspect
that qemu isn't properly emulating those bits. But at first glance at
the code, it looks like *that's* been there for the last decade too...

-- 
dwmw2

[-- Attachment #2: smime.p7s --]
[-- Type: application/x-pkcs7-signature, Size: 5760 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] TCP performance problems - GSO/TSO, MSS, 8139cp related
  2016-11-11 21:23 ` [Qemu-devel] TCP performance problems - GSO/TSO, MSS, 8139cp related David Woodhouse
@ 2016-11-11 22:33   ` Russell King - ARM Linux
  2016-11-12  2:52     ` David Miller
  2016-11-11 22:44   ` Russell King - ARM Linux
  1 sibling, 1 reply; 6+ messages in thread
From: Russell King - ARM Linux @ 2016-11-11 22:33 UTC (permalink / raw)
  To: David Woodhouse; +Cc: netdev, qemu-devel

On Fri, Nov 11, 2016 at 09:23:43PM +0000, David Woodhouse wrote:
> On Fri, 2016-11-11 at 21:05 +0000, Russell King - ARM Linux wrote:
> > 
> > 18:59:38.782818 IP (tos 0x0, ttl 52, id 35619, offset 0, flags [DF], proto TCP (6), length 60)
> >     84.xx.xxx.196.61236 > 195.92.253.2.http: Flags [S], cksum 0x88db (correct), seq 158975430, win 29200, options [mss 1452,sackOK,TS val 1377914597 ecr 0,nop,wscale 7], length 0
> 
> ... (MSS 1452)
> 
> > 18:59:38.816371 IP (tos 0x0, ttl 64, id 25879, offset 0, flags [DF], proto TCP (6), length 1500)
> >     195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 1:1449, ack 154, win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 1448: HTTP, length: 1448
> > 18:59:38.816393 IP (tos 0x0, ttl 64, id 25880, offset 0, flags [DF], proto TCP (6), length 1484)
> >     195.92.253.2.http > 84.xx.xxx.196.61236: Flags [.], seq 1449:2881, ack 154, win 235, options [nop,nop,TS val 3363137952 ecr 1377914627], length 1432: HTTP
> 
> Can you instrument cp_start_xmit() in 8139cp.c and get it to print the
> value of 'mss' when this happens?

Well, I'm not going to fiddle in such a way with a public box... that
would be utter madness.  I'll fiddle with mvneta locally on 4.9-rc
instead - and yes, I know that's not the F23 4.4 kernel, so doesn't
really tell us very much.

I _could_ ask bryce to setup another VM on ZenV for me to play with,
but we'll have to wait for bryce to be around for that... I don't
want to break zenv or zeniv. :)

> All we do is take that value from skb_shinfo(skb)->gso_size, shift it a
> bit, and shove it in the descriptor ring. There's not much scope for a
> driver-specific bug.

Unless there's a different interpretation of what the MSS field in the
driver means...

Looking at mvneta, which works correctly,

- On mvneta (192.168.1.59):

21:39:38.535549 IP (tos 0x0, ttl 64, id 27668, offset 0, flags [DF], proto TCP (6), length 7252)
    192.168.1.59.55170 > 192.168.1.18.5001: Flags [.], seq 25:7225, ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 7200

- On laptop (192.168.1.18):

21:39:38.537442 IP (tos 0x0, ttl 64, id 27668, offset 0, flags [DF], proto TCP (6), length 1492)
    192.168.1.59.55170 > 192.168.1.18.commplex-link: Flags [.], seq 25:1465, ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 1440
21:39:38.537453 IP (tos 0x0, ttl 64, id 27669, offset 0, flags [DF], proto TCP (6), length 1492)
    192.168.1.59.55170 > 192.168.1.18.commplex-link: Flags [.], seq 1465:2905, ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 1440
21:39:38.537461 IP (tos 0x0, ttl 64, id 27670, offset 0, flags [DF], proto TCP (6), length 1492)
    192.168.1.59.55170 > 192.168.1.18.commplex-link: Flags [.], seq 2905:4345, ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 1440
21:39:38.537464 IP (tos 0x0, ttl 64, id 9968, offset 0, flags [DF], proto TCP (6), length 52)
    192.168.1.18.commplex-link > 192.168.1.59.55170: Flags [.], cksum 0x83c4 (incorrect -> 0xa338), ack 1465, win 249, options [nop,nop,TS val 1387514368 ecr 62231754], length 0
21:39:38.537465 IP (tos 0x0, ttl 64, id 27671, offset 0, flags [DF], proto TCP (6), length 1492)
    192.168.1.59.55170 > 192.168.1.18.commplex-link: Flags [.], seq 4345:5785, ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 1440
21:39:38.537469 IP (tos 0x0, ttl 64, id 27672, offset 0, flags [DF], proto TCP (6), length 1492)
    192.168.1.59.55170 > 192.168.1.18.commplex-link: Flags [.], seq 5785:7225, ack 1, win 229, options [nop,nop,TS val 62231754 ecr 1387514367], length 1440

which is all correct.  Now, these packets have a larger TCP header
due to the options:

        0x0000:  0022 6815 37dd 0050 4321 0201 0800 4500  ."h.7..PC!....E.
                 ^mac                               ^iphdr
        0x0010:  05d4 6c14 4000 4006 4572 c0a8 013b c0a8  ..l.@.@.Er...;..
        0x0020:  0112 d782 1389 4cb4 f8f4 7454 ef10 8010  ......L...tT....
                      ^tcphdr
        0x0030:  00e5 2a80 0000 0101 080a 03b5 94ca 52b3  ..*...........R.
                                ^tcpopts
        0x0040:  c9ff 0000 0000 0000 0001 0000 1389 0000  ................
                      ^start of data
        0x0050:  0000 0000 0000 ffff fc18 3435 3637 3839  ..........456789
        0x0060:  3031 3233 3435 3637 3839 3031 3233 3435  0123456789012345

So the data starts at 66 (0x42) into this packet, followed by 1440 bytes
of data.  Looking at drivers/net/ethernet/marvell/mvneta.c, the only
way this can happen is if skb_shinfo(skb)->gso_size is 1440.  I'll
instrument mvneta to dump this value...

While waiting for the kernel to build, I've been reading the TCP code,
and found this:

/* Compute the current effective MSS, taking SACKs and IP options,
 * and even PMTU discovery events into account.
 */
unsigned int tcp_current_mss(struct sock *sk)
...
        /* The mss_cache is sized based on tp->tcp_header_len, which assumes
         * some common options. If this is an odd packet (because we have SACK
         * blocks etc) then our calculated header_len will be different, and
         * we have to adjust mss_now correspondingly */

mss_now is what becomes gso_size, which means that gso_size will be
adjusted for the TCP options - which makes sense.  So, because there
are 12 bytes of options in the above hex packet dump, negotiated MSS
- 12 gives the data payload size of 1440, and so gso_size will be
1440.

And now, going back to that kernel that finished compiling...

[   53.468319] skb len=7266 hdr len=66 gso_size=1440
...
[   53.728752] skb len=64866 hdr len=66 gso_size=1440

so my guesses were right, at least for 4.9-rc4.  Whether that holds
for the fedora f23 4.4 kernel is another matter.  For the record,
removing the TCPMSS clamp gives the expected result:

[  231.244018] skb len=7306 hdr len=66 gso_size=1448

So, gso_size is the size of the TCP data after the TCP header plus
TCP options.

The other thing to notice is that the SKB length minus header
length is divisible by the gso size for these full-sized packets.
There is no "one packet larger next packet smaller" here.

> It's also *fairly* unlikely that the kernel in the guest has developed
> a bug and isn't setting gso_size sanely. I'm more inclined to suspect
> that qemu isn't properly emulating those bits. But at first glance at
> the code, it looks like *that's* been there for the last decade too...

Whether or not it's been there for a decade is kind of irrelevant -
bugs can be around for a decade and remain undiscovered.

Looking at the 8139C information, it says:

"The new buffer management algorithm provides capabilities of Microsoft
Large-Send offload" and as yet I haven't found anything that describes
what this is or how it works.  How certain are we that the LSO MSS
value is the same as our gso_size, iow the size of the data after
the TCP header and options?

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] TCP performance problems - GSO/TSO, MSS, 8139cp related
  2016-11-11 21:23 ` [Qemu-devel] TCP performance problems - GSO/TSO, MSS, 8139cp related David Woodhouse
  2016-11-11 22:33   ` Russell King - ARM Linux
@ 2016-11-11 22:44   ` Russell King - ARM Linux
  2016-11-14  9:29     ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 6+ messages in thread
From: Russell King - ARM Linux @ 2016-11-11 22:44 UTC (permalink / raw)
  To: David Woodhouse; +Cc: netdev, qemu-devel

On Fri, Nov 11, 2016 at 09:23:43PM +0000, David Woodhouse wrote:
> It's also *fairly* unlikely that the kernel in the guest has developed
> a bug and isn't setting gso_size sanely. I'm more inclined to suspect
> that qemu isn't properly emulating those bits. But at first glance at
> the code, it looks like *that's* been there for the last decade too...

I take issue with that, having looked at the qemu rtl8139 code:

                if ((txdw0 & CP_TX_LGSEN) && ip_protocol == IP_PROTO_TCP)
                {
                    int large_send_mss = (txdw0 >> 16) & CP_TC_LGSEN_MSS_MASK;

                    DPRINTF("+++ C+ mode offloaded task TSO MTU=%d IP data %d "
                        "frame data %d specified MSS=%d\n", ETH_MTU,
                        ip_data_len, saved_size - ETH_HLEN, large_send_mss);

That's the only reference to "large_send_mss" there, other than that,
the MSS value that gets stuck into the field by 8139cp.c is completely
unused.  Instead, qemu does this:

                eth_payload_data = saved_buffer + ETH_HLEN;
                eth_payload_len  = saved_size   - ETH_HLEN;

                ip = (ip_header*)eth_payload_data;

                    hlen = IP_HEADER_LENGTH(ip);
                    ip_data_len = be16_to_cpu(ip->ip_len) - hlen;

                    tcp_header *p_tcp_hdr = (tcp_header*)(eth_payload_data + hlen);
                    int tcp_hlen = TCP_HEADER_DATA_OFFSET(p_tcp_hdr);

                    /* ETH_MTU = ip header len + tcp header len + payload */
                    int tcp_data_len = ip_data_len - tcp_hlen;
                    int tcp_chunk_size = ETH_MTU - hlen - tcp_hlen;

                    for (tcp_send_offset = 0; tcp_send_offset < tcp_data_len; tcp_send_offset += tcp_chunk_size)
                    {

It uses a fixed value of ETH_MTU to calculate the size of the TCP
data chunks, and this is not surprisingly the well known:

#define ETH_MTU     1500

Qemu seems to be buggy - it ignores the MSS value, and always tries to
send 1500 byte frames.

-- 
RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
according to speedtest.net.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] TCP performance problems - GSO/TSO, MSS, 8139cp related
  2016-11-11 22:33   ` Russell King - ARM Linux
@ 2016-11-12  2:52     ` David Miller
  0 siblings, 0 replies; 6+ messages in thread
From: David Miller @ 2016-11-12  2:52 UTC (permalink / raw)
  To: linux; +Cc: dwmw2, netdev, qemu-devel

From: Russell King - ARM Linux <linux@armlinux.org.uk>
Date: Fri, 11 Nov 2016 22:33:08 +0000

> "The new buffer management algorithm provides capabilities of Microsoft
> Large-Send offload" and as yet I haven't found anything that describes
> what this is or how it works.

For once I will give Microsoft a big shout out here.

This, and everything a Microsoft networking driver interfaces to, is
_very_ much documented in extreme detail in the Microsoft NDIS
(Network Driver Interface Specification).

Microsoft's networking driver interfaces and expectations are
documented 1,000 times better than that of Linux.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] TCP performance problems - GSO/TSO, MSS, 8139cp related
  2016-11-11 22:44   ` Russell King - ARM Linux
@ 2016-11-14  9:29     ` Dr. David Alan Gilbert
  2016-11-14 16:25       ` Stefan Hajnoczi
  0 siblings, 1 reply; 6+ messages in thread
From: Dr. David Alan Gilbert @ 2016-11-14  9:29 UTC (permalink / raw)
  To: Russell King - ARM Linux, jasowang, vyasevic, stefanha
  Cc: David Woodhouse, netdev, qemu-devel

* Russell King - ARM Linux (linux@armlinux.org.uk) wrote:
> On Fri, Nov 11, 2016 at 09:23:43PM +0000, David Woodhouse wrote:
> > It's also *fairly* unlikely that the kernel in the guest has developed
> > a bug and isn't setting gso_size sanely. I'm more inclined to suspect
> > that qemu isn't properly emulating those bits. But at first glance at
> > the code, it looks like *that's* been there for the last decade too...
> 
> I take issue with that, having looked at the qemu rtl8139 code:
> 
>                 if ((txdw0 & CP_TX_LGSEN) && ip_protocol == IP_PROTO_TCP)
>                 {
>                     int large_send_mss = (txdw0 >> 16) & CP_TC_LGSEN_MSS_MASK;
> 
>                     DPRINTF("+++ C+ mode offloaded task TSO MTU=%d IP data %d "
>                         "frame data %d specified MSS=%d\n", ETH_MTU,
>                         ip_data_len, saved_size - ETH_HLEN, large_send_mss);
> 
> That's the only reference to "large_send_mss" there, other than that,
> the MSS value that gets stuck into the field by 8139cp.c is completely
> unused.  Instead, qemu does this:
> 
>                 eth_payload_data = saved_buffer + ETH_HLEN;
>                 eth_payload_len  = saved_size   - ETH_HLEN;
> 
>                 ip = (ip_header*)eth_payload_data;
> 
>                     hlen = IP_HEADER_LENGTH(ip);
>                     ip_data_len = be16_to_cpu(ip->ip_len) - hlen;
> 
>                     tcp_header *p_tcp_hdr = (tcp_header*)(eth_payload_data + hlen);
>                     int tcp_hlen = TCP_HEADER_DATA_OFFSET(p_tcp_hdr);
> 
>                     /* ETH_MTU = ip header len + tcp header len + payload */
>                     int tcp_data_len = ip_data_len - tcp_hlen;
>                     int tcp_chunk_size = ETH_MTU - hlen - tcp_hlen;
> 
>                     for (tcp_send_offset = 0; tcp_send_offset < tcp_data_len; tcp_send_offset += tcp_chunk_size)
>                     {
> 
> It uses a fixed value of ETH_MTU to calculate the size of the TCP
> data chunks, and this is not surprisingly the well known:
> 
> #define ETH_MTU     1500
> 
> Qemu seems to be buggy - it ignores the MSS value, and always tries to
> send 1500 byte frames.

cc'ing in Stefan who last touched that code and Jason and Vlad who
know the net code.

Dave

> -- 
> RMK's Patch system: http://www.armlinux.org.uk/developer/patches/
> FTTC broadband for 0.8mile line: currently at 9.6Mbps down 400kbps up
> according to speedtest.net.
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [Qemu-devel] TCP performance problems - GSO/TSO, MSS, 8139cp related
  2016-11-14  9:29     ` Dr. David Alan Gilbert
@ 2016-11-14 16:25       ` Stefan Hajnoczi
  0 siblings, 0 replies; 6+ messages in thread
From: Stefan Hajnoczi @ 2016-11-14 16:25 UTC (permalink / raw)
  To: Russell King - ARM Linux, David Woodhouse
  Cc: jasowang, vyasevic, stefanha, netdev, qemu-devel,
	igor.v.kovalenko, dgilbert

[-- Attachment #1: Type: text/plain, Size: 3792 bytes --]

On Mon, Nov 14, 2016 at 09:29:48AM +0000, Dr. David Alan Gilbert wrote:
> * Russell King - ARM Linux (linux@armlinux.org.uk) wrote:
> > On Fri, Nov 11, 2016 at 09:23:43PM +0000, David Woodhouse wrote:
> > > It's also *fairly* unlikely that the kernel in the guest has developed
> > > a bug and isn't setting gso_size sanely. I'm more inclined to suspect
> > > that qemu isn't properly emulating those bits. But at first glance at
> > > the code, it looks like *that's* been there for the last decade too...
> > 
> > I take issue with that, having looked at the qemu rtl8139 code:
> > 
> >                 if ((txdw0 & CP_TX_LGSEN) && ip_protocol == IP_PROTO_TCP)
> >                 {
> >                     int large_send_mss = (txdw0 >> 16) & CP_TC_LGSEN_MSS_MASK;
> > 
> >                     DPRINTF("+++ C+ mode offloaded task TSO MTU=%d IP data %d "
> >                         "frame data %d specified MSS=%d\n", ETH_MTU,
> >                         ip_data_len, saved_size - ETH_HLEN, large_send_mss);
> > 
> > That's the only reference to "large_send_mss" there, other than that,
> > the MSS value that gets stuck into the field by 8139cp.c is completely
> > unused.  Instead, qemu does this:
> > 
> >                 eth_payload_data = saved_buffer + ETH_HLEN;
> >                 eth_payload_len  = saved_size   - ETH_HLEN;
> > 
> >                 ip = (ip_header*)eth_payload_data;
> > 
> >                     hlen = IP_HEADER_LENGTH(ip);
> >                     ip_data_len = be16_to_cpu(ip->ip_len) - hlen;
> > 
> >                     tcp_header *p_tcp_hdr = (tcp_header*)(eth_payload_data + hlen);
> >                     int tcp_hlen = TCP_HEADER_DATA_OFFSET(p_tcp_hdr);
> > 
> >                     /* ETH_MTU = ip header len + tcp header len + payload */
> >                     int tcp_data_len = ip_data_len - tcp_hlen;
> >                     int tcp_chunk_size = ETH_MTU - hlen - tcp_hlen;
> > 
> >                     for (tcp_send_offset = 0; tcp_send_offset < tcp_data_len; tcp_send_offset += tcp_chunk_size)
> >                     {
> > 
> > It uses a fixed value of ETH_MTU to calculate the size of the TCP
> > data chunks, and this is not surprisingly the well known:
> > 
> > #define ETH_MTU     1500
> > 
> > Qemu seems to be buggy - it ignores the MSS value, and always tries to
> > send 1500 byte frames.
> 
> cc'ing in Stefan who last touched that code and Jason and Vlad who
> know the net code.

CCing Igor Kovalenko who implemented "fixed for TCP segmentation
offloading - removed dependency on slirp.h" in 2006.  I don't actually
expect him to remember this from 10 years ago though :).

Looking at the history the large_send_mss variable was never used for
anything beyond the debug printf.

The datasheet for this NIC is here:
http://realtek.info/pdf/rtl8139cp.pdf.  See 9.2.1 Transmit.

Does this untested patch work for you?

diff --git a/hw/net/rtl8139.c b/hw/net/rtl8139.c
index f05e59c..a3f1af5 100644
--- a/hw/net/rtl8139.c
+++ b/hw/net/rtl8139.c
@@ -2167,9 +2167,13 @@ static int rtl8139_cplus_transmit_one(RTL8139State *s)
                     goto skip_offload;
                 }
 
-                /* ETH_MTU = ip header len + tcp header len + payload */
+                /* MSS too small */
+                if (tcp_hlen + hlen >= large_send_mss) {
+                    goto skip_offload;
+                }
+
                 int tcp_data_len = ip_data_len - tcp_hlen;
-                int tcp_chunk_size = ETH_MTU - hlen - tcp_hlen;
+                int tcp_chunk_size = large_send_mss - hlen - tcp_hlen;
 
                 DPRINTF("+++ C+ mode TSO IP data len %d TCP hlen %d TCP "
                     "data len %d TCP chunk size %d\n", ip_data_len,

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]

^ permalink raw reply related	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2016-11-14 16:25 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20161111210500.GE1041@n2100.armlinux.org.uk>
2016-11-11 21:23 ` [Qemu-devel] TCP performance problems - GSO/TSO, MSS, 8139cp related David Woodhouse
2016-11-11 22:33   ` Russell King - ARM Linux
2016-11-12  2:52     ` David Miller
2016-11-11 22:44   ` Russell King - ARM Linux
2016-11-14  9:29     ` Dr. David Alan Gilbert
2016-11-14 16:25       ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).