From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:50680) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aglrL-0007Uy-Mb for qemu-devel@nongnu.org; Fri, 18 Mar 2016 00:18:01 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aglrH-000579-I9 for qemu-devel@nongnu.org; Fri, 18 Mar 2016 00:17:59 -0400 Received: from mx1.redhat.com ([209.132.183.28]:46685) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aglrH-000571-8D for qemu-devel@nongnu.org; Fri, 18 Mar 2016 00:17:55 -0400 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) by mx1.redhat.com (Postfix) with ESMTPS id 5805F141D for ; Fri, 18 Mar 2016 04:17:54 +0000 (UTC) References: <1458033424-25414-1-git-send-email-wexu@redhat.com> <1458033424-25414-2-git-send-email-wexu@redhat.com> <56EA6DE2.7070502@redhat.com> <56EADF28.3000101@redhat.com> <56EB61D8.8080202@redhat.com> From: Wei Xu Message-ID: <56EB816D.5000804@redhat.com> Date: Fri, 18 Mar 2016 12:17:49 +0800 MIME-Version: 1.0 In-Reply-To: <56EB61D8.8080202@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Jason Wang , qemu-devel@nongnu.org On 2016=E5=B9=B403=E6=9C=8818=E6=97=A5 10:03, Jason Wang wrote: > > On 03/18/2016 12:45 AM, Wei Xu wrote: >> On 2016=E5=B9=B403=E6=9C=8817=E6=97=A5 16:42, Jason Wang wrote: >> >>> On 03/15/2016 05:17 PM, wexu@redhat.com wrote: >>>> From: Wei Xu >>>> >>>> All the data packets in a tcp connection will be cached to a big buf= fer >>>> in every receive interval, and will be sent out via a timer, the >>>> 'virtio_net_rsc_timeout' controls the interval, the value will >>>> influent the >>>> performance and response of tcp connection extremely, 50000(50us) is= a >>>> experience value to gain a performance improvement, since the whql t= est >>>> sends packets every 100us, so '300000(300us)' can pass the test case= , >>>> this is also the default value, it's gonna to be tunable. >>>> >>>> The timer will only be triggered if the packets pool is not empty, >>>> and it'll drain off all the cached packets >>>> >>>> 'NetRscChain' is used to save the segments of different protocols in= a >>>> VirtIONet device. >>>> >>>> The main handler of TCP includes TCP window update, duplicated ACK >>>> check >>>> and the real data coalescing if the new segment passed sanity check >>>> and is identified as an 'wanted' one. >>>> >>>> An 'wanted' segment means: >>>> 1. Segment is within current window and the sequence is the expected >>>> one. >>>> 2. ACK of the segment is in the valid window. >>>> 3. If the ACK in the segment is a duplicated one, then it must less >>>> than 2, >>>> this is to notify upper layer TCP starting retransmission due t= o >>>> the spec. >>>> >>>> Sanity check includes: >>>> 1. Incorrect version in IP header >>>> 2. IP options & IP fragment >>>> 3. Not a TCP packets >>>> 4. Sanity size check to prevent buffer overflow attack. >>>> >>>> There maybe more cases should be considered such as ip >>>> identification other >>>> flags, while it broke the test because windows set it to the same >>>> even it's >>>> not a fragment. >>>> >>>> Normally it includes 2 typical ways to handle a TCP control flag, >>>> 'bypass' >>>> and 'finalize', 'bypass' means should be sent out directly, and >>>> 'finalize' >>>> means the packets should also be bypassed, and this should be done >>>> after searching for the same connection packets in the pool and send= ing >>>> all of them out, this is to avoid out of data. >>>> >>>> All the 'SYN' packets will be bypassed since this always begin a new= ' >>>> connection, other flags such 'FIN/RST' will trigger a finalization, >>>> because >>>> this normally happens upon a connection is going to be closed, an >>>> 'URG' packet >>>> also finalize current coalescing unit while there maybe protocol >>>> difference to >>>> different OS. >>> But URG packet should be sent as quickly as possible regardless of >>> ordering, no? >> Yes, you right, URG will terminate the current 'SCU', i'll amend the >> commit log. >> >>>> Statistics can be used to monitor the basic coalescing status, the >>>> 'out of order' >>>> and 'out of window' means how many retransmitting packets, thus >>>> describe the >>>> performance intuitively. >>>> >>>> Signed-off-by: Wei Xu >>>> --- >>>> hw/net/virtio-net.c | 486 >>>> ++++++++++++++++++++++++++++++++++++++++- >>>> include/hw/virtio/virtio-net.h | 1 + >>>> include/hw/virtio/virtio.h | 72 ++++++ >>>> 3 files changed, 558 insertions(+), 1 deletion(-) > [...] > >>>> + } else { >>>> + /* Coalesce window update */ >>>> + o_tcp->th_win =3D n_tcp->th_win; >>>> + chain->stat.win_update++; >>>> + return RSC_COALESCE; >>>> + } >>>> + } else { >>>> + /* pure ack, update ack */ >>>> + o_tcp->th_ack =3D n_tcp->th_ack; >>>> + chain->stat.pure_ack++; >>>> + return RSC_COALESCE; >>> Looks like there're something I missed. The spec said: >>> >>> "In other words, any pure ACK that is not a duplicate ACK or a window >>> update triggers an exception and must not be coalesced. All such pure >>> ACKs must be indicated as individual segments." >>> >>> Does it mean we *should not* coalesce windows update and pure ack? >>> (Since it can wakeup transmission)? >> It's also a little bit inexplicit and flexible due to the spec, please >> see the flowchart I on the same page. >> >> Comments about the flowchart: >> ----------------------------------------------------------------------= -- >> The first of the following two flowcharts describes the rules for >> coalescing segments and updating the TCP headers. >> This flowchart refers to mechanisms for distinguishing valid duplicate >> ACKs and window updates. The second flowchart describes these mechanis= ms. >> ----------------------------------------------------------------------= -- >> As show in the flowchart, only status 'C' will break current scu and >> get finalized, both 'A' and 'B' can be coalesced afaik. >> > Interesting, looks like you're right. > >>>> + } >>>> +} >>>> + >>>> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, >>>> NetRscSeg *seg, >>>> + const uint8_t *buf, NetRscUnit >>>> *n_unit) >>>> +{ >>>> + void *data; >>>> + uint16_t o_ip_len; >>>> + uint32_t nseq, oseq; >>>> + NetRscUnit *o_unit; >>>> + >>>> + o_unit =3D &seg->unit; >>>> + o_ip_len =3D htons(*o_unit->ip_plen); >>>> + nseq =3D htonl(n_unit->tcp->th_seq); >>>> + oseq =3D htonl(o_unit->tcp->th_seq); >>>> + >>>> + if (n_unit->tcp_hdrlen > TCP_HDR_SZ) { >>>> + /* Log this only for debugging observation */ >>>> + chain->stat.tcp_option++; >>>> + } >>>> + >>>> + /* Ignore packet with more/larger tcp options */ >>>> + if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) { >>> What if n_unit->tcp_hdrlen > o_uint->tcp_hdr_len ? >> do you mean '<'? that also means some option changed, maybe i should >> include it. > Yes. > >>>> + chain->stat.tcp_larger_option++; >>>> + return RSC_FINAL; >>>> + } >>>> + >>>> + /* out of order or retransmitted. */ >>>> + if ((nseq - oseq) > MAX_TCP_PAYLOAD) { >>>> + chain->stat.data_out_of_win++; >>>> + return RSC_FINAL; >>>> + } >>>> + >>>> + data =3D ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen; >>>> + if (nseq =3D=3D oseq) { >>>> + if ((0 =3D=3D o_unit->payload) && n_unit->payload) { >>>> + /* From no payload to payload, normal case, not a dup >>>> ack or etc */ >>>> + chain->stat.data_after_pure_ack++; >>>> + goto coalesce; >>>> + } else { >>>> + return virtio_net_rsc_handle_ack(chain, seg, buf, >>>> + n_unit->tcp, >>>> o_unit->tcp); >>>> + } >>>> + } else if ((nseq - oseq) !=3D o_unit->payload) { >>>> + /* Not a consistent packet, out of order */ >>>> + chain->stat.data_out_of_order++; >>>> + return RSC_FINAL; >>>> + } else { >>>> +coalesce: >>>> + if ((o_ip_len + n_unit->payload) > chain->max_payload) { >>>> + chain->stat.over_size++; >>>> + return RSC_FINAL; >>>> + } >>>> + >>>> + /* Here comes the right data, the payload lengh in v4/v6 is >>>> different, >>>> + so use the field value to update and record the new data >>>> len */ >>>> + o_unit->payload +=3D n_unit->payload; /* update new data le= n */ >>>> + >>>> + /* update field in ip header */ >>>> + *o_unit->ip_plen =3D htons(o_ip_len + n_unit->payload); >>>> + >>>> + /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be >>>> coalesced >>>> + for windows guest, while this may change the behavior >>>> for linux >>>> + guest. */ >>> This needs more thought, 'can' probably means don't. (Linux GRO won't >>> merge PUSH packet). >> Yes, since it's mainly for win guest, how about take this as this >> first and do more test and see how to handle it? > Right, this is not an issue probably but just an optimization for laten= cy. > > [...] > >>>> + return NULL; >>>> + } >>>> + >>>> + QTAILQ_FOREACH(chain, &n->rsc_chains, next) { >>>> + if (chain->proto =3D=3D proto) { >>>> + return chain; >>>> + } >>>> + } >>>> + >>>> + chain =3D g_malloc(sizeof(*chain)); >>>> + chain->hdr_size =3D n->guest_hdr_len; >>> Why introduce a specified field for instead of just use >>> n->guest_hdr_len? >> this is to reduce invoking times to find VirtIONet by 'nc', there are >> a few places will use it. > Okay, then I think you'd better keep a pointer to VirtIONet structure > instead. (Consider you may want to refer more data from it, we don't > want to duplicate fields in two places). sure, that's a good idea. > >>>> + chain->proto =3D proto; >>>> + chain->max_payload =3D MAX_IP4_PAYLOAD; >>>> + chain->drain_timer =3D timer_new_ns(QEMU_CLOCK_VIRTUAL, >>>> + virtio_net_rsc_purge, chain); >>>> + memset(&chain->stat, 0, sizeof(chain->stat)); >>>> + >>>> + QTAILQ_INIT(&chain->buffers); >>>> + QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next); >>>> + >>>> + return chain; >>>> +} >>>> + >>>> +static ssize_t virtio_net_rsc_receive(NetClientState *nc, >>>> + const uint8_t *buf, size_t si= ze) >>>> +{ >>>> + uint16_t proto; >>>> + NetRscChain *chain; >>>> + struct eth_header *eth; >>>> + VirtIONet *n; >>>> + >>>> + n =3D qemu_get_nic_opaque(nc); >>>> + if (size < (n->guest_hdr_len + ETH_HDR_SZ)) { >>>> + return virtio_net_do_receive(nc, buf, size); >>>> + } >>>> + >>>> + eth =3D (struct eth_header *)(buf + n->guest_hdr_len); >>>> + proto =3D htons(eth->h_proto); >>>> + >>>> + chain =3D virtio_net_rsc_lookup_chain(n, nc, proto); >>>> + if (!chain) { >>>> + return virtio_net_do_receive(nc, buf, size); >>>> + } else { >>>> + chain->stat.received++; >>>> + return virtio_net_rsc_receive4(chain, nc, buf, size); >>>> + } >>>> +} >>>> + >>>> +static ssize_t virtio_net_receive(NetClientState *nc, >>>> + const uint8_t *buf, size_t size) >>>> +{ >>>> + if (virtio_net_rsc_bypass) { >>>> + return virtio_net_do_receive(nc, buf, size); >>> You need a feature bit for this and compat it for older machine types= . >>> And also need some work on virtio spec I think. >> yes, not sure which way is good to support this, hmp/qmp/ethtool, this >> is gonna to support win guest, >> so need a well-compatible interface, any comments? > I think this should be implemented through feature bits/negotiation > instead of something like ethtool. Looks this feature should be turn on/off dynamically due to the spec, so=20 maybe this should be managed from the guest, is there any reference code=20 for this? > > [...]