From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:52237) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1agb3I-0001D1-2B for qemu-devel@nongnu.org; Thu, 17 Mar 2016 12:45:39 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1agb3E-0000yE-FP for qemu-devel@nongnu.org; Thu, 17 Mar 2016 12:45:35 -0400 Received: from mx1.redhat.com ([209.132.183.28]:49698) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1agb3E-0000yA-5Q for qemu-devel@nongnu.org; Thu, 17 Mar 2016 12:45:32 -0400 Received: from int-mx14.intmail.prod.int.phx2.redhat.com (int-mx14.intmail.prod.int.phx2.redhat.com [10.5.11.27]) by mx1.redhat.com (Postfix) with ESMTPS id 676FD711EF for ; Thu, 17 Mar 2016 16:45:31 +0000 (UTC) Received: from wei-thinkpad.nay.redhat.com (vpn1-7-24.pek2.redhat.com [10.72.7.24]) by int-mx14.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u2HGjS1J030440 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Thu, 17 Mar 2016 12:45:30 -0400 References: <1458033424-25414-1-git-send-email-wexu@redhat.com> <1458033424-25414-2-git-send-email-wexu@redhat.com> <56EA6DE2.7070502@redhat.com> From: Wei Xu Message-ID: <56EADF28.3000101@redhat.com> Date: Fri, 18 Mar 2016 00:45:28 +0800 MIME-Version: 1.0 In-Reply-To: <56EA6DE2.7070502@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org On 2016=E5=B9=B403=E6=9C=8817=E6=97=A5 16:42, Jason Wang wrote: > > On 03/15/2016 05:17 PM, wexu@redhat.com wrote: >> From: Wei Xu >> >> All the data packets in a tcp connection will be cached to a big buffe= r >> in every receive interval, and will be sent out via a timer, the >> 'virtio_net_rsc_timeout' controls the interval, the value will influen= t the >> performance and response of tcp connection extremely, 50000(50us) is a >> experience value to gain a performance improvement, since the whql tes= t >> sends packets every 100us, so '300000(300us)' can pass the test case, >> this is also the default value, it's gonna to be tunable. >> >> The timer will only be triggered if the packets pool is not empty, >> and it'll drain off all the cached packets >> >> 'NetRscChain' is used to save the segments of different protocols in a >> VirtIONet device. >> >> The main handler of TCP includes TCP window update, duplicated ACK che= ck >> and the real data coalescing if the new segment passed sanity check >> and is identified as an 'wanted' one. >> >> An 'wanted' segment means: >> 1. Segment is within current window and the sequence is the expected o= ne. >> 2. ACK of the segment is in the valid window. >> 3. If the ACK in the segment is a duplicated one, then it must less th= an 2, >> this is to notify upper layer TCP starting retransmission due to t= he spec. >> >> Sanity check includes: >> 1. Incorrect version in IP header >> 2. IP options & IP fragment >> 3. Not a TCP packets >> 4. Sanity size check to prevent buffer overflow attack. >> >> There maybe more cases should be considered such as ip identification = other >> flags, while it broke the test because windows set it to the same even= it's >> not a fragment. >> >> Normally it includes 2 typical ways to handle a TCP control flag, 'byp= ass' >> and 'finalize', 'bypass' means should be sent out directly, and 'final= ize' >> means the packets should also be bypassed, and this should be done >> after searching for the same connection packets in the pool and sendin= g >> all of them out, this is to avoid out of data. >> >> All the 'SYN' packets will be bypassed since this always begin a new' >> connection, other flags such 'FIN/RST' will trigger a finalization, be= cause >> this normally happens upon a connection is going to be closed, an 'URG= ' packet >> also finalize current coalescing unit while there maybe protocol diffe= rence to >> different OS. > But URG packet should be sent as quickly as possible regardless of > ordering, no? Yes, you right, URG will terminate the current 'SCU', i'll amend the comm= it log. > >> Statistics can be used to monitor the basic coalescing status, the 'ou= t of order' >> and 'out of window' means how many retransmitting packets, thus descri= be the >> performance intuitively. >> >> Signed-off-by: Wei Xu >> --- >> hw/net/virtio-net.c | 486 +++++++++++++++++++++++++++++++= +++++++++- >> include/hw/virtio/virtio-net.h | 1 + >> include/hw/virtio/virtio.h | 72 ++++++ >> 3 files changed, 558 insertions(+), 1 deletion(-) >> >> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c >> index 5798f87..c23b45f 100644 >> --- a/hw/net/virtio-net.c >> +++ b/hw/net/virtio-net.c >> @@ -15,10 +15,12 @@ >> #include "qemu/iov.h" >> #include "hw/virtio/virtio.h" >> #include "net/net.h" >> +#include "net/eth.h" >> #include "net/checksum.h" >> #include "net/tap.h" >> #include "qemu/error-report.h" >> #include "qemu/timer.h" >> +#include "qemu/sockets.h" >> #include "hw/virtio/virtio-net.h" >> #include "net/vhost_net.h" >> #include "hw/virtio/virtio-bus.h" >> @@ -38,6 +40,35 @@ >> #define endof(container, field) \ >> (offsetof(container, field) + sizeof(((container *)0)->field)) >> =20 >> +#define ETH_HDR_SZ (sizeof(struct eth_header)) >> +#define IP4_HDR_SZ (sizeof(struct ip_header)) >> +#define TCP_HDR_SZ (sizeof(struct tcp_header)) >> +#define ETH_IP4_HDR_SZ (ETH_HDR_SZ + IP4_HDR_SZ) >> + >> +#define IP4_ADDR_SIZE 8 /* ipv4 saddr + daddr */ >> +#define TCP_PORT_SIZE 4 /* sport + dport */ >> + >> +/* IPv4 max payload, 16 bits in the header */ >> +#define MAX_IP4_PAYLOAD (65535 - IP4_HDR_SZ) >> +#define MAX_TCP_PAYLOAD 65535 >> + >> +/* max payload with virtio header */ >> +#define MAX_VIRTIO_PAYLOAD (sizeof(struct virtio_net_hdr_mrg_rxbuf) = \ >> + + ETH_HDR_SZ + MAX_TCP_PAYLOAD) > Should we use guest_hdr_len instead of sizeof() here? Consider the > vnet_hdr will be extended in the future. Sure. > >> + >> +#define IP4_HEADER_LEN 5 /* header lenght value in ip header without = option */ > type, should be 'length' ok. > >> + >> +/* Purge coalesced packets timer interval */ >> +#define RSC_TIMER_INTERVAL 300000 >> + >> +/* Switcher to enable/disable rsc */ >> +static bool virtio_net_rsc_bypass =3D 1; >> + >> +/* This value affects the performance a lot, and should be tuned care= fully, >> + '300000'(300us) is the recommended value to pass the WHQL test, '5= 0000' can >> + gain 2x netperf throughput with tso/gso/gro 'off'. */ >> +static uint32_t virtio_net_rsc_timeout =3D RSC_TIMER_INTERVAL; >> + >> typedef struct VirtIOFeature { >> uint32_t flags; >> size_t end; >> @@ -1089,7 +1120,8 @@ static int receive_filter(VirtIONet *n, const ui= nt8_t *buf, int size) >> return 0; >> } >> =20 >> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *= buf, size_t size) >> +static ssize_t virtio_net_do_receive(NetClientState *nc, >> + const uint8_t *buf, size_t size) > indentation should also changed here. ok. > >> { >> VirtIONet *n =3D qemu_get_nic_opaque(nc); >> VirtIONetQueue *q =3D virtio_net_get_subqueue(nc); >> @@ -1685,6 +1717,456 @@ static int virtio_net_load_device(VirtIODevice= *vdev, QEMUFile *f, >> return 0; >> } >> =20 >> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain, >> + const uint8_t *buf, NetRscUn= it* unit) >> +{ >> + uint16_t ip_hdrlen; >> + >> + unit->ip =3D (struct ip_header *)(buf + chain->hdr_size + ETH_HDR= _SZ); >> + ip_hdrlen =3D ((0xF & unit->ip->ip_ver_len) << 2); >> + unit->ip_plen =3D &unit->ip->ip_len; >> + unit->tcp =3D (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hd= rlen); >> + unit->tcp_hdrlen =3D (htons(unit->tcp->th_offset_flags) & 0xF000)= >> 10; >> + unit->payload =3D htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_h= drlen; >> +} >> + >> +static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip) >> +{ >> + uint32_t sum; >> + >> + ip->ip_sum =3D 0; >> + sum =3D net_checksum_add_cont(IP4_HDR_SZ, (uint8_t *)ip, 0); >> + ip->ip_sum =3D cpu_to_be16(net_checksum_finish(sum)); >> +} >> + >> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg = *seg) >> +{ >> + int ret; >> + >> + virtio_net_rsc_ipv4_checksum(seg->unit.ip); >> + ret =3D virtio_net_do_receive(seg->nc, seg->buf, seg->size); >> + QTAILQ_REMOVE(&chain->buffers, seg, next); >> + g_free(seg->buf); >> + g_free(seg); >> + >> + return ret; >> +} >> + >> +static void virtio_net_rsc_purge(void *opq) >> +{ >> + NetRscChain *chain =3D (NetRscChain *)opq; >> + NetRscSeg *seg, *rn; >> + >> + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) { >> + if (virtio_net_rsc_drain_seg(chain, seg) =3D=3D 0) { >> + chain->stat.purge_failed++; >> + continue; >> + } >> + } >> + >> + chain->stat.timer++; >> + if (!QTAILQ_EMPTY(&chain->buffers)) { >> + timer_mod(chain->drain_timer, >> + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_= timeout); >> + } >> +} >> + >> +static void virtio_net_rsc_cleanup(VirtIONet *n) >> +{ >> + NetRscChain *chain, *rn_chain; >> + NetRscSeg *seg, *rn_seg; >> + >> + QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) { >> + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) { >> + QTAILQ_REMOVE(&chain->buffers, seg, next); >> + g_free(seg->buf); >> + g_free(seg); >> + } >> + >> + timer_del(chain->drain_timer); >> + timer_free(chain->drain_timer); >> + QTAILQ_REMOVE(&n->rsc_chains, chain, next); >> + g_free(chain); >> + } >> +} >> + >> +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientSta= te *nc, >> + const uint8_t *buf, size_t size) >> +{ >> + NetRscSeg *seg; >> + >> + seg =3D g_malloc(sizeof(NetRscSeg)); >> + seg->buf =3D g_malloc(MAX_VIRTIO_PAYLOAD); >> + >> + memmove(seg->buf, buf, size); > Can seg->buf overlap with buf? If not, why use memmove() instead of > memcpy()? Yes, they will not overlap, memcpy() is good for it. > >> + seg->size =3D size; >> + seg->dup_ack_count =3D 0; >> + seg->is_coalesced =3D 0; >> + seg->nc =3D nc; >> + >> + QTAILQ_INSERT_TAIL(&chain->buffers, seg, next); >> + chain->stat.cache++; >> + >> + virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit); >> +} >> + >> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSe= g *seg, >> + const uint8_t *buf, struct tcp_heade= r *n_tcp, >> + struct tcp_header *o_tcp) >> +{ >> + uint32_t nack, oack; >> + uint16_t nwin, owin; >> + >> + nack =3D htonl(n_tcp->th_ack); >> + nwin =3D htons(n_tcp->th_win); >> + oack =3D htonl(o_tcp->th_ack); >> + owin =3D htons(o_tcp->th_win); >> + >> + if ((nack - oack) >=3D MAX_TCP_PAYLOAD) { >> + chain->stat.ack_out_of_win++; >> + return RSC_FINAL; >> + } else if (nack =3D=3D oack) { >> + /* duplicated ack or window probe */ >> + if (nwin =3D=3D owin) { >> + /* duplicated ack, add dup ack count due to whql test up = to 1 */ >> + chain->stat.dup_ack++; >> + >> + if (seg->dup_ack_count =3D=3D 0) { >> + seg->dup_ack_count++; >> + chain->stat.dup_ack1++; >> + return RSC_COALESCE; >> + } else { >> + /* Spec says should send it directly */ >> + chain->stat.dup_ack2++; >> + return RSC_FINAL; >> + } >> + } else { >> + /* Coalesce window update */ >> + o_tcp->th_win =3D n_tcp->th_win; >> + chain->stat.win_update++; >> + return RSC_COALESCE; >> + } >> + } else { >> + /* pure ack, update ack */ >> + o_tcp->th_ack =3D n_tcp->th_ack; >> + chain->stat.pure_ack++; >> + return RSC_COALESCE; > Looks like there're something I missed. The spec said: > > "In other words, any pure ACK that is not a duplicate ACK or a window > update triggers an exception and must not be coalesced. All such pure > ACKs must be indicated as individual segments." > > Does it mean we *should not* coalesce windows update and pure ack? > (Since it can wakeup transmission)? It's also a little bit inexplicit and flexible due to the spec, please se= e the flowchart I on the same page. Comments about the flowchart: ------------------------------------------------------------------------ The first of the following two flowcharts describes the rules for coalesc= ing segments and updating the TCP headers. This flowchart refers to mechanisms for distinguishing valid duplicate AC= Ks and window updates. The second flowchart describes these mechanisms. ------------------------------------------------------------------------ As show in the flowchart, only status 'C' will break current scu and get = finalized, both 'A' and 'B' can be coalesced afaik. > >> + } >> +} >> + >> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, NetRs= cSeg *seg, >> + const uint8_t *buf, NetRscUnit *n= _unit) >> +{ >> + void *data; >> + uint16_t o_ip_len; >> + uint32_t nseq, oseq; >> + NetRscUnit *o_unit; >> + >> + o_unit =3D &seg->unit; >> + o_ip_len =3D htons(*o_unit->ip_plen); >> + nseq =3D htonl(n_unit->tcp->th_seq); >> + oseq =3D htonl(o_unit->tcp->th_seq); >> + >> + if (n_unit->tcp_hdrlen > TCP_HDR_SZ) { >> + /* Log this only for debugging observation */ >> + chain->stat.tcp_option++; >> + } >> + >> + /* Ignore packet with more/larger tcp options */ >> + if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) { > What if n_unit->tcp_hdrlen > o_uint->tcp_hdr_len ? do you mean '<'? that also means some option changed, maybe i should=20 include it. > >> + chain->stat.tcp_larger_option++; >> + return RSC_FINAL; >> + } >> + >> + /* out of order or retransmitted. */ >> + if ((nseq - oseq) > MAX_TCP_PAYLOAD) { >> + chain->stat.data_out_of_win++; >> + return RSC_FINAL; >> + } >> + >> + data =3D ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen; >> + if (nseq =3D=3D oseq) { >> + if ((0 =3D=3D o_unit->payload) && n_unit->payload) { >> + /* From no payload to payload, normal case, not a dup ack= or etc */ >> + chain->stat.data_after_pure_ack++; >> + goto coalesce; >> + } else { >> + return virtio_net_rsc_handle_ack(chain, seg, buf, >> + n_unit->tcp, o_unit->tcp= ); >> + } >> + } else if ((nseq - oseq) !=3D o_unit->payload) { >> + /* Not a consistent packet, out of order */ >> + chain->stat.data_out_of_order++; >> + return RSC_FINAL; >> + } else { >> +coalesce: >> + if ((o_ip_len + n_unit->payload) > chain->max_payload) { >> + chain->stat.over_size++; >> + return RSC_FINAL; >> + } >> + >> + /* Here comes the right data, the payload lengh in v4/v6 is d= ifferent, >> + so use the field value to update and record the new data l= en */ >> + o_unit->payload +=3D n_unit->payload; /* update new data len = */ >> + >> + /* update field in ip header */ >> + *o_unit->ip_plen =3D htons(o_ip_len + n_unit->payload); >> + >> + /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be c= oalesced >> + for windows guest, while this may change the behavior for = linux >> + guest. */ > This needs more thought, 'can' probably means don't. (Linux GRO won't > merge PUSH packet). Yes, since it's mainly for win guest, how about take this as this first=20 and do more test and see how to handle it? > >> + o_unit->tcp->th_offset_flags =3D n_unit->tcp->th_offset_flags= ; >> + >> + o_unit->tcp->th_ack =3D n_unit->tcp->th_ack; >> + o_unit->tcp->th_win =3D n_unit->tcp->th_win; >> + >> + memmove(seg->buf + seg->size, data, n_unit->payload); >> + seg->size +=3D n_unit->payload; >> + chain->stat.coalesced++; >> + return RSC_COALESCE; >> + } >> +} >> + >> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg= *seg, >> + const uint8_t *buf, size_t size, NetRscUnit *= unit) >> +{ >> + if ((unit->ip->ip_src ^ seg->unit.ip->ip_src) >> + || (unit->ip->ip_dst ^ seg->unit.ip->ip_dst) >> + || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport) >> + || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) { >> + chain->stat.no_match++; >> + return RSC_NO_MATCH; >> + } >> + >> + return virtio_net_rsc_coalesce_data(chain, seg, buf, unit); >> +} >> + >> +/* Pakcets with 'SYN' should bypass, other flag should be sent after = drain >> + * to prevent out of order */ >> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain, >> + struct tcp_header *tcp) >> +{ >> + uint16_t tcp_flag; >> + >> + tcp_flag =3D htons(tcp->th_offset_flags) & 0x3F; >> + if (tcp_flag & TH_SYN) { >> + chain->stat.tcp_syn++; >> + return RSC_BYPASS; >> + } >> + >> + if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) { >> + chain->stat.tcp_ctrl_drain++; >> + return RSC_FINAL; >> + } >> + >> + return RSC_WANT; >> +} >> + >> +static bool virtio_net_rsc_empty_cache(NetRscChain *chain, NetClientS= tate *nc, >> + const uint8_t *buf, size_t size) > indentation looks wrong. ok. > >> +{ >> + if (QTAILQ_EMPTY(&chain->buffers)) { >> + chain->stat.empty_cache++; >> + virtio_net_rsc_cache_buf(chain, nc, buf, size); >> + timer_mod(chain->drain_timer, >> + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_= timeout); >> + return 1; >> + } >> + >> + return 0; >> +} >> + >> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClien= tState *nc, >> + const uint8_t *buf, size_t size, NetRscUnit= *unit) >> +{ > and here. ok. > >> + int ret; >> + NetRscSeg *seg, *nseg; >> + >> + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) { >> + ret =3D virtio_net_rsc_coalesce4(chain, seg, buf, size, unit)= ; >> + >> + if (ret =3D=3D RSC_FINAL) { >> + if (virtio_net_rsc_drain_seg(chain, seg) =3D=3D 0) { >> + /* Send failed */ >> + chain->stat.final_failed++; >> + return 0; >> + } >> + >> + /* Send current packet */ >> + return virtio_net_do_receive(nc, buf, size); >> + } else if (ret =3D=3D RSC_NO_MATCH) { >> + continue; >> + } else { >> + /* Coalesced, mark coalesced flag to tell calc cksum for = ipv4 */ >> + seg->is_coalesced =3D 1; >> + return size; >> + } >> + } >> + >> + chain->stat.no_match_cache++; >> + virtio_net_rsc_cache_buf(chain, nc, buf, size); >> + return size; >> +} >> + >> +/* Drain a connection data, this is to avoid out of order segments */ >> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClient= State *nc, >> + const uint8_t *buf, size_t size, uint16_t ip_= start, >> + uint16_t ip_size, uint16_t tcp_port, uint16_t= port_size) >> +{ >> + NetRscSeg *seg, *nseg; >> + >> + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) { >> + if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size) >> + || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size)= ) { >> + continue; >> + } >> + if (virtio_net_rsc_drain_seg(chain, seg) =3D=3D 0) { >> + chain->stat.drain_failed++; >> + } >> + >> + break; >> + } >> + >> + return virtio_net_do_receive(nc, buf, size); >> +} >> + >> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain, >> + struct ip_header *ip, const uint8_t *buf, siz= e_t size) >> +{ >> + uint16_t ip_len; >> + >> + if (size < (chain->hdr_size + ETH_IP4_HDR_SZ + TCP_HDR_SZ)) { >> + return RSC_BYPASS; >> + } >> + >> + /* Not an ipv4 one */ >> + if (((0xF0 & ip->ip_ver_len) >> 4) !=3D IP_HEADER_VERSION_4) { > I've replied this several times, please use a consistent style. E.g > "ip->ip_ver_len & 0xF0". Sorry, i used to think you meant the 'IP_HEADER_VERSION_4'. > >> + chain->stat.ip_option++; >> + return RSC_BYPASS; >> + } >> + >> + /* Don't handle packets with ip option */ >> + if (IP4_HEADER_LEN !=3D (0xF & ip->ip_ver_len)) { >> + chain->stat.ip_option++; >> + return RSC_BYPASS; >> + } >> + >> + /* Don't handle packets with ip fragment */ >> + if (!(htons(ip->ip_off) & IP_DF)) { >> + chain->stat.ip_frag++; >> + return RSC_BYPASS; >> + } >> + >> + if (ip->ip_p !=3D IPPROTO_TCP) { >> + chain->stat.bypass_not_tcp++; >> + return RSC_BYPASS; >> + } >> + >> + /* Sanity check */ >> + ip_len =3D htons(ip->ip_len); >> + if (ip_len < (IP4_HDR_SZ + TCP_HDR_SZ) >> + || ip_len > (size - chain->hdr_size - ETH_HDR_SZ)) { >> + chain->stat.ip_hacked++; >> + return RSC_BYPASS; >> + } >> + >> + return RSC_WANT; >> +} >> + >> +static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc, >> + const uint8_t *buf, size_t size= ) >> +{ >> + int32_t ret; >> + NetRscChain *chain; >> + NetRscUnit unit; >> + >> + chain =3D (NetRscChain *)opq; >> + virtio_net_rsc_extract_unit4(chain, buf, &unit); >> + if (RSC_WANT !=3D virtio_net_rsc_sanity_check4(chain, unit.ip, bu= f, size)) { >> + return virtio_net_do_receive(nc, buf, size); >> + } >> + >> + ret =3D virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp); >> + if (ret =3D=3D RSC_BYPASS) { >> + return virtio_net_do_receive(nc, buf, size); >> + } else if (ret =3D=3D RSC_FINAL) { >> + return virtio_net_rsc_drain_flow(chain, nc, buf, size, >> + ((chain->hdr_size + ETH_HDR_SZ) + 12), IP4_AD= DR_SIZE, >> + (chain->hdr_size + ETH_IP4_HDR_SZ), TCP_PORT_= SIZE); >> + } >> + >> + if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) { >> + return size; >> + } >> + >> + return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit); >> +} >> + >> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n, >> + NetClientState *nc, uint1= 6_t proto) >> +{ >> + NetRscChain *chain; >> + >> + /* Only handle IPv4/6 */ >> + if (proto !=3D (uint16_t)ETH_P_IP) { > The code is conflict with the comment above. ok. > >> + return NULL; >> + } >> + >> + QTAILQ_FOREACH(chain, &n->rsc_chains, next) { >> + if (chain->proto =3D=3D proto) { >> + return chain; >> + } >> + } >> + >> + chain =3D g_malloc(sizeof(*chain)); >> + chain->hdr_size =3D n->guest_hdr_len; > Why introduce a specified field for instead of just use n->guest_hdr_le= n? this is to reduce invoking times to find VirtIONet by 'nc', there are a=20 few places will use it. > >> + chain->proto =3D proto; >> + chain->max_payload =3D MAX_IP4_PAYLOAD; >> + chain->drain_timer =3D timer_new_ns(QEMU_CLOCK_VIRTUAL, >> + virtio_net_rsc_purge, chain); >> + memset(&chain->stat, 0, sizeof(chain->stat)); >> + >> + QTAILQ_INIT(&chain->buffers); >> + QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next); >> + >> + return chain; >> +} >> + >> +static ssize_t virtio_net_rsc_receive(NetClientState *nc, >> + const uint8_t *buf, size_t size= ) >> +{ >> + uint16_t proto; >> + NetRscChain *chain; >> + struct eth_header *eth; >> + VirtIONet *n; >> + >> + n =3D qemu_get_nic_opaque(nc); >> + if (size < (n->guest_hdr_len + ETH_HDR_SZ)) { >> + return virtio_net_do_receive(nc, buf, size); >> + } >> + >> + eth =3D (struct eth_header *)(buf + n->guest_hdr_len); >> + proto =3D htons(eth->h_proto); >> + >> + chain =3D virtio_net_rsc_lookup_chain(n, nc, proto); >> + if (!chain) { >> + return virtio_net_do_receive(nc, buf, size); >> + } else { >> + chain->stat.received++; >> + return virtio_net_rsc_receive4(chain, nc, buf, size); >> + } >> +} >> + >> +static ssize_t virtio_net_receive(NetClientState *nc, >> + const uint8_t *buf, size_t size) >> +{ >> + if (virtio_net_rsc_bypass) { >> + return virtio_net_do_receive(nc, buf, size); > You need a feature bit for this and compat it for older machine types. > And also need some work on virtio spec I think. yes, not sure which way is good to support this, hmp/qmp/ethtool, this=20 is gonna to support win guest, so need a well-compatible interface, any comments? > >> + } else { >> + return virtio_net_rsc_receive(nc, buf, size); >> + } >> +} >> + >> static NetClientInfo net_virtio_info =3D { >> .type =3D NET_CLIENT_OPTIONS_KIND_NIC, >> .size =3D sizeof(NICState), >> @@ -1814,6 +2296,7 @@ static void virtio_net_device_realize(DeviceStat= e *dev, Error **errp) >> nc =3D qemu_get_queue(n->nic); >> nc->rxfilter_notify_enabled =3D 1; >> =20 >> + QTAILQ_INIT(&n->rsc_chains); >> n->qdev =3D dev; >> register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION, >> virtio_net_save, virtio_net_load, n); >> @@ -1848,6 +2331,7 @@ static void virtio_net_device_unrealize(DeviceSt= ate *dev, Error **errp) >> g_free(n->vqs); >> qemu_del_nic(n->nic); >> virtio_cleanup(vdev); >> + virtio_net_rsc_cleanup(n); >> } >> =20 >> static void virtio_net_instance_init(Object *obj) >> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio= -net.h >> index 0cabdb6..6939e92 100644 >> --- a/include/hw/virtio/virtio-net.h >> +++ b/include/hw/virtio/virtio-net.h >> @@ -59,6 +59,7 @@ typedef struct VirtIONet { >> VirtIONetQueue *vqs; >> VirtQueue *ctrl_vq; >> NICState *nic; >> + QTAILQ_HEAD(, NetRscChain) rsc_chains; >> uint32_t tx_timeout; >> int32_t tx_burst; >> uint32_t has_vnet_hdr; >> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h >> index 2b5b248..3b1dfa8 100644 >> --- a/include/hw/virtio/virtio.h >> +++ b/include/hw/virtio/virtio.h >> @@ -128,6 +128,78 @@ typedef struct VirtioDeviceClass { >> int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id); >> } VirtioDeviceClass; >> =20 >> +/* Coalesced packets type & status */ >> +typedef enum { >> + RSC_COALESCE, /* Data been coalesced */ >> + RSC_FINAL, /* Will terminate current connection */ >> + RSC_NO_MATCH, /* No matched in the buffer pool */ >> + RSC_BYPASS, /* Packet to be bypass, not tcp, tcp ctrl= , etc */ >> + RSC_WANT /* Data want to be coalesced */ >> +} COALESCE_STATUS; >> + >> +typedef struct NetRscStat { >> + uint32_t received; >> + uint32_t coalesced; >> + uint32_t over_size; >> + uint32_t cache; >> + uint32_t empty_cache; >> + uint32_t no_match_cache; >> + uint32_t win_update; >> + uint32_t no_match; >> + uint32_t tcp_syn; >> + uint32_t tcp_ctrl_drain; >> + uint32_t dup_ack; >> + uint32_t dup_ack1; >> + uint32_t dup_ack2; >> + uint32_t pure_ack; >> + uint32_t ack_out_of_win; >> + uint32_t data_out_of_win; >> + uint32_t data_out_of_order; >> + uint32_t data_after_pure_ack; >> + uint32_t bypass_not_tcp; >> + uint32_t tcp_option; >> + uint32_t tcp_larger_option; >> + uint32_t ip_frag; >> + uint32_t ip_hacked; >> + uint32_t ip_option; >> + uint32_t purge_failed; >> + uint32_t drain_failed; >> + uint32_t final_failed; >> + int64_t timer; >> +} NetRscStat; >> + >> +/* Rsc unit general info used to checking if can coalescing */ >> +typedef struct NetRscUnit { >> + struct ip_header *ip; /* ip header */ >> + uint16_t *ip_plen; /* data len pointer in ip header field */ >> + struct tcp_header *tcp; /* tcp header */ >> + uint16_t tcp_hdrlen; /* tcp header len */ >> + uint16_t payload; /* pure payload without virtio/eth/ip/tcp = */ >> +} NetRscUnit; >> + >> +/* Coalesced segmant */ >> +typedef struct NetRscSeg { >> + QTAILQ_ENTRY(NetRscSeg) next; >> + void *buf; >> + size_t size; >> + uint32_t dup_ack_count; >> + bool is_coalesced; /* need recal ipv4 header checksum, mark = here */ >> + NetRscUnit unit; >> + NetClientState *nc; >> +} NetRscSeg; >> + >> + >> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */ >> +typedef struct NetRscChain { >> + QTAILQ_ENTRY(NetRscChain) next; >> + uint16_t hdr_size; >> + uint16_t proto; >> + uint16_t max_payload; >> + QEMUTimer *drain_timer; >> + QTAILQ_HEAD(, NetRscSeg) buffers; >> + NetRscStat stat; >> +} NetRscChain; >> + >> void virtio_instance_init_common(Object *proxy_obj, void *data, >> size_t vdev_size, const char *vdev_= name); >> =20 > >