From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:56786) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aoR8z-0002Ys-Uu for qemu-devel@nongnu.org; Fri, 08 Apr 2016 03:47:56 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1aoR8w-0002dA-LZ for qemu-devel@nongnu.org; Fri, 08 Apr 2016 03:47:53 -0400 Received: from mx1.redhat.com ([209.132.183.28]:35442) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1aoR8w-0002d6-Aq for qemu-devel@nongnu.org; Fri, 08 Apr 2016 03:47:50 -0400 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id E79F67F08B for ; Fri, 8 Apr 2016 07:47:49 +0000 (UTC) Received: from wei-thinkpad.nay.redhat.com (vpn1-4-44.pek2.redhat.com [10.72.4.44]) by int-mx13.intmail.prod.int.phx2.redhat.com (8.14.4/8.14.4) with ESMTP id u387llcB027835 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=NO) for ; Fri, 8 Apr 2016 03:47:48 -0400 References: <1459711556-10273-1-git-send-email-wexu@redhat.com> <1459711556-10273-3-git-send-email-wexu@redhat.com> <57032731.8060907@redhat.com> From: Wei Xu Message-ID: <5707622D.7010307@redhat.com> Date: Fri, 8 Apr 2016 15:47:57 +0800 MIME-Version: 1.0 In-Reply-To: <57032731.8060907@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [ RFC Patch v4 2/3] virtio-net rsc: support coalescing ipv4 tcp traffic List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org On 2016=E5=B9=B404=E6=9C=8805=E6=97=A5 10:47, Jason Wang wrote: > > On 04/04/2016 03:25 AM, wexu@redhat.com wrote: >> From: Wei Xu >> >> All the data packets in a tcp connection will be cached to a big buffe= r >> in every receive interval, and will be sent out via a timer, the >> 'virtio_net_rsc_timeout' controls the interval, the value will influen= t the >> performance and response of tcp connection extremely, 50000(50us) is a >> experience value to gain a performance improvement, since the whql tes= t >> sends packets every 100us, so '300000(300us)' can pass the test case, >> this is also the default value, it's gonna to be tunable. >> >> The timer will only be triggered if the packets pool is not empty, >> and it'll drain off all the cached packets >> >> 'NetRscChain' is used to save the segments of different protocols in a >> VirtIONet device. >> >> The main handler of TCP includes TCP window update, duplicated ACK che= ck >> and the real data coalescing if the new segment passed sanity check >> and is identified as an 'wanted' one. >> >> An 'wanted' segment means: >> 1. Segment is within current window and the sequence is the expected o= ne. >> 2. ACK of the segment is in the valid window. >> 3. If the ACK in the segment is a duplicated one, then it must less th= an 2, >> this is to notify upper layer TCP starting retransmission due to t= he spec. >> >> Sanity check includes: >> 1. Incorrect version in IP header >> 2. IP options & IP fragment >> 3. Not a TCP packets >> 4. Sanity size check to prevent buffer overflow attack. >> >> There maybe more cases should be considered such as ip identification = other >> flags, while it broke the test because windows set it to the same even= it's >> not a fragment. >> >> Normally it includes 2 typical ways to handle a TCP control flag, 'byp= ass' >> and 'finalize', 'bypass' means should be sent out directly, and 'final= ize' >> means the packets should also be bypassed, and this should be done >> after searching for the same connection packets in the pool and sendin= g >> all of them out, this is to avoid out of data. >> >> All the 'SYN' packets will be bypassed since this always begin a new' >> connection, other flags such 'FIN/RST' will trigger a finalization, be= cause >> this normally happens upon a connection is going to be closed, an 'URG= ' packet >> also finalize current coalescing unit. >> >> Statistics can be used to monitor the basic coalescing status, the 'ou= t of order' >> and 'out of window' means how many retransmitting packets, thus descri= be the >> performance intuitively. >> >> Signed-off-by: Wei Xu >> --- >> hw/net/virtio-net.c | 480 +++++++++++++++++++++++++++++++= +++++++++- >> include/hw/virtio/virtio-net.h | 1 + >> include/hw/virtio/virtio.h | 72 +++++++ >> 3 files changed, 552 insertions(+), 1 deletion(-) >> >> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c >> index bd91a4b..81e8e71 100644 >> --- a/hw/net/virtio-net.c >> +++ b/hw/net/virtio-net.c >> @@ -15,10 +15,12 @@ >> #include "qemu/iov.h" >> #include "hw/virtio/virtio.h" >> #include "net/net.h" >> +#include "net/eth.h" >> #include "net/checksum.h" >> #include "net/tap.h" >> #include "qemu/error-report.h" >> #include "qemu/timer.h" >> +#include "qemu/sockets.h" >> #include "hw/virtio/virtio-net.h" >> #include "net/vhost_net.h" >> #include "hw/virtio/virtio-bus.h" >> @@ -38,6 +40,24 @@ >> #define endof(container, field) \ >> (offsetof(container, field) + sizeof(((container *)0)->field)) >> =20 >> +#define VIRTIO_NET_IP4_ADDR_SIZE 8 /* ipv4 saddr + daddr */ >> +#define VIRTIO_NET_TCP_PORT_SIZE 4 /* sport + dport */ >> + >> +/* IPv4 max payload, 16 bits in the header */ >> +#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header)) >> +#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535 >> + >> +/* header lenght value in ip header without option */ > typo here. Thanks. > >> +#define VIRTIO_NET_IP4_HEADER_LENGTH 5 >> + >> +/* Purge coalesced packets timer interval */ >> +#define VIRTIO_NET_RSC_INTERVAL 300000 >> + >> +/* This value affects the performance a lot, and should be tuned care= fully, >> + '300000'(300us) is the recommended value to pass the WHQL test, '5= 0000' can >> + gain 2x netperf throughput with tso/gso/gro 'off'. */ >> +static uint32_t virtio_net_rsc_timeout =3D VIRTIO_NET_RSC_INTERVAL; > Like we've discussed in previous versions, need we add another property > for this? Do you know how to make this a tunable parameter to guest? can this=20 parameter be set via control queue? > >> + >> typedef struct VirtIOFeature { >> uint32_t flags; >> size_t end; >> @@ -1688,11 +1708,467 @@ static int virtio_net_load_device(VirtIODevic= e *vdev, QEMUFile *f, >> return 0; >> } >> =20 >> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain, >> + const uint8_t *buf, NetRscUn= it* unit) >> +{ >> + uint16_t hdr_len; >> + uint16_t ip_hdrlen; >> + struct ip_header *ip; >> + >> + hdr_len =3D ((VirtIONet *)(chain->n))->guest_hdr_len; > The case seems odd. Why not just use VirtIONet * for chain->n? OK, will take it. > >> + ip =3D (struct ip_header *)(buf + hdr_len + sizeof(struct eth_hea= der)); >> + unit->ip =3D (void *)ip; >> + ip_hdrlen =3D (ip->ip_ver_len & 0xF) << 2; >> + unit->ip_plen =3D &ip->ip_len; >> + unit->tcp =3D (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hd= rlen); >> + unit->tcp_hdrlen =3D (htons(unit->tcp->th_offset_flags) & 0xF000)= >> 10; >> + unit->payload =3D htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_h= drlen; >> +} >> + >> +static void virtio_net_rsc_ipv4_checksum(struct ip_header *ip) >> +{ >> + uint32_t sum; >> + >> + ip->ip_sum =3D 0; >> + sum =3D net_checksum_add_cont(sizeof(struct ip_header), (uint8_t = *)ip, 0); >> + ip->ip_sum =3D cpu_to_be16(net_checksum_finish(sum)); > =20 > Should we set VIRTIO_NET_HDR_F_DATA_VALID for hdr.flags here? have a look at the code, seems should set it, packets will be dropped by=20 guest if don't calculate checksum here with this patch, looks there is an inexplicit logical check, will check it out and test=20 again, thanks. > >> +} >> + >> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg = *seg) >> +{ >> + int ret; >> + >> + virtio_net_rsc_ipv4_checksum(seg->unit.ip); >> + ret =3D virtio_net_do_receive(seg->nc, seg->buf, seg->size); >> + QTAILQ_REMOVE(&chain->buffers, seg, next); >> + g_free(seg->buf); >> + g_free(seg); >> + >> + return ret; >> +} >> + >> +static void virtio_net_rsc_purge(void *opq) >> +{ >> + NetRscChain *chain =3D (NetRscChain *)opq; >> + NetRscSeg *seg, *rn; >> + >> + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) { >> + if (virtio_net_rsc_drain_seg(chain, seg) =3D=3D 0) { >> + chain->stat.purge_failed++; >> + continue; > Is it better to break here, consider we fail to do the receive? Actually this fails only when receive fails according to the test, but=20 should drain all the segments here, if break here, then will have to free all the buffers, is it possible a=20 successful receiving followed by a failed one? if that does happens, i preferred give it a shot other than free it=20 directly. > >> + } >> + } >> + >> + chain->stat.timer++; >> + if (!QTAILQ_EMPTY(&chain->buffers)) { >> + timer_mod(chain->drain_timer, >> + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_= timeout); >> + } >> +} >> + >> +static void virtio_net_rsc_cleanup(VirtIONet *n) >> +{ >> + NetRscChain *chain, *rn_chain; >> + NetRscSeg *seg, *rn_seg; >> + >> + QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) { >> + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) { >> + QTAILQ_REMOVE(&chain->buffers, seg, next); >> + g_free(seg->buf); >> + g_free(seg); >> + } >> + >> + timer_del(chain->drain_timer); >> + timer_free(chain->drain_timer); >> + QTAILQ_REMOVE(&n->rsc_chains, chain, next); >> + g_free(chain); >> + } >> +} >> + >> +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientSta= te *nc, >> + const uint8_t *buf, size_t size) >> +{ >> + uint16_t hdr_len; >> + NetRscSeg *seg; >> + >> + hdr_len =3D ((VirtIONet *)(chain->n))->guest_hdr_len; >> + seg =3D g_malloc(sizeof(NetRscSeg)); >> + seg->buf =3D g_malloc(hdr_len + sizeof(struct eth_header)\ >> + + sizeof(struct ip6_header) + VIRTIO_NET_MAX_TCP_P= AYLOAD); > Why ip6_header here? ipv6 packets can carry 65535(2^16) bytes pure data payload, that means=20 the header is not included. but ipv4 is included, i choose the maximum here. > >> + memcpy(seg->buf, buf, size); >> + seg->size =3D size; >> + seg->dup_ack_count =3D 0; >> + seg->is_coalesced =3D 0; >> + seg->nc =3D nc; >> + >> + QTAILQ_INSERT_TAIL(&chain->buffers, seg, next); >> + chain->stat.cache++; >> + >> + virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit); >> +} >> + >> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, NetRscSe= g *seg, >> + const uint8_t *buf, struct tcp_heade= r *n_tcp, >> + struct tcp_header *o_tcp) >> +{ >> + uint32_t nack, oack; >> + uint16_t nwin, owin; >> + >> + nack =3D htonl(n_tcp->th_ack); >> + nwin =3D htons(n_tcp->th_win); >> + oack =3D htonl(o_tcp->th_ack); >> + owin =3D htons(o_tcp->th_win); >> + >> + if ((nack - oack) >=3D VIRTIO_NET_MAX_TCP_PAYLOAD) { >> + chain->stat.ack_out_of_win++; >> + return RSC_FINAL; >> + } else if (nack =3D=3D oack) { >> + /* duplicated ack or window probe */ >> + if (nwin =3D=3D owin) { >> + /* duplicated ack, add dup ack count due to whql test up = to 1 */ >> + chain->stat.dup_ack++; >> + >> + if (seg->dup_ack_count =3D=3D 0) { >> + seg->dup_ack_count++; >> + chain->stat.dup_ack1++; >> + return RSC_COALESCE; >> + } else { >> + /* Spec says should send it directly */ > Better quote the what spec said here or just remove the above comment. OK. > >> + chain->stat.dup_ack2++; >> + return RSC_FINAL; >> + } >> + } else { >> + /* Coalesce window update */ >> + o_tcp->th_win =3D n_tcp->th_win; >> + chain->stat.win_update++; >> + return RSC_COALESCE; >> + } >> + } else { >> + /* pure ack, update ack */ >> + o_tcp->th_ack =3D n_tcp->th_ack; >> + chain->stat.pure_ack++; >> + return RSC_COALESCE; >> + } >> +} >> + >> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, NetRs= cSeg *seg, >> + const uint8_t *buf, NetRscUnit *n= _unit) >> +{ >> + void *data; >> + uint16_t o_ip_len; >> + uint32_t nseq, oseq; >> + NetRscUnit *o_unit; >> + >> + o_unit =3D &seg->unit; >> + o_ip_len =3D htons(*o_unit->ip_plen); >> + nseq =3D htonl(n_unit->tcp->th_seq); >> + oseq =3D htonl(o_unit->tcp->th_seq); >> + >> + if (n_unit->tcp_hdrlen > sizeof(struct tcp_header)) { >> + /* Log this only for debugging observation */ >> + chain->stat.tcp_option++; >> + } >> + >> + /* Ignore packet with more/larger tcp options */ >> + if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) { >> + chain->stat.tcp_larger_option++; >> + return RSC_FINAL; >> + } >> + >> + /* Remember larger/smaller options may behave different,so keep a >> + separate statitics for each of them for debugging convenience.= */ >> + if (n_unit->tcp_hdrlen < o_unit->tcp_hdrlen) { >> + chain->stat.tcp_smaller_option++; >> + return RSC_FINAL; >> + } >> + >> + /* out of order or retransmitted. */ >> + if ((nseq - oseq) > VIRTIO_NET_MAX_TCP_PAYLOAD) { >> + chain->stat.data_out_of_win++; >> + return RSC_FINAL; >> + } >> + >> + data =3D ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen; >> + if (nseq =3D=3D oseq) { >> + if ((0 =3D=3D o_unit->payload) && n_unit->payload) { > o_unit->payload =3D=3D 0 please. ok, sorry. > >> + /* From no payload to payload, normal case, not a dup ack= or etc */ >> + chain->stat.data_after_pure_ack++; >> + goto coalesce; >> + } else { >> + return virtio_net_rsc_handle_ack(chain, seg, buf, >> + n_unit->tcp, o_unit->tcp= ); >> + } >> + } else if ((nseq - oseq) !=3D o_unit->payload) { >> + /* Not a consistent packet, out of order */ >> + chain->stat.data_out_of_order++; >> + return RSC_FINAL; >> + } else { >> +coalesce: >> + if ((o_ip_len + n_unit->payload) > chain->max_payload) { >> + chain->stat.over_size++; >> + return RSC_FINAL; >> + } >> + >> + /* Here comes the right data, the payload lengh in v4/v6 is d= ifferent, >> + so use the field value to update and record the new data l= en */ >> + o_unit->payload +=3D n_unit->payload; /* update new data len = */ >> + >> + /* update field in ip header */ >> + *o_unit->ip_plen =3D htons(o_ip_len + n_unit->payload); >> + >> + /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be c= oalesced >> + for windows guest, while this may change the behavior for = linux >> + guest. */ > In the future, need to measure TCP_RR performance in this case to see > the impact on Linux guests. OK. > >> + o_unit->tcp->th_offset_flags =3D n_unit->tcp->th_offset_flags= ; >> + >> + o_unit->tcp->th_ack =3D n_unit->tcp->th_ack; >> + o_unit->tcp->th_win =3D n_unit->tcp->th_win; >> + >> + memmove(seg->buf + seg->size, data, n_unit->payload); >> + seg->size +=3D n_unit->payload; >> + chain->stat.coalesced++; >> + return RSC_COALESCE; >> + } >> +} >> + >> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg= *seg, >> + const uint8_t *buf, size_t size, NetRscUnit *= unit) > indentation is wrong here. Should it be aligned from parameter begin? i was going to save an extra=20 line parameter. > >> +{ >> + struct ip_header *ip1, *ip2; >> + >> + ip1 =3D (struct ip_header *)(unit->ip); >> + ip2 =3D (struct ip_header *)(seg->unit.ip); >> + if ((ip1->ip_src ^ ip2->ip_src) || (ip1->ip_dst ^ ip2->ip_dst) >> + || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport) >> + || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) { >> + chain->stat.no_match++; >> + return RSC_NO_MATCH; >> + } >> + >> + return virtio_net_rsc_coalesce_data(chain, seg, buf, unit); >> +} >> + >> +/* Pakcets with 'SYN' should bypass, other flag should be sent after = drain >> + * to prevent out of order */ >> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain, >> + struct tcp_header *tcp) >> +{ >> + uint16_t tcp_flag; >> + >> + tcp_flag =3D htons(tcp->th_offset_flags) & 0x3F; >> + if (tcp_flag & TH_SYN) { >> + chain->stat.tcp_syn++; >> + return RSC_BYPASS; >> + } >> + >> + if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) { >> + chain->stat.tcp_ctrl_drain++; >> + return RSC_FINAL; >> + } >> + >> + return RSC_WANT; >> +} >> + >> +static bool virtio_net_rsc_empty_cache(NetRscChain *chain, NetClientS= tate *nc, >> + const uint8_t *buf, size_t siz= e) >> +{ >> + if (QTAILQ_EMPTY(&chain->buffers)) { >> + chain->stat.empty_cache++; >> + virtio_net_rsc_cache_buf(chain, nc, buf, size); >> + timer_mod(chain->drain_timer, >> + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + virtio_net_rsc_= timeout); >> + return 1; >> + } >> + >> + return 0; >> +} >> + >> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClien= tState *nc, >> + const uint8_t *buf, size_t s= ize, >> + NetRscUnit *unit) >> +{ >> + int ret; >> + NetRscSeg *seg, *nseg; >> + >> + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) { >> + ret =3D virtio_net_rsc_coalesce4(chain, seg, buf, size, unit)= ; >> + >> + if (ret =3D=3D RSC_FINAL) { >> + if (virtio_net_rsc_drain_seg(chain, seg) =3D=3D 0) { >> + /* Send failed */ >> + chain->stat.final_failed++; >> + return 0; >> + } >> + >> + /* Send current packet */ >> + return virtio_net_do_receive(nc, buf, size); >> + } else if (ret =3D=3D RSC_NO_MATCH) { >> + continue; >> + } else { >> + /* Coalesced, mark coalesced flag to tell calc cksum for = ipv4 */ >> + seg->is_coalesced =3D 1; >> + return size; >> + } >> + } >> + >> + chain->stat.no_match_cache++; >> + virtio_net_rsc_cache_buf(chain, nc, buf, size); > Can we rework the code to make it a little bit more readable here? > > RSC_BYPASS, RSC_WANT was checked by the caller > (virtio_net_rsc_receive4()). and the rest were checked here. Can we do > all the check in one function with a switch...case? > > Maybe you can introduce sanity check and tcp check callback for each ch= ain. OK, will check it out. > > >> + return size; >> +} >> + >> +/* Drain a connection data, this is to avoid out of order segments */ >> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClient= State *nc, >> + const uint8_t *buf, size_t size, uint16_t ip_= start, >> + uint16_t ip_size, uint16_t tcp_port, uint16_t= port_size) > indentation is wrong. OK. >> +{ >> + NetRscSeg *seg, *nseg; >> + >> + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) { >> + if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size) >> + || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size)= ) { >> + continue; > May consider to optimize (e.g not using linear search) in the future. Yes. > >> + } >> + if (virtio_net_rsc_drain_seg(chain, seg) =3D=3D 0) { >> + chain->stat.drain_failed++; >> + } >> + >> + break; >> + } >> + >> + return virtio_net_do_receive(nc, buf, size); >> +} >> + >> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain, >> + struct ip_header *ip, const uint8_t *buf, siz= e_t size)+ chain->stat.ip_option++; > indentation. OK. > >> +{ >> + uint16_t ip_len; >> + uint16_t hdr_len; >> + >> + hdr_len =3D ((VirtIONet *)(chain->n))->guest_hdr_len; >> + if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct i= p_header) >> + + sizeof(struct tcp_header))) { >> + return RSC_BYPASS; >> + } >> + >> + /* Not an ipv4 one */ >> + if (((ip->ip_ver_len & 0xF0) >> 4) !=3D IP_HEADER_VERSION_4) { >> + chain->stat.ip_option++; >> + return RSC_BYPASS; >> + } >> + >> + /* Don't handle packets with ip option */ >> + if (VIRTIO_NET_IP4_HEADER_LENGTH !=3D (ip->ip_ver_len & 0xF)) { >> + chain->stat.ip_option++; >> + return RSC_BYPASS; >> + } >> + >> + /* Don't handle packets with ip fragment */ >> + if (!(htons(ip->ip_off) & IP_DF)) { >> + chain->stat.ip_frag++; >> + return RSC_BYPASS; >> + } >> + >> + if (ip->ip_p !=3D IPPROTO_TCP) { >> + chain->stat.bypass_not_tcp++; >> + return RSC_BYPASS; >> + } >> + >> + ip_len =3D htons(ip->ip_len); >> + if (ip_len < (sizeof(struct ip_header) + sizeof(struct tcp_header= )) >> + || ip_len > (size - hdr_len - sizeof(struct eth_header))) { >> + chain->stat.ip_hacked++; >> + return RSC_BYPASS; >> + } >> + >> + return RSC_WANT; >> +} >> + >> +static size_t virtio_net_rsc_receive4(void *opq, NetClientState* nc, > Why use void * here? Seems there is a type definition error before, will check it out. > >> + const uint8_t *buf, size_t size= ) >> +{ >> + int32_t ret; >> + uint16_t hdr_len; >> + NetRscChain *chain; >> + NetRscUnit unit; >> + >> + chain =3D (NetRscChain *)opq; >> + hdr_len =3D ((VirtIONet *)(chain->n))->guest_hdr_len; >> + virtio_net_rsc_extract_unit4(chain, buf, &unit); >> + if (RSC_WANT !=3D virtio_net_rsc_sanity_check4(chain, unit.ip, bu= f, size)) { >> + return virtio_net_do_receive(nc, buf, size); >> + } >> + >> + ret =3D virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp); >> + if (ret =3D=3D RSC_BYPASS) { >> + return virtio_net_do_receive(nc, buf, size); >> + } else if (ret =3D=3D RSC_FINAL) { >> + return virtio_net_rsc_drain_flow(chain, nc, buf, size, >> + ((hdr_len + sizeof(struct eth_header)) + 12), >> + VIRTIO_NET_IP4_ADDR_SIZE, >> + hdr_len + sizeof(struct eth_header) + sizeof(struct i= p_header), >> + VIRTIO_NET_TCP_PORT_SIZE); >> + } >> + >> + if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) { >> + return size;r >> + } >> + >> + return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit); >> +} >> + >> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n, >> + NetClientState *nc, >> + uint16_t proto) >> +{ >> + NetRscChain *chain; >> + >> + if (proto !=3D (uint16_t)ETH_P_IP) { >> + return NULL; >> + } >> + >> + QTAILQ_FOREACH(chain, &n->rsc_chains, next) { >> + if (chain->proto =3D=3D proto) { >> + return chain; >> + } >> + } >> + >> + chain =3D g_malloc(sizeof(*chain)); >> + chain->n =3D n; >> + chain->proto =3D proto; >> + chain->max_payload =3D VIRTIO_NET_MAX_IP4_PAYLOAD; >> + chain->drain_timer =3D timer_new_ns(QEMU_CLOCK_VIRTUAL, >> + virtio_net_rsc_purge, chain); >> + memset(&chain->stat, 0, sizeof(chain->stat)); >> + >> + QTAILQ_INIT(&chain->buffers); >> + QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next); >> + >> + return chain; >> +} >> =20 >> static ssize_t virtio_net_rsc_receive(NetClientState *nc, >> const uint8_t *buf, size_t siz= e) >> { >> - return virtio_net_do_receive(nc, buf, size); >> + uint16_t proto; >> + NetRscChain *chain; >> + struct eth_header *eth; >> + VirtIONet *n; >> + >> + n =3D qemu_get_nic_opaque(nc); >> + if (size < (n->guest_hdr_len + sizeof(struct eth_header))) { >> + return virtio_net_do_receive(nc, buf, size); >> + } >> + >> + eth =3D (struct eth_header *)(buf + n->guest_hdr_len); >> + proto =3D htons(eth->h_proto); >> + >> + chain =3D virtio_net_rsc_lookup_chain(n, nc, proto); >> + if (!chain) { >> + return virtio_net_do_receive(nc, buf, size); >> + } else { >> + chain->stat.received++; >> + return virtio_net_rsc_receive4(chain, nc, buf, size); >> + } >> } >> =20 >> static ssize_t virtio_net_receive(NetClientState *nc, >> @@ -1837,6 +2313,7 @@ static void virtio_net_device_realize(DeviceStat= e *dev, Error **errp) >> nc =3D qemu_get_queue(n->nic); >> nc->rxfilter_notify_enabled =3D 1; >> =20 >> + QTAILQ_INIT(&n->rsc_chains); >> n->qdev =3D dev; >> register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION, >> virtio_net_save, virtio_net_load, n); >> @@ -1871,6 +2348,7 @@ static void virtio_net_device_unrealize(DeviceSt= ate *dev, Error **errp) >> g_free(n->vqs); >> qemu_del_nic(n->nic); >> virtio_cleanup(vdev); >> + virtio_net_rsc_cleanup(n); >> } >> =20 >> static void virtio_net_instance_init(Object *obj) >> diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio= -net.h >> index 0cabdb6..6939e92 100644 >> --- a/include/hw/virtio/virtio-net.h >> +++ b/include/hw/virtio/virtio-net.h >> @@ -59,6 +59,7 @@ typedef struct VirtIONet { >> VirtIONetQueue *vqs; >> VirtQueue *ctrl_vq; >> NICState *nic; >> + QTAILQ_HEAD(, NetRscChain) rsc_chains; >> uint32_t tx_timeout; >> int32_t tx_burst; >> uint32_t has_vnet_hdr; >> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h >> index 2b5b248..6be1463 100644 >> --- a/include/hw/virtio/virtio.h >> +++ b/include/hw/virtio/virtio.h >> @@ -128,6 +128,78 @@ typedef struct VirtioDeviceClass { >> int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id); >> } VirtioDeviceClass; >> =20 >> +/* Coalesced packets type & status */ >> +typedef enum { >> + RSC_COALESCE, /* Data been coalesced */ >> + RSC_FINAL, /* Will terminate current connection */ >> + RSC_NO_MATCH, /* No matched in the buffer pool */ >> + RSC_BYPASS, /* Packet to be bypass, not tcp, tcp ctrl= , etc */ >> + RSC_WANT /* Data want to be coalesced */ >> +} COALESCE_STATUS; >> + >> +typedef struct NetRscStat { >> + uint32_t received; >> + uint32_t coalesced; >> + uint32_t over_size; >> + uint32_t cache; >> + uint32_t empty_cache; >> + uint32_t no_match_cache; >> + uint32_t win_update; >> + uint32_t no_match; >> + uint32_t tcp_syn; >> + uint32_t tcp_ctrl_drain; >> + uint32_t dup_ack; >> + uint32_t dup_ack1; >> + uint32_t dup_ack2; >> + uint32_t pure_ack; >> + uint32_t ack_out_of_win; >> + uint32_t data_out_of_win; >> + uint32_t data_out_of_order; >> + uint32_t data_after_pure_ack; >> + uint32_t bypass_not_tcp; >> + uint32_t tcp_option; >> + uint32_t tcp_larger_option; >> + uint32_t tcp_smaller_option; >> + uint32_t ip_frag; >> + uint32_t ip_hacked; >> + uint32_t ip_option; >> + uint32_t purge_failed; >> + uint32_t drain_failed; >> + uint32_t final_failed; >> + int64_t timer; >> +} NetRscStat; >> + >> +/* Rsc unit general info used to checking if can coalescing */ >> +typedef struct NetRscUnit { >> + void *ip; /* ip header */ >> + uint16_t *ip_plen; /* data len pointer in ip header field */ >> + struct tcp_header *tcp; /* tcp header */ >> + uint16_t tcp_hdrlen; /* tcp header len */ >> + uint16_t payload; /* pure payload without virtio/eth/ip/tcp= */ >> +} NetRscUnit; >> + >> +/* Coalesced segmant */ >> +typedef struct NetRscSeg { >> + QTAILQ_ENTRY(NetRscSeg) next; >> + void *buf; >> + size_t size; >> + uint32_t dup_ack_count; >> + bool is_coalesced; /* need recal ipv4 header checksum, mark = here */ >> + NetRscUnit unit; >> + NetClientState *nc; >> +} NetRscSeg; >> + >> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */ >> +typedef struct NetRscChain { >> + QTAILQ_ENTRY(NetRscChain) next; >> + void *n; /* VirtIONet */ >> + uint16_t proto; >> + uint16_t max_payload; >> + QEMUTimer *drain_timer; >> + QTAILQ_HEAD(, NetRscSeg) buffers; >> + NetRscStat stat; >> +} NetRscChain; >> + >> void virtio_instance_init_common(Object *proxy_obj, void *data, >> size_t vdev_size, const char *vdev_= name); >> =20 >