From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:36780) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b7Egj-0005co-GC for qemu-devel@nongnu.org; Mon, 30 May 2016 00:20:28 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1b7Egf-0007sI-68 for qemu-devel@nongnu.org; Mon, 30 May 2016 00:20:24 -0400 Received: from mx1.redhat.com ([209.132.183.28]:33961) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b7Ege-0007sC-Qx for qemu-devel@nongnu.org; Mon, 30 May 2016 00:20:21 -0400 Received: from int-mx10.intmail.prod.int.phx2.redhat.com (int-mx10.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 1C40B64370 for ; Mon, 30 May 2016 04:20:20 +0000 (UTC) References: <1464453454-5703-1-git-send-email-wexu@redhat.com> <1464453454-5703-2-git-send-email-wexu@redhat.com> From: Jason Wang Message-ID: <574BBF7E.4050609@redhat.com> Date: Mon, 30 May 2016 12:20:14 +0800 MIME-Version: 1.0 In-Reply-To: <1464453454-5703-2-git-send-email-wexu@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [ RFC Patch v6 1/3] virtio-net rsc: support coalescing ipv4 tcp traffic List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: wexu@redhat.com, qemu-devel@nongnu.org Cc: marcel@redhat.com, mst@redhat.com, victork@redhat.com, dfleytma@redhat.com, yvugenfi@redhat.com On 2016=E5=B9=B405=E6=9C=8829=E6=97=A5 00:37, wexu@redhat.com wrote: > From: Wei Xu > > All the data packets in a tcp connection will be cached to a big buffer > in every receive interval, and will be sent out via a timer, the > 'virtio_net_rsc_timeout' controls the interval, the value will influent= the > performance and response of tcp connection extremely, 50000(50us) is a > experience value to gain a performance improvement, since the whql test > sends packets every 100us, so '300000(300us)' can pass the test case, > this is also the default value, it's tunable via the command line > parameter 'rsc_interval' with 'virtio-net-pci' device, for example, bel= ow > parameter is to launch a guest with interval set as '500000'. Does the value make sense if it was smaller than 1us? If not, why not=20 just make the unit to be 1us? > > 'virtio-net-pci,netdev=3Dhostnet1,bus=3Dpci.0,id=3Dnet1,mac=3D00,rsc_in= terval=3D500000' > will > > The timer will only be triggered if the packets pool is not empty, > and it'll drain off all the cached packets. > > 'NetRscChain' is used to save the segments of different protocols in a > VirtIONet device. > > The main handler of TCP includes TCP window update, duplicated ACK chec= k > and the real data coalescing if the new segment passed sanity check > and is identified as an 'wanted' one. > > An 'wanted' segment means: > 1. Segment is within current window and the sequence is the expected on= e. > 2. 'ACK' of the segment is in the valid window. > > Sanity check includes: > 1. Incorrect version in IP header > 2. IP options & IP fragment > 3. Not a TCP packets > 4. Sanity size check to prevent buffer overflow attack. > > There maybe more cases should be considered such as ip identification o= ther > flags, while it broke the test because windows set it to the same even = it's > not a fragment. > > Normally it includes 2 typical ways to handle a TCP control flag, 'bypa= ss' > and 'finalize', 'bypass' means should be sent out directly, and 'finali= ze' > means the packets should also be bypassed, and this should be done > after searching for the same connection packets in the pool and sending > all of them out, this is to avoid out of data. > > All the 'SYN' packets will be bypassed since this always begin a new' > connection, other flags such 'FIN/RST' will trigger a finalization, bec= ause > this normally happens upon a connection is going to be closed, an 'URG'= packet > also finalize current coalescing unit. > > Statistics can be used to monitor the basic coalescing status, the 'out= of order' > and 'out of window' means how many retransmitting packets, thus describ= e the > performance intuitively. We usually don't add device specific monitor command. Maybe a new=20 control vq command ethtool -S in guest in the future. I was thinking of=20 removing those counters since it was never used in this series. > > Signed-off-by: Wei Xu > --- > hw/net/virtio-net.c | 498 +++++++++++++++++++= ++++++++- > include/hw/virtio/virtio-net.h | 2 + > include/hw/virtio/virtio.h | 75 +++++ > include/standard-headers/linux/virtio_net.h | 1 + For RFC, it's ok. But for formal patch, this is not the correct way to=20 modify Linux headers. There's a script in=20 scripts/update-linux-headers.sh which was used to sync it from Linux=20 source. This means, it must be merged in Linux or at least in=20 maintainer's tree first. > 4 files changed, 575 insertions(+), 1 deletion(-) > > diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c > index 5798f87..b3bb63b 100644 > --- a/hw/net/virtio-net.c > +++ b/hw/net/virtio-net.c > @@ -15,10 +15,12 @@ > #include "qemu/iov.h" > #include "hw/virtio/virtio.h" > #include "net/net.h" > +#include "net/eth.h" > #include "net/checksum.h" > #include "net/tap.h" > #include "qemu/error-report.h" > #include "qemu/timer.h" > +#include "qemu/sockets.h" > #include "hw/virtio/virtio-net.h" > #include "net/vhost_net.h" > #include "hw/virtio/virtio-bus.h" > @@ -38,6 +40,25 @@ > #define endof(container, field) \ > (offsetof(container, field) + sizeof(((container *)0)->field)) > =20 > +#define VIRTIO_NET_IP4_ADDR_SIZE 8 /* ipv4 saddr + daddr */ > +#define VIRTIO_NET_TCP_PORT_SIZE 4 /* sport + dport */ > + > +#define VIRTIO_NET_TCP_FLAG 0x3F > +#define VIRTIO_NET_TCP_HDR_LENGTH 0xF000 > + > +/* IPv4 max payload, 16 bits in the header */ > +#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header)) > +#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535 Is this still true if window scaling is enabled? And need to make sure=20 your ack/seq processing can work for window scaling. > + > +/* header length value in ip header without option */ > +#define VIRTIO_NET_IP4_HEADER_LENGTH 5 > + > +/* Purge coalesced packets timer interval, This value affects the perf= ormance > + a lot, and should be tuned carefully, '300000'(300us) is the recomm= ended > + value to pass the WHQL test, '50000' can gain 2x netperf throughput= with > + tso/gso/gro 'off'. */ > +#define VIRTIO_NET_RSC_INTERVAL 300000 Do we need a new property for this value? > + > typedef struct VirtIOFeature { > uint32_t flags; > size_t end; > @@ -1089,7 +1110,8 @@ static int receive_filter(VirtIONet *n, const uin= t8_t *buf, int size) > return 0; > } > =20 > -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t *b= uf, size_t size) > +static ssize_t virtio_net_do_receive(NetClientState *nc, > + const uint8_t *buf, size_t size) > { > VirtIONet *n =3D qemu_get_nic_opaque(nc); > VirtIONetQueue *q =3D virtio_net_get_subqueue(nc); > @@ -1685,6 +1707,474 @@ static int virtio_net_load_device(VirtIODevice = *vdev, QEMUFile *f, > return 0; > } > =20 > +static void virtio_net_rsc_extract_unit4(NetRscChain *chain, > + const uint8_t *buf, NetRscUni= t* unit) > +{ > + uint16_t hdr_len; > + uint16_t ip_hdrlen; > + struct ip_header *ip; > + > + hdr_len =3D chain->n->guest_hdr_len; > + ip =3D (struct ip_header *)(buf + hdr_len + sizeof(struct eth_head= er)); > + unit->ip =3D (void *)ip; > + ip_hdrlen =3D (ip->ip_ver_len & 0xF) << 2; > + unit->ip_plen =3D &ip->ip_len; > + unit->tcp =3D (struct tcp_header *)(((uint8_t *)unit->ip) + ip_hdr= len); > + unit->tcp_hdrlen =3D (htons(unit->tcp->th_offset_flags) & 0xF000) = >> 10; > + unit->payload =3D htons(*unit->ip_plen) - ip_hdrlen - unit->tcp_hd= rlen; > +} > + > +static void virtio_net_rsc_ipv4_checksum(struct virtio_net_hdr *vhdr, > + struct ip_header *ip) > +{ > + uint32_t sum; > + > + ip->ip_sum =3D 0; > + sum =3D net_checksum_add_cont(sizeof(struct ip_header), (uint8_t *= )ip, 0); > + ip->ip_sum =3D cpu_to_be16(net_checksum_finish(sum)); > + vhdr->flags &=3D ~VIRTIO_NET_HDR_F_NEEDS_CSUM; > + vhdr->flags |=3D VIRTIO_NET_HDR_F_DATA_VALID; > +} > + > +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg *= seg) > +{ > + int ret; > + struct virtio_net_hdr *h; > + > + h =3D (struct virtio_net_hdr *)seg->buf; > + virtio_net_rsc_ipv4_checksum(h, seg->unit.ip); > + ret =3D virtio_net_do_receive(seg->nc, seg->buf, seg->size); > + QTAILQ_REMOVE(&chain->buffers, seg, next); > + g_free(seg->buf); > + g_free(seg); > + > + return ret; > +} > + > +static void virtio_net_rsc_purge(void *opq) > +{ > + NetRscSeg *seg, *rn; > + NetRscChain *chain =3D (NetRscChain *)opq; > + > + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) { > + if (virtio_net_rsc_drain_seg(chain, seg) =3D=3D 0) { > + chain->stat.purge_failed++; > + continue; Why need this? > + } > + } > + > + chain->stat.timer++; > + if (!QTAILQ_EMPTY(&chain->buffers)) { > + timer_mod(chain->drain_timer, > + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + chain->n->rsc_ti= meout); Consider the timer is invisible to guest, why use QEMU_CLOCK_VIRTUAL? > + } > +} > + > +static void virtio_net_rsc_cleanup(VirtIONet *n) > +{ > + NetRscChain *chain, *rn_chain; > + NetRscSeg *seg, *rn_seg; > + > + QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) { > + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) { > + QTAILQ_REMOVE(&chain->buffers, seg, next); > + g_free(seg->buf); > + g_free(seg); > + } > + > + timer_del(chain->drain_timer); > + timer_free(chain->drain_timer); > + QTAILQ_REMOVE(&n->rsc_chains, chain, next); > + g_free(chain); > + } > +} > + > +static void virtio_net_rsc_cache_buf(NetRscChain *chain, NetClientStat= e *nc, > + const uint8_t *buf, size_t size) > +{ > + uint16_t hdr_len; > + NetRscSeg *seg; > + > + hdr_len =3D chain->n->guest_hdr_len; > + seg =3D g_malloc(sizeof(NetRscSeg)); > + seg->buf =3D g_malloc(hdr_len + sizeof(struct eth_header)\ > + + VIRTIO_NET_MAX_TCP_PAYLOAD); > + memcpy(seg->buf, buf, size); > + seg->size =3D size; > + seg->packets =3D 1; > + seg->dup_ack =3D 0; > + seg->is_coalesced =3D 0; > + seg->nc =3D nc; > + > + QTAILQ_INSERT_TAIL(&chain->buffers, seg, next); > + chain->stat.cache++; > + > + virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit); > +} > + > +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, > + NetRscSeg *seg, const uint8_t= *buf, > + struct tcp_header *n_tcp, > + struct tcp_header *o_tcp) > +{ > + uint32_t nack, oack; > + uint16_t nwin, owin; > + > + nack =3D htonl(n_tcp->th_ack); > + nwin =3D htons(n_tcp->th_win); > + oack =3D htonl(o_tcp->th_ack); > + owin =3D htons(o_tcp->th_win); > + > + if ((nack - oack) >=3D VIRTIO_NET_MAX_TCP_PAYLOAD) { > + chain->stat.ack_out_of_win++; > + return RSC_FINAL; > + } else if (nack =3D=3D oack) { > + /* duplicated ack or window probe */ > + if (nwin =3D=3D owin) { > + /* duplicated ack, add dup ack count due to whql test up t= o 1 */ > + chain->stat.dup_ack++; > + return RSC_FINAL; > + } else { > + /* Coalesce window update */ > + o_tcp->th_win =3D n_tcp->th_win; > + chain->stat.win_update++; > + return RSC_COALESCE; > + } > + } else { > + /* pure ack, go to 'C', finalize*/ > + chain->stat.pure_ack++; > + return RSC_FINAL; > + } > +} > + > +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, > + NetRscSeg *seg, const uint= 8_t *buf, > + NetRscUnit *n_unit) > +{ > + void *data; > + uint16_t o_ip_len; > + uint32_t nseq, oseq; > + NetRscUnit *o_unit; > + > + o_unit =3D &seg->unit; > + o_ip_len =3D htons(*o_unit->ip_plen); > + nseq =3D htonl(n_unit->tcp->th_seq); > + oseq =3D htonl(o_unit->tcp->th_seq); > + > + /* out of order or retransmitted. */ > + if ((nseq - oseq) > VIRTIO_NET_MAX_TCP_PAYLOAD) { > + chain->stat.data_out_of_win++; > + return RSC_FINAL; > + } > + > + data =3D ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen; > + if (nseq =3D=3D oseq) { > + if ((o_unit->payload =3D=3D 0) && n_unit->payload) { > + /* From no payload to payload, normal case, not a dup ack = or etc */ > + chain->stat.data_after_pure_ack++; > + goto coalesce; > + } else { > + return virtio_net_rsc_handle_ack(chain, seg, buf, > + n_unit->tcp, o_unit->tcp)= ; > + } > + } else if ((nseq - oseq) !=3D o_unit->payload) { > + /* Not a consistent packet, out of order */ > + chain->stat.data_out_of_order++; > + return RSC_FINAL; > + } else { > +coalesce: > + if ((o_ip_len + n_unit->payload) > chain->max_payload) { > + chain->stat.over_size++; > + return RSC_FINAL; > + } > + > + /* Here comes the right data, the payload length in v4/v6 is d= ifferent, > + so use the field value to update and record the new data le= n */ > + o_unit->payload +=3D n_unit->payload; /* update new data len *= / > + > + /* update field in ip header */ > + *o_unit->ip_plen =3D htons(o_ip_len + n_unit->payload); > + > + /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be co= alesced > + for windows guest, while this may change the behavior for l= inux > + guest. */ > + o_unit->tcp->th_offset_flags =3D n_unit->tcp->th_offset_flags; > + > + o_unit->tcp->th_ack =3D n_unit->tcp->th_ack; > + o_unit->tcp->th_win =3D n_unit->tcp->th_win; > + > + memmove(seg->buf + seg->size, data, n_unit->payload); > + seg->size +=3D n_unit->payload; > + seg->packets++; > + chain->stat.coalesced++; > + return RSC_COALESCE; > + } > +} > + > +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg = *seg, > + const uint8_t *buf, size_t siz= e, > + NetRscUnit *unit) > +{ > + struct ip_header *ip1, *ip2; > + > + ip1 =3D (struct ip_header *)(unit->ip); > + ip2 =3D (struct ip_header *)(seg->unit.ip); > + if ((ip1->ip_src ^ ip2->ip_src) || (ip1->ip_dst ^ ip2->ip_dst) > + || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport) > + || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) { > + chain->stat.no_match++; > + return RSC_NO_MATCH; > + } > + > + return virtio_net_rsc_coalesce_data(chain, seg, buf, unit); > +} > + > +/* Pakcets with 'SYN' should bypass, other flag should be sent after d= rain > + * to prevent out of order */ > +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain, > + struct tcp_header *tcp) > +{ > + uint16_t tcp_hdr; > + uint16_t tcp_flag; > + > + tcp_flag =3D htons(tcp->th_offset_flags); > + tcp_hdr =3D (tcp_flag & VIRTIO_NET_TCP_HDR_LENGTH) >> 10; > + tcp_flag &=3D VIRTIO_NET_TCP_FLAG; > + tcp_flag =3D htons(tcp->th_offset_flags) & 0x3F; > + if (tcp_flag & TH_SYN) { > + chain->stat.tcp_syn++; > + return RSC_BYPASS; > + } > + > + if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) { > + chain->stat.tcp_ctrl_drain++; > + return RSC_FINAL; > + } > + > + if (tcp_hdr > sizeof(struct tcp_header)) { > + chain->stat.tcp_all_opt++; > + return RSC_FINAL; > + } > + > + return RSC_WANT; > +} > + > +static bool virtio_net_rsc_empty_cache(NetRscChain *chain, NetClientSt= ate *nc, > + const uint8_t *buf, size_t size= ) > +{ The name is really confusing, it in fact cache the buf if the cache is=20 empty. So I guess what you mean is "virtio_net_rsc_cache_if_empty()?".=20 But the question is why need a specific function like this, can we=20 simply add the logic to virtio_net_rsc_do_coalesce()? (And looks like=20 most of the logic has been there, except for timer and counter. > + if (QTAILQ_EMPTY(&chain->buffers)) { > + chain->stat.empty_cache++; > + virtio_net_rsc_cache_buf(chain, nc, buf, size); > + timer_mod(chain->drain_timer, > + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + chain->n->rsc_ti= meout); > + return 1; > + } > + > + return 0; > +} > + > +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, NetClient= State *nc, > + const uint8_t *buf, size_t si= ze, > + NetRscUnit *unit) > +{ > + int ret; > + NetRscSeg *seg, *nseg; > + > + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) { > + ret =3D virtio_net_rsc_coalesce4(chain, seg, buf, size, unit); > + > + if (ret =3D=3D RSC_FINAL) { > + if (virtio_net_rsc_drain_seg(chain, seg) =3D=3D 0) { > + /* Send failed */ > + chain->stat.final_failed++; > + return 0; > + } > + > + /* Send current packet */ > + return virtio_net_do_receive(nc, buf, size); > + } else if (ret =3D=3D RSC_NO_MATCH) { > + continue; > + } else { > + /* Coalesced, mark coalesced flag to tell calc cksum for i= pv4 */ > + seg->is_coalesced =3D 1; > + return size; > + } > + } > + > + chain->stat.no_match_cache++; > + virtio_net_rsc_cache_buf(chain, nc, buf, size); > + return size; > +} > + > +/* Drain a connection data, this is to avoid out of order segments */ > +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, NetClientS= tate *nc, > + const uint8_t *buf, size_t siz= e, > + uint16_t ip_start, uint16_t ip= _size, > + uint16_t tcp_port, uint16_t po= rt_size) > +{ > + NetRscSeg *seg, *nseg; > + > + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) { > + if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size) > + || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size))= { The header comparison is rather similar to what you've done in=20 virtio_net_rsc_coalesce4/6(). Any chance to reduce the duplication? > + continue; > + } > + if (virtio_net_rsc_drain_seg(chain, seg) =3D=3D 0) { > + chain->stat.drain_failed++; > + } > + > + break; > + } > + > + return virtio_net_do_receive(nc, buf, size); > +} > + > +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain, > + struct ip_header *ip, > + const uint8_t *buf, size_t= size) > +{ > + uint16_t ip_len; > + uint16_t hdr_len; > + > + hdr_len =3D chain->n->guest_hdr_len; > + if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct ip= _header) > + + sizeof(struct tcp_header))) { It's interesting that you don't have a stat counter for this case. > + return RSC_BYPASS; > + } > + > + /* Not an ipv4 one */ > + if (((ip->ip_ver_len & 0xF0) >> 4) !=3D IP_HEADER_VERSION_4) { > + chain->stat.ip_option++; > + return RSC_BYPASS; > + } > + > + /* Don't handle packets with ip option */ > + if (VIRTIO_NET_IP4_HEADER_LENGTH !=3D (ip->ip_ver_len & 0xF)) { (ip-> ...) !=3D VIRTIO_NET_IP4_HEADER_LENGTH > + chain->stat.ip_option++; > + return RSC_BYPASS; > + } > + > + if (ip->ip_p !=3D IPPROTO_TCP) { > + chain->stat.bypass_not_tcp++; > + return RSC_BYPASS; > + } > + > + /* Don't handle packets with ip fragment */ > + if (!(htons(ip->ip_off) & IP_DF)) { > + chain->stat.ip_frag++; > + return RSC_BYPASS; > + } > + > + ip_len =3D htons(ip->ip_len); > + if (ip_len < (sizeof(struct ip_header) + sizeof(struct tcp_header)= ) > + || ip_len > (size - hdr_len - sizeof(struct eth_header))) { > + chain->stat.ip_hacked++; > + return RSC_BYPASS; > + } > + > + return RSC_WANT; > +} > + > +static size_t virtio_net_rsc_receive4(NetRscChain *chain, NetClientSta= te* nc, > + const uint8_t *buf, size_t size) > +{ > + int32_t ret; > + uint16_t hdr_len; > + NetRscUnit unit; > + > + hdr_len =3D ((VirtIONet *)(chain->n))->guest_hdr_len; > + virtio_net_rsc_extract_unit4(chain, buf, &unit); Is this safe for you do header analysis before doing sanity check? > + if (RSC_WANT !=3D virtio_net_rsc_sanity_check4(chain, unit.ip, buf= , size)) { virtio_net_rsc_sanity_check4() =3D RSC_WANT please. I've pointed out this= =20 kind of style issues several times. > + return virtio_net_do_receive(nc, buf, size); > + } > + > + ret =3D virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp); > + if (ret =3D=3D RSC_BYPASS) { > + return virtio_net_do_receive(nc, buf, size); > + } else if (ret =3D=3D RSC_FINAL) { > + return virtio_net_rsc_drain_flow(chain, nc, buf, size, > + ((hdr_len + sizeof(struct eth_header)) + 12), > + VIRTIO_NET_IP4_ADDR_SIZE, > + hdr_len + sizeof(struct eth_header) + sizeof(struct ip= _header), > + VIRTIO_NET_TCP_PORT_SIZE); > + } > + > + if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) { > + return size; > + } > + > + return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit); > +} > + > +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n, > + NetClientState *nc, > + uint16_t proto) > +{ > + NetRscChain *chain; > + > + if (proto !=3D (uint16_t)ETH_P_IP) { > + return NULL; > + } > + > + QTAILQ_FOREACH(chain, &n->rsc_chains, next) { > + if (chain->proto =3D=3D proto) { > + return chain; > + } > + } > + > + chain =3D g_malloc(sizeof(*chain)); > + chain->n =3D n; > + chain->proto =3D proto; > + chain->max_payload =3D VIRTIO_NET_MAX_IP4_PAYLOAD; > + chain->gso_type =3D VIRTIO_NET_HDR_GSO_TCPV4; > + chain->drain_timer =3D timer_new_ns(QEMU_CLOCK_VIRTUAL, > + virtio_net_rsc_purge, chain); > + memset(&chain->stat, 0, sizeof(chain->stat)); > + > + QTAILQ_INIT(&chain->buffers); > + QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next); > + > + return chain; > +} > + > +static ssize_t virtio_net_rsc_receive(NetClientState *nc, > + const uint8_t *buf, size_t size) > +{ > + uint16_t proto; > + NetRscChain *chain; > + struct eth_header *eth; > + VirtIONet *n; > + > + n =3D qemu_get_nic_opaque(nc); > + if (size < (n->guest_hdr_len + sizeof(struct eth_header))) { Here should use host_hdr_len, no? > + return virtio_net_do_receive(nc, buf, size); > + } > + > + eth =3D (struct eth_header *)(buf + n->guest_hdr_len); > + proto =3D htons(eth->h_proto); > + > + chain =3D virtio_net_rsc_lookup_chain(n, nc, proto); > + if (!chain) { > + return virtio_net_do_receive(nc, buf, size); > + } else { > + chain->stat.received++; > + return virtio_net_rsc_receive4(chain, nc, buf, size); > + } > +} > + > +static ssize_t virtio_net_receive(NetClientState *nc, > + const uint8_t *buf, size_t size) > +{ > + VirtIONet *n; > + > + n =3D qemu_get_nic_opaque(nc); > + if (n->host_features & (1ULL << VIRTIO_NET_F_GUEST_RSC)) { > + return virtio_net_rsc_receive(nc, buf, size); > + } else { > + return virtio_net_do_receive(nc, buf, size); > + } > +} > + > static NetClientInfo net_virtio_info =3D { > .type =3D NET_CLIENT_OPTIONS_KIND_NIC, > .size =3D sizeof(NICState), > @@ -1814,6 +2304,7 @@ static void virtio_net_device_realize(DeviceState= *dev, Error **errp) > nc =3D qemu_get_queue(n->nic); > nc->rxfilter_notify_enabled =3D 1; > =20 > + QTAILQ_INIT(&n->rsc_chains); > n->qdev =3D dev; > register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION, > virtio_net_save, virtio_net_load, n); > @@ -1848,6 +2339,7 @@ static void virtio_net_device_unrealize(DeviceSta= te *dev, Error **errp) > g_free(n->vqs); > qemu_del_nic(n->nic); > virtio_cleanup(vdev); > + virtio_net_rsc_cleanup(n); > } > =20 > static void virtio_net_instance_init(Object *obj) > @@ -1909,6 +2401,10 @@ static Property virtio_net_properties[] =3D { > TX_TIMER_INTERVAL), > DEFINE_PROP_INT32("x-txburst", VirtIONet, net_conf.txburst, TX_BU= RST), > DEFINE_PROP_STRING("tx", VirtIONet, net_conf.tx), > + DEFINE_PROP_BIT("guest_rsc", VirtIONet, host_features, > + VIRTIO_NET_F_GUEST_RSC, false), > + DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout, > + VIRTIO_NET_RSC_INTERVAL), > DEFINE_PROP_END_OF_LIST(), > }; > =20 > diff --git a/include/hw/virtio/virtio-net.h b/include/hw/virtio/virtio-= net.h > index 0cabdb6..e995190 100644 > --- a/include/hw/virtio/virtio-net.h > +++ b/include/hw/virtio/virtio-net.h > @@ -59,6 +59,8 @@ typedef struct VirtIONet { > VirtIONetQueue *vqs; > VirtQueue *ctrl_vq; > NICState *nic; > + QTAILQ_HEAD(, NetRscChain) rsc_chains; > + uint32_t rsc_timeout; > uint32_t tx_timeout; > int32_t tx_burst; > uint32_t has_vnet_hdr; > diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h > index 6a37065..a7cb84a 100644 > --- a/include/hw/virtio/virtio.h > +++ b/include/hw/virtio/virtio.h > @@ -30,6 +30,8 @@ > (0x1ULL << VIRTIO_F_ANY_LAYOUT)) > =20 > struct VirtQueue; > +struct VirtIONet; > +typedef struct VirtIONet VirtIONet; > =20 > static inline hwaddr vring_align(hwaddr addr, > unsigned long align) > @@ -128,6 +130,79 @@ typedef struct VirtioDeviceClass { > int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id); > } VirtioDeviceClass; > =20 > +/* Coalesced packets type & status */ > +typedef enum { > + RSC_COALESCE, /* Data been coalesced */ > + RSC_FINAL, /* Will terminate current connection */ > + RSC_NO_MATCH, /* No matched in the buffer pool */ > + RSC_BYPASS, /* Packet to be bypass, not tcp, tcp ctrl,= etc */ > + RSC_WANT /* Data want to be coalesced */ > +} COALESCE_STATUS; > + > +typedef struct NetRscStat { > + uint32_t received; > + uint32_t coalesced; > + uint32_t over_size; > + uint32_t cache; > + uint32_t empty_cache; > + uint32_t no_match_cache; > + uint32_t win_update; > + uint32_t no_match; > + uint32_t tcp_syn; > + uint32_t tcp_ctrl_drain; > + uint32_t dup_ack; > + uint32_t dup_ack1; > + uint32_t dup_ack2; > + uint32_t pure_ack; > + uint32_t ack_out_of_win; > + uint32_t data_out_of_win; > + uint32_t data_out_of_order; > + uint32_t data_after_pure_ack; > + uint32_t bypass_not_tcp; > + uint32_t tcp_option; > + uint32_t tcp_all_opt; > + uint32_t ip_frag; > + uint32_t ip_hacked; > + uint32_t ip_option; > + uint32_t purge_failed; > + uint32_t drain_failed; > + uint32_t final_failed; > + int64_t timer; > +} NetRscStat; > + > +/* Rsc unit general info used to checking if can coalescing */ > +typedef struct NetRscUnit { > + void *ip; /* ip header */ > + uint16_t *ip_plen; /* data len pointer in ip header field */ > + struct tcp_header *tcp; /* tcp header */ > + uint16_t tcp_hdrlen; /* tcp header len */ > + uint16_t payload; /* pure payload without virtio/eth/ip/tcp = */ > +} NetRscUnit; > + > +/* Coalesced segmant */ > +typedef struct NetRscSeg { > + QTAILQ_ENTRY(NetRscSeg) next; > + void *buf; > + size_t size; > + uint16_t packets; > + uint16_t dup_ack; > + bool is_coalesced; /* need recal ipv4 header checksum, mark h= ere */ > + NetRscUnit unit; > + NetClientState *nc; > +} NetRscSeg; > + > +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */ > +typedef struct NetRscChain { > + QTAILQ_ENTRY(NetRscChain) next; > + VirtIONet *n; /* VirtIONet */ > + uint16_t proto; > + uint8_t gso_type; > + uint16_t max_payload; > + QEMUTimer *drain_timer; > + QTAILQ_HEAD(, NetRscSeg) buffers; > + NetRscStat stat; > +} NetRscChain; > + > void virtio_instance_init_common(Object *proxy_obj, void *data, > size_t vdev_size, const char *vdev_n= ame); > =20 > diff --git a/include/standard-headers/linux/virtio_net.h b/include/stan= dard-headers/linux/virtio_net.h > index a78f33e..5b95762 100644 > --- a/include/standard-headers/linux/virtio_net.h > +++ b/include/standard-headers/linux/virtio_net.h > @@ -55,6 +55,7 @@ > #define VIRTIO_NET_F_MQ 22 /* Device supports Receive Flow > * Steering */ > #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ > +#define VIRTIO_NET_F_GUEST_RSC 24 /* Guest can coalesce tcp pack= ets */ The comment should be "Guest can receive coalesced tcp packets" ? > =20 > #ifndef VIRTIO_NET_NO_LEGACY > #define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO type */