From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55401) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b8VIF-0002iJ-R7 for qemu-devel@nongnu.org; Thu, 02 Jun 2016 12:16:27 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1b8VIC-0001dx-EB for qemu-devel@nongnu.org; Thu, 02 Jun 2016 12:16:23 -0400 Received: from mx1.redhat.com ([209.132.183.28]:33226) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1b8VIB-0001do-WC for qemu-devel@nongnu.org; Thu, 02 Jun 2016 12:16:20 -0400 Received: from int-mx13.intmail.prod.int.phx2.redhat.com (int-mx13.intmail.prod.int.phx2.redhat.com [10.5.11.26]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mx1.redhat.com (Postfix) with ESMTPS id 9C1494944E for ; Thu, 2 Jun 2016 16:16:18 +0000 (UTC) References: <1464453454-5703-1-git-send-email-wexu@redhat.com> <1464453454-5703-2-git-send-email-wexu@redhat.com> <574BBF7E.4050609@redhat.com> From: Wei Xu Message-ID: <57505BCD.3030600@redhat.com> Date: Fri, 3 Jun 2016 00:16:13 +0800 MIME-Version: 1.0 In-Reply-To: <574BBF7E.4050609@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: quoted-printable Subject: Re: [Qemu-devel] [ RFC Patch v6 1/3] virtio-net rsc: support coalescing ipv4 tcp traffic List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Jason Wang , qemu-devel@nongnu.org Cc: marcel@redhat.com, mst@redhat.com, victork@redhat.com, dfleytma@redhat.com, yvugenfi@redhat.com On 2016=E5=B9=B405=E6=9C=8830=E6=97=A5 12:20, Jason Wang wrote: > > > On 2016=E5=B9=B405=E6=9C=8829=E6=97=A5 00:37, wexu@redhat.com wrote: >> From: Wei Xu >> >> All the data packets in a tcp connection will be cached to a big buffe= r >> in every receive interval, and will be sent out via a timer, the >> 'virtio_net_rsc_timeout' controls the interval, the value will >> influent the >> performance and response of tcp connection extremely, 50000(50us) is a >> experience value to gain a performance improvement, since the whql tes= t >> sends packets every 100us, so '300000(300us)' can pass the test case, >> this is also the default value, it's tunable via the command line >> parameter 'rsc_interval' with 'virtio-net-pci' device, for example, be= low >> parameter is to launch a guest with interval set as '500000'. > > Does the value make sense if it was smaller than 1us? If not, why not > just make the unit to be 1us? It's an experience value, 500us is a good candidate for netperf=20 throughput test, too short interval less than 50 us will gain an obvious=20 penalty AFAIK, this is caused only few packets can be coalesced and the=20 cycles wasted for rsc itself. >> >> 'virtio-net-pci,netdev=3Dhostnet1,bus=3Dpci.0,id=3Dnet1,mac=3D00,rsc_i= nterval=3D500000' >> >> will >> >> The timer will only be triggered if the packets pool is not empty, >> and it'll drain off all the cached packets. >> >> 'NetRscChain' is used to save the segments of different protocols in a >> VirtIONet device. >> >> The main handler of TCP includes TCP window update, duplicated ACK che= ck >> and the real data coalescing if the new segment passed sanity check >> and is identified as an 'wanted' one. >> >> An 'wanted' segment means: >> 1. Segment is within current window and the sequence is the expected o= ne. >> 2. 'ACK' of the segment is in the valid window. >> >> Sanity check includes: >> 1. Incorrect version in IP header >> 2. IP options & IP fragment >> 3. Not a TCP packets >> 4. Sanity size check to prevent buffer overflow attack. >> >> There maybe more cases should be considered such as ip identification >> other >> flags, while it broke the test because windows set it to the same even >> it's >> not a fragment. >> >> Normally it includes 2 typical ways to handle a TCP control flag, >> 'bypass' >> and 'finalize', 'bypass' means should be sent out directly, and >> 'finalize' >> means the packets should also be bypassed, and this should be done >> after searching for the same connection packets in the pool and sendin= g >> all of them out, this is to avoid out of data. >> >> All the 'SYN' packets will be bypassed since this always begin a new' >> connection, other flags such 'FIN/RST' will trigger a finalization, >> because >> this normally happens upon a connection is going to be closed, an >> 'URG' packet >> also finalize current coalescing unit. >> >> Statistics can be used to monitor the basic coalescing status, the >> 'out of order' >> and 'out of window' means how many retransmitting packets, thus >> describe the >> performance intuitively. > > We usually don't add device specific monitor command. Maybe a new > control vq command ethtool -S in guest in the future. I was thinking of > removing those counters since it was never used in this series. Yes, that's a good idea, actually i'm doubting whether this feature=20 should be a guest feature or a host feature, the spec says it should be=20 more like a guest feature, but it's provided as a host built-in feature,=20 so how and where to examine it is a problem. I'm using gdb to debugging=20 it currently, normally i would check this counter directly via debug=20 command, and the statistics is quite useful for troubleshooting, it'll=20 be optional when this feature is geting more and more robust, can we=20 keep it and leave if should keep it or where to display it along with=20 qa's testing? > >> >> Signed-off-by: Wei Xu >> --- >> hw/net/virtio-net.c | 498 >> +++++++++++++++++++++++++++- >> include/hw/virtio/virtio-net.h | 2 + >> include/hw/virtio/virtio.h | 75 +++++ >> include/standard-headers/linux/virtio_net.h | 1 + > > For RFC, it's ok. But for formal patch, this is not the correct way to > modify Linux headers. There's a script in > scripts/update-linux-headers.sh which was used to sync it from Linux > source. This means, it must be merged in Linux or at least in > maintainer's tree first. > >> 4 files changed, 575 insertions(+), 1 deletion(-) >> >> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c >> index 5798f87..b3bb63b 100644 >> --- a/hw/net/virtio-net.c >> +++ b/hw/net/virtio-net.c >> @@ -15,10 +15,12 @@ >> #include "qemu/iov.h" >> #include "hw/virtio/virtio.h" >> #include "net/net.h" >> +#include "net/eth.h" >> #include "net/checksum.h" >> #include "net/tap.h" >> #include "qemu/error-report.h" >> #include "qemu/timer.h" >> +#include "qemu/sockets.h" >> #include "hw/virtio/virtio-net.h" >> #include "net/vhost_net.h" >> #include "hw/virtio/virtio-bus.h" >> @@ -38,6 +40,25 @@ >> #define endof(container, field) \ >> (offsetof(container, field) + sizeof(((container *)0)->field)) >> +#define VIRTIO_NET_IP4_ADDR_SIZE 8 /* ipv4 saddr + daddr */ >> +#define VIRTIO_NET_TCP_PORT_SIZE 4 /* sport + dport */ >> + >> +#define VIRTIO_NET_TCP_FLAG 0x3F >> +#define VIRTIO_NET_TCP_HDR_LENGTH 0xF000 >> + >> +/* IPv4 max payload, 16 bits in the header */ >> +#define VIRTIO_NET_MAX_IP4_PAYLOAD (65535 - sizeof(struct ip_header)) >> +#define VIRTIO_NET_MAX_TCP_PAYLOAD 65535 > > Is this still true if window scaling is enabled? And need to make sure > your ack/seq processing can work for window scaling. > >> + >> +/* header length value in ip header without option */ >> +#define VIRTIO_NET_IP4_HEADER_LENGTH 5 >> + >> +/* Purge coalesced packets timer interval, This value affects the >> performance >> + a lot, and should be tuned carefully, '300000'(300us) is the >> recommended >> + value to pass the WHQL test, '50000' can gain 2x netperf >> throughput with >> + tso/gso/gro 'off'. */ >> +#define VIRTIO_NET_RSC_INTERVAL 300000 > > Do we need a new property for this value? Yes, i added this already. + DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout, + VIRTIO_NET_RSC_INTERVAL), > >> + >> typedef struct VirtIOFeature { >> uint32_t flags; >> size_t end; >> @@ -1089,7 +1110,8 @@ static int receive_filter(VirtIONet *n, const >> uint8_t *buf, int size) >> return 0; >> } >> -static ssize_t virtio_net_receive(NetClientState *nc, const uint8_t >> *buf, size_t size) >> +static ssize_t virtio_net_do_receive(NetClientState *nc, >> + const uint8_t *buf, size_t size) >> { >> VirtIONet *n =3D qemu_get_nic_opaque(nc); >> VirtIONetQueue *q =3D virtio_net_get_subqueue(nc); >> @@ -1685,6 +1707,474 @@ static int virtio_net_load_device(VirtIODevice >> *vdev, QEMUFile *f, >> return 0; >> } >> +static void virtio_net_rsc_extract_unit4(NetRscChain *chain, >> + const uint8_t *buf, >> NetRscUnit* unit) >> +{ >> + uint16_t hdr_len; >> + uint16_t ip_hdrlen; >> + struct ip_header *ip; >> + >> + hdr_len =3D chain->n->guest_hdr_len; >> + ip =3D (struct ip_header *)(buf + hdr_len + sizeof(struct >> eth_header)); >> + unit->ip =3D (void *)ip; >> + ip_hdrlen =3D (ip->ip_ver_len & 0xF) << 2; >> + unit->ip_plen =3D &ip->ip_len; >> + unit->tcp =3D (struct tcp_header *)(((uint8_t *)unit->ip) + >> ip_hdrlen); >> + unit->tcp_hdrlen =3D (htons(unit->tcp->th_offset_flags) & 0xF000) >> >> 10; >> + unit->payload =3D htons(*unit->ip_plen) - ip_hdrlen - >> unit->tcp_hdrlen; >> +} >> + >> +static void virtio_net_rsc_ipv4_checksum(struct virtio_net_hdr *vhdr, >> + struct ip_header *ip) >> +{ >> + uint32_t sum; >> + >> + ip->ip_sum =3D 0; >> + sum =3D net_checksum_add_cont(sizeof(struct ip_header), (uint8_t >> *)ip, 0); >> + ip->ip_sum =3D cpu_to_be16(net_checksum_finish(sum)); >> + vhdr->flags &=3D ~VIRTIO_NET_HDR_F_NEEDS_CSUM; >> + vhdr->flags |=3D VIRTIO_NET_HDR_F_DATA_VALID; >> +} >> + >> +static size_t virtio_net_rsc_drain_seg(NetRscChain *chain, NetRscSeg >> *seg) >> +{ >> + int ret; >> + struct virtio_net_hdr *h; >> + >> + h =3D (struct virtio_net_hdr *)seg->buf; >> + virtio_net_rsc_ipv4_checksum(h, seg->unit.ip); >> + ret =3D virtio_net_do_receive(seg->nc, seg->buf, seg->size); >> + QTAILQ_REMOVE(&chain->buffers, seg, next); >> + g_free(seg->buf); >> + g_free(seg); >> + >> + return ret; >> +} >> + >> +static void virtio_net_rsc_purge(void *opq) >> +{ >> + NetRscSeg *seg, *rn; >> + NetRscChain *chain =3D (NetRscChain *)opq; >> + >> + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn) { >> + if (virtio_net_rsc_drain_seg(chain, seg) =3D=3D 0) { >> + chain->stat.purge_failed++; >> + continue; > > Why need this? ah., We have discussed this before, there maybe momentary congestion and=20 can be resumed quickly for the guest, so it's try to continue receiving=20 rather then drop the cached packets. > >> + } >> + } >> + >> + chain->stat.timer++; >> + if (!QTAILQ_EMPTY(&chain->buffers)) { >> + timer_mod(chain->drain_timer, >> + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + >> chain->n->rsc_timeout); > > Consider the timer is invisible to guest, why use QEMU_CLOCK_VIRTUAL? Yes, it's invisible to the guest, should so should i choose a host or=20 timer here? > >> + } >> +} >> + >> +static void virtio_net_rsc_cleanup(VirtIONet *n) >> +{ >> + NetRscChain *chain, *rn_chain; >> + NetRscSeg *seg, *rn_seg; >> + >> + QTAILQ_FOREACH_SAFE(chain, &n->rsc_chains, next, rn_chain) { >> + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, rn_seg) { >> + QTAILQ_REMOVE(&chain->buffers, seg, next); >> + g_free(seg->buf); >> + g_free(seg); >> + } >> + >> + timer_del(chain->drain_timer); >> + timer_free(chain->drain_timer); >> + QTAILQ_REMOVE(&n->rsc_chains, chain, next); >> + g_free(chain); >> + } >> +} >> + >> +static void virtio_net_rsc_cache_buf(NetRscChain *chain, >> NetClientState *nc, >> + const uint8_t *buf, size_t size) >> +{ >> + uint16_t hdr_len; >> + NetRscSeg *seg; >> + >> + hdr_len =3D chain->n->guest_hdr_len; >> + seg =3D g_malloc(sizeof(NetRscSeg)); >> + seg->buf =3D g_malloc(hdr_len + sizeof(struct eth_header)\ >> + + VIRTIO_NET_MAX_TCP_PAYLOAD); >> + memcpy(seg->buf, buf, size); >> + seg->size =3D size; >> + seg->packets =3D 1; >> + seg->dup_ack =3D 0; >> + seg->is_coalesced =3D 0; >> + seg->nc =3D nc; >> + >> + QTAILQ_INSERT_TAIL(&chain->buffers, seg, next); >> + chain->stat.cache++; >> + >> + virtio_net_rsc_extract_unit4(chain, seg->buf, &seg->unit); >> +} >> + >> +static int32_t virtio_net_rsc_handle_ack(NetRscChain *chain, >> + NetRscSeg *seg, const >> uint8_t *buf, >> + struct tcp_header *n_tcp, >> + struct tcp_header *o_tcp) >> +{ >> + uint32_t nack, oack; >> + uint16_t nwin, owin; >> + >> + nack =3D htonl(n_tcp->th_ack); >> + nwin =3D htons(n_tcp->th_win); >> + oack =3D htonl(o_tcp->th_ack); >> + owin =3D htons(o_tcp->th_win); >> + >> + if ((nack - oack) >=3D VIRTIO_NET_MAX_TCP_PAYLOAD) { >> + chain->stat.ack_out_of_win++; >> + return RSC_FINAL; >> + } else if (nack =3D=3D oack) { >> + /* duplicated ack or window probe */ >> + if (nwin =3D=3D owin) { >> + /* duplicated ack, add dup ack count due to whql test up >> to 1 */ >> + chain->stat.dup_ack++; >> + return RSC_FINAL; >> + } else { >> + /* Coalesce window update */ >> + o_tcp->th_win =3D n_tcp->th_win; >> + chain->stat.win_update++; >> + return RSC_COALESCE; >> + } >> + } else { >> + /* pure ack, go to 'C', finalize*/ >> + chain->stat.pure_ack++; >> + return RSC_FINAL; >> + } >> +} >> + >> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain, >> + NetRscSeg *seg, const >> uint8_t *buf, >> + NetRscUnit *n_unit) >> +{ >> + void *data; >> + uint16_t o_ip_len; >> + uint32_t nseq, oseq; >> + NetRscUnit *o_unit; >> + >> + o_unit =3D &seg->unit; >> + o_ip_len =3D htons(*o_unit->ip_plen); >> + nseq =3D htonl(n_unit->tcp->th_seq); >> + oseq =3D htonl(o_unit->tcp->th_seq); >> + >> + /* out of order or retransmitted. */ >> + if ((nseq - oseq) > VIRTIO_NET_MAX_TCP_PAYLOAD) { >> + chain->stat.data_out_of_win++; >> + return RSC_FINAL; >> + } >> + >> + data =3D ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen; >> + if (nseq =3D=3D oseq) { >> + if ((o_unit->payload =3D=3D 0) && n_unit->payload) { >> + /* From no payload to payload, normal case, not a dup ack >> or etc */ >> + chain->stat.data_after_pure_ack++; >> + goto coalesce; >> + } else { >> + return virtio_net_rsc_handle_ack(chain, seg, buf, >> + n_unit->tcp, o_unit->tcp= ); >> + } >> + } else if ((nseq - oseq) !=3D o_unit->payload) { >> + /* Not a consistent packet, out of order */ >> + chain->stat.data_out_of_order++; >> + return RSC_FINAL; >> + } else { >> +coalesce: >> + if ((o_ip_len + n_unit->payload) > chain->max_payload) { >> + chain->stat.over_size++; >> + return RSC_FINAL; >> + } >> + >> + /* Here comes the right data, the payload length in v4/v6 is >> different, >> + so use the field value to update and record the new data >> len */ >> + o_unit->payload +=3D n_unit->payload; /* update new data len = */ >> + >> + /* update field in ip header */ >> + *o_unit->ip_plen =3D htons(o_ip_len + n_unit->payload); >> + >> + /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be >> coalesced >> + for windows guest, while this may change the behavior for >> linux >> + guest. */ >> + o_unit->tcp->th_offset_flags =3D n_unit->tcp->th_offset_flags= ; >> + >> + o_unit->tcp->th_ack =3D n_unit->tcp->th_ack; >> + o_unit->tcp->th_win =3D n_unit->tcp->th_win; >> + >> + memmove(seg->buf + seg->size, data, n_unit->payload); >> + seg->size +=3D n_unit->payload; >> + seg->packets++; >> + chain->stat.coalesced++; >> + return RSC_COALESCE; >> + } >> +} >> + >> +static int32_t virtio_net_rsc_coalesce4(NetRscChain *chain, NetRscSeg >> *seg, >> + const uint8_t *buf, size_t si= ze, >> + NetRscUnit *unit) >> +{ >> + struct ip_header *ip1, *ip2; >> + >> + ip1 =3D (struct ip_header *)(unit->ip); >> + ip2 =3D (struct ip_header *)(seg->unit.ip); >> + if ((ip1->ip_src ^ ip2->ip_src) || (ip1->ip_dst ^ ip2->ip_dst) >> + || (unit->tcp->th_sport ^ seg->unit.tcp->th_sport) >> + || (unit->tcp->th_dport ^ seg->unit.tcp->th_dport)) { >> + chain->stat.no_match++; >> + return RSC_NO_MATCH; >> + } >> + >> + return virtio_net_rsc_coalesce_data(chain, seg, buf, unit); >> +} >> + >> +/* Pakcets with 'SYN' should bypass, other flag should be sent after >> drain >> + * to prevent out of order */ >> +static int virtio_net_rsc_tcp_ctrl_check(NetRscChain *chain, >> + struct tcp_header *tcp) >> +{ >> + uint16_t tcp_hdr; >> + uint16_t tcp_flag; >> + >> + tcp_flag =3D htons(tcp->th_offset_flags); >> + tcp_hdr =3D (tcp_flag & VIRTIO_NET_TCP_HDR_LENGTH) >> 10; >> + tcp_flag &=3D VIRTIO_NET_TCP_FLAG; >> + tcp_flag =3D htons(tcp->th_offset_flags) & 0x3F; >> + if (tcp_flag & TH_SYN) { >> + chain->stat.tcp_syn++; >> + return RSC_BYPASS; >> + } >> + >> + if (tcp_flag & (TH_FIN | TH_URG | TH_RST)) { >> + chain->stat.tcp_ctrl_drain++; >> + return RSC_FINAL; >> + } >> + >> + if (tcp_hdr > sizeof(struct tcp_header)) { >> + chain->stat.tcp_all_opt++; >> + return RSC_FINAL; >> + } >> + >> + return RSC_WANT; >> +} >> + >> +static bool virtio_net_rsc_empty_cache(NetRscChain *chain, >> NetClientState *nc, >> + const uint8_t *buf, size_t siz= e) >> +{ > > The name is really confusing, it in fact cache the buf if the cache is > empty. So I guess what you mean is "virtio_net_rsc_cache_if_empty()?". > But the question is why need a specific function like this, can we > simply add the logic to virtio_net_rsc_do_coalesce()? (And looks like > most of the logic has been there, except for timer and counter. OK. > >> + if (QTAILQ_EMPTY(&chain->buffers)) { >> + chain->stat.empty_cache++; >> + virtio_net_rsc_cache_buf(chain, nc, buf, size); >> + timer_mod(chain->drain_timer, >> + qemu_clock_get_ns(QEMU_CLOCK_VIRTUAL) + >> chain->n->rsc_timeout); >> + return 1; >> + } >> + >> + return 0; >> +} >> + >> +static size_t virtio_net_rsc_do_coalesce(NetRscChain *chain, >> NetClientState *nc, >> + const uint8_t *buf, size_t >> size, >> + NetRscUnit *unit) >> +{ >> + int ret; >> + NetRscSeg *seg, *nseg; >> + >> + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) { >> + ret =3D virtio_net_rsc_coalesce4(chain, seg, buf, size, unit)= ; >> + >> + if (ret =3D=3D RSC_FINAL) { >> + if (virtio_net_rsc_drain_seg(chain, seg) =3D=3D 0) { >> + /* Send failed */ >> + chain->stat.final_failed++; >> + return 0; >> + } >> + >> + /* Send current packet */ >> + return virtio_net_do_receive(nc, buf, size); >> + } else if (ret =3D=3D RSC_NO_MATCH) { >> + continue; >> + } else { >> + /* Coalesced, mark coalesced flag to tell calc cksum for >> ipv4 */ >> + seg->is_coalesced =3D 1; >> + return size; >> + } >> + } >> + >> + chain->stat.no_match_cache++; >> + virtio_net_rsc_cache_buf(chain, nc, buf, size); >> + return size; >> +} >> + >> +/* Drain a connection data, this is to avoid out of order segments */ >> +static size_t virtio_net_rsc_drain_flow(NetRscChain *chain, >> NetClientState *nc, >> + const uint8_t *buf, size_t si= ze, >> + uint16_t ip_start, uint16_t >> ip_size, >> + uint16_t tcp_port, uint16_t >> port_size) >> +{ >> + NetRscSeg *seg, *nseg; >> + >> + QTAILQ_FOREACH_SAFE(seg, &chain->buffers, next, nseg) { >> + if (memcmp(buf + ip_start, seg->buf + ip_start, ip_size) >> + || memcmp(buf + tcp_port, seg->buf + tcp_port, port_size)= ) { > > The header comparison is rather similar to what you've done in > virtio_net_rsc_coalesce4/6(). Any chance to reduce the duplication? OK. > >> + continue; >> + } >> + if (virtio_net_rsc_drain_seg(chain, seg) =3D=3D 0) { >> + chain->stat.drain_failed++; >> + } >> + >> + break; >> + } >> + >> + return virtio_net_do_receive(nc, buf, size); >> +} >> + >> +static int32_t virtio_net_rsc_sanity_check4(NetRscChain *chain, >> + struct ip_header *ip, >> + const uint8_t *buf, >> size_t size) >> +{ >> + uint16_t ip_len; >> + uint16_t hdr_len; >> + >> + hdr_len =3D chain->n->guest_hdr_len; >> + if (size < (hdr_len + sizeof(struct eth_header) + sizeof(struct >> ip_header) >> + + sizeof(struct tcp_header))) { > > It's interesting that you don't have a stat counter for this case. Thanks. > >> + return RSC_BYPASS; >> + } >> + >> + /* Not an ipv4 one */ >> + if (((ip->ip_ver_len & 0xF0) >> 4) !=3D IP_HEADER_VERSION_4) { >> + chain->stat.ip_option++; >> + return RSC_BYPASS; >> + } >> + >> + /* Don't handle packets with ip option */ >> + if (VIRTIO_NET_IP4_HEADER_LENGTH !=3D (ip->ip_ver_len & 0xF)) { > > (ip-> ...) !=3D VIRTIO_NET_IP4_HEADER_LENGTH OK. > >> + chain->stat.ip_option++; >> + return RSC_BYPASS; >> + } >> + >> + if (ip->ip_p !=3D IPPROTO_TCP) { >> + chain->stat.bypass_not_tcp++; >> + return RSC_BYPASS; >> + } >> + >> + /* Don't handle packets with ip fragment */ >> + if (!(htons(ip->ip_off) & IP_DF)) { >> + chain->stat.ip_frag++; >> + return RSC_BYPASS; >> + } >> + >> + ip_len =3D htons(ip->ip_len); >> + if (ip_len < (sizeof(struct ip_header) + sizeof(struct tcp_header= )) >> + || ip_len > (size - hdr_len - sizeof(struct eth_header))) { >> + chain->stat.ip_hacked++; >> + return RSC_BYPASS; >> + } >> + >> + return RSC_WANT; >> +} >> + >> +static size_t virtio_net_rsc_receive4(NetRscChain *chain, >> NetClientState* nc, >> + const uint8_t *buf, size_t size= ) >> +{ >> + int32_t ret; >> + uint16_t hdr_len; >> + NetRscUnit unit; >> + >> + hdr_len =3D ((VirtIONet *)(chain->n))->guest_hdr_len; >> + virtio_net_rsc_extract_unit4(chain, buf, &unit); > > Is this safe for you do header analysis before doing sanity check? looks, there should be unsafe case here, will double check it. > >> + if (RSC_WANT !=3D virtio_net_rsc_sanity_check4(chain, unit.ip, bu= f, >> size)) { > > virtio_net_rsc_sanity_check4() =3D RSC_WANT please. I've pointed out th= is > kind of style issues several times. > >> + return virtio_net_do_receive(nc, buf, size); >> + } >> + >> + ret =3D virtio_net_rsc_tcp_ctrl_check(chain, unit.tcp); >> + if (ret =3D=3D RSC_BYPASS) { >> + return virtio_net_do_receive(nc, buf, size); >> + } else if (ret =3D=3D RSC_FINAL) { >> + return virtio_net_rsc_drain_flow(chain, nc, buf, size, >> + ((hdr_len + sizeof(struct eth_header)) + 12), >> + VIRTIO_NET_IP4_ADDR_SIZE, >> + hdr_len + sizeof(struct eth_header) + sizeof(struct >> ip_header), >> + VIRTIO_NET_TCP_PORT_SIZE); >> + } >> + >> + if (virtio_net_rsc_empty_cache(chain, nc, buf, size)) { >> + return size; >> + } >> + >> + return virtio_net_rsc_do_coalesce(chain, nc, buf, size, &unit); >> +} >> + >> +static NetRscChain *virtio_net_rsc_lookup_chain(VirtIONet * n, >> + NetClientState *nc, >> + uint16_t proto) >> +{ >> + NetRscChain *chain; >> + >> + if (proto !=3D (uint16_t)ETH_P_IP) { >> + return NULL; >> + } >> + >> + QTAILQ_FOREACH(chain, &n->rsc_chains, next) { >> + if (chain->proto =3D=3D proto) { >> + return chain; >> + } >> + } >> + >> + chain =3D g_malloc(sizeof(*chain)); >> + chain->n =3D n; >> + chain->proto =3D proto; >> + chain->max_payload =3D VIRTIO_NET_MAX_IP4_PAYLOAD; >> + chain->gso_type =3D VIRTIO_NET_HDR_GSO_TCPV4; >> + chain->drain_timer =3D timer_new_ns(QEMU_CLOCK_VIRTUAL, >> + virtio_net_rsc_purge, chain); >> + memset(&chain->stat, 0, sizeof(chain->stat)); >> + >> + QTAILQ_INIT(&chain->buffers); >> + QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next); >> + >> + return chain; >> +} >> + >> +static ssize_t virtio_net_rsc_receive(NetClientState *nc, >> + const uint8_t *buf, size_t size= ) >> +{ >> + uint16_t proto; >> + NetRscChain *chain; >> + struct eth_header *eth; >> + VirtIONet *n; >> + >> + n =3D qemu_get_nic_opaque(nc); >> + if (size < (n->guest_hdr_len + sizeof(struct eth_header))) { > > Here should use host_hdr_len, no? Not understanding the code well, could you please give a typical usage=20 of in what kind of case will the host_hdr_len be different with=20 guest_hdr_len? i was thinking they are equal in most cases, is it caused=20 by different feature capability such as mergeable buffers, etc? > >> + return virtio_net_do_receive(nc, buf, size); >> + } >> + >> + eth =3D (struct eth_header *)(buf + n->guest_hdr_len); >> + proto =3D htons(eth->h_proto); >> + >> + chain =3D virtio_net_rsc_lookup_chain(n, nc, proto); >> + if (!chain) { >> + return virtio_net_do_receive(nc, buf, size); >> + } else { >> + chain->stat.received++; >> + return virtio_net_rsc_receive4(chain, nc, buf, size); >> + } >> +} >> + >> +static ssize_t virtio_net_receive(NetClientState *nc, >> + const uint8_t *buf, size_t size) >> +{ >> + VirtIONet *n; >> + >> + n =3D qemu_get_nic_opaque(nc); >> + if (n->host_features & (1ULL << VIRTIO_NET_F_GUEST_RSC)) { >> + return virtio_net_rsc_receive(nc, buf, size); >> + } else { >> + return virtio_net_do_receive(nc, buf, size); >> + } >> +} >> + >> static NetClientInfo net_virtio_info =3D { >> .type =3D NET_CLIENT_OPTIONS_KIND_NIC, >> .size =3D sizeof(NICState), >> @@ -1814,6 +2304,7 @@ static void >> virtio_net_device_realize(DeviceState *dev, Error **errp) >> nc =3D qemu_get_queue(n->nic); >> nc->rxfilter_notify_enabled =3D 1; >> + QTAILQ_INIT(&n->rsc_chains); >> n->qdev =3D dev; >> register_savevm(dev, "virtio-net", -1, VIRTIO_NET_VM_VERSION, >> virtio_net_save, virtio_net_load, n); >> @@ -1848,6 +2339,7 @@ static void >> virtio_net_device_unrealize(DeviceState *dev, Error **errp) >> g_free(n->vqs); >> qemu_del_nic(n->nic); >> virtio_cleanup(vdev); >> + virtio_net_rsc_cleanup(n); >> } >> static void virtio_net_instance_init(Object *obj) >> @@ -1909,6 +2401,10 @@ static Property virtio_net_properties[] =3D { >> TX_TIMER_INTERVAL), >> DEFINE_PROP_INT32("x-txburst", VirtIONet, net_conf.txburst, >> TX_BURST), >> DEFINE_PROP_STRING("tx", VirtIONet, net_conf.tx), >> + DEFINE_PROP_BIT("guest_rsc", VirtIONet, host_features, >> + VIRTIO_NET_F_GUEST_RSC, false), >> + DEFINE_PROP_UINT32("rsc_interval", VirtIONet, rsc_timeout, >> + VIRTIO_NET_RSC_INTERVAL), >> DEFINE_PROP_END_OF_LIST(), >> }; >> diff --git a/include/hw/virtio/virtio-net.h >> b/include/hw/virtio/virtio-net.h >> index 0cabdb6..e995190 100644 >> --- a/include/hw/virtio/virtio-net.h >> +++ b/include/hw/virtio/virtio-net.h >> @@ -59,6 +59,8 @@ typedef struct VirtIONet { >> VirtIONetQueue *vqs; >> VirtQueue *ctrl_vq; >> NICState *nic; >> + QTAILQ_HEAD(, NetRscChain) rsc_chains; >> + uint32_t rsc_timeout; >> uint32_t tx_timeout; >> int32_t tx_burst; >> uint32_t has_vnet_hdr; >> diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h >> index 6a37065..a7cb84a 100644 >> --- a/include/hw/virtio/virtio.h >> +++ b/include/hw/virtio/virtio.h >> @@ -30,6 +30,8 @@ >> (0x1ULL << VIRTIO_F_ANY_LAYOUT)) >> struct VirtQueue; >> +struct VirtIONet; >> +typedef struct VirtIONet VirtIONet; >> static inline hwaddr vring_align(hwaddr addr, >> unsigned long align) >> @@ -128,6 +130,79 @@ typedef struct VirtioDeviceClass { >> int (*load)(VirtIODevice *vdev, QEMUFile *f, int version_id); >> } VirtioDeviceClass; >> +/* Coalesced packets type & status */ >> +typedef enum { >> + RSC_COALESCE, /* Data been coalesced */ >> + RSC_FINAL, /* Will terminate current connection */ >> + RSC_NO_MATCH, /* No matched in the buffer pool */ >> + RSC_BYPASS, /* Packet to be bypass, not tcp, tcp >> ctrl, etc */ >> + RSC_WANT /* Data want to be coalesced */ >> +} COALESCE_STATUS; >> + >> +typedef struct NetRscStat { >> + uint32_t received; >> + uint32_t coalesced; >> + uint32_t over_size; >> + uint32_t cache; >> + uint32_t empty_cache; >> + uint32_t no_match_cache; >> + uint32_t win_update; >> + uint32_t no_match; >> + uint32_t tcp_syn; >> + uint32_t tcp_ctrl_drain; >> + uint32_t dup_ack; >> + uint32_t dup_ack1; >> + uint32_t dup_ack2; >> + uint32_t pure_ack; >> + uint32_t ack_out_of_win; >> + uint32_t data_out_of_win; >> + uint32_t data_out_of_order; >> + uint32_t data_after_pure_ack; >> + uint32_t bypass_not_tcp; >> + uint32_t tcp_option; >> + uint32_t tcp_all_opt; >> + uint32_t ip_frag; >> + uint32_t ip_hacked; >> + uint32_t ip_option; >> + uint32_t purge_failed; >> + uint32_t drain_failed; >> + uint32_t final_failed; >> + int64_t timer; >> +} NetRscStat; >> + >> +/* Rsc unit general info used to checking if can coalescing */ >> +typedef struct NetRscUnit { >> + void *ip; /* ip header */ >> + uint16_t *ip_plen; /* data len pointer in ip header field */ >> + struct tcp_header *tcp; /* tcp header */ >> + uint16_t tcp_hdrlen; /* tcp header len */ >> + uint16_t payload; /* pure payload without virtio/eth/ip/tcp= */ >> +} NetRscUnit; >> + >> +/* Coalesced segmant */ >> +typedef struct NetRscSeg { >> + QTAILQ_ENTRY(NetRscSeg) next; >> + void *buf; >> + size_t size; >> + uint16_t packets; >> + uint16_t dup_ack; >> + bool is_coalesced; /* need recal ipv4 header checksum, mark >> here */ >> + NetRscUnit unit; >> + NetClientState *nc; >> +} NetRscSeg; >> + >> +/* Chain is divided by protocol(ipv4/v6) and NetClientInfo */ >> +typedef struct NetRscChain { >> + QTAILQ_ENTRY(NetRscChain) next; >> + VirtIONet *n; /* VirtIONet */ >> + uint16_t proto; >> + uint8_t gso_type; >> + uint16_t max_payload; >> + QEMUTimer *drain_timer; >> + QTAILQ_HEAD(, NetRscSeg) buffers; >> + NetRscStat stat; >> +} NetRscChain; >> + >> void virtio_instance_init_common(Object *proxy_obj, void *data, >> size_t vdev_size, const char >> *vdev_name); >> diff --git a/include/standard-headers/linux/virtio_net.h >> b/include/standard-headers/linux/virtio_net.h >> index a78f33e..5b95762 100644 >> --- a/include/standard-headers/linux/virtio_net.h >> +++ b/include/standard-headers/linux/virtio_net.h >> @@ -55,6 +55,7 @@ >> #define VIRTIO_NET_F_MQ 22 /* Device supports Receive Flow >> * Steering */ >> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */ >> +#define VIRTIO_NET_F_GUEST_RSC 24 /* Guest can coalesce tcp >> packets */ > > The comment should be "Guest can receive coalesced tcp packets" ? Yes, actually windows driver is using gso_type 'GSO_TCPV4/6' for=20 coalesced packets, i'm doubting if they are overlapped and can be=20 multiplexed. > >> #ifndef VIRTIO_NET_NO_LEGACY >> #define VIRTIO_NET_F_GSO 6 /* Host handles pkts w/ any GSO >> type */ >