All of lore.kernel.org
 help / color / mirror / Atom feed
From: Wei Xu <wexu@redhat.com>
To: Jason Wang <jasowang@redhat.com>, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic
Date: Fri, 18 Mar 2016 12:17:49 +0800	[thread overview]
Message-ID: <56EB816D.5000804@redhat.com> (raw)
In-Reply-To: <56EB61D8.8080202@redhat.com>



On 2016年03月18日 10:03, Jason Wang wrote:
>
> On 03/18/2016 12:45 AM, Wei Xu wrote:
>> On 2016年03月17日 16:42, Jason Wang wrote:
>>
>>> On 03/15/2016 05:17 PM, wexu@redhat.com wrote:
>>>> From: Wei Xu <wexu@redhat.com>
>>>>
>>>> All the data packets in a tcp connection will be cached to a big buffer
>>>> in every receive interval, and will be sent out via a timer, the
>>>> 'virtio_net_rsc_timeout' controls the interval, the value will
>>>> influent the
>>>> performance and response of tcp connection extremely, 50000(50us) is a
>>>> experience value to gain a performance improvement, since the whql test
>>>> sends packets every 100us, so '300000(300us)' can pass the test case,
>>>> this is also the default value, it's gonna to be tunable.
>>>>
>>>> The timer will only be triggered if the packets pool is not empty,
>>>> and it'll drain off all the cached packets
>>>>
>>>> 'NetRscChain' is used to save the segments of different protocols in a
>>>> VirtIONet device.
>>>>
>>>> The main handler of TCP includes TCP window update, duplicated ACK
>>>> check
>>>> and the real data coalescing if the new segment passed sanity check
>>>> and is identified as an 'wanted' one.
>>>>
>>>> An 'wanted' segment means:
>>>> 1. Segment is within current window and the sequence is the expected
>>>> one.
>>>> 2. ACK of the segment is in the valid window.
>>>> 3. If the ACK in the segment is a duplicated one, then it must less
>>>> than 2,
>>>>      this is to notify upper layer TCP starting retransmission due to
>>>> the spec.
>>>>
>>>> Sanity check includes:
>>>> 1. Incorrect version in IP header
>>>> 2. IP options & IP fragment
>>>> 3. Not a TCP packets
>>>> 4. Sanity size check to prevent buffer overflow attack.
>>>>
>>>> There maybe more cases should be considered such as ip
>>>> identification other
>>>> flags, while it broke the test because windows set it to the same
>>>> even it's
>>>> not a fragment.
>>>>
>>>> Normally it includes 2 typical ways to handle a TCP control flag,
>>>> 'bypass'
>>>> and 'finalize', 'bypass' means should be sent out directly, and
>>>> 'finalize'
>>>> means the packets should also be bypassed, and this should be done
>>>> after searching for the same connection packets in the pool and sending
>>>> all of them out, this is to avoid out of data.
>>>>
>>>> All the 'SYN' packets will be bypassed since this always begin a new'
>>>> connection, other flags such 'FIN/RST' will trigger a finalization,
>>>> because
>>>> this normally happens upon a connection is going to be closed, an
>>>> 'URG' packet
>>>> also finalize current coalescing unit while there maybe protocol
>>>> difference to
>>>> different OS.
>>> But URG packet should be sent as quickly as possible regardless of
>>> ordering, no?
>> Yes, you right, URG will terminate the current 'SCU', i'll amend the
>> commit log.
>>
>>>> Statistics can be used to monitor the basic coalescing status, the
>>>> 'out of order'
>>>> and 'out of window' means how many retransmitting packets, thus
>>>> describe the
>>>> performance intuitively.
>>>>
>>>> Signed-off-by: Wei Xu <wexu@redhat.com>
>>>> ---
>>>>    hw/net/virtio-net.c            | 486
>>>> ++++++++++++++++++++++++++++++++++++++++-
>>>>    include/hw/virtio/virtio-net.h |   1 +
>>>>    include/hw/virtio/virtio.h     |  72 ++++++
>>>>    3 files changed, 558 insertions(+), 1 deletion(-)
> [...]
>
>>>> +        } else {
>>>> +            /* Coalesce window update */
>>>> +            o_tcp->th_win = n_tcp->th_win;
>>>> +            chain->stat.win_update++;
>>>> +            return RSC_COALESCE;
>>>> +        }
>>>> +    } else {
>>>> +        /* pure ack, update ack */
>>>> +        o_tcp->th_ack = n_tcp->th_ack;
>>>> +        chain->stat.pure_ack++;
>>>> +        return RSC_COALESCE;
>>> Looks like there're something I missed. The spec said:
>>>
>>> "In other words, any pure ACK that is not a duplicate ACK or a window
>>> update triggers an exception and must not be coalesced. All such pure
>>> ACKs must be indicated as individual segments."
>>>
>>> Does it mean we *should not* coalesce windows update and pure ack?
>>> (Since it can wakeup transmission)?
>> It's also a little bit inexplicit and flexible due to the spec, please
>> see the flowchart I on the same page.
>>
>> Comments about the  flowchart:
>> ------------------------------------------------------------------------
>> The first of the following two flowcharts describes the rules for
>> coalescing segments and updating the TCP headers.
>> This flowchart refers to mechanisms for distinguishing valid duplicate
>> ACKs and window updates. The second flowchart describes these mechanisms.
>> ------------------------------------------------------------------------
>> As show in the flowchart, only status 'C' will break current scu and
>> get finalized, both 'A' and 'B' can be coalesced afaik.
>>
> Interesting, looks like you're right.
>
>>>> +    }
>>>> +}
>>>> +
>>>> +static int32_t virtio_net_rsc_coalesce_data(NetRscChain *chain,
>>>> NetRscSeg *seg,
>>>> +                                    const uint8_t *buf, NetRscUnit
>>>> *n_unit)
>>>> +{
>>>> +    void *data;
>>>> +    uint16_t o_ip_len;
>>>> +    uint32_t nseq, oseq;
>>>> +    NetRscUnit *o_unit;
>>>> +
>>>> +    o_unit = &seg->unit;
>>>> +    o_ip_len = htons(*o_unit->ip_plen);
>>>> +    nseq = htonl(n_unit->tcp->th_seq);
>>>> +    oseq = htonl(o_unit->tcp->th_seq);
>>>> +
>>>> +    if (n_unit->tcp_hdrlen > TCP_HDR_SZ) {
>>>> +        /* Log this only for debugging observation */
>>>> +        chain->stat.tcp_option++;
>>>> +    }
>>>> +
>>>> +    /* Ignore packet with more/larger tcp options */
>>>> +    if (n_unit->tcp_hdrlen > o_unit->tcp_hdrlen) {
>>> What if n_unit->tcp_hdrlen > o_uint->tcp_hdr_len ?
>> do you mean '<'? that also means some option changed, maybe i should
>> include it.
> Yes.
>
>>>> +        chain->stat.tcp_larger_option++;
>>>> +        return RSC_FINAL;
>>>> +    }
>>>> +
>>>> +    /* out of order or retransmitted. */
>>>> +    if ((nseq - oseq) > MAX_TCP_PAYLOAD) {
>>>> +        chain->stat.data_out_of_win++;
>>>> +        return RSC_FINAL;
>>>> +    }
>>>> +
>>>> +    data = ((uint8_t *)n_unit->tcp) + n_unit->tcp_hdrlen;
>>>> +    if (nseq == oseq) {
>>>> +        if ((0 == o_unit->payload) && n_unit->payload) {
>>>> +            /* From no payload to payload, normal case, not a dup
>>>> ack or etc */
>>>> +            chain->stat.data_after_pure_ack++;
>>>> +            goto coalesce;
>>>> +        } else {
>>>> +            return virtio_net_rsc_handle_ack(chain, seg, buf,
>>>> +                                             n_unit->tcp,
>>>> o_unit->tcp);
>>>> +        }
>>>> +    } else if ((nseq - oseq) != o_unit->payload) {
>>>> +        /* Not a consistent packet, out of order */
>>>> +        chain->stat.data_out_of_order++;
>>>> +        return RSC_FINAL;
>>>> +    } else {
>>>> +coalesce:
>>>> +        if ((o_ip_len + n_unit->payload) > chain->max_payload) {
>>>> +            chain->stat.over_size++;
>>>> +            return RSC_FINAL;
>>>> +        }
>>>> +
>>>> +        /* Here comes the right data, the payload lengh in v4/v6 is
>>>> different,
>>>> +           so use the field value to update and record the new data
>>>> len */
>>>> +        o_unit->payload += n_unit->payload; /* update new data len */
>>>> +
>>>> +        /* update field in ip header */
>>>> +        *o_unit->ip_plen = htons(o_ip_len + n_unit->payload);
>>>> +
>>>> +        /* Bring 'PUSH' big, the whql test guide says 'PUSH' can be
>>>> coalesced
>>>> +           for windows guest, while this may change the behavior
>>>> for linux
>>>> +           guest. */
>>> This needs more thought, 'can' probably means don't. (Linux GRO won't
>>> merge PUSH packet).
>> Yes, since it's mainly for win guest, how about take this as this
>> first and do more test and see how to handle it?
> Right, this is not an issue probably but just an optimization for latency.
>
> [...]
>
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    QTAILQ_FOREACH(chain, &n->rsc_chains, next) {
>>>> +        if (chain->proto == proto) {
>>>> +            return chain;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    chain = g_malloc(sizeof(*chain));
>>>> +    chain->hdr_size = n->guest_hdr_len;
>>> Why introduce a specified field for instead of just use
>>> n->guest_hdr_len?
>> this is to reduce invoking times to find VirtIONet by 'nc', there are
>> a few places will use it.
> Okay, then I think you'd better keep a pointer to VirtIONet structure
> instead. (Consider you may want to refer more data from it, we don't
> want to duplicate fields in two places).
sure, that's a good idea.
>
>>>> +    chain->proto = proto;
>>>> +    chain->max_payload = MAX_IP4_PAYLOAD;
>>>> +    chain->drain_timer = timer_new_ns(QEMU_CLOCK_VIRTUAL,
>>>> +                                      virtio_net_rsc_purge, chain);
>>>> +    memset(&chain->stat, 0, sizeof(chain->stat));
>>>> +
>>>> +    QTAILQ_INIT(&chain->buffers);
>>>> +    QTAILQ_INSERT_TAIL(&n->rsc_chains, chain, next);
>>>> +
>>>> +    return chain;
>>>> +}
>>>> +
>>>> +static ssize_t virtio_net_rsc_receive(NetClientState *nc,
>>>> +                                      const uint8_t *buf, size_t size)
>>>> +{
>>>> +    uint16_t proto;
>>>> +    NetRscChain *chain;
>>>> +    struct eth_header *eth;
>>>> +    VirtIONet *n;
>>>> +
>>>> +    n = qemu_get_nic_opaque(nc);
>>>> +    if (size < (n->guest_hdr_len + ETH_HDR_SZ)) {
>>>> +        return virtio_net_do_receive(nc, buf, size);
>>>> +    }
>>>> +
>>>> +    eth = (struct eth_header *)(buf + n->guest_hdr_len);
>>>> +    proto = htons(eth->h_proto);
>>>> +
>>>> +    chain = virtio_net_rsc_lookup_chain(n, nc, proto);
>>>> +    if (!chain) {
>>>> +        return virtio_net_do_receive(nc, buf, size);
>>>> +    } else {
>>>> +        chain->stat.received++;
>>>> +        return virtio_net_rsc_receive4(chain, nc, buf, size);
>>>> +    }
>>>> +}
>>>> +
>>>> +static ssize_t virtio_net_receive(NetClientState *nc,
>>>> +                                  const uint8_t *buf, size_t size)
>>>> +{
>>>> +    if (virtio_net_rsc_bypass) {
>>>> +        return virtio_net_do_receive(nc, buf, size);
>>> You need a feature bit for this and compat it for older machine types.
>>> And also need some work on virtio spec I think.
>> yes, not sure which way is good to support this, hmp/qmp/ethtool, this
>> is gonna to support win guest,
>> so need a well-compatible interface, any comments?
> I think this should be implemented through feature bits/negotiation
> instead of something like ethtool.
Looks this feature should be turn on/off dynamically due to the spec, so 
maybe this should be managed from the guest, is there any reference code 
for this?
>
> [...]

  reply	other threads:[~2016-03-18  4:18 UTC|newest]

Thread overview: 32+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-15  9:17 [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest wexu
2016-03-15  9:17 ` [Qemu-devel] [ Patch 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic wexu
2016-03-15 10:00   ` Michael S. Tsirkin
2016-03-16  3:23     ` Wei Xu
2016-03-17  8:42   ` Jason Wang
2016-03-17 16:45     ` Wei Xu
2016-03-18  2:03       ` Jason Wang
2016-03-18  4:17         ` Wei Xu [this message]
2016-03-18  5:20           ` Jason Wang
2016-03-18  6:38             ` Wei Xu
2016-03-18  6:56               ` Jason Wang
2016-03-18 14:52                 ` Wei Xu
2016-03-15  9:17 ` [Qemu-devel] [ Patch 2/2] virtio-net rsc: support coalescing ipv6 " wexu
2016-03-17  8:50   ` Jason Wang
2016-03-17 16:50     ` Wei Xu
2016-03-15 10:01 ` [Qemu-devel] [ Patch 0/2] Support Receive-Segment-Offload(RSC) for WHQL test of Window guest Michael S. Tsirkin
2016-03-16  3:08   ` Wei Xu
2016-03-17  6:47 ` Jason Wang
2016-03-17 15:21   ` Wei Xu
2016-03-17 15:44     ` Michael S. Tsirkin
2016-03-17 16:57       ` Wei Xu
2016-03-18  2:22         ` Jason Wang
2016-03-18  4:24           ` Wei Xu
2016-03-18  5:21             ` Jason Wang
2016-03-18  6:30               ` Wei Xu
  -- strict thread matches above, loose matches on Subject: below --
2016-10-31 17:41 [Qemu-devel] [ RFC Patch v7 0/2] Support Receive-Segment-Offload(RSC) for WHQL wexu
2016-10-31 17:41 ` [Qemu-devel] [PATCH 1/2] virtio-net rsc: support coalescing ipv4 tcp traffic wexu
2016-11-24  4:17   ` Jason Wang
2016-11-24  4:26     ` Michael S. Tsirkin
2016-11-24  4:31       ` Jason Wang
2016-11-24  5:09         ` Michael S. Tsirkin
2016-11-30  8:55     ` Wei Xu
2016-11-30 11:12       ` Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=56EB816D.5000804@redhat.com \
    --to=wexu@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.