Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload

Discussion of the implementations of VIRTIO specification
 help / color / mirror / Atom feed

From: "Michael S. Tsirkin" <mst@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Cc: David Riddoch <driddoch@solarflare.com>,
	Virtio-Dev <virtio-dev@lists.oasis-open.org>
Subject: Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
Date: Tue, 19 Feb 2019 09:27:44 -0500	[thread overview]
Message-ID: <20190219092333-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <e6535c70-32ed-1213-3f2c-fcfc05ec107a@redhat.com>

On Tue, Feb 19, 2019 at 02:33:04PM +0800, Jason Wang wrote:
> 
> On 2019/2/14 下午12:04, Michael S. Tsirkin wrote:
> > On Thu, Feb 14, 2019 at 11:34:22AM +0800, Jason Wang wrote:
> > > On 2019/2/14 上午1:30, Michael S. Tsirkin wrote:
> > > > On Wed, Feb 13, 2019 at 06:33:50PM +0800, Jason Wang wrote:
> > > > > On 2019/2/13 上午1:35, Michael S. Tsirkin wrote:
> > > > > > On Tue, Feb 12, 2019 at 04:47:08PM +0000, David Riddoch wrote:
> > > > > > > > > > > I would prefer to have the write barrier before writing the idx.
> > > > > > > > > > Well that's driver overhead for something device might never utilise in
> > > > > > > > > > a given workload. If we are optimizing let's optimize for speed.
> > > > > > > > > I think doing the barrier before writing idx is best for speed (see below).
> > > > > > > > I don't see it below:(
> > > > > > > Sorry, I'm not being clear.  If the write barrier is before the idx, then a
> > > > > > > PV device can read the idx, do a single rmb and read a whole bunch of
> > > > > > > descriptors.  As things stand today a PV device has to do an rmb for each
> > > > > > > descriptor that it reads.
> > > > > > > 
> > > > > > > I'm not sure, but this may be what Jason meant when he said "prefetch".
> > > > > Yes.
> > > > > 
> > > > > 
> > > > > > Oh. Right. So for PV it's easy to convert an rmb to a data dependency
> > > > > > instead.  It remains to be seen whether an extra cache miss per
> > > > > > batch is cheaper or more expensive than such a dependency
> > > > > > per descriptor.
> > > > > > 
> > > > > > I hope Jason can experiment and let us know.
> > > > > > 
> > > > > > 
> > > > > To clarify, actually I did play someting like:
> > > > > 
> > > > > 
> > > > > if (desc_is_avail((last_avail + 32) % size))
> > > > > 
> > > > >       [last_avail, last_avail + 32] is avail [1]
> > > > > 
> > > > > else if (desc_is_avail(last_avail + 1) % size))
> > > > > 
> > > > >       last_avail is avail
> > > > > 
> > > > > else
> > > > > 
> > > > >      no new
> > > > > 
> > > > > 
> > > > > And looks like [1] depends on the driver to do:
> > > > > 
> > > > > for all descriptors
> > > > > 
> > > > >       update desc (but not make it available for the flag)
> > > > > 
> > > > > wmb()
> > > > > 
> > > > > for all descriptors
> > > > > 
> > > > >       update desc.flag
> > > > > 
> > > > > 
> > > > > This means if last_avail + 32 is avail, no need to check flags of
> > > > > [last_avail, last_avail + 31] since the descriptor content (except for the
> > > > > flag) is guaranteed to be correct.
> > > > We can try looking into that, sure. It's not just wmb though which is
> > > > anyway free on x86. There are cases of e.g. chained descriptors
> > > > to consider where we do not want device to see a partial chain.
> > > 
> > > If we saw the last descriptor is in a chain, it's still doesn't harm, just
> > > read more?
> > So it's possible to add a variant that sticks a wmb after each flag
> > update for sure. However if you are writing out batches that's an
> > upfront cost and what you gain is prefetch on the other side which is
> > speculative.
> 
> 
> Yes.
> 
> 
> > 
> > 
> > > > 
> > > > > I get ~10% PPS improvement.
> > > > > 
> > > > > Thanks
> > > > I suspect most of the gain isn't in the check at all though.
> > > 
> > > The gain was:
> > > 
> > > 1) batching reading descriptors, saving the times of userspace access
> > Yes that needs to be fixed. The way linux vhost driver does userspace
> > accesses right now with stac/clac and memory barriers within a loop is
> > plain silly.  Designing a host/guest interface to work around that kind
> > of inefficiency makes no sense, it's purely an implementation issue.
> 
> 
> Actually, it has other advantages, consider the case when driver is faster,
> batch reading may reduce the cache contention on the descriptor ring (when
> the ring is almost full).

A shared index implies always doing a cache miss though.
By comparison speculating a read ahead of a cache line
will sometimes bring you a cache.

> 
> > 
> > > 2) reduce the rmb(), otherwise you need rmb per descriptor
> > Well not really, you can
> > - validate a batch then do a single rmb
> > - replace rmb with a dependency
> > 
> > > > It's probably reading descriptors and maybe loop unrolling.
> > > 
> > > I'm not quite understand why unrolling help in this case.
> > The amount of operations in this loop is tiny but not tiny
> > enough for gcc to decide it's free. Take a look
> > 
> > 
> > #include <stdio.h>
> > 
> > struct desc {
> >          unsigned long long a;
> >          unsigned len;
> >          unsigned short type;
> >          unsigned short flags;
> > };
> > #ifdef UNROLL
> > __attribute__((optimize("unroll-loops")))
> > #endif
> > int foo(struct desc *desc)
> > {
> >       unsigned short aflags = 0xffff;
> >       unsigned short oflags = 0x0;
> > 
> >       int i;
> > 
> >          for (i=0; i < 4; ++i) {
> >               aflags &= desc[i].flags;
> >               oflags |= desc[i].flags;
> >          }
> > 
> >          return aflags == oflags;
> > }
> > 
> > $ gcc -Wall -DUNROLL -O2 -c l.c
> 
> 
> I think you mean without -DUNROLL.


Since there's a jump I think you are right.

> 
> > 
> > 0000000000000000 <foo>:
> > foo():
> >     0:   48 8d 47 0e             lea    0xe(%rdi),%rax
> >     4:   31 c9                   xor    %ecx,%ecx
> >     6:   48 83 c7 4e             add    $0x4e,%rdi
> >     a:   be ff ff ff ff          mov    $0xffffffff,%esi
> >     f:   0f b7 10                movzwl (%rax),%edx
> >    12:   48 83 c0 10             add    $0x10,%rax
> >    16:   21 d6                   and    %edx,%esi
> >    18:   09 d1                   or     %edx,%ecx
> >    1a:   48 39 f8                cmp    %rdi,%rax
> >    1d:   75 f0                   jne    f <foo+0xf>
> >    1f:   31 c0                   xor    %eax,%eax
> >    21:   66 39 ce                cmp    %cx,%si
> >    24:   0f 94 c0                sete   %al
> >    27:   c3                      retq
> > 
> > 
> > 
> > $ gcc -Wall -DUNROLL -O2 -c l.c
> > 
> > Disassembly of section .text:
> > 
> > 0000000000000000 <foo>:
> > foo():
> >     0:   0f b7 47 1e             movzwl 0x1e(%rdi),%eax
> >     4:   0f b7 77 2e             movzwl 0x2e(%rdi),%esi
> >     8:   0f b7 4f 0e             movzwl 0xe(%rdi),%ecx
> >     c:   89 c2                   mov    %eax,%edx
> >     e:   09 f0                   or     %esi,%eax
> >    10:   21 f2                   and    %esi,%edx
> >    12:   09 c8                   or     %ecx,%eax
> >    14:   21 ca                   and    %ecx,%edx
> >    16:   0f b7 4f 3e             movzwl 0x3e(%rdi),%ecx
> >    1a:   09 c8                   or     %ecx,%eax
> >    1c:   21 ca                   and    %ecx,%edx
> >    1e:   66 39 c2                cmp    %ax,%dx
> >    21:   0f 94 c0                sete   %al
> >    24:   0f b6 c0                movzbl %al,%eax
> >    27:   c3                      retq
> > 
> > 
> > See? gcc just does not unroll agressively enough.
> 
> 
> Ok, so this is just for avoiding branches.

Which is important for a variety of reasons, e.g. linux adds in tricks
to block speculation so branches might not be predicted even on intel.
A pipeline stall on each read isn't good for speed.

> 
> > 
> > > > Something like:
> > > > 
> > > > __attribute__((optimize("unroll-loops")))
> > > > bool fast_get_descriptors(vq, desc)
> > > > {
> > > > 
> > > > 	copy X descriptors
> > > > 	u16 aflags = 0xffff;
> > > > 	u16 oflags = 0x0;
> > > > 	for (i = 0; i < X; ++i) {
> > > > 		aflags &= desc[i].flags;
> > > > 		oflags |= desc[i].flags;
> > > > 	}
> > > > 	/* all flags must be same */
> > > > 	if (aflags != oflags)
> > > > 		return false;
> > > > 
> > > > 	return is_avail(vq->wrapcount, aflags);
> > > > }
> > > 
> > > This may work. It should be a little bit less efficient but without any
> > > assumption of driver if I understand correctly.
> > > 
> > > 
> > > > 
> > > > I also think 32 is very aggressive, I would start with 8.
> > > 
> > > Maybe, or adding some heuristic to predict it.
> > > 
> > > Thanks
> > The reason I say 8 is because it's two cache lines which
> > is typically what cpu pulls in anyway (1+linear pfetch).
> > Anything more might have more cost as it might
> > put more pressure on the cache. Further
> > static sized short loops can be unrolled by compiler if
> > you ask it nicely, which saves a bunch of branching,
> > speculation etc. Dynamic sized ones can't,
> > you end up with extra math and branches in the loop.
> 
> 
> You're probably right here, but need some benchmark to check.
> 
> Thanks
> 
> 
> > 
> > 
> > > > 
> > > > With above, you could try:
> > > > 
> > > > 	if (fast_get_descriptors(vq, desc))
> > > > 	     [last_avail, last_avail + 32] is avail [1]
> > > > 	else if (desc_is_avail(last_avail + 1) % size))
> > > > 	     last_avail is avail
> > > > 	 else
> > > >    	 	no new
> > > > 
> > > > 

---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org

next prev parent reply	other threads:[~2019-02-19 14:27 UTC|newest]

Thread overview: 37+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-02-01 14:23 [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload David Riddoch
2019-02-11  7:33 ` Michael S. Tsirkin
2019-02-11  8:52   ` David Riddoch
2019-02-11 14:24     ` Michael S. Tsirkin
2019-02-11 14:58       ` David Riddoch
2019-02-12  5:08         ` Michael S. Tsirkin
2019-02-12  7:28           ` Jason Wang
2019-02-12 13:44             ` Michael S. Tsirkin
2019-02-13 10:00               ` Jason Wang
2019-02-13 15:20                 ` Michael S. Tsirkin
2019-02-14  3:21                   ` Jason Wang
2019-02-14  3:41                     ` Michael S. Tsirkin
2019-02-15  3:59                       ` Jason Wang
2019-02-15  4:23                         ` Michael S. Tsirkin
2019-02-19  6:21                           ` Jason Wang
2019-02-19 14:18                             ` Michael S. Tsirkin
2019-02-12 11:40           ` David Riddoch
2019-02-12 12:51             ` Michael S. Tsirkin
2019-02-12 14:01             ` Michael S. Tsirkin
2019-02-12 16:47               ` David Riddoch
2019-02-12 17:35                 ` Michael S. Tsirkin
2019-02-13  9:49                   ` David Riddoch
2019-02-13 10:33                   ` Jason Wang
2019-02-13 17:30                     ` Michael S. Tsirkin
2019-02-14  3:34                       ` Jason Wang
2019-02-14  4:04                         ` Michael S. Tsirkin
2019-02-19  6:33                           ` Jason Wang
2019-02-19 14:27                             ` Michael S. Tsirkin [this message]
2019-04-30 22:41     ` Michael S. Tsirkin
2019-06-06 12:34       ` Michael S. Tsirkin
2019-06-17 14:41         ` Michael S. Tsirkin
     [not found] <501110004.11631.1549031044295@oodm23.prod.google.com>
2019-02-01 17:43 ` Rob Miller
2019-02-04  5:36   ` Stefan Hajnoczi
2019-02-12 18:58     ` Michael S. Tsirkin
2019-02-12 18:55   ` Michael S. Tsirkin
2019-02-12 20:03     ` Rob Miller
2019-02-13 17:38       ` Michael S. Tsirkin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20190219092333-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=driddoch@solarflare.com \
    --cc=jasowang@redhat.com \
    --cc=virtio-dev@lists.oasis-open.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox