From: "Michael S. Tsirkin" <mst@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Cc: David Riddoch <driddoch@solarflare.com>,
Virtio-Dev <virtio-dev@lists.oasis-open.org>
Subject: Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
Date: Tue, 19 Feb 2019 09:27:44 -0500 [thread overview]
Message-ID: <20190219092333-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <e6535c70-32ed-1213-3f2c-fcfc05ec107a@redhat.com>
On Tue, Feb 19, 2019 at 02:33:04PM +0800, Jason Wang wrote:
>
> On 2019/2/14 下午12:04, Michael S. Tsirkin wrote:
> > On Thu, Feb 14, 2019 at 11:34:22AM +0800, Jason Wang wrote:
> > > On 2019/2/14 上午1:30, Michael S. Tsirkin wrote:
> > > > On Wed, Feb 13, 2019 at 06:33:50PM +0800, Jason Wang wrote:
> > > > > On 2019/2/13 上午1:35, Michael S. Tsirkin wrote:
> > > > > > On Tue, Feb 12, 2019 at 04:47:08PM +0000, David Riddoch wrote:
> > > > > > > > > > > I would prefer to have the write barrier before writing the idx.
> > > > > > > > > > Well that's driver overhead for something device might never utilise in
> > > > > > > > > > a given workload. If we are optimizing let's optimize for speed.
> > > > > > > > > I think doing the barrier before writing idx is best for speed (see below).
> > > > > > > > I don't see it below:(
> > > > > > > Sorry, I'm not being clear. If the write barrier is before the idx, then a
> > > > > > > PV device can read the idx, do a single rmb and read a whole bunch of
> > > > > > > descriptors. As things stand today a PV device has to do an rmb for each
> > > > > > > descriptor that it reads.
> > > > > > >
> > > > > > > I'm not sure, but this may be what Jason meant when he said "prefetch".
> > > > > Yes.
> > > > >
> > > > >
> > > > > > Oh. Right. So for PV it's easy to convert an rmb to a data dependency
> > > > > > instead. It remains to be seen whether an extra cache miss per
> > > > > > batch is cheaper or more expensive than such a dependency
> > > > > > per descriptor.
> > > > > >
> > > > > > I hope Jason can experiment and let us know.
> > > > > >
> > > > > >
> > > > > To clarify, actually I did play someting like:
> > > > >
> > > > >
> > > > > if (desc_is_avail((last_avail + 32) % size))
> > > > >
> > > > > [last_avail, last_avail + 32] is avail [1]
> > > > >
> > > > > else if (desc_is_avail(last_avail + 1) % size))
> > > > >
> > > > > last_avail is avail
> > > > >
> > > > > else
> > > > >
> > > > > no new
> > > > >
> > > > >
> > > > > And looks like [1] depends on the driver to do:
> > > > >
> > > > > for all descriptors
> > > > >
> > > > > update desc (but not make it available for the flag)
> > > > >
> > > > > wmb()
> > > > >
> > > > > for all descriptors
> > > > >
> > > > > update desc.flag
> > > > >
> > > > >
> > > > > This means if last_avail + 32 is avail, no need to check flags of
> > > > > [last_avail, last_avail + 31] since the descriptor content (except for the
> > > > > flag) is guaranteed to be correct.
> > > > We can try looking into that, sure. It's not just wmb though which is
> > > > anyway free on x86. There are cases of e.g. chained descriptors
> > > > to consider where we do not want device to see a partial chain.
> > >
> > > If we saw the last descriptor is in a chain, it's still doesn't harm, just
> > > read more?
> > So it's possible to add a variant that sticks a wmb after each flag
> > update for sure. However if you are writing out batches that's an
> > upfront cost and what you gain is prefetch on the other side which is
> > speculative.
>
>
> Yes.
>
>
> >
> >
> > > >
> > > > > I get ~10% PPS improvement.
> > > > >
> > > > > Thanks
> > > > I suspect most of the gain isn't in the check at all though.
> > >
> > > The gain was:
> > >
> > > 1) batching reading descriptors, saving the times of userspace access
> > Yes that needs to be fixed. The way linux vhost driver does userspace
> > accesses right now with stac/clac and memory barriers within a loop is
> > plain silly. Designing a host/guest interface to work around that kind
> > of inefficiency makes no sense, it's purely an implementation issue.
>
>
> Actually, it has other advantages, consider the case when driver is faster,
> batch reading may reduce the cache contention on the descriptor ring (when
> the ring is almost full).
A shared index implies always doing a cache miss though.
By comparison speculating a read ahead of a cache line
will sometimes bring you a cache.
>
> >
> > > 2) reduce the rmb(), otherwise you need rmb per descriptor
> > Well not really, you can
> > - validate a batch then do a single rmb
> > - replace rmb with a dependency
> >
> > > > It's probably reading descriptors and maybe loop unrolling.
> > >
> > > I'm not quite understand why unrolling help in this case.
> > The amount of operations in this loop is tiny but not tiny
> > enough for gcc to decide it's free. Take a look
> >
> >
> > #include <stdio.h>
> >
> > struct desc {
> > unsigned long long a;
> > unsigned len;
> > unsigned short type;
> > unsigned short flags;
> > };
> > #ifdef UNROLL
> > __attribute__((optimize("unroll-loops")))
> > #endif
> > int foo(struct desc *desc)
> > {
> > unsigned short aflags = 0xffff;
> > unsigned short oflags = 0x0;
> >
> > int i;
> >
> > for (i=0; i < 4; ++i) {
> > aflags &= desc[i].flags;
> > oflags |= desc[i].flags;
> > }
> >
> > return aflags == oflags;
> > }
> >
> > $ gcc -Wall -DUNROLL -O2 -c l.c
>
>
> I think you mean without -DUNROLL.
Since there's a jump I think you are right.
>
> >
> > 0000000000000000 <foo>:
> > foo():
> > 0: 48 8d 47 0e lea 0xe(%rdi),%rax
> > 4: 31 c9 xor %ecx,%ecx
> > 6: 48 83 c7 4e add $0x4e,%rdi
> > a: be ff ff ff ff mov $0xffffffff,%esi
> > f: 0f b7 10 movzwl (%rax),%edx
> > 12: 48 83 c0 10 add $0x10,%rax
> > 16: 21 d6 and %edx,%esi
> > 18: 09 d1 or %edx,%ecx
> > 1a: 48 39 f8 cmp %rdi,%rax
> > 1d: 75 f0 jne f <foo+0xf>
> > 1f: 31 c0 xor %eax,%eax
> > 21: 66 39 ce cmp %cx,%si
> > 24: 0f 94 c0 sete %al
> > 27: c3 retq
> >
> >
> >
> > $ gcc -Wall -DUNROLL -O2 -c l.c
> >
> > Disassembly of section .text:
> >
> > 0000000000000000 <foo>:
> > foo():
> > 0: 0f b7 47 1e movzwl 0x1e(%rdi),%eax
> > 4: 0f b7 77 2e movzwl 0x2e(%rdi),%esi
> > 8: 0f b7 4f 0e movzwl 0xe(%rdi),%ecx
> > c: 89 c2 mov %eax,%edx
> > e: 09 f0 or %esi,%eax
> > 10: 21 f2 and %esi,%edx
> > 12: 09 c8 or %ecx,%eax
> > 14: 21 ca and %ecx,%edx
> > 16: 0f b7 4f 3e movzwl 0x3e(%rdi),%ecx
> > 1a: 09 c8 or %ecx,%eax
> > 1c: 21 ca and %ecx,%edx
> > 1e: 66 39 c2 cmp %ax,%dx
> > 21: 0f 94 c0 sete %al
> > 24: 0f b6 c0 movzbl %al,%eax
> > 27: c3 retq
> >
> >
> > See? gcc just does not unroll agressively enough.
>
>
> Ok, so this is just for avoiding branches.
Which is important for a variety of reasons, e.g. linux adds in tricks
to block speculation so branches might not be predicted even on intel.
A pipeline stall on each read isn't good for speed.
>
> >
> > > > Something like:
> > > >
> > > > __attribute__((optimize("unroll-loops")))
> > > > bool fast_get_descriptors(vq, desc)
> > > > {
> > > >
> > > > copy X descriptors
> > > > u16 aflags = 0xffff;
> > > > u16 oflags = 0x0;
> > > > for (i = 0; i < X; ++i) {
> > > > aflags &= desc[i].flags;
> > > > oflags |= desc[i].flags;
> > > > }
> > > > /* all flags must be same */
> > > > if (aflags != oflags)
> > > > return false;
> > > >
> > > > return is_avail(vq->wrapcount, aflags);
> > > > }
> > >
> > > This may work. It should be a little bit less efficient but without any
> > > assumption of driver if I understand correctly.
> > >
> > >
> > > >
> > > > I also think 32 is very aggressive, I would start with 8.
> > >
> > > Maybe, or adding some heuristic to predict it.
> > >
> > > Thanks
> > The reason I say 8 is because it's two cache lines which
> > is typically what cpu pulls in anyway (1+linear pfetch).
> > Anything more might have more cost as it might
> > put more pressure on the cache. Further
> > static sized short loops can be unrolled by compiler if
> > you ask it nicely, which saves a bunch of branching,
> > speculation etc. Dynamic sized ones can't,
> > you end up with extra math and branches in the loop.
>
>
> You're probably right here, but need some benchmark to check.
>
> Thanks
>
>
> >
> >
> > > >
> > > > With above, you could try:
> > > >
> > > > if (fast_get_descriptors(vq, desc))
> > > > [last_avail, last_avail + 32] is avail [1]
> > > > else if (desc_is_avail(last_avail + 1) % size))
> > > > last_avail is avail
> > > > else
> > > > no new
> > > >
> > > >
---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
next prev parent reply other threads:[~2019-02-19 14:27 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-01 14:23 [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload David Riddoch
2019-02-11 7:33 ` Michael S. Tsirkin
2019-02-11 8:52 ` David Riddoch
2019-02-11 14:24 ` Michael S. Tsirkin
2019-02-11 14:58 ` David Riddoch
2019-02-12 5:08 ` Michael S. Tsirkin
2019-02-12 7:28 ` Jason Wang
2019-02-12 13:44 ` Michael S. Tsirkin
2019-02-13 10:00 ` Jason Wang
2019-02-13 15:20 ` Michael S. Tsirkin
2019-02-14 3:21 ` Jason Wang
2019-02-14 3:41 ` Michael S. Tsirkin
2019-02-15 3:59 ` Jason Wang
2019-02-15 4:23 ` Michael S. Tsirkin
2019-02-19 6:21 ` Jason Wang
2019-02-19 14:18 ` Michael S. Tsirkin
2019-02-12 11:40 ` David Riddoch
2019-02-12 12:51 ` Michael S. Tsirkin
2019-02-12 14:01 ` Michael S. Tsirkin
2019-02-12 16:47 ` David Riddoch
2019-02-12 17:35 ` Michael S. Tsirkin
2019-02-13 9:49 ` David Riddoch
2019-02-13 10:33 ` Jason Wang
2019-02-13 17:30 ` Michael S. Tsirkin
2019-02-14 3:34 ` Jason Wang
2019-02-14 4:04 ` Michael S. Tsirkin
2019-02-19 6:33 ` Jason Wang
2019-02-19 14:27 ` Michael S. Tsirkin [this message]
2019-04-30 22:41 ` Michael S. Tsirkin
2019-06-06 12:34 ` Michael S. Tsirkin
2019-06-17 14:41 ` Michael S. Tsirkin
[not found] <501110004.11631.1549031044295@oodm23.prod.google.com>
2019-02-01 17:43 ` Rob Miller
2019-02-04 5:36 ` Stefan Hajnoczi
2019-02-12 18:58 ` Michael S. Tsirkin
2019-02-12 18:55 ` Michael S. Tsirkin
2019-02-12 20:03 ` Rob Miller
2019-02-13 17:38 ` Michael S. Tsirkin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190219092333-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=driddoch@solarflare.com \
--cc=jasowang@redhat.com \
--cc=virtio-dev@lists.oasis-open.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.