From: "Michael S. Tsirkin" <mst@redhat.com>
To: Jason Wang <jasowang@redhat.com>
Cc: David Riddoch <driddoch@solarflare.com>,
Virtio-Dev <virtio-dev@lists.oasis-open.org>
Subject: Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
Date: Tue, 19 Feb 2019 09:27:44 -0500 [thread overview]
Message-ID: <20190219092333-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <e6535c70-32ed-1213-3f2c-fcfc05ec107a@redhat.com>
On Tue, Feb 19, 2019 at 02:33:04PM +0800, Jason Wang wrote:
>
> On 2019/2/14 下午12:04, Michael S. Tsirkin wrote:
> > On Thu, Feb 14, 2019 at 11:34:22AM +0800, Jason Wang wrote:
> > > On 2019/2/14 上午1:30, Michael S. Tsirkin wrote:
> > > > On Wed, Feb 13, 2019 at 06:33:50PM +0800, Jason Wang wrote:
> > > > > On 2019/2/13 上午1:35, Michael S. Tsirkin wrote:
> > > > > > On Tue, Feb 12, 2019 at 04:47:08PM +0000, David Riddoch wrote:
> > > > > > > > > > > I would prefer to have the write barrier before writing the idx.
> > > > > > > > > > Well that's driver overhead for something device might never utilise in
> > > > > > > > > > a given workload. If we are optimizing let's optimize for speed.
> > > > > > > > > I think doing the barrier before writing idx is best for speed (see below).
> > > > > > > > I don't see it below:(
> > > > > > > Sorry, I'm not being clear. If the write barrier is before the idx, then a
> > > > > > > PV device can read the idx, do a single rmb and read a whole bunch of
> > > > > > > descriptors. As things stand today a PV device has to do an rmb for each
> > > > > > > descriptor that it reads.
> > > > > > >
> > > > > > > I'm not sure, but this may be what Jason meant when he said "prefetch".
> > > > > Yes.
> > > > >
> > > > >
> > > > > > Oh. Right. So for PV it's easy to convert an rmb to a data dependency
> > > > > > instead. It remains to be seen whether an extra cache miss per
> > > > > > batch is cheaper or more expensive than such a dependency
> > > > > > per descriptor.
> > > > > >
> > > > > > I hope Jason can experiment and let us know.
> > > > > >
> > > > > >
> > > > > To clarify, actually I did play someting like:
> > > > >
> > > > >
> > > > > if (desc_is_avail((last_avail + 32) % size))
> > > > >
> > > > > [last_avail, last_avail + 32] is avail [1]
> > > > >
> > > > > else if (desc_is_avail(last_avail + 1) % size))
> > > > >
> > > > > last_avail is avail
> > > > >
> > > > > else
> > > > >
> > > > > no new
> > > > >
> > > > >
> > > > > And looks like [1] depends on the driver to do:
> > > > >
> > > > > for all descriptors
> > > > >
> > > > > update desc (but not make it available for the flag)
> > > > >
> > > > > wmb()
> > > > >
> > > > > for all descriptors
> > > > >
> > > > > update desc.flag
> > > > >
> > > > >
> > > > > This means if last_avail + 32 is avail, no need to check flags of
> > > > > [last_avail, last_avail + 31] since the descriptor content (except for the
> > > > > flag) is guaranteed to be correct.
> > > > We can try looking into that, sure. It's not just wmb though which is
> > > > anyway free on x86. There are cases of e.g. chained descriptors
> > > > to consider where we do not want device to see a partial chain.
> > >
> > > If we saw the last descriptor is in a chain, it's still doesn't harm, just
> > > read more?
> > So it's possible to add a variant that sticks a wmb after each flag
> > update for sure. However if you are writing out batches that's an
> > upfront cost and what you gain is prefetch on the other side which is
> > speculative.
>
>
> Yes.
>
>
> >
> >
> > > >
> > > > > I get ~10% PPS improvement.
> > > > >
> > > > > Thanks
> > > > I suspect most of the gain isn't in the check at all though.
> > >
> > > The gain was:
> > >
> > > 1) batching reading descriptors, saving the times of userspace access
> > Yes that needs to be fixed. The way linux vhost driver does userspace
> > accesses right now with stac/clac and memory barriers within a loop is
> > plain silly. Designing a host/guest interface to work around that kind
> > of inefficiency makes no sense, it's purely an implementation issue.
>
>
> Actually, it has other advantages, consider the case when driver is faster,
> batch reading may reduce the cache contention on the descriptor ring (when
> the ring is almost full).
A shared index implies always doing a cache miss though.
By comparison speculating a read ahead of a cache line
will sometimes bring you a cache.
>
> >
> > > 2) reduce the rmb(), otherwise you need rmb per descriptor
> > Well not really, you can
> > - validate a batch then do a single rmb
> > - replace rmb with a dependency
> >
> > > > It's probably reading descriptors and maybe loop unrolling.
> > >
> > > I'm not quite understand why unrolling help in this case.
> > The amount of operations in this loop is tiny but not tiny
> > enough for gcc to decide it's free. Take a look
> >
> >
> > #include <stdio.h>
> >
> > struct desc {
> > unsigned long long a;
> > unsigned len;
> > unsigned short type;
> > unsigned short flags;
> > };
> > #ifdef UNROLL
> > __attribute__((optimize("unroll-loops")))
> > #endif
> > int foo(struct desc *desc)
> > {
> > unsigned short aflags = 0xffff;
> > unsigned short oflags = 0x0;
> >
> > int i;
> >
> > for (i=0; i < 4; ++i) {
> > aflags &= desc[i].flags;
> > oflags |= desc[i].flags;
> > }
> >
> > return aflags == oflags;
> > }
> >
> > $ gcc -Wall -DUNROLL -O2 -c l.c
>
>
> I think you mean without -DUNROLL.
Since there's a jump I think you are right.
>
> >
> > 0000000000000000 <foo>:
> > foo():
> > 0: 48 8d 47 0e lea 0xe(%rdi),%rax
> > 4: 31 c9 xor %ecx,%ecx
> > 6: 48 83 c7 4e add $0x4e,%rdi
> > a: be ff ff ff ff mov $0xffffffff,%esi
> > f: 0f b7 10 movzwl (%rax),%edx
> > 12: 48 83 c0 10 add $0x10,%rax
> > 16: 21 d6 and %edx,%esi
> > 18: 09 d1 or %edx,%ecx
> > 1a: 48 39 f8 cmp %rdi,%rax
> > 1d: 75 f0 jne f <foo+0xf>
> > 1f: 31 c0 xor %eax,%eax
> > 21: 66 39 ce cmp %cx,%si
> > 24: 0f 94 c0 sete %al
> > 27: c3 retq
> >
> >
> >
> > $ gcc -Wall -DUNROLL -O2 -c l.c
> >
> > Disassembly of section .text:
> >
> > 0000000000000000 <foo>:
> > foo():
> > 0: 0f b7 47 1e movzwl 0x1e(%rdi),%eax
> > 4: 0f b7 77 2e movzwl 0x2e(%rdi),%esi
> > 8: 0f b7 4f 0e movzwl 0xe(%rdi),%ecx
> > c: 89 c2 mov %eax,%edx
> > e: 09 f0 or %esi,%eax
> > 10: 21 f2 and %esi,%edx
> > 12: 09 c8 or %ecx,%eax
> > 14: 21 ca and %ecx,%edx
> > 16: 0f b7 4f 3e movzwl 0x3e(%rdi),%ecx
> > 1a: 09 c8 or %ecx,%eax
> > 1c: 21 ca and %ecx,%edx
> > 1e: 66 39 c2 cmp %ax,%dx
> > 21: 0f 94 c0 sete %al
> > 24: 0f b6 c0 movzbl %al,%eax
> > 27: c3 retq
> >
> >
> > See? gcc just does not unroll agressively enough.
>
>
> Ok, so this is just for avoiding branches.
Which is important for a variety of reasons, e.g. linux adds in tricks
to block speculation so branches might not be predicted even on intel.
A pipeline stall on each read isn't good for speed.
>
> >
> > > > Something like:
> > > >
> > > > __attribute__((optimize("unroll-loops")))
> > > > bool fast_get_descriptors(vq, desc)
> > > > {
> > > >
> > > > copy X descriptors
> > > > u16 aflags = 0xffff;
> > > > u16 oflags = 0x0;
> > > > for (i = 0; i < X; ++i) {
> > > > aflags &= desc[i].flags;
> > > > oflags |= desc[i].flags;
> > > > }
> > > > /* all flags must be same */
> > > > if (aflags != oflags)
> > > > return false;
> > > >
> > > > return is_avail(vq->wrapcount, aflags);
> > > > }
> > >
> > > This may work. It should be a little bit less efficient but without any
> > > assumption of driver if I understand correctly.
> > >
> > >
> > > >
> > > > I also think 32 is very aggressive, I would start with 8.
> > >
> > > Maybe, or adding some heuristic to predict it.
> > >
> > > Thanks
> > The reason I say 8 is because it's two cache lines which
> > is typically what cpu pulls in anyway (1+linear pfetch).
> > Anything more might have more cost as it might
> > put more pressure on the cache. Further
> > static sized short loops can be unrolled by compiler if
> > you ask it nicely, which saves a bunch of branching,
> > speculation etc. Dynamic sized ones can't,
> > you end up with extra math and branches in the loop.
>
>
> You're probably right here, but need some benchmark to check.
>
> Thanks
>
>
> >
> >
> > > >
> > > > With above, you could try:
> > > >
> > > > if (fast_get_descriptors(vq, desc))
> > > > [last_avail, last_avail + 32] is avail [1]
> > > > else if (desc_is_avail(last_avail + 1) % size))
> > > > last_avail is avail
> > > > else
> > > > no new
> > > >
> > > >
---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
next prev parent reply other threads:[~2019-02-19 14:27 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-01 14:23 [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload David Riddoch
2019-02-11 7:33 ` Michael S. Tsirkin
2019-02-11 8:52 ` David Riddoch
2019-02-11 14:24 ` Michael S. Tsirkin
2019-02-11 14:58 ` David Riddoch
2019-02-12 5:08 ` Michael S. Tsirkin
2019-02-12 7:28 ` Jason Wang
2019-02-12 13:44 ` Michael S. Tsirkin
2019-02-13 10:00 ` Jason Wang
2019-02-13 15:20 ` Michael S. Tsirkin
2019-02-14 3:21 ` Jason Wang
2019-02-14 3:41 ` Michael S. Tsirkin
2019-02-15 3:59 ` Jason Wang
2019-02-15 4:23 ` Michael S. Tsirkin
2019-02-19 6:21 ` Jason Wang
2019-02-19 14:18 ` Michael S. Tsirkin
2019-02-12 11:40 ` David Riddoch
2019-02-12 12:51 ` Michael S. Tsirkin
2019-02-12 14:01 ` Michael S. Tsirkin
2019-02-12 16:47 ` David Riddoch
2019-02-12 17:35 ` Michael S. Tsirkin
2019-02-13 9:49 ` David Riddoch
2019-02-13 10:33 ` Jason Wang
2019-02-13 17:30 ` Michael S. Tsirkin
2019-02-14 3:34 ` Jason Wang
2019-02-14 4:04 ` Michael S. Tsirkin
2019-02-19 6:33 ` Jason Wang
2019-02-19 14:27 ` Michael S. Tsirkin [this message]
2019-04-30 22:41 ` Michael S. Tsirkin
2019-06-06 12:34 ` Michael S. Tsirkin
2019-06-17 14:41 ` Michael S. Tsirkin
[not found] <501110004.11631.1549031044295@oodm23.prod.google.com>
2019-02-01 17:43 ` Rob Miller
2019-02-04 5:36 ` Stefan Hajnoczi
2019-02-12 18:58 ` Michael S. Tsirkin
2019-02-12 18:55 ` Michael S. Tsirkin
2019-02-12 20:03 ` Rob Miller
2019-02-13 17:38 ` Michael S. Tsirkin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190219092333-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=driddoch@solarflare.com \
--cc=jasowang@redhat.com \
--cc=virtio-dev@lists.oasis-open.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox