From: "Michael S. Tsirkin" <mst@redhat.com>
To: David Riddoch <driddoch@solarflare.com>
Cc: Virtio-Dev <virtio-dev@lists.oasis-open.org>
Subject: Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
Date: Mon, 11 Feb 2019 09:24:33 -0500 [thread overview]
Message-ID: <20190211080432-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <d5d3b7f6-95a3-0d1c-4890-6e72f2e4bfc5@solarflare.com>
On Mon, Feb 11, 2019 at 08:52:01AM +0000, David Riddoch wrote:
> On 11/02/2019 07:33, Michael S. Tsirkin wrote:
> > On Fri, Feb 01, 2019 at 02:23:55PM +0000, David Riddoch wrote:
> > > All,
> > >
> > > I'd like to propose a small extension to the packed virtqueue mode. My
> > > proposal is to add an offset/wrap field, written by the driver, indicating
> > > how many available descriptors have been added to the ring.
> > >
> > > The reason for wanting this is to improve performance of hardware devices.
> > > Because of high read latency over a PCIe bus, it is important for hardware
> > > devices to read multiple ring entries in parallel. It is desirable to know
> > > how many descriptors are available prior to issuing these reads, else you
> > > risk fetching descriptors that are not yet available. As well as wasting
> > > bus bandwidth this adds complexity.
> > >
> > > I'd previously hoped that VIRTIO_F_NOTIFICATION_DATA would solve this
> > > problem,
> > Right. And this seems like the ideal solution latency-wise since
> > this pushes data out to device without need for round-trips
> > over PCIe.
>
> Yes, and I'm not proposing getting rid of it. I'd expect a PCIe device to
> use both features together.
>
> > > but we still have a problem. If you rely on doorbells to tell you
> > > how many descriptors are available, then you have to keep doorbells enabled
> > > at all times.
> > I would say right now there are two modes and device can transition
> > between them at will:
> >
> > 1. read each descriptor twice - once speculatively, once
> > to get the actual data
> >
> > optimal for driver suboptimal for device
>
> You might read each descriptor multiple times in some scenarios. Reading
> descriptors in batches is hugely important given the latency and overheads
> of PCIe (and lack of adjacent data fetching that caching gives you).
>
> > 2. enable notification for each descritor and rely on
> > these notifications
> >
> > optimal for device suboptimal for driver
> >
> > > This can result in a very high rate of doorbells with some
> > > drivers, which can become a severe bottleneck (because x86 CPUs can't emit
> > > MMIOs at very high rates).
> > Interesting. Is there any data you could share to help guide the design?
> > E.g. what's the highest rate of MMIO writes supported etc?
>
> On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
>
> On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s. On
> PCIe-remote socket ~8M/s.
Is that mega bytes? Or million writes?
With 32 bit writes are we limited to 2M then?
> This doesn't just impose a limit on aggregate packet rate: If you hit this
> bottleneck then the CPU core is back-pressured by the MMIO writes, and so
> instr/cycle takes a huge hit.
>
> > > The proposed offset/wrap field allows devices to disable doorbells when
> > > appropriate, and determine the latest fill level via a PCIe read.
> > This kind of looks like a split ring though, does it not?
>
> I don't agree, because the ring isn't split. Split-ring is very painful for
> hardware because of the potentially random accesses to the descriptor table
> (including pointer chasing) which inhibit batching of descriptor fetches, as
> well as causing complexity in implementation.
That's different with IN_ORDER, right?
> Packed ring is a huge improvement on both dimensions, but introduces a
> specific pain point for hardware offload.
>
> > The issue is
> > we will again need to bounce more cache lines to communicate.
>
> You'll only bounce cache lines if the device chooses to read the idx. A PV
> device need not offer this feature. A PCIe device will, but the cost to the
> driver of writing an in-memory idx is much smaller than the cost of an MMIO,
> which is always a strongly ordered barrier on x86/64.
>
> With vDPA you ideally would have this feature enabled, and the device would
> sometimes be PV and sometimes PCIe. The PV device would ignore the new idx
> field and so cache lines would not bounce then either.
>
> Ie. The only time cache lines are shared is when sharing with a PCIe device,
> which is the scenario when this is a win.
True. OTOH on non-x86 you will need more write barriers :( It would be
natural to say driver can reuse the barrier before flags update, but note
that that would mean hardware still needs to support re-fetching if
flags value is wrong. Such re-fetch is probably rare so fine by me, but it
does add a bit more complexity.
> > So I wonder: what if we made a change to spec that would allow prefetch
> > of ring entries? E.g. you would be able to read at random and if you
> > guessed right then you can just use what you have read, no need to
> > re-fetch?
>
> Unless I've misunderstood I think this would imply that the driver would
> have to ensure strict ordering for each descriptor it wrote, which would
> impose a cost to the driver. At the moment a write barrier is only needed
> before writing the flags of the first descriptor in a batch.
On non-x86 right? OTOH the extra data structure also adds more ordering
requirements.
> > > I suggest the best place to put this would be in the driver area,
> > > immediately after the event suppression structure.
> > Could you comment on why is that a good place though?
>
> The new field is written by the driver, as are the other fields in the
> driver area. Also I expect that devices might be able to read this new idx
> together with the interrupt suppression fields in a single read, reducing
> PCIe overheads.
OK then you need it in the same aligned 256 byte boundary.
> Placing the new field immediately after the descriptor ring would also work,
> but lose the benefit of combining reads, and potentially cause drivers to
> allocate a substantially bigger buffer (as I expect the descriptor ring is
> typically a power-of-2 size and drivers allocate multiples of page size).
>
> Placing the new field in a separate data structure is undesirable, as it
> would require devices to store a further 64bits per virt-queue.
>
> Thanks,
> David
>
>
> > > Presumably we would like this to be an optional feature, as implementations
> > > of packed mode already exist in the wild. How about
> > > VIRTIO_F_RING_PACKED_AVAIL_IDX?
> > > If I prepare a patch to the spec is there still time to get this into v1.1?
> > Any new feature would require another round of public review.
> > I personally think it's better to instead try to do 1.2 soon after,
> > e.g. we could try to target quarterly releases.
> >
> > But ultimately that would be up to the TC vote.
>
>
> --
> David Riddoch <driddoch@solarflare.com> -- Chief Architect, Solarflare
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
next prev parent reply other threads:[~2019-02-11 14:24 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-01 14:23 [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload David Riddoch
2019-02-11 7:33 ` Michael S. Tsirkin
2019-02-11 8:52 ` David Riddoch
2019-02-11 14:24 ` Michael S. Tsirkin [this message]
2019-02-11 14:58 ` David Riddoch
2019-02-12 5:08 ` Michael S. Tsirkin
2019-02-12 7:28 ` Jason Wang
2019-02-12 13:44 ` Michael S. Tsirkin
2019-02-13 10:00 ` Jason Wang
2019-02-13 15:20 ` Michael S. Tsirkin
2019-02-14 3:21 ` Jason Wang
2019-02-14 3:41 ` Michael S. Tsirkin
2019-02-15 3:59 ` Jason Wang
2019-02-15 4:23 ` Michael S. Tsirkin
2019-02-19 6:21 ` Jason Wang
2019-02-19 14:18 ` Michael S. Tsirkin
2019-02-12 11:40 ` David Riddoch
2019-02-12 12:51 ` Michael S. Tsirkin
2019-02-12 14:01 ` Michael S. Tsirkin
2019-02-12 16:47 ` David Riddoch
2019-02-12 17:35 ` Michael S. Tsirkin
2019-02-13 9:49 ` David Riddoch
2019-02-13 10:33 ` Jason Wang
2019-02-13 17:30 ` Michael S. Tsirkin
2019-02-14 3:34 ` Jason Wang
2019-02-14 4:04 ` Michael S. Tsirkin
2019-02-19 6:33 ` Jason Wang
2019-02-19 14:27 ` Michael S. Tsirkin
2019-04-30 22:41 ` Michael S. Tsirkin
2019-06-06 12:34 ` Michael S. Tsirkin
2019-06-17 14:41 ` Michael S. Tsirkin
[not found] <501110004.11631.1549031044295@oodm23.prod.google.com>
2019-02-01 17:43 ` Rob Miller
2019-02-04 5:36 ` Stefan Hajnoczi
2019-02-12 18:58 ` Michael S. Tsirkin
2019-02-12 18:55 ` Michael S. Tsirkin
2019-02-12 20:03 ` Rob Miller
2019-02-13 17:38 ` Michael S. Tsirkin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190211080432-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=driddoch@solarflare.com \
--cc=virtio-dev@lists.oasis-open.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.