From: "Michael S. Tsirkin" <mst@redhat.com>
To: David Riddoch <driddoch@solarflare.com>
Cc: Virtio-Dev <virtio-dev@lists.oasis-open.org>
Subject: Re: [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload
Date: Mon, 11 Feb 2019 09:24:33 -0500 [thread overview]
Message-ID: <20190211080432-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <d5d3b7f6-95a3-0d1c-4890-6e72f2e4bfc5@solarflare.com>
On Mon, Feb 11, 2019 at 08:52:01AM +0000, David Riddoch wrote:
> On 11/02/2019 07:33, Michael S. Tsirkin wrote:
> > On Fri, Feb 01, 2019 at 02:23:55PM +0000, David Riddoch wrote:
> > > All,
> > >
> > > I'd like to propose a small extension to the packed virtqueue mode. My
> > > proposal is to add an offset/wrap field, written by the driver, indicating
> > > how many available descriptors have been added to the ring.
> > >
> > > The reason for wanting this is to improve performance of hardware devices.
> > > Because of high read latency over a PCIe bus, it is important for hardware
> > > devices to read multiple ring entries in parallel. It is desirable to know
> > > how many descriptors are available prior to issuing these reads, else you
> > > risk fetching descriptors that are not yet available. As well as wasting
> > > bus bandwidth this adds complexity.
> > >
> > > I'd previously hoped that VIRTIO_F_NOTIFICATION_DATA would solve this
> > > problem,
> > Right. And this seems like the ideal solution latency-wise since
> > this pushes data out to device without need for round-trips
> > over PCIe.
>
> Yes, and I'm not proposing getting rid of it. I'd expect a PCIe device to
> use both features together.
>
> > > but we still have a problem. If you rely on doorbells to tell you
> > > how many descriptors are available, then you have to keep doorbells enabled
> > > at all times.
> > I would say right now there are two modes and device can transition
> > between them at will:
> >
> > 1. read each descriptor twice - once speculatively, once
> > to get the actual data
> >
> > optimal for driver suboptimal for device
>
> You might read each descriptor multiple times in some scenarios. Reading
> descriptors in batches is hugely important given the latency and overheads
> of PCIe (and lack of adjacent data fetching that caching gives you).
>
> > 2. enable notification for each descritor and rely on
> > these notifications
> >
> > optimal for device suboptimal for driver
> >
> > > This can result in a very high rate of doorbells with some
> > > drivers, which can become a severe bottleneck (because x86 CPUs can't emit
> > > MMIOs at very high rates).
> > Interesting. Is there any data you could share to help guide the design?
> > E.g. what's the highest rate of MMIO writes supported etc?
>
> On an E3-1270 v5 @ 3.60GHz, max rate across all cores is ~18M/s.
>
> On an E5-2637 v4 @ 3.50GHz, max rate on PCIe-local socket is ~14M/s. On
> PCIe-remote socket ~8M/s.
Is that mega bytes? Or million writes?
With 32 bit writes are we limited to 2M then?
> This doesn't just impose a limit on aggregate packet rate: If you hit this
> bottleneck then the CPU core is back-pressured by the MMIO writes, and so
> instr/cycle takes a huge hit.
>
> > > The proposed offset/wrap field allows devices to disable doorbells when
> > > appropriate, and determine the latest fill level via a PCIe read.
> > This kind of looks like a split ring though, does it not?
>
> I don't agree, because the ring isn't split. Split-ring is very painful for
> hardware because of the potentially random accesses to the descriptor table
> (including pointer chasing) which inhibit batching of descriptor fetches, as
> well as causing complexity in implementation.
That's different with IN_ORDER, right?
> Packed ring is a huge improvement on both dimensions, but introduces a
> specific pain point for hardware offload.
>
> > The issue is
> > we will again need to bounce more cache lines to communicate.
>
> You'll only bounce cache lines if the device chooses to read the idx. A PV
> device need not offer this feature. A PCIe device will, but the cost to the
> driver of writing an in-memory idx is much smaller than the cost of an MMIO,
> which is always a strongly ordered barrier on x86/64.
>
> With vDPA you ideally would have this feature enabled, and the device would
> sometimes be PV and sometimes PCIe. The PV device would ignore the new idx
> field and so cache lines would not bounce then either.
>
> Ie. The only time cache lines are shared is when sharing with a PCIe device,
> which is the scenario when this is a win.
True. OTOH on non-x86 you will need more write barriers :( It would be
natural to say driver can reuse the barrier before flags update, but note
that that would mean hardware still needs to support re-fetching if
flags value is wrong. Such re-fetch is probably rare so fine by me, but it
does add a bit more complexity.
> > So I wonder: what if we made a change to spec that would allow prefetch
> > of ring entries? E.g. you would be able to read at random and if you
> > guessed right then you can just use what you have read, no need to
> > re-fetch?
>
> Unless I've misunderstood I think this would imply that the driver would
> have to ensure strict ordering for each descriptor it wrote, which would
> impose a cost to the driver. At the moment a write barrier is only needed
> before writing the flags of the first descriptor in a batch.
On non-x86 right? OTOH the extra data structure also adds more ordering
requirements.
> > > I suggest the best place to put this would be in the driver area,
> > > immediately after the event suppression structure.
> > Could you comment on why is that a good place though?
>
> The new field is written by the driver, as are the other fields in the
> driver area. Also I expect that devices might be able to read this new idx
> together with the interrupt suppression fields in a single read, reducing
> PCIe overheads.
OK then you need it in the same aligned 256 byte boundary.
> Placing the new field immediately after the descriptor ring would also work,
> but lose the benefit of combining reads, and potentially cause drivers to
> allocate a substantially bigger buffer (as I expect the descriptor ring is
> typically a power-of-2 size and drivers allocate multiples of page size).
>
> Placing the new field in a separate data structure is undesirable, as it
> would require devices to store a further 64bits per virt-queue.
>
> Thanks,
> David
>
>
> > > Presumably we would like this to be an optional feature, as implementations
> > > of packed mode already exist in the wild. How about
> > > VIRTIO_F_RING_PACKED_AVAIL_IDX?
> > > If I prepare a patch to the spec is there still time to get this into v1.1?
> > Any new feature would require another round of public review.
> > I personally think it's better to instead try to do 1.2 soon after,
> > e.g. we could try to target quarterly releases.
> >
> > But ultimately that would be up to the TC vote.
>
>
> --
> David Riddoch <driddoch@solarflare.com> -- Chief Architect, Solarflare
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
---------------------------------------------------------------------
To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
next prev parent reply other threads:[~2019-02-11 14:24 UTC|newest]
Thread overview: 37+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-02-01 14:23 [virtio-dev] RFC: Doorbell suppression, packed-ring mode and hardware offload David Riddoch
2019-02-11 7:33 ` Michael S. Tsirkin
2019-02-11 8:52 ` David Riddoch
2019-02-11 14:24 ` Michael S. Tsirkin [this message]
2019-02-11 14:58 ` David Riddoch
2019-02-12 5:08 ` Michael S. Tsirkin
2019-02-12 7:28 ` Jason Wang
2019-02-12 13:44 ` Michael S. Tsirkin
2019-02-13 10:00 ` Jason Wang
2019-02-13 15:20 ` Michael S. Tsirkin
2019-02-14 3:21 ` Jason Wang
2019-02-14 3:41 ` Michael S. Tsirkin
2019-02-15 3:59 ` Jason Wang
2019-02-15 4:23 ` Michael S. Tsirkin
2019-02-19 6:21 ` Jason Wang
2019-02-19 14:18 ` Michael S. Tsirkin
2019-02-12 11:40 ` David Riddoch
2019-02-12 12:51 ` Michael S. Tsirkin
2019-02-12 14:01 ` Michael S. Tsirkin
2019-02-12 16:47 ` David Riddoch
2019-02-12 17:35 ` Michael S. Tsirkin
2019-02-13 9:49 ` David Riddoch
2019-02-13 10:33 ` Jason Wang
2019-02-13 17:30 ` Michael S. Tsirkin
2019-02-14 3:34 ` Jason Wang
2019-02-14 4:04 ` Michael S. Tsirkin
2019-02-19 6:33 ` Jason Wang
2019-02-19 14:27 ` Michael S. Tsirkin
2019-04-30 22:41 ` Michael S. Tsirkin
2019-06-06 12:34 ` Michael S. Tsirkin
2019-06-17 14:41 ` Michael S. Tsirkin
[not found] <501110004.11631.1549031044295@oodm23.prod.google.com>
2019-02-01 17:43 ` Rob Miller
2019-02-04 5:36 ` Stefan Hajnoczi
2019-02-12 18:58 ` Michael S. Tsirkin
2019-02-12 18:55 ` Michael S. Tsirkin
2019-02-12 20:03 ` Rob Miller
2019-02-13 17:38 ` Michael S. Tsirkin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190211080432-mutt-send-email-mst@kernel.org \
--to=mst@redhat.com \
--cc=driddoch@solarflare.com \
--cc=virtio-dev@lists.oasis-open.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox