From: Stefan Hajnoczi <stefanha@redhat.com>
To: Ilya Maximets <i.maximets@ovn.org>
Cc: qemu-devel@nongnu.org, Hanna Reitz <hreitz@redhat.com>,
qemu-block@nongnu.org, Julia Suvorova <jusual@redhat.com>,
Aarushi Mehta <mehta.aaru20@gmail.com>,
Stefano Garzarella <sgarzare@redhat.com>,
Kevin Wolf <kwolf@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
Jason Wang <jasowang@redhat.com>
Subject: Re: [PATCH 1/2] virtio: use blk_io_plug_call() in virtio_irqfd_notify()
Date: Wed, 16 Aug 2023 17:16:39 -0400 [thread overview]
Message-ID: <20230816211639.GD3454448@fedora> (raw)
In-Reply-To: <3e2c2433-df5c-152e-70de-7c6cfe7ea4af@ovn.org>
[-- Attachment #1: Type: text/plain, Size: 12806 bytes --]
On Wed, Aug 16, 2023 at 08:30:58PM +0200, Ilya Maximets wrote:
> On 8/16/23 17:30, Stefan Hajnoczi wrote:
> > On Wed, Aug 16, 2023 at 03:36:32PM +0200, Ilya Maximets wrote:
> >> On 8/15/23 14:08, Stefan Hajnoczi wrote:
> >>> virtio-blk and virtio-scsi invoke virtio_irqfd_notify() to send Used
> >>> Buffer Notifications from an IOThread. This involves an eventfd
> >>> write(2) syscall. Calling this repeatedly when completing multiple I/O
> >>> requests in a row is wasteful.
> >>
> >> Hi, Stefan. This is an interesting change!
> >>
> >> There is more or less exactly the same problem with fast network backends
> >> and I was playing around with similar ideas in this area while working on
> >> af-xdp network backend recently. Primarily, implementation of the Rx BH
> >> for virtio-net device and locking the RCU before passing packets from the
> >> backend to the device one by one.
> >>
> >>>
> >>> Use the blk_io_plug_call() API to batch together virtio_irqfd_notify()
> >>> calls made during Linux AIO (aio=native) or io_uring (aio=io_uring)
> >>> completion processing. Do not modify the thread pool (aio=threads) to
> >>> avoid introducing a dependency from util/ onto the block layer.
> >>
> >> But you're introducing a dependency from generic virtio code onto the
> >> block layer in this patch. This seem to break the module abstraction.
> >>
> >> It looks like there are 2 options to resolve the semantics issue here:
> >
> > Yes, it's a layering violation.
> >
> >>
> >> 1. Move virtio_notify_irqfd() from virtio.c down to the block layer.
> >> Block layer is the only user, so that may be justified, but it
> >> doesn't seem like a particularly good solution. (I'm actually not
> >> sure why block devices are the only ones using this function...)
> >
> > Yes, this is the easiest way to avoid the layering violation for now.
> >
> > The virtio_notify_irqfd() API is necessary when running in an IOThread
> > because the normal QEMU irq API must run under the Big QEMU Lock. Block
> > devices are the only ones that raise interrupts from IOThreads at the
> > moment.
>
> Ack. Thanks for explanation!
>
> >
> >>
> >> 2. Move and rename the block/plug library somewhere generic. The plug
> >> library seems to not have any dependencies on a block layer, other
> >> than a name, so it should not be hard to generalize it (well, the
> >> naming might be hard).
> >
> > Yes, it should be possible to make it generic quite easily. I will give
> > this a try in the next version of the patch.
>
> OK. Sounds good to me.
>
> >
> >> In general, while looking at the plug library, it seems to do what is
> >> typically implemented via RCU frameworks - the delayed function call.
> >> The only difference is that RCU doesn't check for duplicates and the
> >> callbacks are global. Should not be hard to add some new functionality
> >> to RCU framework in order to address these, e.g. rcu_call_local() for
> >> calls that should be executed once the current thread exits its own
> >> critical section.
> >
> > This rcu_call_local() idea is unrelated to Read Copy Update, so I don't
> > think it should be part of the RCU API.
>
> Agreed.
>
> > Another deferred function call mechanism is QEMUBH. It already supports
> > coalescing. However, BHs are invoked once per AioContext event loop
> > iteration and there is no way invoke the BH earlier. Also the BH pointer
> > needs to be passed to every function that wishes to schedule a deferred
> > call, which can be tedious (e.g. block/linux-aio.c should defer the
> > io_submit(2) syscall until the end of virtio-blk request processing -
> > there are a lot of layers of code between those two components).
> >
> >>
> >> Using RCU for non-RCU-protected things might be considered as an abuse.
> >> However, we might solve two issues in one shot if instead of entering
> >> blk_io_plug/unplug section we will enter an RCU critical section and
> >> call callbacks at the exit. The first issue is the notification batching
> >> that this patch is trying to fix, the second is an excessive number of
> >> thread fences on RCU exits every time virtio_notify_irqfd() and other
> >> virtio functions are invoked. The second issue can be avoided by using
> >> RCU_READ_LOCK_GUARD() in completion functions. Not sure if that will
> >> improve performance, but it definitely removes a lot of noise from the
> >> perf top for network backends. This makes the code a bit less explicit
> >> though, the lock guard will definitely need a comment. Though, the reason
> >> for blk_io_plug() calls is not fully clear for a module code alone either.
> >
> > util/aio-posix.c:run_poll_handlers() has a top-level
> > RCU_READ_LOCK_GUARD() for this reason.
>
> Nice, didn't know that.
>
> > Maybe we should do the same
> > around aio_bh_poll() + aio_dispatch_ready_handlers() in
> > util/aio-posix.c:aio_poll()? The downside is that latency-sensitive
> > call_rcu() callbacks perform worse.
>
> "latency-sensitive call_rcu() callback" is a bit of an oxymoron.
> There is no real way to tell when the other thread will exit the
> critical section. But I'm not familiar with the code enough to
> make a decision here.
I see your point :).
Another issue occurred to me: drain_call_rcu() is invoked by some QEMU
monitor commands and the monitor runs in QEMU's event loop.
drain_call_rcu() waits for the call_rcu thread, which needs
synchronize_rcu() to finish, so I guess it will hang if called with the
RCU read lock held. There is the potential for a deadlock here. It might
be solvable.
>
> >>
> >> I'm not sure what is the best way forward. I'm trying to figure out
> >> that myself, since a similar solution will in the end be needed for
> >> network backends and so it might be better to have something more generic.
> >> What do you think?
> >
> > I suggest moving blk_io_plug() to util/ and renaming it to
> > defer_begin/defer_end()/defer_call(). The interface and implementation
> > can also be improved as needed for other users.
>
> Sounds good to me.
>
> >
> >>
> >>>
> >>> Behavior is unchanged for emulated devices that do not use blk_io_plug()
> >>> since blk_io_plug_call() immediately invokes the callback when called
> >>> outside a blk_io_plug()/blk_io_unplug() region.
> >>>
> >>> fio rw=randread bs=4k iodepth=64 numjobs=8 IOPS increases by ~9% with a
> >>> single IOThread and 8 vCPUs. iodepth=1 decreases by ~1% but this could
> >>> be noise. Detailed performance data and configuration specifics are
> >>> available here:
> >>> https://gitlab.com/stefanha/virt-playbooks/-/tree/blk_io_plug-irqfd
> >>>
> >>> This duplicates the BH that virtio-blk uses for batching. The next
> >>> commit will remove it.
> >>
> >> I'm likely missing something, but could you explain why it is safe to
> >> batch unconditionally here? The current BH code, as you mentioned in
> >> the second patch, is only batching if EVENT_IDX is not set.
> >> Maybe worth adding a few words in the commit message for people like
> >> me, who are a bit rusty on QEMU/virtio internals. :)
> >
> > The BH is condition on EVENT_IDX not for correctness/safety, but for
> > performance.
>
> I see. Thanks!
> Do you think using deferred call in the non-irqfd virtio_notify()
> will make sense? Not in this patch set, but in general, e.g. for
> a future use from a network backend.
Yes, a deferred call may improve performance in the non-irqfd case too.
> > Commit 12c1c7d7cefb4dbffeb5712e75a33e4692f0a76b
> > ("virtio-blk: dataplane: Don't batch notifications if EVENT_IDX is
> > present") argued that VIRTIO's EVENT_IDX feature already provides
> > batching so the BH is not needed in that case.
> >
> > Actually I'm not sure exactly why deferring virtio_notify_irqfd() helps,
> > but IOPS improves on my machine. One theory:
> >
> > Linux drivers/virtio/virtio_ring.c:virtqueue_get_buf_ctx_split() (or the
> > end of the irq handler function) updates EVENT_IDX so that QEMU will
> > inject an interrupt again when the next request is completed. If the
> > vCPU and QEMU are running in parallel, then N requests could result in N
> > interrupts when the vCPU updates EVENT_IDX before QEMU completes the
> > next request. Maybe that's why I see better performance when deferring
> > virtio_notify_irqfd().
>
> Sounds very plausible. I actually see only latency and interrupt
> count measurements in either 12c1c7d7cefb ("virtio-blk: dataplane:
> Don't batch notifications if EVENT_IDX is present") or 5b2ffbe4d998
> ("virtio-blk: dataplane: notify guest as a batch") as well as in
> mailing list discussions about these changes. So, it's possible that
> impact of batching on throughput in terms of IOPS was not actually
> measured before (it porobably was, but not published).
>
> >
> > (I'm thinking of experimenting with rate limiting irq injection so that
> > it's possible to batch even more efficienctly but I worry that it will
> > be hard to avoid latency vs throughput regressions.)
> >
> > Stefan
> >
> >>
> >> Best regards, Ilya Maximets.
> >>
> >>>
> >>> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
> >>> ---
> >>> block/io_uring.c | 6 ++++++
> >>> block/linux-aio.c | 4 ++++
> >>> hw/virtio/virtio.c | 10 +++++++++-
> >>> 3 files changed, 19 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/block/io_uring.c b/block/io_uring.c
> >>> index 69d9820928..749cf83934 100644
> >>> --- a/block/io_uring.c
> >>> +++ b/block/io_uring.c
> >>> @@ -124,6 +124,9 @@ static void luring_process_completions(LuringState *s)
> >>> {
> >>> struct io_uring_cqe *cqes;
> >>> int total_bytes;
> >>> +
> >>> + blk_io_plug();
> >>> +
> >>> /*
> >>> * Request completion callbacks can run the nested event loop.
> >>> * Schedule ourselves so the nested event loop will "see" remaining
> >>> @@ -216,7 +219,10 @@ end:
> >>> aio_co_wake(luringcb->co);
> >>> }
> >>> }
> >>> +
> >>> qemu_bh_cancel(s->completion_bh);
> >>> +
> >>> + blk_io_unplug();
> >>> }
> >>>
> >>> static int ioq_submit(LuringState *s)
> >>> diff --git a/block/linux-aio.c b/block/linux-aio.c
> >>> index 561c71a9ae..cef3d6b1c7 100644
> >>> --- a/block/linux-aio.c
> >>> +++ b/block/linux-aio.c
> >>> @@ -204,6 +204,8 @@ static void qemu_laio_process_completions(LinuxAioState *s)
> >>> {
> >>> struct io_event *events;
> >>>
> >>> + blk_io_plug();
> >>> +
> >>> /* Reschedule so nested event loops see currently pending completions */
> >>> qemu_bh_schedule(s->completion_bh);
> >>>
> >>> @@ -230,6 +232,8 @@ static void qemu_laio_process_completions(LinuxAioState *s)
> >>> * own `for` loop. If we are the last all counters droped to zero. */
> >>> s->event_max = 0;
> >>> s->event_idx = 0;
> >>> +
> >>> + blk_io_unplug();
> >>> }
> >>>
> >>> static void qemu_laio_process_completions_and_submit(LinuxAioState *s)
> >>> diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
> >>> index 309038fd46..a691e8526b 100644
> >>> --- a/hw/virtio/virtio.c
> >>> +++ b/hw/virtio/virtio.c
> >>> @@ -28,6 +28,7 @@
> >>> #include "hw/virtio/virtio-bus.h"
> >>> #include "hw/qdev-properties.h"
> >>> #include "hw/virtio/virtio-access.h"
> >>> +#include "sysemu/block-backend.h"
> >>> #include "sysemu/dma.h"
> >>> #include "sysemu/runstate.h"
> >>> #include "virtio-qmp.h"
> >>> @@ -2426,6 +2427,13 @@ static bool virtio_should_notify(VirtIODevice *vdev, VirtQueue *vq)
> >>> }
> >>> }
> >>>
> >>> +/* Batch irqs while inside a blk_io_plug()/blk_io_unplug() section */
> >>> +static void virtio_notify_irqfd_unplug_fn(void *opaque)
> >>> +{
> >>> + EventNotifier *notifier = opaque;
> >>> + event_notifier_set(notifier);
> >>> +}
> >>> +
> >>> void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq)
> >>> {
> >>> WITH_RCU_READ_LOCK_GUARD() {
> >>> @@ -2452,7 +2460,7 @@ void virtio_notify_irqfd(VirtIODevice *vdev, VirtQueue *vq)
> >>> * to an atomic operation.
> >>> */
> >>> virtio_set_isr(vq->vdev, 0x1);
> >>> - event_notifier_set(&vq->guest_notifier);
> >>> + blk_io_plug_call(virtio_notify_irqfd_unplug_fn, &vq->guest_notifier);
>
> There is a trace call in this function. Should we move it or add a new
> trace call to the callback? The trace was printed right before the
> actual notification before, and now the actual notification is delayed
> potentially reducing the usefulness of the trace.
>
> >>> }
> >>>
> >>> static void virtio_irq(VirtQueue *vq)
> >>
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
next prev parent reply other threads:[~2023-08-16 21:17 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-15 12:08 [PATCH 0/2] virtio-blk: use blk_io_plug_call() instead of notification BH Stefan Hajnoczi
2023-08-15 12:08 ` [PATCH 1/2] virtio: use blk_io_plug_call() in virtio_irqfd_notify() Stefan Hajnoczi
2023-08-16 13:36 ` Ilya Maximets
2023-08-16 15:30 ` Stefan Hajnoczi
2023-08-16 18:30 ` Ilya Maximets
2023-08-16 21:16 ` Stefan Hajnoczi [this message]
2023-08-15 12:08 ` [PATCH 2/2] virtio-blk: remove batch notification BH Stefan Hajnoczi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230816211639.GD3454448@fedora \
--to=stefanha@redhat.com \
--cc=hreitz@redhat.com \
--cc=i.maximets@ovn.org \
--cc=jasowang@redhat.com \
--cc=jusual@redhat.com \
--cc=kwolf@redhat.com \
--cc=mehta.aaru20@gmail.com \
--cc=mst@redhat.com \
--cc=qemu-block@nongnu.org \
--cc=qemu-devel@nongnu.org \
--cc=sgarzare@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).