All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Sagi Grimberg <sagi@grimberg.me>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Qemu Developers <qemu-devel@nongnu.org>
Subject: Re: virtio-blk using a single iothread
Date: Thu, 27 Jul 2023 11:11:51 -0400	[thread overview]
Message-ID: <20230727151151.GA970709@fedora> (raw)
In-Reply-To: <d8028f17-8d33-790b-8d3e-fa1170108774@grimberg.me>

[-- Attachment #1: Type: text/plain, Size: 6723 bytes --]

On Sun, Jun 11, 2023 at 03:27:57PM +0300, Sagi Grimberg wrote:
> 
> 
> On 6/8/23 19:08, Stefan Hajnoczi wrote:
> > On Thu, Jun 08, 2023 at 10:40:57AM +0300, Sagi Grimberg wrote:
> > > Hey Stefan, Paolo,
> > > 
> > > I just had a report from a user experiencing lower virtio-blk
> > > performance than he expected. This user is running virtio-blk on top of
> > > nvme-tcp device. The guest is running 12 CPU cores.
> > > 
> > > The guest read/write throughput is capped at around 30% of the available
> > > throughput from the host (~800MB/s from the guest vs. 2800MB/s from the
> > > host - 25Gb/s nic). The workload running on the guest is a
> > > multi-threaded fio workload.
> > > 
> > > What is observed is the fact that virtio-blk is using a single disk-wide
> > > iothread processing all the vqs. Specifically nvme-tcp (similar to other
> > > tcp based protocols) is negatively impacted by lack of thread
> > > concurrency that can distribute I/O requests to different TCP
> > > connections.
> > > 
> > > We also attempted to move the iothread to a dedicated core, however that
> > > did yield any meaningful performance improvements). The reason appears
> > > to be less about CPU utilization on the iothread core, but more around
> > > single TCP connection serialization.
> > > 
> > > Moving to io=threads does increase the throughput, however sacrificing
> > > latency significantly.
> > > 
> > > So the user find itself with available host cpus and TCP connections
> > > that it could easily use to get maximum throughput, without the ability
> > > to leverage them. True, other guests will use different
> > > threads/contexts, however the goal here is to allow the full performance
> > > from a single device.
> > > 
> > > I've seen several discussions and attempts in the past to allow a
> > > virtio-blk device leverage multiple iothreads, but around 2 years ago
> > > the discussions over this paused. So wanted to ask, are there any plans
> > > or anything in the works to address this limitation?
> > > 
> > > I've seen that the spdk folks are heading in this direction with their
> > > vhost-blk implementation:
> > > https://review.spdk.io/gerrit/c/spdk/spdk/+/16068
> > 
> > Hi Sagi,
> > Yes, there is an ongoing QEMU multi-queue block layer effort to make it
> > possible for multiple IOThreads to process disk I/O for the same
> > --blockdev in parallel.
> 
> Great to know.
> 
> > Most of my recent QEMU patches have been part of this effort. There is a
> > work-in-progress branch that supports mapping virtio-blk virtqueues to
> > specific IOThreads:
> > https://gitlab.com/stefanha/qemu/-/commits/virtio-blk-iothread-vq-mapping
> 
> Thanks for the pointer.
> 
> > The syntax is:
> > 
> >    --device '{"driver":"virtio-blk-pci","iothread-vq-mapping":[{"iothread":"iothread0"},{"iothread":"iothread1"}],"drive":"drive0"}'
> > 
> > This says "assign virtqueues round-robin to iothread0 and iothread1".
> > Half the virtqueues will be processed by iothread0 and the other half by
> > iothread1. There is also syntax for assigning specific virtqueues to
> > each IOThread, but usually the automatic round-robin assignment is all
> > that's needed.
> > 
> > This work is not finished yet. Basic I/O (e.g. fio) works without
> > crashes, but expect to hit issues if you use blockjobs, hotplug, etc.
> > 
> > Performance optimization work has just begun, so it won't deliver all
> > the benefits yet. I ran a benchmark yesterday where going from 1 to 2
> > IOThreads increased performance by 25%. That's much less than we're
> > aiming for; attaching two independent virtio-blk devices improves the
> > performance by ~100%. I know we can get there eventually. Some of the
> > bottlenecks are known (e.g. block statistics collection causes lock
> > contention) and others are yet to be investigated.
> 
> Hmm, I rebased this branch on top of mainline master and ran a naive
> test, and it seems that performance regressed quite a bit :(
> 
> I'm running this test on my laptop (Intel(R) Core(TM) i7-8650U CPU
> @1.90GHz), so this is more qualitative test for BW only.
> I use null_blk as the host device.
> 
> With mainline master I get ~9GB/s 64k randread, and with your branch
> I get ~5GB/s, this is regardless of assigning iothreads (one or
> two) or not.
> 
> my qemu command:
> taskset -c 0-3 build/qemu-system-x86_64 -cpu host -m 1G -enable-kvm -smp 4
> -drive
> file=/var/lib/libvirt/images/ubuntu-22/root-disk-clone.qcow2,format=qcow2
> -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=/dev/nullb0
> -device virtio-blk-pci,drive=drive0,scsi=off -nographic
> 
> my guest fio jobfile:
> --
> [global]
> group_reporting
> runtime=3000
> time_based
> loops=1
> direct=1
> invalidate=1
> randrepeat=0
> norandommap
> exitall
> cpus_allowed=0-3
> cpus_allowed_policy=split
> 
> [read]
> filename=/dev/vda
> numjobs=4
> iodepth=32
> bs=64k
> rw=randread
> ioengine=io_uring

Hi Sagi,
I have some news and pushed new code to my repo:
https://gitlab.com/stefanha/qemu/-/commits/virtio-blk-iothread-vq-mapping

This branch changes virtio-blk emulation to process requests in
coroutines. The reason for this change was to reduce the number of
coroutines created per request and minimize nested event loops
(AIO_WAIT_WHILE() -> aio_poll()). However, I found a performance issue
with the implementation: request coroutines were yielding and thereby
deferring request processing until later in the event loop.

The new code I pushed yesterday works around this by skipping request
serialization/tracking (bs->tracked_requests) for read requests. I only
modified the code for read requests because that's what I benchmark.
bs->tracked_requests and its lock, bs->reqs_lock, was causing contention
and coroutine yields.

A proper solution that keeps request tracking but makes it SMP-friendly
will need to be implemented, but for now this may solve the issues you
were seeing.

On my system 4 KB randread iodepth=64 numjobs=8 now achieves the same
IOPS on bare metal and in a VM. I'm not sure if this addresses the
performance issue you were seeing but there's a good chance it does.

I'll run your fio jobs and compare against qemu.git/master without my
patches.

(I also added the --device virtio-blk-pci,stats-enabled=off,... option
to skip block I/O statistics collection. The statistics data is
protected by a lock that can cause contention when multiple IOThreads
process requests for the same device. In my testing it doesn't have much
of an effect on IOPS but I can see the difference in traces of futex
syscalls.)

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  parent reply	other threads:[~2023-07-27 15:14 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-08  7:40 virtio-blk using a single iothread Sagi Grimberg
2023-06-08 16:08 ` Stefan Hajnoczi
2023-06-11 12:27   ` Sagi Grimberg
2023-06-21 12:23     ` Stefan Hajnoczi
2023-07-27 15:11     ` Stefan Hajnoczi [this message]
2023-07-31 15:51     ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230727151151.GA970709@fedora \
    --to=stefanha@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.