qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Stefan Hajnoczi <stefanha@redhat.com>
To: Sagi Grimberg <sagi@grimberg.me>
Cc: Paolo Bonzini <pbonzini@redhat.com>,
	Qemu Developers <qemu-devel@nongnu.org>
Subject: Re: virtio-blk using a single iothread
Date: Thu, 27 Jul 2023 11:11:51 -0400	[thread overview]
Message-ID: <20230727151151.GA970709@fedora> (raw)
In-Reply-To: <d8028f17-8d33-790b-8d3e-fa1170108774@grimberg.me>

[-- Attachment #1: Type: text/plain, Size: 6723 bytes --]

On Sun, Jun 11, 2023 at 03:27:57PM +0300, Sagi Grimberg wrote:
> 
> 
> On 6/8/23 19:08, Stefan Hajnoczi wrote:
> > On Thu, Jun 08, 2023 at 10:40:57AM +0300, Sagi Grimberg wrote:
> > > Hey Stefan, Paolo,
> > > 
> > > I just had a report from a user experiencing lower virtio-blk
> > > performance than he expected. This user is running virtio-blk on top of
> > > nvme-tcp device. The guest is running 12 CPU cores.
> > > 
> > > The guest read/write throughput is capped at around 30% of the available
> > > throughput from the host (~800MB/s from the guest vs. 2800MB/s from the
> > > host - 25Gb/s nic). The workload running on the guest is a
> > > multi-threaded fio workload.
> > > 
> > > What is observed is the fact that virtio-blk is using a single disk-wide
> > > iothread processing all the vqs. Specifically nvme-tcp (similar to other
> > > tcp based protocols) is negatively impacted by lack of thread
> > > concurrency that can distribute I/O requests to different TCP
> > > connections.
> > > 
> > > We also attempted to move the iothread to a dedicated core, however that
> > > did yield any meaningful performance improvements). The reason appears
> > > to be less about CPU utilization on the iothread core, but more around
> > > single TCP connection serialization.
> > > 
> > > Moving to io=threads does increase the throughput, however sacrificing
> > > latency significantly.
> > > 
> > > So the user find itself with available host cpus and TCP connections
> > > that it could easily use to get maximum throughput, without the ability
> > > to leverage them. True, other guests will use different
> > > threads/contexts, however the goal here is to allow the full performance
> > > from a single device.
> > > 
> > > I've seen several discussions and attempts in the past to allow a
> > > virtio-blk device leverage multiple iothreads, but around 2 years ago
> > > the discussions over this paused. So wanted to ask, are there any plans
> > > or anything in the works to address this limitation?
> > > 
> > > I've seen that the spdk folks are heading in this direction with their
> > > vhost-blk implementation:
> > > https://review.spdk.io/gerrit/c/spdk/spdk/+/16068
> > 
> > Hi Sagi,
> > Yes, there is an ongoing QEMU multi-queue block layer effort to make it
> > possible for multiple IOThreads to process disk I/O for the same
> > --blockdev in parallel.
> 
> Great to know.
> 
> > Most of my recent QEMU patches have been part of this effort. There is a
> > work-in-progress branch that supports mapping virtio-blk virtqueues to
> > specific IOThreads:
> > https://gitlab.com/stefanha/qemu/-/commits/virtio-blk-iothread-vq-mapping
> 
> Thanks for the pointer.
> 
> > The syntax is:
> > 
> >    --device '{"driver":"virtio-blk-pci","iothread-vq-mapping":[{"iothread":"iothread0"},{"iothread":"iothread1"}],"drive":"drive0"}'
> > 
> > This says "assign virtqueues round-robin to iothread0 and iothread1".
> > Half the virtqueues will be processed by iothread0 and the other half by
> > iothread1. There is also syntax for assigning specific virtqueues to
> > each IOThread, but usually the automatic round-robin assignment is all
> > that's needed.
> > 
> > This work is not finished yet. Basic I/O (e.g. fio) works without
> > crashes, but expect to hit issues if you use blockjobs, hotplug, etc.
> > 
> > Performance optimization work has just begun, so it won't deliver all
> > the benefits yet. I ran a benchmark yesterday where going from 1 to 2
> > IOThreads increased performance by 25%. That's much less than we're
> > aiming for; attaching two independent virtio-blk devices improves the
> > performance by ~100%. I know we can get there eventually. Some of the
> > bottlenecks are known (e.g. block statistics collection causes lock
> > contention) and others are yet to be investigated.
> 
> Hmm, I rebased this branch on top of mainline master and ran a naive
> test, and it seems that performance regressed quite a bit :(
> 
> I'm running this test on my laptop (Intel(R) Core(TM) i7-8650U CPU
> @1.90GHz), so this is more qualitative test for BW only.
> I use null_blk as the host device.
> 
> With mainline master I get ~9GB/s 64k randread, and with your branch
> I get ~5GB/s, this is regardless of assigning iothreads (one or
> two) or not.
> 
> my qemu command:
> taskset -c 0-3 build/qemu-system-x86_64 -cpu host -m 1G -enable-kvm -smp 4
> -drive
> file=/var/lib/libvirt/images/ubuntu-22/root-disk-clone.qcow2,format=qcow2
> -drive if=none,id=drive0,cache=none,aio=native,format=raw,file=/dev/nullb0
> -device virtio-blk-pci,drive=drive0,scsi=off -nographic
> 
> my guest fio jobfile:
> --
> [global]
> group_reporting
> runtime=3000
> time_based
> loops=1
> direct=1
> invalidate=1
> randrepeat=0
> norandommap
> exitall
> cpus_allowed=0-3
> cpus_allowed_policy=split
> 
> [read]
> filename=/dev/vda
> numjobs=4
> iodepth=32
> bs=64k
> rw=randread
> ioengine=io_uring

Hi Sagi,
I have some news and pushed new code to my repo:
https://gitlab.com/stefanha/qemu/-/commits/virtio-blk-iothread-vq-mapping

This branch changes virtio-blk emulation to process requests in
coroutines. The reason for this change was to reduce the number of
coroutines created per request and minimize nested event loops
(AIO_WAIT_WHILE() -> aio_poll()). However, I found a performance issue
with the implementation: request coroutines were yielding and thereby
deferring request processing until later in the event loop.

The new code I pushed yesterday works around this by skipping request
serialization/tracking (bs->tracked_requests) for read requests. I only
modified the code for read requests because that's what I benchmark.
bs->tracked_requests and its lock, bs->reqs_lock, was causing contention
and coroutine yields.

A proper solution that keeps request tracking but makes it SMP-friendly
will need to be implemented, but for now this may solve the issues you
were seeing.

On my system 4 KB randread iodepth=64 numjobs=8 now achieves the same
IOPS on bare metal and in a VM. I'm not sure if this addresses the
performance issue you were seeing but there's a good chance it does.

I'll run your fio jobs and compare against qemu.git/master without my
patches.

(I also added the --device virtio-blk-pci,stats-enabled=off,... option
to skip block I/O statistics collection. The statistics data is
protected by a lock that can cause contention when multiple IOThreads
process requests for the same device. In my testing it doesn't have much
of an effect on IOPS but I can see the difference in traces of futex
syscalls.)

Thanks,
Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

  parent reply	other threads:[~2023-07-27 15:14 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-08  7:40 virtio-blk using a single iothread Sagi Grimberg
2023-06-08 16:08 ` Stefan Hajnoczi
2023-06-11 12:27   ` Sagi Grimberg
2023-06-21 12:23     ` Stefan Hajnoczi
2023-07-27 15:11     ` Stefan Hajnoczi [this message]
2023-07-31 15:51     ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230727151151.GA970709@fedora \
    --to=stefanha@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).