Re: [RFC patch 0/1] block: vhost-blk backend

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Stefan Hajnoczi <stefanha@redhat.com>
To: Andrey Zhadchenko <andrey.zhadchenko@virtuozzo.com>
Cc: qemu-block@nongnu.org, qemu-devel@nongnu.org, kwolf@redhat.com,
	hreitz@redhat.com, mst@redhat.com, den@virtuozzo.com
Subject: Re: [RFC patch 0/1] block: vhost-blk backend
Date: Wed, 5 Oct 2022 11:40:17 -0400	[thread overview]
Message-ID: <Yz2lYcKVH553MxfM@fedora> (raw)
In-Reply-To: <cff288d8-b8b5-76ba-aa90-91ddbd2d95a8@virtuozzo.com>

[-- Attachment #1: Type: text/plain, Size: 6474 bytes --]

On Wed, Oct 05, 2022 at 02:50:06PM +0300, Andrey Zhadchenko wrote:
> 
> 
> On 10/4/22 22:00, Stefan Hajnoczi wrote:
> > On Mon, Jul 25, 2022 at 11:55:26PM +0300, Andrey Zhadchenko wrote:
> > > Although QEMU virtio-blk is quite fast, there is still some room for
> > > improvements. Disk latency can be reduced if we handle virito-blk requests
> > > in host kernel so we avoid a lot of syscalls and context switches.
> > > 
> > > The biggest disadvantage of this vhost-blk flavor is raw format.
> > > Luckily Kirill Thai proposed device mapper driver for QCOW2 format to attach
> > > files as block devices: https://www.spinics.net/lists/kernel/msg4292965.html
> > > 
> > > Also by using kernel modules we can bypass iothread limitation and finaly scale
> > > block requests with cpus for high-performance devices. This is planned to be
> > > implemented in next version.
> > > 
> > > Linux kernel module part:
> > > https://lore.kernel.org/kvm/20220725202753.298725-1-andrey.zhadchenko@virtuozzo.com/
> > > 
> > > test setups and results:
> > > fio --direct=1 --rw=randread  --bs=4k  --ioengine=libaio --iodepth=128
> > > QEMU drive options: cache=none
> > > filesystem: xfs
> > > 
> > > SSD:
> > >                 | randread, IOPS  | randwrite, IOPS |
> > > Host           |      95.8k	 |	85.3k	   |
> > > QEMU virtio    |      57.5k	 |	79.4k	   |
> > > QEMU vhost-blk |      95.6k	 |	84.3k	   |
> > > 
> > > RAMDISK (vq == vcpu):
> > >                   | randread, IOPS | randwrite, IOPS |
> > > virtio, 1vcpu    |	123k	  |	 129k       |
> > > virtio, 2vcpu    |	253k (??) |	 250k (??)  |
> > > virtio, 4vcpu    |	158k	  |	 154k       |
> > > vhost-blk, 1vcpu |	110k	  |	 113k       |
> > > vhost-blk, 2vcpu |	247k	  |	 252k       |
> > > vhost-blk, 4vcpu |	576k	  |	 567k       |
> > > 
> > > Andrey Zhadchenko (1):
> > >    block: add vhost-blk backend
> > > 
> > >   configure                     |  13 ++
> > >   hw/block/Kconfig              |   5 +
> > >   hw/block/meson.build          |   1 +
> > >   hw/block/vhost-blk.c          | 395 ++++++++++++++++++++++++++++++++++
> > >   hw/virtio/meson.build         |   1 +
> > >   hw/virtio/vhost-blk-pci.c     | 102 +++++++++
> > >   include/hw/virtio/vhost-blk.h |  44 ++++
> > >   linux-headers/linux/vhost.h   |   3 +
> > >   8 files changed, 564 insertions(+)
> > >   create mode 100644 hw/block/vhost-blk.c
> > >   create mode 100644 hw/virtio/vhost-blk-pci.c
> > >   create mode 100644 include/hw/virtio/vhost-blk.h
> > 
> > vhost-blk has been tried several times in the past. That doesn't mean it
> > cannot be merged this time, but past arguments should be addressed:
> > 
> > - What makes it necessary to move the code into the kernel? In the past
> >    the performance results were not very convincing. The fastest
> >    implementations actually tend to be userspace NVMe PCI drivers that
> >    bypass the kernel! Bypassing the VFS and submitting block requests
> >    directly was not a huge boost. The syscall/context switch argument
> >    sounds okay but the numbers didn't really show that kernel block I/O
> >    is much faster than userspace block I/O.
> > 
> >    I've asked for more details on the QEMU command-line to understand
> >    what your numbers show. Maybe something has changed since previous
> >    times when vhost-blk has been tried.
> > 
> >    The only argument I see is QEMU's current 1 IOThread per virtio-blk
> >    device limitation, which is currently being worked on. If that's the
> >    only reason for vhost-blk then is it worth doing all the work of
> >    getting vhost-blk shipped (kernel, QEMU, and libvirt changes)? It
> >    seems like a short-term solution.
> > 
> > - The security impact of bugs in kernel vhost-blk code is more serious
> >    than bugs in a QEMU userspace process.
> > 
> > - The management stack needs to be changed to use vhost-blk whereas
> >    QEMU can be optimized without affecting other layers.
> > 
> > Stefan
> 
> Indeed there was several vhost-blk attempts, but from what I found in
> mailing lists only the Asias attempt got some attention and discussion.
> Ramdisk performance results were great but ramdisk is more a benchmark than
> a real use. I didn't find out why Asias dropped his version except vague "He
> concluded performance results was not worth". The storage speed is very
> important for vhost-blk performance, as there is no point to cut cpu costs
> from 1ms to 0,1ms if the request need 50ms to proceed in the actual disk. I
> think that 10 years ago NVMI was non-existent and SSD + SATA was probably a
> lot faster than HDD but still not enough to utilize this technology.

Yes, it's possible that latency improvements are more noticeable now.
Thank you for posting the benchmark results. I will also run benchmarks
so we can compare vhost-blk with today's QEMU as well as multiqueue
IOThreads QEMU (for which I only have a hacky prototype) on a local NVMe
PCI SSD.

> The tests I did give me 60k IOPS randwrite for VM and 95k for host. And the
> vhost-blk is able to negate the difference even using only 1 thread/vq/vcpu.
> And unlinke current QEMU single IOThread it can be easily scaled with number
> of cpus/vcpus. For sure this can be solved by liftimg IOThread limitations
> but this will probably be even more disastrous amount of changes (and adding
> vhost-blk won't break old setups!).
> 
> Probably the only undisputed advantage of vhost-blk is syscalls reduction.
> And again the benefit really depends on a storage speed, as it should be
> somehow comparable with syscalls time. Also I must note that this may be
> good for high-density servers with a lot of VMs. But for now I did not have
> the exact numbers which show how much time we are really winning for a
> single request at average.
> 
> Overall vhost-blk will only become better along with the increase of storage
> speed.
> 
> Also I must note that all arguments above apply to vdpa-blk. And unlike
> vhost-blk, which needs it's own QEMU code, vdpa-blk can be setup with
> generic virtio-vdpa QEMU code (I am not sure if it is merged yet but still).
> Although vdpa-blk have it's own problems for now.

Yes, I think that's why Stefano hasn't pushed for a software vpda-blk
device yet despite having played with it and is more focussed on
hardware enablement. vdpa-blk has the same issues as vhost-blk.

Stefan

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

     prev parent reply	other threads:[~2022-10-05 15:43 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-07-25 20:55 [RFC patch 0/1] block: vhost-blk backend Andrey Zhadchenko via
2022-07-25 20:55 ` [RFC PATCH 1/1] block: add " Andrey Zhadchenko via
2022-10-04 18:45   ` Stefan Hajnoczi
2022-10-05 13:06     ` Andrey Zhadchenko
2022-10-05 15:50       ` Stefan Hajnoczi
2022-07-26 13:51 ` [RFC patch 0/1] block: " Michael S. Tsirkin
2022-07-26 14:15   ` Denis V. Lunev
2022-07-27 13:06     ` Stefano Garzarella
2022-07-28  5:28       ` Andrey Zhadchenko
2022-07-28 15:40         ` Stefano Garzarella
2022-10-04 18:13 ` Stefan Hajnoczi
2022-10-05  9:14   ` Andrey Zhadchenko
2022-10-05 15:18     ` Stefan Hajnoczi
2022-10-04 18:26 ` Stefan Hajnoczi
2022-10-05 10:28   ` Andrey Zhadchenko
2022-10-05 15:30     ` Stefan Hajnoczi
2022-10-04 19:00 ` Stefan Hajnoczi
2022-10-05 11:50   ` Andrey Zhadchenko
2022-10-05 15:40     ` Stefan Hajnoczi [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Yz2lYcKVH553MxfM@fedora \
    --to=stefanha@redhat.com \
    --cc=andrey.zhadchenko@virtuozzo.com \
    --cc=den@virtuozzo.com \
    --cc=hreitz@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=mst@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.