All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Daniel P. Berrange" <berrange@redhat.com>
To: Kashyap Chamarthy <kchamart@redhat.com>
Cc: qemu-devel@nongnu.org, qemu-block@nongnu.org,
	Eric Blake <eblake@redhat.com>,
	mbooth@redhat.com
Subject: Re: [Qemu-devel] [Qemu-block] RFC: use case for adding QMP, block jobs & multiple exports to qemu-nbd ?
Date: Thu, 2 Nov 2017 17:04:48 +0000	[thread overview]
Message-ID: <20171102170448.GX32533@redhat.com> (raw)
In-Reply-To: <20171102164028.lkl3cv52stkdiywj@eukaryote>

On Thu, Nov 02, 2017 at 05:40:28PM +0100, Kashyap Chamarthy wrote:
> [Cc: Matt Booth from Nova upstream; so not snipping the email to retain
> context for Matt.]
> 
> On Thu, Nov 02, 2017 at 12:02:23PM +0000, Daniel P. Berrange wrote:
> > I've been thinking about a potential design/impl improvement for the way
> > that OpenStack Nova handles disk images when booting virtual machines, and
> > thinking if some enhancements to qemu-nbd could be beneficial...
> 
> Just read-through, very intereesting idea.  A couple of things inline.
> 
> > At a high level, OpenStack has a repository of disk images (Glance), and
> > when we go to boot a VM, Nova copies the disk image out of the repository
> > onto the local host's image cache. We doing this, Nova may also enlarge
> > disk image (eg if the original image has 10GB size, it may do a qemu-img
> > resize to 40GB). Nova then creates a qcow2 overlay with backing file
> > pointing to its local cache. Multiple VMs can be booted in parallel each
> > with their own overlay pointing to the same backing file
> > 
> > The problem with this approach is that VM startup is delayed while we copy
> > the disk image from the glance repository to the local cache, and again
> > while we do the image resize (though the latter is pretty quick really
> > since its just changing metadata in the image and/or host filesystem)
> > 
> > One might suggest that we avoid the local disk copy and just point the
> > VM directly at an NBD server running in the remote image repository, but
> > this introduces a centralized point of failure. With the local disk copy
> > VMs can safely continue running even if the image repository dies. Running
> > from the local image cache can offer better performance too, particularly
> > if having SSD storage. 
> > 
> > Conceptually what I want to start with is a 3 layer chain
> > 
> >    master-disk1.qcow2  (qemu-nbd)
> >           |
> >           |  (format=raw, proto=nbd)
> > 	  |
> >    cache-disk1.qcow2   (qemu-system-XXX)
> >           |
> >           |  (format=qcow2, proto=file)
> > 	  |
> >           +-  vm-a-disk1.qcow2   (qemu-system-XXX)
> > 
> > NB vm-?-disk.qcow2 sizes may different than the backing file.
> > Sometimes OS disk images are built with a fairly small root filesystem
> > size, and the guest OS will grow its root FS to fill the actual disk
> > size allowed to the specific VM instance.
> > 
> > The cache-disk1.qcow2 is on each local virt host that needs disk1, and
> > created when the first VM is launched. Further launched VMs can all use
> > this same cached disk.  Now the cache-disk1.qcow2 is not useful as is,
> > because it has no allocated clusters, so after its created we need to
> > be able to stream content into it from master-disk1.qcow2, in parallel
> > with the VM A booting off vm-a-disk1.qcow2
> > 
> > If there was only a single VM, this would be easy enough, because we
> > can use drive-mirror monitor command to pull master-disk1.qcow2 data
> > into cache-disk1.qcow2 and then remove the backing chain leaving just
> > 
> >    cache-disk1.qcow2   (qemu-system-XXX)
> >           |
> 
> Just for my own understanding: in this hypothetical single VM diagram,
> you denote a QEMU binary ("qemu-system-XXX") for 'cache-disk1.qcow2'
> because it will be issuing 'drive-mirror' / 'blockdev-mirror' to the
> 'qemu-nbd' that exported 'master-disk1.qcow2', and "un-chain"
> post completion of 'mirror' job.  Yes?

In this diagram the same QEMU process has both cache-disk1.qcow2 and
vm-a-disk1.qcow2 open - its just a regular backing file setup.

> 
> > 	    |  (format=qcow2, proto=file)
> >           |
> >           +-  vm-a-disk1.qcow2  (qemu-system-XXX)
> > 
> > The problem is that many VMs are wanting to use cache-disk1.qcow2 as
> > their disk's backing file, and only one process is permitted to be
> > writing to disk backing file at any time.
> 
> Can you explain a bit more about how many VMs are trying to write to
> write to the same backing file 'cache-disk1.qcow2'?  I'd assume it's
> just the "immutable" local backing store (once the previous 'mirror' job
> is completed), based on which Nova creates a qcow2 overlay for each
> instance it boots.

An arbitrary number of  vm-*-disk1.qcow2 files could exist all using
the same cache-disk1.qcow2 image. Its only limited by how many VMs
you can fit on the host. By definition you can only ever have a single
process writing to a qcow2 file though, otherwise corruption will quickly
follow.

> When I pointed this e-mail of yours to Matt Booth on Freenode Nova IRC
> channel, he said the intermediate image (cache-disk1.qcow2) is a COR
> Copy-On-Read).  I realize what COR is -- everytime you read a cluster
> from the backing file, you write that locally, to avoid reading it
> again.

qcow2 doesn't give you COR, only COW. So every read request would have a miss
in cache-disk1.qcow2 and thus have to be fetched from master-disk1.qcow2. The
use of drive-mirror to pull master-disk1.qcow2 contents into cache-disk1.qcow
makes up for the lack of COR by populating cache-disk1.qcow2 in the background.

> > So I can't use the drive-mirror
> > in the QEMU processes to deal with this; all QEMU's must see their
> > backing file in a consistent read-only state
> > 
> > I've been wondering if it is possible to add an extra layer of NBD to
> > deal with this scenario. i.e. start off with:
> > 
> >    master-disk1.qcow2  (qemu-nbd)
> >           |
> >           |  (format=raw, proto=nbd)
> > 	  |
> >    cache-disk1.qcow2  (qemu-nbd)
> >           |
> >           |  (format=raw, proto=nbd)
> > 	  |
> >           +-  vm-a-disk1.qcow2  (qemu-system-XXX)
> >           +-  vm-b-disk1.qcow2  (qemu-system-XXX)
> >           +-  vm-c-disk1.qcow2  (qemu-system-XXX)
> > 
> > 
> > In this model 'cache-disk1.qcow2' would be opened read-write by a
> > qemu-nbd server process, but exported read-only to QEMU. qemu-nbd
> > would then do a drive mirror to stream the contents of
> > master-disk1.qcow2 into its cache-disk1.qcow2, concurrently with
> > servicing read requests from many QEMU's vm-*-disk1.qcow2 files
> > over NBD. When the drive mirror is complete we would again cut
> > the backing file to give
> > 
> >    cache-disk1.qcow2  (qemu-nbd)
> >           |
> >           |  (format=raw, proto=nbd)
> > 	  |
> >           +-  vm-a-disk1.qcow2  (qemu-system-XXX)
> >           +-  vm-b-disk1.qcow2  (qemu-system-XXX)
> >           +-  vm-c-disk1.qcow2  (qemu-system-XXX)
> > 
> > Since qemu-nbd no longer needs write to cache-disk1.qcow2 at this point,
> > we can further pivot all the QEMU servers to make vm-*-disk1.qcow2 use
> > format=qcow2,proto=file, allowing the local qemu-nbd to close the disk
> > image, and potentially exit (assuming it doesn't have other disks to
> > service). This would leave
> > 
> >    cache-disk1.qcow2  (qemu-system-XXX)
> >           |
> >           |  (format=qcow2, proto=file)
> > 	  |
> >           +-  vm-a-disk1.qcow2  (qemu-system-XXX)
> >           +-  vm-b-disk1.qcow2  (qemu-system-XXX)
> >           +-  vm-c-disk1.qcow2  (qemu-system-XXX)
> > 
> > Conceptually QEMU has all the pieces neccessary to support this kind of
> > approach to disk images, but they're not exposed by qemu-nbd as it has
> > no QMP interface of its own.
> > 
> > Another more minor issue is that the disk image repository may have
> > 1000's of images in it, and I don't want to be running 1000's of
> > qemu-nbd instances. I'd like 1 server to export many disks. I could
> > use iscsi in the disk image repository instead to deal with that, 
> > only having the qemu-nbd processes running on the local virt host
> > for the duration of populating cache-disks1.qcow2 from master-disk1.qcow2
> > The iscsi server admin commands are pretty unplesant to use compared
> > to QMP though, so its appealing to use NBD for everything.
> > 
> > After all that long background explanation, what I'm wondering is whether
> > there is any interest / desire to extend qemu-nbd to have more advanced
> > featureset than simply exporting a single disk image which must be listed
> > at startup time.
> > 
> >  - Ability to start qemu-nbd up with no initial disk image connected
> >  - Option to have a QMP interface to control qemu-nbd
> >  - Commands to add / remove individual disk image exports
> >  - Commands for doing the drive-mirror / backing file pivot
> > 
> > It feels like this wouldn't require significant new functionality in either
> > QMP or block layer. It ought to be mostly a cache of taking existing QMP
> > code and wiring it up in qemu-nbd, and only exposing a whitelisted subset
> > of existing QMP commands related to block backends. 
> > 
> > One alternative approach to doing this would be to suggest that we should
> > instead just spawn qemu-system-x86_64 with '--machine none' and use that
> > as a replacement for qemu-nbd, since it already has a built-in NBD server
> > which can do many exports at once and arbitrary block jobs.
> > 
> > I'm concerned that this could end up being a be a game of whack-a-mole
> > though, constantly trying to cut out/down all the bits of system emulation
> > in the machine emulators to get its resource overhead to match the low
> > overhead of standalone qemu-nbd.
> > 

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

  reply	other threads:[~2017-11-02 17:05 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-02 12:02 [Qemu-devel] RFC: use case for adding QMP, block jobs & multiple exports to qemu-nbd ? Daniel P. Berrange
2017-11-02 16:40 ` [Qemu-devel] [Qemu-block] " Kashyap Chamarthy
2017-11-02 17:04   ` Daniel P. Berrange [this message]
2017-11-02 17:50     ` Eric Blake
2017-11-03 10:04       ` Stefan Hajnoczi
2017-11-03 10:16         ` Daniel P. Berrange
2017-11-02 18:06 ` Paolo Bonzini
2017-11-02 21:38 ` Max Reitz
2017-11-03  9:59   ` Stefan Hajnoczi
2017-11-09 13:54   ` Markus Armbruster
2017-11-09 16:02     ` Daniel P. Berrange
2017-11-03  6:00 ` [Qemu-devel] " Fam Zheng
2017-11-03 10:01   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171102170448.GX32533@redhat.com \
    --to=berrange@redhat.com \
    --cc=eblake@redhat.com \
    --cc=kchamart@redhat.com \
    --cc=mbooth@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.