qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Kashyap Chamarthy <kchamart@redhat.com>
To: "Daniel P. Berrange" <berrange@redhat.com>
Cc: qemu-devel@nongnu.org, qemu-block@nongnu.org,
	Eric Blake <eblake@redhat.com>,
	mbooth@redhat.com
Subject: Re: [Qemu-devel] [Qemu-block] RFC: use case for adding QMP, block jobs & multiple exports to qemu-nbd ?
Date: Thu, 2 Nov 2017 17:40:28 +0100	[thread overview]
Message-ID: <20171102164028.lkl3cv52stkdiywj@eukaryote> (raw)
In-Reply-To: <20171102120223.GI32533@redhat.com>

[Cc: Matt Booth from Nova upstream; so not snipping the email to retain
context for Matt.]

On Thu, Nov 02, 2017 at 12:02:23PM +0000, Daniel P. Berrange wrote:
> I've been thinking about a potential design/impl improvement for the way
> that OpenStack Nova handles disk images when booting virtual machines, and
> thinking if some enhancements to qemu-nbd could be beneficial...

Just read-through, very intereesting idea.  A couple of things inline.

> At a high level, OpenStack has a repository of disk images (Glance), and
> when we go to boot a VM, Nova copies the disk image out of the repository
> onto the local host's image cache. We doing this, Nova may also enlarge
> disk image (eg if the original image has 10GB size, it may do a qemu-img
> resize to 40GB). Nova then creates a qcow2 overlay with backing file
> pointing to its local cache. Multiple VMs can be booted in parallel each
> with their own overlay pointing to the same backing file
> 
> The problem with this approach is that VM startup is delayed while we copy
> the disk image from the glance repository to the local cache, and again
> while we do the image resize (though the latter is pretty quick really
> since its just changing metadata in the image and/or host filesystem)
> 
> One might suggest that we avoid the local disk copy and just point the
> VM directly at an NBD server running in the remote image repository, but
> this introduces a centralized point of failure. With the local disk copy
> VMs can safely continue running even if the image repository dies. Running
> from the local image cache can offer better performance too, particularly
> if having SSD storage. 
> 
> Conceptually what I want to start with is a 3 layer chain
> 
>    master-disk1.qcow2  (qemu-nbd)
>           |
>           |  (format=raw, proto=nbd)
> 	  |
>    cache-disk1.qcow2   (qemu-system-XXX)
>           |
>           |  (format=qcow2, proto=file)
> 	  |
>           +-  vm-a-disk1.qcow2   (qemu-system-XXX)
> 
> NB vm-?-disk.qcow2 sizes may different than the backing file.
> Sometimes OS disk images are built with a fairly small root filesystem
> size, and the guest OS will grow its root FS to fill the actual disk
> size allowed to the specific VM instance.
> 
> The cache-disk1.qcow2 is on each local virt host that needs disk1, and
> created when the first VM is launched. Further launched VMs can all use
> this same cached disk.  Now the cache-disk1.qcow2 is not useful as is,
> because it has no allocated clusters, so after its created we need to
> be able to stream content into it from master-disk1.qcow2, in parallel
> with the VM A booting off vm-a-disk1.qcow2
> 
> If there was only a single VM, this would be easy enough, because we
> can use drive-mirror monitor command to pull master-disk1.qcow2 data
> into cache-disk1.qcow2 and then remove the backing chain leaving just
> 
>    cache-disk1.qcow2   (qemu-system-XXX)
>           |

Just for my own understanding: in this hypothetical single VM diagram,
you denote a QEMU binary ("qemu-system-XXX") for 'cache-disk1.qcow2'
because it will be issuing 'drive-mirror' / 'blockdev-mirror' to the
'qemu-nbd' that exported 'master-disk1.qcow2', and "un-chain"
post completion of 'mirror' job.  Yes?

> 	    |  (format=qcow2, proto=file)
>           |
>           +-  vm-a-disk1.qcow2  (qemu-system-XXX)
> 
> The problem is that many VMs are wanting to use cache-disk1.qcow2 as
> their disk's backing file, and only one process is permitted to be
> writing to disk backing file at any time.

Can you explain a bit more about how many VMs are trying to write to
write to the same backing file 'cache-disk1.qcow2'?  I'd assume it's
just the "immutable" local backing store (once the previous 'mirror' job
is completed), based on which Nova creates a qcow2 overlay for each
instance it boots.

When I pointed this e-mail of yours to Matt Booth on Freenode Nova IRC
channel, he said the intermediate image (cache-disk1.qcow2) is a COR
Copy-On-Read).  I realize what COR is -- everytime you read a cluster
from the backing file, you write that locally, to avoid reading it
again.

> So I can't use the drive-mirror
> in the QEMU processes to deal with this; all QEMU's must see their
> backing file in a consistent read-only state
> 
> I've been wondering if it is possible to add an extra layer of NBD to
> deal with this scenario. i.e. start off with:
> 
>    master-disk1.qcow2  (qemu-nbd)
>           |
>           |  (format=raw, proto=nbd)
> 	  |
>    cache-disk1.qcow2  (qemu-nbd)
>           |
>           |  (format=raw, proto=nbd)
> 	  |
>           +-  vm-a-disk1.qcow2  (qemu-system-XXX)
>           +-  vm-b-disk1.qcow2  (qemu-system-XXX)
>           +-  vm-c-disk1.qcow2  (qemu-system-XXX)
> 
> 
> In this model 'cache-disk1.qcow2' would be opened read-write by a
> qemu-nbd server process, but exported read-only to QEMU. qemu-nbd
> would then do a drive mirror to stream the contents of
> master-disk1.qcow2 into its cache-disk1.qcow2, concurrently with
> servicing read requests from many QEMU's vm-*-disk1.qcow2 files
> over NBD. When the drive mirror is complete we would again cut
> the backing file to give
> 
>    cache-disk1.qcow2  (qemu-nbd)
>           |
>           |  (format=raw, proto=nbd)
> 	  |
>           +-  vm-a-disk1.qcow2  (qemu-system-XXX)
>           +-  vm-b-disk1.qcow2  (qemu-system-XXX)
>           +-  vm-c-disk1.qcow2  (qemu-system-XXX)
> 
> Since qemu-nbd no longer needs write to cache-disk1.qcow2 at this point,
> we can further pivot all the QEMU servers to make vm-*-disk1.qcow2 use
> format=qcow2,proto=file, allowing the local qemu-nbd to close the disk
> image, and potentially exit (assuming it doesn't have other disks to
> service). This would leave
> 
>    cache-disk1.qcow2  (qemu-system-XXX)
>           |
>           |  (format=qcow2, proto=file)
> 	  |
>           +-  vm-a-disk1.qcow2  (qemu-system-XXX)
>           +-  vm-b-disk1.qcow2  (qemu-system-XXX)
>           +-  vm-c-disk1.qcow2  (qemu-system-XXX)
> 
> Conceptually QEMU has all the pieces neccessary to support this kind of
> approach to disk images, but they're not exposed by qemu-nbd as it has
> no QMP interface of its own.
> 
> Another more minor issue is that the disk image repository may have
> 1000's of images in it, and I don't want to be running 1000's of
> qemu-nbd instances. I'd like 1 server to export many disks. I could
> use iscsi in the disk image repository instead to deal with that, 
> only having the qemu-nbd processes running on the local virt host
> for the duration of populating cache-disks1.qcow2 from master-disk1.qcow2
> The iscsi server admin commands are pretty unplesant to use compared
> to QMP though, so its appealing to use NBD for everything.
> 
> After all that long background explanation, what I'm wondering is whether
> there is any interest / desire to extend qemu-nbd to have more advanced
> featureset than simply exporting a single disk image which must be listed
> at startup time.
> 
>  - Ability to start qemu-nbd up with no initial disk image connected
>  - Option to have a QMP interface to control qemu-nbd
>  - Commands to add / remove individual disk image exports
>  - Commands for doing the drive-mirror / backing file pivot
> 
> It feels like this wouldn't require significant new functionality in either
> QMP or block layer. It ought to be mostly a cache of taking existing QMP
> code and wiring it up in qemu-nbd, and only exposing a whitelisted subset
> of existing QMP commands related to block backends. 
> 
> One alternative approach to doing this would be to suggest that we should
> instead just spawn qemu-system-x86_64 with '--machine none' and use that
> as a replacement for qemu-nbd, since it already has a built-in NBD server
> which can do many exports at once and arbitrary block jobs.
> 
> I'm concerned that this could end up being a be a game of whack-a-mole
> though, constantly trying to cut out/down all the bits of system emulation
> in the machine emulators to get its resource overhead to match the low
> overhead of standalone qemu-nbd.
> 
> Regards,
> Daniel
> -- 
> |: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
> |: https://libvirt.org         -o-            https://fstop138.berrange.com :|
> |: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|
> 

-- 
/kashyap

  reply	other threads:[~2017-11-02 16:40 UTC|newest]

Thread overview: 13+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-11-02 12:02 [Qemu-devel] RFC: use case for adding QMP, block jobs & multiple exports to qemu-nbd ? Daniel P. Berrange
2017-11-02 16:40 ` Kashyap Chamarthy [this message]
2017-11-02 17:04   ` [Qemu-devel] [Qemu-block] " Daniel P. Berrange
2017-11-02 17:50     ` Eric Blake
2017-11-03 10:04       ` Stefan Hajnoczi
2017-11-03 10:16         ` Daniel P. Berrange
2017-11-02 18:06 ` Paolo Bonzini
2017-11-02 21:38 ` Max Reitz
2017-11-03  9:59   ` Stefan Hajnoczi
2017-11-09 13:54   ` Markus Armbruster
2017-11-09 16:02     ` Daniel P. Berrange
2017-11-03  6:00 ` [Qemu-devel] " Fam Zheng
2017-11-03 10:01   ` [Qemu-devel] [Qemu-block] " Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20171102164028.lkl3cv52stkdiywj@eukaryote \
    --to=kchamart@redhat.com \
    --cc=berrange@redhat.com \
    --cc=eblake@redhat.com \
    --cc=mbooth@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).