[Qemu-devel] Raw notes from a small block layer/QAPI/something pre-christmas meeting

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] Raw notes from a small block layer/QAPI/something pre-christmas meeting
@ 2017-12-15 16:38 Max Reitz
  2017-12-18 10:11 ` Markus Armbruster
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Max Reitz @ 2017-12-15 16:38 UTC (permalink / raw)
  To: Qemu-block; +Cc: qemu-devel@nongnu.org

[-- Attachment #1: Type: text/plain, Size: 17006 bytes --]

Hi everyone,

Kevin, Markus, and me had a small personal meeting over the last 1.5
days and discussed a couple of things about the block layer and its QAPI
entanglements.

Here's a rather rough sketch on what we talked about:


== Quorum is broken ==

a) x-blockdev-change is broken.  When adding a new block device, you
   should always set a child name.  This cannot or at least should not
   be inferred automagically.

   Also, adding is just a special case of replacing: If there is no such
   child yet, it will just be added.

   So the new interface could look like this:
       x-blockdev-change(parent, child, node)
   Where nothing is optional, @parent is the node name of the parent,
   @child is the BdrvChild name of the edge to be changed (if it does
   not exist yet, it will be created), and @node is the node name of the
   child (null if the edge is to be removed).

   Note: (b) will say that x-blockdev-change can be replaced by a
   QMP-bdrv_reopen() anyway, so we should probably do so.

b) Quorum child list is pretty much broken right now; since it always
   tries to keep the list continuous from front to back, child names can
   change at any point and you don't have a say in how it's ordered.
   (And probably more things.)
   Needed:
   - Fixed child names for quorum
   - Way to reorder the children

   Ideally, the path to the child's BlockdevRef should be equivalent to
   the child's name (like it is now with "children.0" etc.).  There are
   at least three ways to achieve this (suggestions welcome):
   (Note from the next day: This may be obsolete.[1])
   - Put the child options either directly into the Quorum node's
     options (like it is done for all other block drivers) or in a dict
     under e.g. a "children" key.  This would probably require new QAPI
     infrastructure as we would want to be able to specify an arbitrary
     number of BlockdevRefs with arbitrary keys inside of
     BlockdevOptionsQuorum (or BlockdevOptionsQuorum.children).
     Also, Dicts generally do not have an order, so we would have to add
     a list to BlockdevOptionsQuorum that specifies the child order (by
     child name).

   - Same as the last, but introduce ordered Dicts and ditch the
     explicit ordering list.

   - Keep the current way, but do not keep the children list continuous
     at all cost.  Instead, whenever a child is deleted, replace it by
     QNull.

   The last would require the least QAPI work and the least
   modifications to Quorum (especially its interface) and may thus be
   what we want to do.

   In any case, if you want to change Quorum children and change their
   ordering at the same time, we would need a transaction of
   x-blockdev-change and bdrv_reopen.  Since x-blockdev-change is
   basically a special case of bdrv_reopen, we probably just want
   bdrv_reopen over QMP instead.

   [1] If we implement this through bdrv_reopen anyway, there is no need
       for any of the things mentioned above.  We can keep the current
       list and if you want to add or remove children, you just have to
       specify it from scratch, thus the user knows the new child names
       and they don't change implicitly.

   Other advantages of bdrv_reopen:
   - Allows you to keep state of nodes (instead of creating a new node
     and replacing the old one by the new one)
   - Is already a transaction

Most important conclusion: We should think very hard before removing the
x- before x-blockdev-change.  We probably don't want it at all.  It
breaks quorum in its current state (by implicitly changing its
children's names), and we want bdrv_reopen over it.


== Automatically inserted filter nodes (includes stream/COR) ==

- backing_bs() should automatically skip all filter nodes; except when
  it should not (e.g. bdrv_query_info(): There we only want to skip
  implicitly created nodes).
  - We probably need different functions for different cases: Sometimes
    we want the actual backing child, filter or not; at other times, we
    want to skip implicitly created nodes (for legacy commands); and
    sometimes, we even want to skip all filter nodes (e.g. for the
    commit block job).
    For skipping nodes, we probably need some infrastructure so that
    each node can actually say what its filtered node is instead of
    trying to guess it generally.
    (Also, maybe the commit block job does not want to just skip all
     nodes; say there is a throttling node in the chain, then it will
     not have any effect on writing to the base, which may not be
     intuitively correct (at least).  We could make this an error and
     require the user to reinstate all filter nodes outside of the
     commit chain before we do the commit; or we simply implement
     blockdev-copy (see below) where it is clear that writing to the
     target will not go through any backing chain.)

- Stream node: Basically the same as a COR node, just with more
  functionality -- either add a stream_top node that can do generic COR
  ("sync=none"), or add a COR node that can do everything stream should
  be able to do, or make both effectively the same driver but call them
  differently still (just in the interface) so that the stream driver
  can later be extended if so desired.
  - We need a COR node anyway, but just as with trottling, we will have
    to keep the current code around for -drive (until we either kill
    -drive or implement the -drive option with an automatically inserted
    COR node)

- backup_top node: Should just be done (would be obsoleted by
  blockdev-copy, but that is probably further in the future than we'd
  like)


== 3.0 deprecation ==

- Remove drive-backup, drive-mirror, blockdev-snapshot-sync,
  block_passwd, block_set_io_throttle, block-job-set-speed
  - query-block and query-named-block-nodes need a replacement first
  - Along with block-job-set-speed: Remove block job throttling
- Rework block-commit and block-stream to only use sane options (no
  filenames except for the backing file string and no @device)
- -drive (at least explicitly mark it HMP-unstable)
  - In any case features such as throttling, COR, snapshots
    (COR needs to be a filter driver first)
  - Part of -blockdev, but: Maybe convert detect-zeroes to a filter
    driver and at some point (not necessarily 3.0) remove the generic
    node-level option
- Remove BB names (in QMP commands; but without -drive, they should be
  gone completely, then)
- Remove QMP change
  - Maybe QMP eject, too
- Visibly remove old block drivers, maybe moving them to nbdkit
  (either through completely removing them, or through disabling write
   support (only enabling it in qtest or something), or by using the
   whitelist and not adding them by default)
  Arguably things like bochs, dmg, qcow, qed, ...?
  (In ascending order of radicalness)
  ((vvfat is always a candidate, but some people actually like it))


== Preparation for thorough internal block layer QAPIfication ==

- Blockdev options should always be able to reflect the actual internal
  state -- this is not always the case (e.g. x-image in blkdebug and
  blkverify, which stores the (fat) filename of the image that is
  tested).
  List of issues:
  - blkdebug's x-image
  - blkverify's x-image and x-raw
  - rbd's =keyvalue-pairs
  - ssh's host_key_check

  (A) Add these options to the QAPI schema.
      - Bad because we don't want fat filenames in that schema.
      - If adding =keyvalue-pairs was so easy, we'd have done so
        already.  (Same for ssh, but that doesn't look impossible.)

  (B) Drop x-image and x-raw, disallowing fat filenames for blkdebug and
      blkverify.
      - Basically an idea worth pursuing, but doesn't solve rbd's or
        ssh's issue.

  (C) QAPI options, but continue to give .bdrv_parse_filename() a way to
      store random stuff (that will then be handed to .bdrv_open()).
      - Results in another parameter for .bdrv_open(), but would solve
        the issue for the rest.
      - Holds us back from fixing the real issue, which is that
        everything you can do with a fat filename you should be able to
        do through QAPI, too.

   We can also do a combination: Drop x-image and x-raw, add
   host_key_check to the QAPI schema, and then think long and hard about
   =keyvalue-pairs.


== Single block job for backup/commit/mirror (blockdev-copy) ==

- Special things:
  - mirror: Ready state and block-job-complete
  - mirror: Always replaces some node by the target
  - backup: copy before write and synchronous
    (mirror/commit are copy after write and asynchronous (except
     active-mirror))
  - backup: Has @compressed, that may not work with mirror right now
    (because block drivers assume you only write once)
  - commit: Resizes target if smaller than source
  - commit: Does not share write permissions, maybe because it doesn't
    want to use a dirty bitmap (a unified job would just use a dirty
    bitmap so then it could just generally share write permissions)
  - active commit: Can read from source even if the backing chain is
    garbage because of the ongoing commit (mirror) job
    - block_job_add_bdrv() on nearly whole backing chain so that
      READ_CONSISTENT is not being shared
  - non-active commit: Write target's (base's) filename into all of
    source's (top's) overlays as their backing filename

At least some of these differences would require blockdev-copy runtime
options.

blockdev-copy runtime parameters:
- everything blockdev-mirror has, plus
- @compressed? -- could be done as a block filter that converts normal
  writes to compressed writes
- Some option to suppress node replacement
  (in theory you could put block-job-finalize and bdrv_reopen into a
   transaction to do that replacement yourself)
- Some option to suppress READY event and complete automatically
  (compatibility to backup and non-active commit)
- Options to control the exact copying behavior (before/after write,
  active/passive)
  (passive before-write is impossible)
- Option to resize target if smaller than source
  (maybe just internal and not visible in the interface)
  (compatibility to block-commit)
- Several other commit thingies (like permissions) which can be
  automatically deduced and set if the target is in source's backing
  chain
- Option to specify the target's filename (this is written into all
  overlays of @to_replace as their backing filename; if omitted, the
  filename QEMU knows will be used)


== Image creation ==

Image creation and op blockers:
  At least for the time being, we probably just want file-posix to open
  the new file with O_CREAT | O_RDWR, then claim the appropriate op
  blockers (WRITE and RESIZE) and then invoke ftruncate().
  Alternatively, we could somehow do the truncation from the generic
  block layer, but this may not be possible with all protocols (and
  currently we support file locking only over file-posix anyway).

Image creation job:
  We want to have this anyway (including QAPIfication of the creation
  options), but when, alas, we cannot say.
  Some ideas:
  - Do creation per driver, not automatic "pass-through".
    First, create the protocol file.  Then, format it using the format
    driver.
    So blockdev-create would take the same child BlockdevRefs as
    blockdev-add, and then format those.  For protocol drivers, you just
    don't pass any child and the file is thus created.
    (You then open it with blockdev-add or just use it inline in the
     format driver's blockdev-create to format the image.)

Image creation in qemu-system-* vs. qemu-img:
  In order to get proper introspection for qemu-img create, we need a
  QAPI schema.  If we have a QAPI schema, we might as well add
  blockdev-create to QMP.
  As long as we do not have a really-none (null, void, ...) machine type
  for qemu-system-*, launching such a process just for creating an image
  will bring quite a bit of overhead (e.g. with -M none -accel qtest).
  However, as for libvirt, this is not exactly a regression since
  libvirt currently cannot create images at all (apart from implicitly
  through drive-mirror etc.).  Further work on voidifying qemu-system-*
  will improve performance.
  On the other side, we can also add QAPI introspection to qemu-img.
  (qemu-img already links to QAPI, so this should not be too hard.)
  qemu-img will also need command-line introspection, though.

  Plan B:
    libvirt can use qemu-img now with the currently supported options,
    and as soon as libvirt needs anything better, we will have something
    better done.
    (Also, there is "qemu-img create -f $format -o help"!  Because
     parsing help texts has worked so well in the past.)


== GRAPH_MOD ==

Why do we have this GRAPH_MOD comment in block/mirror.c?
- Do we need the replace_blocker at all?
  - Does it block the check_to_replace_node() information?
    - No, it does not really (anymore, broken 2.5 years ago).
    - Also, check_to_replace() seems to only have one function and that
      is to prevent data corruption (so the user can only replace nodes
      that show the same data as the source node, as there are only
      filters between the two).  But if the user is so inclined to get
      bad data, they should feel free to.
      (QEMU will never implicitly add non-filter nodes, so any
       non-filter node must have been added by the user and thus cannot
       catch them by surprise.)
    - Therefore, we can probably just drop check_to_replace_node()
      (now).
  - Then, the blocker does not seem to serve any useful purpose anymore,
    so we can drop that, too.
    - (Maybe the op blocker had something to do with bdrv_swap(), where
       a pointer to a BDS could point to a completely different BDS
       after a bit of time.  Now we no longer perform such criminal
       procedures and thus should be fine without.)
    - Action item: We should also resolve the to_replace node name only
      once, and that is at the beginning of the job.
- Ergo, nobody knows what the comment is supposed to mean.  In any case
  we don't need GRAPH_MOD there.

OK, so do we need it somewhere else?
- Maybe not?


== Corrupted qcow2 nodes ==

(Note: The current idea was to replace a corrupted qcow2 node by e.g. a
 null-co node or another special node that would always return errors
 when accessed.)

Adding a new driver (or runtime options for null-co) is hard, because
this would require us to still keep the old node around (so that it
doesn't suddenly disappear) and to prevent attachment to it (so that you
cannot use it through a back door).

Therefore the easiest way might be to add a new flag BDS.corrupted that
is simply checked in functions that write to such a BDS (write, discard,
flush, ...) and that then return -ENOMEDIUM (or something, e.g. EIEIO)
in such a case.


== Locking for graph changes ==

Currently, we just do graph changes at any point whenever we so desire.
This gives us nice breakage e.g. in the drain functions and everywhere
we don't expect it (even though drain should expect it...).

Therefore, graph changes should probably only happen whenever nobody
else is looking.

- bdrv_drained_begin()/_end() should be locks to prevent graph changes
  from anyone else
  (the caller can do changes, though)
- Maybe there are exceptions if you change your very own children,
  because if you do so, you already know that you yourself are
  effectively drained


== Dirty bitmaps ==

Omniscient dirty bitmaps are mostly relevant for filter nodes.  In their
case, maybe they should not have their own bitmaps but just forward
bitmap accesses to their children.

Either the filter node itself decides how this forwarding is done;
because e.g. raw allows accessing a windowed portion of its filtered node.
Or we recognize this is a bit stupid (because we would have to add
callbacks for every bitmap access function there is, for rather little
gain) and we just implement a generic function that returns the bitmap
we want to access (and nodes can say whether skipping them is OK or not,
see the note somewhere above on how filter nodes should advertise their
filtered node).
Or we add a BlockDriver callback to let the driver decide which bitmap
to access (instead of generically hopping through the graph).

For all children where this is not implemented (e.g. "file" children of
format nodes), we can prevent this issue by requiring drivers not to
share PERM_WRITE for those children.
Note that we have to unshare PERM_WRITE whenever such pass-through is
impossible, for all children that influence the node's visible data, as
soon as you try to create a bitmap on that node.


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 512 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Raw notes from a small block layer/QAPI/something pre-christmas meeting
  2017-12-15 16:38 [Qemu-devel] Raw notes from a small block layer/QAPI/something pre-christmas meeting Max Reitz
@ 2017-12-18 10:11 ` Markus Armbruster
  2017-12-20 10:44   ` [Qemu-devel] [Qemu-block] " Kashyap Chamarthy
  2017-12-20 11:11 ` [Qemu-devel] " Daniel P. Berrange
  2017-12-22 11:01 ` [Qemu-devel] " Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 12+ messages in thread
From: Markus Armbruster @ 2017-12-18 10:11 UTC (permalink / raw)
  To: Max Reitz; +Cc: Qemu-block, qemu-devel@nongnu.org

Max Reitz <mreitz@redhat.com> writes:

[...]
> == Image creation ==
>
> Image creation and op blockers:
>   At least for the time being, we probably just want file-posix to open
>   the new file with O_CREAT | O_RDWR, then claim the appropriate op
>   blockers (WRITE and RESIZE) and then invoke ftruncate().
>   Alternatively, we could somehow do the truncation from the generic
>   block layer, but this may not be possible with all protocols (and
>   currently we support file locking only over file-posix anyway).
>
> Image creation job:
>   We want to have this anyway (including QAPIfication of the creation
>   options), but when, alas, we cannot say.
>   Some ideas:
>   - Do creation per driver, not automatic "pass-through".
>     First, create the protocol file.  Then, format it using the format
>     driver.
>     So blockdev-create would take the same child BlockdevRefs as
>     blockdev-add, and then format those.  For protocol drivers, you just
>     don't pass any child and the file is thus created.
>     (You then open it with blockdev-add or just use it inline in the
>      format driver's blockdev-create to format the image.)
>
> Image creation in qemu-system-* vs. qemu-img:
>   In order to get proper introspection for qemu-img create, we need a
>   QAPI schema.  If we have a QAPI schema, we might as well add
>   blockdev-create to QMP.
>   As long as we do not have a really-none (null, void, ...) machine type
>   for qemu-system-*, launching such a process just for creating an image
>   will bring quite a bit of overhead (e.g. with -M none -accel qtest).
>   However, as for libvirt, this is not exactly a regression since
>   libvirt currently cannot create images at all (apart from implicitly
>   through drive-mirror etc.).  Further work on voidifying qemu-system-*
>   will improve performance.

Another thought: do we want to give qemu-system-* the necessary
privileges for creating images?  Two cases: running with and without a
guest.

>   On the other side, we can also add QAPI introspection to qemu-img.
>   (qemu-img already links to QAPI, so this should not be too hard.)

I agree that adding some kind of QAPI introspection shouldn't be too
hard, but providing it via a QMP monitor would be significantly harder,
because the QMP monitor is entangled with HMP and character devices,
which are in turn entangled with lots of stuff.  The disentangling
necessary to make QMP available for qemu-img might be expensive.

>   qemu-img will also need command-line introspection, though.

My RFC patches to provide command-line introspection don't cover this,
yet; they only provide via QMP.  Providing it directly on the command
line feels feasible.

>   Plan B:
>     libvirt can use qemu-img now with the currently supported options,
>     and as soon as libvirt needs anything better, we will have something
>     better done.
>     (Also, there is "qemu-img create -f $format -o help"!  Because
>      parsing help texts has worked so well in the past.)


[...]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Raw notes from a small block layer/QAPI/something pre-christmas meeting
  2017-12-18 10:11 ` Markus Armbruster
@ 2017-12-20 10:44   ` Kashyap Chamarthy
  2017-12-20 10:57     ` Daniel P. Berrange
  0 siblings, 1 reply; 12+ messages in thread
From: Kashyap Chamarthy @ 2017-12-20 10:44 UTC (permalink / raw)
  To: Markus Armbruster; +Cc: Max Reitz, qemu-devel@nongnu.org, Qemu-block

On Mon, Dec 18, 2017 at 11:11:00AM +0100, Markus Armbruster wrote:
> Max Reitz <mreitz@redhat.com> writes:

[...]

Thanks, Max, for the detailed notes.

> > Image creation in qemu-system-* vs. qemu-img:
> >   In order to get proper introspection for qemu-img create, we need a
> >   QAPI schema.  If we have a QAPI schema, we might as well add
> >   blockdev-create to QMP.
> >   As long as we do not have a really-none (null, void, ...) machine type
> >   for qemu-system-*, launching such a process just for creating an image
> >   will bring quite a bit of overhead (e.g. with -M none -accel qtest).
> >   However, as for libvirt, this is not exactly a regression since
> >   libvirt currently cannot create images at all (apart from implicitly
> >   through drive-mirror etc.).  Further work on voidifying qemu-system-*
> >   will improve performance.
> 
> Another thought: do we want to give qemu-system-* the necessary
> privileges for creating images?  Two cases: running with and without a
> guest.

Related: Just curious -- was it an explicit design decision to not give
`qemu-system-*` permissions to create disk images?

[...]

-- 
/kashyap

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Raw notes from a small block layer/QAPI/something pre-christmas meeting
  2017-12-20 10:44   ` [Qemu-devel] [Qemu-block] " Kashyap Chamarthy
@ 2017-12-20 10:57     ` Daniel P. Berrange
  2017-12-20 11:29       ` Kashyap Chamarthy
  2017-12-20 13:33       ` Kevin Wolf
  0 siblings, 2 replies; 12+ messages in thread
From: Daniel P. Berrange @ 2017-12-20 10:57 UTC (permalink / raw)
  To: Kashyap Chamarthy
  Cc: Markus Armbruster, qemu-devel@nongnu.org, Qemu-block, Max Reitz

On Wed, Dec 20, 2017 at 11:44:36AM +0100, Kashyap Chamarthy wrote:
> On Mon, Dec 18, 2017 at 11:11:00AM +0100, Markus Armbruster wrote:
> > Max Reitz <mreitz@redhat.com> writes:
> 
> [...]
> 
> Thanks, Max, for the detailed notes.
> 
> > > Image creation in qemu-system-* vs. qemu-img:
> > >   In order to get proper introspection for qemu-img create, we need a
> > >   QAPI schema.  If we have a QAPI schema, we might as well add
> > >   blockdev-create to QMP.
> > >   As long as we do not have a really-none (null, void, ...) machine type
> > >   for qemu-system-*, launching such a process just for creating an image
> > >   will bring quite a bit of overhead (e.g. with -M none -accel qtest).
> > >   However, as for libvirt, this is not exactly a regression since
> > >   libvirt currently cannot create images at all (apart from implicitly
> > >   through drive-mirror etc.).  Further work on voidifying qemu-system-*
> > >   will improve performance.
> > 
> > Another thought: do we want to give qemu-system-* the necessary
> > privileges for creating images?  Two cases: running with and without a
> > guest.
> 
> Related: Just curious -- was it an explicit design decision to not give
> `qemu-system-*` permissions to create disk images?

Our security model considers QEMU broadly untrustworthy, and so any resources
it needs to use must either be passed in by libvirt, or have permissions
explicitly assigned to permit usage by QEMU. QEMU is allowed to create tmp
files, and create RAM files for memory backing, but in general we don't want
to have QEMU able to create arbitrary files, only open things that are
already created.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Raw notes from a small block layer/QAPI/something pre-christmas meeting
  2017-12-15 16:38 [Qemu-devel] Raw notes from a small block layer/QAPI/something pre-christmas meeting Max Reitz
  2017-12-18 10:11 ` Markus Armbruster
@ 2017-12-20 11:11 ` Daniel P. Berrange
  2017-12-20 18:15   ` Markus Armbruster
  2017-12-22 11:01 ` [Qemu-devel] " Vladimir Sementsov-Ogievskiy
  2 siblings, 1 reply; 12+ messages in thread
From: Daniel P. Berrange @ 2017-12-20 11:11 UTC (permalink / raw)
  To: Max Reitz; +Cc: Qemu-block, qemu-devel@nongnu.org

On Fri, Dec 15, 2017 at 05:38:00PM +0100, Max Reitz wrote:

> Image creation in qemu-system-* vs. qemu-img:
>   In order to get proper introspection for qemu-img create, we need a
>   QAPI schema.  If we have a QAPI schema, we might as well add
>   blockdev-create to QMP.
>   As long as we do not have a really-none (null, void, ...) machine type
>   for qemu-system-*, launching such a process just for creating an image
>   will bring quite a bit of overhead (e.g. with -M none -accel qtest).
>   However, as for libvirt, this is not exactly a regression since
>   libvirt currently cannot create images at all (apart from implicitly
>   through drive-mirror etc.).  Further work on voidifying qemu-system-*
>   will improve performance.

In terms of the I/O operations involved, image creation is a already a
pretty slow process, particularly if pre-allocation is used which is
common.  So even QEMU's current slow (circa 300ms) startup time is a
complete non-issue for image creation IMHO - it'll be dwarfed by the
time to actually create the image. 

>   On the other side, we can also add QAPI introspection to qemu-img.
>   (qemu-img already links to QAPI, so this should not be too hard.)
>   qemu-img will also need command-line introspection, though.

I figure the qapi-ificiation is the hard & time consuming bit of
work. Once that's done exporting it via both qemu-img & qemu-system*
is quite straighforward.

>   Plan B:
>     libvirt can use qemu-img now with the currently supported options,
>     and as soon as libvirt needs anything better, we will have something
>     better done.
>     (Also, there is "qemu-img create -f $format -o help"!  Because
>      parsing help texts has worked so well in the past.)

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Raw notes from a small block layer/QAPI/something pre-christmas meeting
  2017-12-20 10:57     ` Daniel P. Berrange
@ 2017-12-20 11:29       ` Kashyap Chamarthy
  2017-12-20 13:33       ` Kevin Wolf
  1 sibling, 0 replies; 12+ messages in thread
From: Kashyap Chamarthy @ 2017-12-20 11:29 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Markus Armbruster, qemu-devel@nongnu.org, Qemu-block, Max Reitz

On Wed, Dec 20, 2017 at 10:57:40AM +0000, Daniel P. Berrange wrote:
> On Wed, Dec 20, 2017 at 11:44:36AM +0100, Kashyap Chamarthy wrote:
> > On Mon, Dec 18, 2017 at 11:11:00AM +0100, Markus Armbruster wrote:

[...]

> > > Another thought: do we want to give qemu-system-* the necessary
> > > privileges for creating images?  Two cases: running with and without a
> > > guest.
> > 
> > Related: Just curious -- was it an explicit design decision to not give
> > `qemu-system-*` permissions to create disk images?
> 
> Our security model considers QEMU broadly untrustworthy, and so any resources
> it needs to use must either be passed in by libvirt, or have permissions
> explicitly assigned to permit usage by QEMU. QEMU is allowed to create tmp
> files, and create RAM files for memory backing, but in general we don't want
> to have QEMU able to create arbitrary files, only open things that are
> already created.

Ah, true.  Thanks for the reminder about the security architecture.
(Also I realize that libvirt launches QEMU as an unprivileged user,
'qemu', which is part of the defense-in-depth approach, along w/ sVirt
mechanism, etc.)

[...]

-- 
/kashyap

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Raw notes from a small block layer/QAPI/something pre-christmas meeting
  2017-12-20 10:57     ` Daniel P. Berrange
  2017-12-20 11:29       ` Kashyap Chamarthy
@ 2017-12-20 13:33       ` Kevin Wolf
  2017-12-20 13:40         ` Daniel P. Berrange
  1 sibling, 1 reply; 12+ messages in thread
From: Kevin Wolf @ 2017-12-20 13:33 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Kashyap Chamarthy, Max Reitz, Markus Armbruster, pkrempa,
	Qemu-block, qemu-devel@nongnu.org

Am 20.12.2017 um 11:57 hat Daniel P. Berrange geschrieben:
> On Wed, Dec 20, 2017 at 11:44:36AM +0100, Kashyap Chamarthy wrote:
> > On Mon, Dec 18, 2017 at 11:11:00AM +0100, Markus Armbruster wrote:
> > > Max Reitz <mreitz@redhat.com> writes:
> > 
> > [...]
> > 
> > Thanks, Max, for the detailed notes.
> > 
> > > > Image creation in qemu-system-* vs. qemu-img:
> > > >   In order to get proper introspection for qemu-img create, we need a
> > > >   QAPI schema.  If we have a QAPI schema, we might as well add
> > > >   blockdev-create to QMP.
> > > >   As long as we do not have a really-none (null, void, ...) machine type
> > > >   for qemu-system-*, launching such a process just for creating an image
> > > >   will bring quite a bit of overhead (e.g. with -M none -accel qtest).
> > > >   However, as for libvirt, this is not exactly a regression since
> > > >   libvirt currently cannot create images at all (apart from implicitly
> > > >   through drive-mirror etc.).  Further work on voidifying qemu-system-*
> > > >   will improve performance.
> > > 
> > > Another thought: do we want to give qemu-system-* the necessary
> > > privileges for creating images?  Two cases: running with and without a
> > > guest.
> > 
> > Related: Just curious -- was it an explicit design decision to not give
> > `qemu-system-*` permissions to create disk images?
> 
> Our security model considers QEMU broadly untrustworthy, and so any resources
> it needs to use must either be passed in by libvirt, or have permissions
> explicitly assigned to permit usage by QEMU. QEMU is allowed to create tmp
> files, and create RAM files for memory backing, but in general we don't want
> to have QEMU able to create arbitrary files, only open things that are
> already created.

Which kind of suggests that libvirt should use an external qemu-img
process rather than a QMP command to create images?

Or would libvirt only create an empty file externally and then use QMP
to create some image format in it? In other words, it would only ever
use the QMP command to create formats like qcow2, but it would never use
it to create the file-posix layer?

Kevin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Raw notes from a small block layer/QAPI/something pre-christmas meeting
  2017-12-20 13:33       ` Kevin Wolf
@ 2017-12-20 13:40         ` Daniel P. Berrange
  2017-12-21 12:04           ` Peter Krempa
  0 siblings, 1 reply; 12+ messages in thread
From: Daniel P. Berrange @ 2017-12-20 13:40 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Kashyap Chamarthy, Max Reitz, Markus Armbruster, pkrempa,
	Qemu-block, qemu-devel@nongnu.org

On Wed, Dec 20, 2017 at 02:33:54PM +0100, Kevin Wolf wrote:
> Am 20.12.2017 um 11:57 hat Daniel P. Berrange geschrieben:
> > On Wed, Dec 20, 2017 at 11:44:36AM +0100, Kashyap Chamarthy wrote:
> > > On Mon, Dec 18, 2017 at 11:11:00AM +0100, Markus Armbruster wrote:
> > > > Max Reitz <mreitz@redhat.com> writes:
> > > 
> > > [...]
> > > 
> > > Thanks, Max, for the detailed notes.
> > > 
> > > > > Image creation in qemu-system-* vs. qemu-img:
> > > > >   In order to get proper introspection for qemu-img create, we need a
> > > > >   QAPI schema.  If we have a QAPI schema, we might as well add
> > > > >   blockdev-create to QMP.
> > > > >   As long as we do not have a really-none (null, void, ...) machine type
> > > > >   for qemu-system-*, launching such a process just for creating an image
> > > > >   will bring quite a bit of overhead (e.g. with -M none -accel qtest).
> > > > >   However, as for libvirt, this is not exactly a regression since
> > > > >   libvirt currently cannot create images at all (apart from implicitly
> > > > >   through drive-mirror etc.).  Further work on voidifying qemu-system-*
> > > > >   will improve performance.
> > > > 
> > > > Another thought: do we want to give qemu-system-* the necessary
> > > > privileges for creating images?  Two cases: running with and without a
> > > > guest.
> > > 
> > > Related: Just curious -- was it an explicit design decision to not give
> > > `qemu-system-*` permissions to create disk images?
> > 
> > Our security model considers QEMU broadly untrustworthy, and so any resources
> > it needs to use must either be passed in by libvirt, or have permissions
> > explicitly assigned to permit usage by QEMU. QEMU is allowed to create tmp
> > files, and create RAM files for memory backing, but in general we don't want
> > to have QEMU able to create arbitrary files, only open things that are
> > already created.
> 
> Which kind of suggests that libvirt should use an external qemu-img
> process rather than a QMP command to create images?

That is my gut feeling, though Peter K might have other opinions as the
person doing most libvirt work in this area.

> Or would libvirt only create an empty file externally and then use QMP
> to create some image format in it? In other words, it would only ever
> use the QMP command to create formats like qcow2, but it would never use
> it to create the file-posix layer?

The complexity with creating an empty file externally, and then writing
the image format into it is figuring out the right size to use for the
empty "file".

We could just create an empty file and let it grow-on-demand but this
only works for things which really are file backed.

The problem still exists for qcow2-over-LVM or  LUKS-over-iSCSI, etc,
where there's no way for QEMU to ever create the underlying LVM or
iSCSI volume. Indeed libvirt probably can't create them either - the
higher level mgmt app (OpenSTack/OVirt) would have to. I guess this
is where the work Stefan did for a "measure" command would come into
play perhaps.

I'm inclined to say though that if libvirt has to create the base layer
externally with qemu-img, we might as well just crete the entire thing
externally. I'm not seeing the obvious benefit to creating the file
with qemu-img, and formatting it with a QMP command.

I think where QMP becomes interesting / useful, over qemu-img is when
populating the file with data (eg a qemu-img convert equiv), since we
can get proper API for progress monitoring & cancelling via block jobs.

Regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Raw notes from a small block layer/QAPI/something pre-christmas meeting
  2017-12-20 11:11 ` [Qemu-devel] " Daniel P. Berrange
@ 2017-12-20 18:15   ` Markus Armbruster
  2018-01-08 15:12     ` [Qemu-devel] [Qemu-block] " Peter Krempa
  0 siblings, 1 reply; 12+ messages in thread
From: Markus Armbruster @ 2017-12-20 18:15 UTC (permalink / raw)
  To: Daniel P. Berrange; +Cc: Max Reitz, qemu-devel@nongnu.org, Qemu-block

"Daniel P. Berrange" <berrange@redhat.com> writes:

> On Fri, Dec 15, 2017 at 05:38:00PM +0100, Max Reitz wrote:
>
>> Image creation in qemu-system-* vs. qemu-img:
>>   In order to get proper introspection for qemu-img create, we need a
>>   QAPI schema.  If we have a QAPI schema, we might as well add
>>   blockdev-create to QMP.
>>   As long as we do not have a really-none (null, void, ...) machine type
>>   for qemu-system-*, launching such a process just for creating an image
>>   will bring quite a bit of overhead (e.g. with -M none -accel qtest).
>>   However, as for libvirt, this is not exactly a regression since
>>   libvirt currently cannot create images at all (apart from implicitly
>>   through drive-mirror etc.).  Further work on voidifying qemu-system-*
>>   will improve performance.
>
> In terms of the I/O operations involved, image creation is a already a
> pretty slow process, particularly if pre-allocation is used which is
> common.  So even QEMU's current slow (circa 300ms) startup time is a
> complete non-issue for image creation IMHO - it'll be dwarfed by the
> time to actually create the image. 
>
>>   On the other side, we can also add QAPI introspection to qemu-img.
>>   (qemu-img already links to QAPI, so this should not be too hard.)
>>   qemu-img will also need command-line introspection, though.
>
> I figure the qapi-ificiation is the hard & time consuming bit of
> work. Once that's done exporting it via both qemu-img & qemu-system*
> is quite straighforward.

qemu-system-*: trivial.
qemu-img, via command line: straightforward
qemu-img, via QMP: more difficult, since QMP is entangled with HMP,
    character devices, ...

If libvirt really wants to use QMP for the job, *and* doesn't want to
use the qemu-system-* that's running a guest, the easy solution is
running another qemu-system-* without a guest.

>>   Plan B:
>>     libvirt can use qemu-img now with the currently supported options,
>>     and as soon as libvirt needs anything better, we will have something
>>     better done.
>>     (Also, there is "qemu-img create -f $format -o help"!  Because
>>      parsing help texts has worked so well in the past.)
>
> Regards,
> Daniel

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Raw notes from a small block layer/QAPI/something pre-christmas meeting
  2017-12-20 13:40         ` Daniel P. Berrange
@ 2017-12-21 12:04           ` Peter Krempa
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Krempa @ 2017-12-21 12:04 UTC (permalink / raw)
  To: Daniel P. Berrange
  Cc: Kevin Wolf, Kashyap Chamarthy, Max Reitz, Markus Armbruster,
	Qemu-block, qemu-devel@nongnu.org

[-- Attachment #1: Type: text/plain, Size: 3288 bytes --]

On Wed, Dec 20, 2017 at 13:40:18 +0000, Daniel Berrange wrote:
> On Wed, Dec 20, 2017 at 02:33:54PM +0100, Kevin Wolf wrote:
> > Am 20.12.2017 um 11:57 hat Daniel P. Berrange geschrieben:
> > > On Wed, Dec 20, 2017 at 11:44:36AM +0100, Kashyap Chamarthy wrote:
> > > > On Mon, Dec 18, 2017 at 11:11:00AM +0100, Markus Armbruster wrote:
> > > > > Max Reitz <mreitz@redhat.com> writes:
> > > > 
> > > > [...]
> > > > 
> > > > Thanks, Max, for the detailed notes.
> > > > 
> > > > > > Image creation in qemu-system-* vs. qemu-img:
> > > > > >   In order to get proper introspection for qemu-img create, we need a
> > > > > >   QAPI schema.  If we have a QAPI schema, we might as well add
> > > > > >   blockdev-create to QMP.
> > > > > >   As long as we do not have a really-none (null, void, ...) machine type
> > > > > >   for qemu-system-*, launching such a process just for creating an image
> > > > > >   will bring quite a bit of overhead (e.g. with -M none -accel qtest).
> > > > > >   However, as for libvirt, this is not exactly a regression since
> > > > > >   libvirt currently cannot create images at all (apart from implicitly
> > > > > >   through drive-mirror etc.).  Further work on voidifying qemu-system-*
> > > > > >   will improve performance.
> > > > > 
> > > > > Another thought: do we want to give qemu-system-* the necessary
> > > > > privileges for creating images?  Two cases: running with and without a
> > > > > guest.
> > > > 
> > > > Related: Just curious -- was it an explicit design decision to not give
> > > > `qemu-system-*` permissions to create disk images?
> > > 
> > > Our security model considers QEMU broadly untrustworthy, and so any resources
> > > it needs to use must either be passed in by libvirt, or have permissions
> > > explicitly assigned to permit usage by QEMU. QEMU is allowed to create tmp
> > > files, and create RAM files for memory backing, but in general we don't want
> > > to have QEMU able to create arbitrary files, only open things that are
> > > already created.
> > 
> > Which kind of suggests that libvirt should use an external qemu-img
> > process rather than a QMP command to create images?
> 
> That is my gut feeling, though Peter K might have other opinions as the
> person doing most libvirt work in this area.

Well then we need a way to do capability probing for qemu-img. And also
libvirt will then need a way to override the qemu-img binary on a per-VM
basis as we do with qemu binaries, so that users can switch to a more
capable binary.

If that will not be possible, block jobs for a VM might be impossible if
an old qemu-img is installed system-wide.

Also I really doubt it will save any qemu work when forcing us to do it
via qemu-img, since qemu-img still needs to be improved to support e.g.
creating an image on a multi-host gluster server [1], which currently is
not possible (basically we need the same capabilities as when opening an
image).


Peter


[1] You could argue, that if you select the properly functioning volfile
server from the list of servers configured by the user only one is
necessary for creating the image. Libvirt has no way of nowing which one
works though, so let's just have gluster figure it out by themselves.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] Raw notes from a small block layer/QAPI/something pre-christmas meeting
  2017-12-15 16:38 [Qemu-devel] Raw notes from a small block layer/QAPI/something pre-christmas meeting Max Reitz
  2017-12-18 10:11 ` Markus Armbruster
  2017-12-20 11:11 ` [Qemu-devel] " Daniel P. Berrange
@ 2017-12-22 11:01 ` Vladimir Sementsov-Ogievskiy
  2 siblings, 0 replies; 12+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2017-12-22 11:01 UTC (permalink / raw)
  To: Max Reitz, Qemu-block; +Cc: qemu-devel@nongnu.org

15.12.2017 19:38, Max Reitz wrote:
> Hi everyone,
>
> Kevin, Markus, and me had a small personal meeting over the last 1.5
> days and discussed a couple of things about the block layer and its QAPI
> entanglements.
>
> Here's a rather rough sketch on what we talked about:
>
>

[...]

> == Single block job for backup/commit/mirror (blockdev-copy) ==
>
> - Special things:
>    - mirror: Ready state and block-job-complete
>    - mirror: Always replaces some node by the target
>    - backup: copy before write and synchronous
>      (mirror/commit are copy after write and asynchronous (except
>       active-mirror))
>    - backup: Has @compressed, that may not work with mirror right now
>      (because block drivers assume you only write once)
>    - commit: Resizes target if smaller than source
>    - commit: Does not share write permissions, maybe because it doesn't
>      want to use a dirty bitmap (a unified job would just use a dirty
>      bitmap so then it could just generally share write permissions)
>    - active commit: Can read from source even if the backing chain is
>      garbage because of the ongoing commit (mirror) job
>      - block_job_add_bdrv() on nearly whole backing chain so that
>        READ_CONSISTENT is not being shared
>    - non-active commit: Write target's (base's) filename into all of
>      source's (top's) overlays as their backing filename
>
> At least some of these differences would require blockdev-copy runtime
> options.
>
> blockdev-copy runtime parameters:
> - everything blockdev-mirror has, plus
> - @compressed? -- could be done as a block filter that converts normal
>    writes to compressed writes
> - Some option to suppress node replacement
>    (in theory you could put block-job-finalize and bdrv_reopen into a
>     transaction to do that replacement yourself)
> - Some option to suppress READY event and complete automatically
>    (compatibility to backup and non-active commit)
> - Options to control the exact copying behavior (before/after write,
>    active/passive)

also, I think, for backup we need options to control the following:
- it there is a write notifier, which waits for copy, if copy fails, 
what to do:
   fail guest write or fail the backup?
- should we handle guest writes when block-job is paused (now they are 
handled)

>    (passive before-write is impossible)
> - Option to resize target if smaller than source
>    (maybe just internal and not visible in the interface)
>    (compatibility to block-commit)
> - Several other commit thingies (like permissions) which can be
>    automatically deduced and set if the target is in source's backing
>    chain
> - Option to specify the target's filename (this is written into all
>    overlays of @to_replace as their backing filename; if omitted, the
>    filename QEMU knows will be used)
>
>



-- 
Best regards,
Vladimir

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Qemu-devel] [Qemu-block] Raw notes from a small block layer/QAPI/something pre-christmas meeting
  2017-12-20 18:15   ` Markus Armbruster
@ 2018-01-08 15:12     ` Peter Krempa
  0 siblings, 0 replies; 12+ messages in thread
From: Peter Krempa @ 2018-01-08 15:12 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Daniel P. Berrange, qemu-devel@nongnu.org, Qemu-block, Max Reitz

[-- Attachment #1: Type: text/plain, Size: 2409 bytes --]

On Wed, Dec 20, 2017 at 19:15:55 +0100, Markus Armbruster wrote:
> "Daniel P. Berrange" <berrange@redhat.com> writes:
> 
> > On Fri, Dec 15, 2017 at 05:38:00PM +0100, Max Reitz wrote:
> >
> >> Image creation in qemu-system-* vs. qemu-img:
> >>   In order to get proper introspection for qemu-img create, we need a
> >>   QAPI schema.  If we have a QAPI schema, we might as well add
> >>   blockdev-create to QMP.
> >>   As long as we do not have a really-none (null, void, ...) machine type
> >>   for qemu-system-*, launching such a process just for creating an image
> >>   will bring quite a bit of overhead (e.g. with -M none -accel qtest).
> >>   However, as for libvirt, this is not exactly a regression since
> >>   libvirt currently cannot create images at all (apart from implicitly
> >>   through drive-mirror etc.).  Further work on voidifying qemu-system-*
> >>   will improve performance.
> >
> > In terms of the I/O operations involved, image creation is a already a
> > pretty slow process, particularly if pre-allocation is used which is
> > common.  So even QEMU's current slow (circa 300ms) startup time is a
> > complete non-issue for image creation IMHO - it'll be dwarfed by the
> > time to actually create the image. 
> >
> >>   On the other side, we can also add QAPI introspection to qemu-img.
> >>   (qemu-img already links to QAPI, so this should not be too hard.)
> >>   qemu-img will also need command-line introspection, though.
> >
> > I figure the qapi-ificiation is the hard & time consuming bit of
> > work. Once that's done exporting it via both qemu-img & qemu-system*
> > is quite straighforward.
> 
> qemu-system-*: trivial.
> qemu-img, via command line: straightforward
> qemu-img, via QMP: more difficult, since QMP is entangled with HMP,
>     character devices, ...
> 
> If libvirt really wants to use QMP for the job, *and* doesn't want to
> use the qemu-system-* that's running a guest, the easy solution is
> running another qemu-system-* without a guest.

QMP is necessary to have for very-long operations (blockdev-mirror), but
for image creation the command line will be enough. Provided that
interface for -blockdev and "create image" will be similar enough.

Especially the fact that qemu-img create does not really like 'json:{}'
thus some image options are impossible to pass (multiple hosts for
gluster, ...).

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-01-08 15:13 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2017-12-15 16:38 [Qemu-devel] Raw notes from a small block layer/QAPI/something pre-christmas meeting Max Reitz
2017-12-18 10:11 ` Markus Armbruster
2017-12-20 10:44   ` [Qemu-devel] [Qemu-block] " Kashyap Chamarthy
2017-12-20 10:57     ` Daniel P. Berrange
2017-12-20 11:29       ` Kashyap Chamarthy
2017-12-20 13:33       ` Kevin Wolf
2017-12-20 13:40         ` Daniel P. Berrange
2017-12-21 12:04           ` Peter Krempa
2017-12-20 11:11 ` [Qemu-devel] " Daniel P. Berrange
2017-12-20 18:15   ` Markus Armbruster
2018-01-08 15:12     ` [Qemu-devel] [Qemu-block] " Peter Krempa
2017-12-22 11:01 ` [Qemu-devel] " Vladimir Sementsov-Ogievskiy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).