Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Stefan Hajnoczi <stefanha@redhat.com>
To: Francesco Romani <fromani@redhat.com>
Cc: kwolf@redhat.com, Stefan Hajnoczi <stefanha@gmail.com>,
	mdroth@linux.vnet.ibm.com, qemu-devel@nongnu.org,
	lcapitulino@redhat.com
Subject: Re: [Qemu-devel] [RFC][PATCH v2] block: add write threshold reporting for block devices
Date: Wed, 19 Nov 2014 15:52:51 +0000	[thread overview]
Message-ID: <20141119155251.GA16593@stefanha-thinkpad.redhat.com> (raw)
In-Reply-To: <1011194982.313492.1416298332568.JavaMail.zimbra@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4556 bytes --]

On Tue, Nov 18, 2014 at 03:12:12AM -0500, Francesco Romani wrote:
> > > +static int coroutine_fn before_write_notify(NotifierWithReturn *notifier,
> > > +                                            void *opaque)
> > > +{
> > > +    BdrvTrackedRequest *req = opaque;
> > > +    BlockDriverState *bs = req->bs;
> > > +    int64_t amount = 0;
> > > +
> > > +    assert((req->offset & (BDRV_SECTOR_SIZE - 1)) == 0);
> > > +    assert((req->bytes & (BDRV_SECTOR_SIZE - 1)) == 0);
> > 
> > Does the code still make these assumptions or are the asserts left over
> > from previous versions of the patch?
> 
> It's a leftover.
> I understood they don't hurt and add a bit of safety, but if they are confusing
> I'll remove them.

Yes, it made me wonder why.  Probably best to remove them.

> > > +void bdrv_set_usage_threshold(BlockDriverState *bs, int64_t
> > > threshold_bytes)
> > > +{
> > > +    BlockDriverState *target_bs = bs;
> > > +    if (bs->file) {
> > > +        target_bs = bs->file;
> > > +    }
> > 
> > Hmm...I think now I understand why you are trying to use bs->file.  This
> > is an attempt to make image formats work with the threshold.
> > 
> > Unfortunately the BlockDriverState topology can be more complicated than
> > just 1 level.
> 
> I thought so but I can't reproduce yet more complex topologies.
> Disclosure: I'm testing against the topology I know to be used on oVirt, lacking
> immediate availability of others: suggestions welcome.
> 
> At risk of being redundant, in oVirt the usecase is QCOW2 over lvm block device,
> and we'd like to be notified about the allocation of the lvm block device, which (IIUC)
> is the last bs->file.
> 
> This is a simple topology (unless I'm missing something), and that's
> the reason why I go down just one level.
> 
> Of course I want a general solution for this change, so...

There is a block driver for error injection called "blkdebug" (see
docs/blkdebug.txt).  Here is an example of the following topology:

  raw_bsd (drive0) -> blkdebug -> raw-posix (test.img)

qemu-system-x86_64 -drive if=virtio,format=raw,file.driver=blkdebug,file.image.filename=test.img

The blkdebug driver is interposing between the raw_bsd (drive0) root and
the raw-posix leaf node.

> > If we hardcode a strategy to traverse bs->file then it will work in most
> > cases:
> > 
> >   while (bs->file) {
> >       bs = bs->file;
> >   }
> > 
> > But there are cases like VMDK extent files where a BlockDriverState
> > actually has multiple children.
> > 
> > One way to solve this is to require that the management tool tells QEMU
> > which exact BlockDriverState node the threshold applies to.  Then QEMU
> > doesn't need any hardcoded policy.  But I'm not sure how realistic that
> > it at the moment (whether management tools are uses node names for each
> > node yet), so it may be best to hardcode the bs->file traversal that
> > I've suggested.
> 
> oVirt relies on libvirt[1], and uses device node (e.g. 'vda').
> 
> BTW, I haven't found a way to inspect from the outside (e.g. monitor command) the BlockDriverState
> topology, there is a way I'm missing?

You can get the BlockDriverState and ->backing_hd chain using the
query-block QMP command.  I'm not aware of a command that returns the
full graph of BlockDriverState nodes.

The management tool should not need to inspect the graph because the
graph can change at runtime (e.g. imagine I/O throttling is implemented
as a BlockDriverState node then it could appear/disapper when the
feature is activated/deactivated).  Instead the management tool should
name the nodes it knows about and then use those node names.

> Another issue I don't yet have a proper solution is related to this one.
> 
> The management app will have deal with VM with more than one block device disk, so we need
> a way to map the notification with the corresponding device.
> 
> If we descend the bs->file chain, AFAIU the easiest mapping, being the device name,
> is easily lost because only the outermost BlockDriverState has a device name attached, so when the
> notification trigger
> bdrv_get_device_name() will return NULL

In the worst case a name string can be passed in along with the
threshold values.

> I believe we don't necessarily need a device name in the notification, for example something
> like an opaque cookie set together with the threshold and returned back with the notification
> will suffice. Of course the device name is nicer :)

Agreed.

[-- Attachment #2: Type: application/pgp-signature, Size: 473 bytes --]

next prev parent reply	other threads:[~2014-11-19 15:53 UTC|newest]

Thread overview: 14+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-11-07 13:12 [Qemu-devel] [RFC][PATCH v2] add write threshold reporting for block devices Francesco Romani
2014-11-07 13:12 ` [Qemu-devel] [RFC][PATCH v2] block: " Francesco Romani
2014-11-17 16:49   ` Stefan Hajnoczi
2014-11-18  8:12     ` Francesco Romani
2014-11-19 15:52       ` Stefan Hajnoczi [this message]
2014-11-20  8:23         ` Francesco Romani
2014-11-20 10:30     ` Kevin Wolf
2014-11-20 11:04       ` Stefan Hajnoczi
2014-11-20 11:34         ` Kevin Wolf
2014-11-21  8:43           ` Francesco Romani
2014-11-21 10:11             ` Kevin Wolf
2014-11-21 15:32               ` Francesco Romani
2014-11-21 16:24             ` Eric Blake
2014-11-21  0:10         ` Eric Blake

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20141119155251.GA16593@stefanha-thinkpad.redhat.com \
    --to=stefanha@redhat.com \
    --cc=fromani@redhat.com \
    --cc=kwolf@redhat.com \
    --cc=lcapitulino@redhat.com \
    --cc=mdroth@linux.vnet.ibm.com \
    --cc=qemu-devel@nongnu.org \
    --cc=stefanha@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).