qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Kevin Wolf <kwolf@redhat.com>, "Denis V. Lunev" <den@openvz.org>,
	qemu-devel@nongnu.org,
	Raushaniya Maksudova <rmaksudova@virtuozzo.com>
Subject: Re: [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration
Date: Mon, 28 Sep 2015 14:55:44 +0100	[thread overview]
Message-ID: <20150928135544.GA5478@work-vm> (raw)
In-Reply-To: <20150928124200.GH8756@stefanha-thinkpad.redhat.com>

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Fri, Sep 25, 2015 at 01:34:22PM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@gmail.com) wrote:
> > > On Tue, Sep 08, 2015 at 04:48:24PM +0200, Kevin Wolf wrote:
> > > > Am 08.09.2015 um 16:23 hat Denis V. Lunev geschrieben:
> > > > > On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> > > > > >Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> > > > > >>interesting point. Yes, it flushes all requests and most likely
> > > > > >>hangs inside waiting requests to complete. But fortunately
> > > > > >>this happens after the switch to paused state thus
> > > > > >>the guest becomes paused. That's why I have missed this
> > > > > >>fact.
> > > > > >>
> > > > > >>This (could) be considered as a problem but I have no (good)
> > > > > >>solution at the moment. Should think a bit on.
> > > > > >Let me suggest a radically different design. Note that I don't say this
> > > > > >is necessarily how things should be done, I'm just trying to introduce
> > > > > >some new ideas and broaden the discussion, so that we have a larger set
> > > > > >of ideas from which we can pick the right solution(s).
> > > > > >
> > > > > >The core of my idea would be a new filter block driver 'timeout' that
> > > > > >can be added on top of each BDS that could potentially fail, like a
> > > > > >raw-posix BDS pointing to a file on NFS. This way most pieces of the
> > > > > >solution are nicely modularised and don't touch the block layer core.
> > > > > >
> > > > > >During normal operation the driver would just be passing through
> > > > > >requests to the lower layer. When it detects a timeout, however, it
> > > > > >completes the request it received with -ETIMEDOUT. It also completes any
> > > > > >new request it receives with -ETIMEDOUT without passing the request on
> > > > > >until the request that originally timed out returns. This is our safety
> > > > > >measure against anyone seeing whether or how the timed out request
> > > > > >modified data.
> > > > > >
> > > > > >We need to make sure that bdrv_drain() doesn't wait for this request.
> > > > > >Possibly we need to introduce a .bdrv_drain callback that replaces the
> > > > > >default handling, because bdrv_requests_pending() in the default
> > > > > >handling considers bs->file, which would still have the timed out
> > > > > >request. We don't want to see this; bdrv_drain_all() should complete
> > > > > >even though that request is still pending internally (externally, we
> > > > > >returned -ETIMEDOUT, so we can consider it completed). This way the
> > > > > >monitor stays responsive and background jobs can go on if they don't use
> > > > > >the failing block device.
> > > > > >
> > > > > >And then we essentially reuse the rerror/werror mechanism that we
> > > > > >already have to stop the VM. The device models would be extended to
> > > > > >always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> > > > > >this state, the VM would even be migratable if you make sure that the
> > > > > >pending request can't modify the image on the destination host any more.
> > > > > >
> > > > > >Do you think this could work, or did I miss something important?
> > > > > >
> > > > > >Kevin
> > > > > could I propose even more radical solution then?
> > > > > 
> > > > > My original approach was based on the fact that
> > > > > this could should be maintainable out-of-stream.
> > > > > If the patch will be merged - this boundary condition
> > > > > could be dropped.
> > > > > 
> > > > > Why not to invent 'terror' field on BdrvOptions
> > > > > and process things in core block layer without
> > > > > a filter? RB Tree entry will just not created if
> > > > > the policy will be set to 'ignore'.
> > > > 
> > > > 'terror' might not be the most fortunate name... ;-)
> > > > 
> > > > The reason why I would prefer a filter driver is so the code and the
> > > > associated data structures are cleanly modularised and we can keep the
> > > > actual block layer core small and clean. The same is true for some other
> > > > functions that I would rather move out of the core into filter drivers
> > > > than add new cases (e.g. I/O throttling, backup notifiers, etc.), but
> > > > which are a bit harder to actually move because we already have old
> > > > interfaces that we can't break (we'll probably do it anyway eventually,
> > > > even if it needs a bit more compatibility code).
> > > > 
> > > > However, it seems that you are mostly touching code that is maintained
> > > > by Stefan, and Stefan used to be a bit more open to adding functionality
> > > > to the core, so my opinion might not be the last word.
> > > 
> > > I've been thinking more about the correctness of this feature:
> > > 
> > > QEMU cannot cancel I/O because there is no Linux userspace API for doing
> > > so.  Linux AIO's io_cancel(2) syscall is a nop since file systems don't
> > > implement a kiocb_cancel_fn.  Sending a signal to a task blocked in
> > > O_DIRECT preadv(2)/pwritev(2) doesn't work either because the task is in
> > > uninterruptible sleep.
> > 
> > There are things that work on some devices, but nothing generic.
> > For NBD/iSCSI/(ceph?) you should be able to issue a shutdown(2) on the socket
> > that connects to the server and that should call all existing IO to fail
> > quickly.  Then you could do a drain and be done.    This would
> > be very useful for the fault-tolerant uses (e.g. Wen Congyang's block replication).
> > 
> > There are even ways of killing hard NFS mounts; for example adding
> > a unreachable route to the NFS server (ip route add unreachable hostname),
> > and then umount -f  seems to cause I/O errors to tasks.   (I can't find
> > a way to do a remount to change the hard flag).  This isn't pretty but
> > it's a reasonable way of getting your host back to useable if one NFS
> > server has died.
> 
> If you just throw away a socket, you don't know the state of the disk
> since some requests may have been handled by the server and others were
> not handled.
> 
> So I doubt these approaches work because cleanly closing a connection
> requires communication between the client and server to determine that
> the connection was closed and which pending requests were completed.
> 
> The trade-off is that the client no longer has DMA buffers that might
> get written to, but now you no longer know the state of the disk!

Right, you dont know what the last successfull IOs really were, but if
you know that the NBD/iSCSI/NFS server is dead and is going to need to
get rebooted/replaced anyway then your current state is that you have
some QEMUs that are running fine except for one disk, but are now very
delicate because anything that tries to a drain will hang.  There's no
way that you can recover that knowledge about which IOs completed, but
you can recover all your guests that aren't critical on that device.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

  reply	other threads:[~2015-09-28 13:55 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-09-08  8:00 [Qemu-devel] [PATCH RFC 0/5] disk deadlines Denis V. Lunev
2015-09-08  8:00 ` [Qemu-devel] [PATCH 1/5] add QEMU style defines for __sync_add_and_fetch Denis V. Lunev
2015-09-10  8:19   ` Stefan Hajnoczi
2015-09-08  8:00 ` [Qemu-devel] [PATCH 2/5] disk_deadlines: add request to resume Virtual Machine Denis V. Lunev
2015-09-10  8:51   ` Stefan Hajnoczi
2015-09-10 19:18     ` Denis V. Lunev
2015-09-14 16:46       ` Stefan Hajnoczi
2015-09-08  8:00 ` [Qemu-devel] [PATCH 3/5] disk_deadlines: add disk-deadlines option per drive Denis V. Lunev
2015-09-10  9:05   ` Stefan Hajnoczi
2015-09-08  8:00 ` [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration Denis V. Lunev
2015-09-08  9:35   ` Fam Zheng
2015-09-08  9:42     ` Denis V. Lunev
2015-09-08 11:06   ` Kevin Wolf
2015-09-08 11:27     ` Denis V. Lunev
2015-09-08 13:05       ` Kevin Wolf
2015-09-08 14:23         ` Denis V. Lunev
2015-09-08 14:48           ` Kevin Wolf
2015-09-10 10:27             ` Stefan Hajnoczi
2015-09-10 11:39               ` Kevin Wolf
2015-09-14 16:53                 ` Stefan Hajnoczi
2015-09-25 12:34               ` Dr. David Alan Gilbert
2015-09-28 12:42                 ` Stefan Hajnoczi
2015-09-28 13:55                   ` Dr. David Alan Gilbert [this message]
2015-09-08  8:00 ` [Qemu-devel] [PATCH 5/5] disk_deadlines: add info disk-deadlines option Denis V. Lunev
2015-09-08 16:20   ` Eric Blake
2015-09-08 16:26     ` Eric Blake
2015-09-10 18:53       ` Denis V. Lunev
2015-09-10 19:13     ` Denis V. Lunev
2015-09-08  8:58 ` [Qemu-devel] [PATCH RFC 0/5] disk deadlines Vasiliy Tolstov
2015-09-08  9:20 ` Fam Zheng
2015-09-08 10:11   ` Kevin Wolf
2015-09-08 10:13     ` Denis V. Lunev
2015-09-08 10:20     ` Fam Zheng
2015-09-08 10:46       ` Denis V. Lunev
2015-09-08 10:49       ` Kevin Wolf
2015-09-08 13:20         ` Fam Zheng
2015-09-08  9:33 ` Paolo Bonzini
2015-09-08  9:41   ` Denis V. Lunev
2015-09-08  9:43     ` Paolo Bonzini
2015-09-08 10:37     ` Andrey Korolyov
2015-09-08 10:50       ` Denis V. Lunev
2015-09-08 10:07   ` Kevin Wolf
2015-09-08 10:08     ` Denis V. Lunev
2015-09-08 10:22   ` Stefan Hajnoczi
2015-09-08 10:26     ` Paolo Bonzini
2015-09-08 10:36     ` Denis V. Lunev
2015-09-08 19:11 ` John Snow
2015-09-10 19:29 ` [Qemu-devel] Summary: " Denis V. Lunev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20150928135544.GA5478@work-vm \
    --to=dgilbert@redhat.com \
    --cc=den@openvz.org \
    --cc=kwolf@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rmaksudova@virtuozzo.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).