Re: [PATCH v2 2/6] vhost-user-blk: Don't reconnect during initialisation

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: "Michael S. Tsirkin" <mst@redhat.com>
To: Kevin Wolf <kwolf@redhat.com>
Cc: den-plotnikov@yandex-team.ru, qemu-devel@nongnu.org,
	qemu-block@nongnu.org, raphael.norwitz@nutanix.com
Subject: Re: [PATCH v2 2/6] vhost-user-blk: Don't reconnect during initialisation
Date: Tue, 4 May 2021 07:08:58 -0400	[thread overview]
Message-ID: <20210504070518-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <YJEomcoHdjFlSwC1@merkur.fritz.box>

On Tue, May 04, 2021 at 12:57:29PM +0200, Kevin Wolf wrote:
> Am 04.05.2021 um 11:44 hat Michael S. Tsirkin geschrieben:
> > On Tue, May 04, 2021 at 11:27:12AM +0200, Kevin Wolf wrote:
> > > Am 04.05.2021 um 10:59 hat Michael S. Tsirkin geschrieben:
> > > > On Thu, Apr 29, 2021 at 07:13:12PM +0200, Kevin Wolf wrote:
> > > > > This is a partial revert of commits 77542d43149 and bc79c87bcde.
> > > > > 
> > > > > Usually, an error during initialisation means that the configuration was
> > > > > wrong. Reconnecting won't make the error go away, but just turn the
> > > > > error condition into an endless loop. Avoid this and return errors
> > > > > again.
> > > > 
> > > > So there are several possible reasons for an error:
> > > > 
> > > > 1. remote restarted - we would like to reconnect,
> > > >    this was the original use-case for reconnect.
> > > > 
> > > >    I am not very happy that we are killing this usecase.
> > > 
> > > This patch is killing it only during initialisation, where it's quite
> > > unlikely compared to other cases and where the current implementation is
> > > rather broken. So reverting the broken feature and going back to a
> > > simpler correct state feels like a good idea to me.
> > > 
> > > The idea is to add the "retry during initialisation" feature back on top
> > > of this, but it requires some more changes in the error paths so that we
> > > can actually distinguish different kinds of errors and don't retry when
> > > we already know that it can't succeed.
> > 
> > Okay ... let's make all this explicit in the commit log though, ok?
> 
> That's fair, I'll add a paragraph addressing this case when merging the
> series, like this:
> 
>     Note that this removes the ability to reconnect during
>     initialisation (but not during operation) when there is no permanent
>     error, but the backend restarts, as the implementation was buggy.
>     This feature can be added back in a follow-up series after changing
>     error paths to distinguish cases where retrying could help from
>     cases with permanent errors.
> 
> > > > 2. qemu detected an error and closed the connection
> > > >    looks like we try to handle that by reconnect,
> > > >    this is something we should address.
> > > 
> > > Yes, if qemu produces the error locally, retrying is useless.
> > > 
> > > > 3. remote failed due to a bad command from qemu.
> > > >    this usecase isn't well supported at the moment.
> > > > 
> > > >    How about supporting it on the remote side? I think that if the
> > > >    data is well-formed just has a configuration remote can not support
> > > >    then instead of closing the connection, remote can wait for
> > > >    commands with need_reply set, and respond with an error. Or at
> > > >    least do it if VHOST_USER_PROTOCOL_F_REPLY_ACK has been negotiated.
> > > >    If VHOST_USER_SET_VRING_ERR is used then signalling that fd might
> > > >    also be reasonable.
> > > > 
> > > >    OTOH if qemu is buggy and sends malformed data and remote detects
> > > >    that then hacing qemu retry forever is ok, might actually be
> > > >    benefitial for debugging.
> > > 
> > > I haven't really checked this case yet, it seems to be less common.
> > > Explicitly communicating an error is certainly better than just cutting
> > > the connection. But as you say, it means QEMU is buggy, so blindly
> > > retrying in this case is kind of acceptable.
> > > 
> > > Raphael suggested that we could limit the number of retries during
> > > initialisation so that it wouldn't result in a hang at least.
> > 
> > not sure how do I feel about random limits ... how would we set the
> > limit?
> 
> To be honest, probably even 1 would already be good enough in practice.
> Make it 5 or something and you definitely cover any realistic case when
> there is no bug involved.
> 
> Even hitting this case once requires bad luck with the timing, so that
> the restart of the backend coincides with already having connected to
> the socket, but not completed the configuration yet, which is a really
> short window. Having the backend drop the connection again in the same
> short window on the second attempt is an almost sure sign of a bug with
> one of the operations done during initialisation.
> 
> Even if this corner case turned out to be a bit less unlikely to happen
> than I'm thinking (which is, it won't happen at all), randomly failing a
> device-add once in a while still feels a lot better than hanging the VM
> once in a while.
> 
> Kevin

Well if backend is e.g. just stuck and connection does not close, then
VM hangs anyway. So IMHO it's not such a big deal.  If we really want to
address this we should handle all this asynchronously. As in make
device-add succeed and then progress in stages but do not block the
monitor. That would be nice but it's a big change in the code.

-- 
MST

next prev parent reply	other threads:[~2021-05-04 11:10 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-29 17:13 [PATCH v2 0/6] vhost-user-blk: Error handling fixes during initialistion Kevin Wolf
2021-04-29 17:13 ` [PATCH v2 1/6] vhost-user-blk: Make sure to set Error on realize failure Kevin Wolf
2021-05-03 17:12   ` Eric Blake
2021-05-03 17:24   ` Raphael Norwitz
2021-04-29 17:13 ` [PATCH v2 2/6] vhost-user-blk: Don't reconnect during initialisation Kevin Wolf
2021-05-03 17:01   ` Raphael Norwitz
2021-05-04  9:10     ` Kevin Wolf
2021-05-04  8:59   ` Michael S. Tsirkin
2021-05-04  9:27     ` Kevin Wolf
2021-05-04  9:44       ` Michael S. Tsirkin
2021-05-04 10:57         ` Kevin Wolf
2021-05-04 11:08           ` Michael S. Tsirkin [this message]
2021-04-29 17:13 ` [PATCH v2 3/6] vhost-user-blk: Improve error reporting in realize Kevin Wolf
2021-04-29 17:13 ` [PATCH v2 4/6] vhost-user-blk: Get more feature flags from vhost device Kevin Wolf
2021-04-29 17:13 ` [PATCH v2 5/6] virtio: Fail if iommu_platform is requested, but unsupported Kevin Wolf
2021-04-29 17:13 ` [PATCH v2 6/6] vhost-user-blk: Check that num-queues is supported by backend Kevin Wolf
2021-05-14 12:20 ` [PATCH v2 0/6] vhost-user-blk: Error handling fixes during initialistion Michael S. Tsirkin
2021-05-14 16:24   ` Kevin Wolf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210504070518-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=den-plotnikov@yandex-team.ru \
    --cc=kwolf@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=raphael.norwitz@nutanix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).