qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Kevin Wolf <kwolf@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: den-plotnikov@yandex-team.ru, qemu-devel@nongnu.org,
	qemu-block@nongnu.org, raphael.norwitz@nutanix.com
Subject: Re: [PATCH v2 2/6] vhost-user-blk: Don't reconnect during initialisation
Date: Tue, 4 May 2021 11:27:12 +0200	[thread overview]
Message-ID: <YJETcFAyQUHB13N6@merkur.fritz.box> (raw)
In-Reply-To: <20210504044050-mutt-send-email-mst@kernel.org>

Am 04.05.2021 um 10:59 hat Michael S. Tsirkin geschrieben:
> On Thu, Apr 29, 2021 at 07:13:12PM +0200, Kevin Wolf wrote:
> > This is a partial revert of commits 77542d43149 and bc79c87bcde.
> > 
> > Usually, an error during initialisation means that the configuration was
> > wrong. Reconnecting won't make the error go away, but just turn the
> > error condition into an endless loop. Avoid this and return errors
> > again.
> 
> So there are several possible reasons for an error:
> 
> 1. remote restarted - we would like to reconnect,
>    this was the original use-case for reconnect.
> 
>    I am not very happy that we are killing this usecase.

This patch is killing it only during initialisation, where it's quite
unlikely compared to other cases and where the current implementation is
rather broken. So reverting the broken feature and going back to a
simpler correct state feels like a good idea to me.

The idea is to add the "retry during initialisation" feature back on top
of this, but it requires some more changes in the error paths so that we
can actually distinguish different kinds of errors and don't retry when
we already know that it can't succeed.

> 2. qemu detected an error and closed the connection
>    looks like we try to handle that by reconnect,
>    this is something we should address.

Yes, if qemu produces the error locally, retrying is useless.

> 3. remote failed due to a bad command from qemu.
>    this usecase isn't well supported at the moment.
> 
>    How about supporting it on the remote side? I think that if the
>    data is well-formed just has a configuration remote can not support
>    then instead of closing the connection, remote can wait for
>    commands with need_reply set, and respond with an error. Or at
>    least do it if VHOST_USER_PROTOCOL_F_REPLY_ACK has been negotiated.
>    If VHOST_USER_SET_VRING_ERR is used then signalling that fd might
>    also be reasonable.
> 
>    OTOH if qemu is buggy and sends malformed data and remote detects
>    that then hacing qemu retry forever is ok, might actually be
>    benefitial for debugging.

I haven't really checked this case yet, it seems to be less common.
Explicitly communicating an error is certainly better than just cutting
the connection. But as you say, it means QEMU is buggy, so blindly
retrying in this case is kind of acceptable.

Raphael suggested that we could limit the number of retries during
initialisation so that it wouldn't result in a hang at least.

> > Additionally, calling vhost_user_blk_disconnect() from the chardev event
> > handler could result in use-after-free because none of the
> > initialisation code expects that the device could just go away in the
> > middle. So removing the call fixes crashes in several places.
> > For example, using a num-queues setting that is incompatible with the
> > backend would result in a crash like this (dereferencing dev->opaque,
> > which is already NULL):
> > 
> >  #0  0x0000555555d0a4bd in vhost_user_read_cb (source=0x5555568f4690, condition=(G_IO_IN | G_IO_HUP), opaque=0x7fffffffcbf0) at ../hw/virtio/vhost-user.c:313
> >  #1  0x0000555555d950d3 in qio_channel_fd_source_dispatch (source=0x555557c3f750, callback=0x555555d0a478 <vhost_user_read_cb>, user_data=0x7fffffffcbf0) at ../io/channel-watch.c:84
> >  #2  0x00007ffff7b32a9f in g_main_context_dispatch () at /lib64/libglib-2.0.so.0
> >  #3  0x00007ffff7b84a98 in g_main_context_iterate.constprop () at /lib64/libglib-2.0.so.0
> >  #4  0x00007ffff7b32163 in g_main_loop_run () at /lib64/libglib-2.0.so.0
> >  #5  0x0000555555d0a724 in vhost_user_read (dev=0x555557bc62f8, msg=0x7fffffffcc50) at ../hw/virtio/vhost-user.c:402
> >  #6  0x0000555555d0ee6b in vhost_user_get_config (dev=0x555557bc62f8, config=0x555557bc62ac "", config_len=60) at ../hw/virtio/vhost-user.c:2133
> >  #7  0x0000555555d56d46 in vhost_dev_get_config (hdev=0x555557bc62f8, config=0x555557bc62ac "", config_len=60) at ../hw/virtio/vhost.c:1566
> >  #8  0x0000555555cdd150 in vhost_user_blk_device_realize (dev=0x555557bc60b0, errp=0x7fffffffcf90) at ../hw/block/vhost-user-blk.c:510
> >  #9  0x0000555555d08f6d in virtio_device_realize (dev=0x555557bc60b0, errp=0x7fffffffcff0) at ../hw/virtio/virtio.c:3660
> 
> Right. So that's definitely something to fix.
> 
> > 
> > Signed-off-by: Kevin Wolf <kwolf@redhat.com>

Kevin



  reply	other threads:[~2021-05-04  9:29 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-29 17:13 [PATCH v2 0/6] vhost-user-blk: Error handling fixes during initialistion Kevin Wolf
2021-04-29 17:13 ` [PATCH v2 1/6] vhost-user-blk: Make sure to set Error on realize failure Kevin Wolf
2021-05-03 17:12   ` Eric Blake
2021-05-03 17:24   ` Raphael Norwitz
2021-04-29 17:13 ` [PATCH v2 2/6] vhost-user-blk: Don't reconnect during initialisation Kevin Wolf
2021-05-03 17:01   ` Raphael Norwitz
2021-05-04  9:10     ` Kevin Wolf
2021-05-04  8:59   ` Michael S. Tsirkin
2021-05-04  9:27     ` Kevin Wolf [this message]
2021-05-04  9:44       ` Michael S. Tsirkin
2021-05-04 10:57         ` Kevin Wolf
2021-05-04 11:08           ` Michael S. Tsirkin
2021-04-29 17:13 ` [PATCH v2 3/6] vhost-user-blk: Improve error reporting in realize Kevin Wolf
2021-04-29 17:13 ` [PATCH v2 4/6] vhost-user-blk: Get more feature flags from vhost device Kevin Wolf
2021-04-29 17:13 ` [PATCH v2 5/6] virtio: Fail if iommu_platform is requested, but unsupported Kevin Wolf
2021-04-29 17:13 ` [PATCH v2 6/6] vhost-user-blk: Check that num-queues is supported by backend Kevin Wolf
2021-05-14 12:20 ` [PATCH v2 0/6] vhost-user-blk: Error handling fixes during initialistion Michael S. Tsirkin
2021-05-14 16:24   ` Kevin Wolf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YJETcFAyQUHB13N6@merkur.fritz.box \
    --to=kwolf@redhat.com \
    --cc=den-plotnikov@yandex-team.ru \
    --cc=mst@redhat.com \
    --cc=qemu-block@nongnu.org \
    --cc=qemu-devel@nongnu.org \
    --cc=raphael.norwitz@nutanix.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).