Re: [PATCH v8] introduce vfio-user protocol specification

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: John Levon <john.levon@nutanix.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: "benjamin.walker@intel.com" <benjamin.walker@intel.com>,
	John G Johnson <john.g.johnson@oracle.com>,
	Swapnil Ingle <swapnil.ingle@nutanix.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	John Levon <levon@movementarian.org>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	Gerd Hoffmann <kraxel@redhat.com>,
	"tina.zhang@intel.com" <tina.zhang@intel.com>,
	"jag.raman@oracle.com" <jag.raman@oracle.com>,
	"james.r.harris@intel.com" <james.r.harris@intel.com>,
	Elena Ufimtseva <elena.ufimtseva@oracle.com>,
	"Kanth.Ghatraju@oracle.com" <Kanth.Ghatraju@oracle.com>,
	Felipe Franciosi <felipe@nutanix.com>,
	"marcandre.lureau@redhat.com" <marcandre.lureau@redhat.com>,
	Christophe de Dinechin <cdupontd@redhat.com>,
	Yan Zhao <yan.y.zhao@intel.com>,
	"konrad.wilk@oracle.com" <konrad.wilk@oracle.com>,
	"yuvalkashtan@gmail.com" <yuvalkashtan@gmail.com>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>,
	"ismael@linux.com" <ismael@linux.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"changpeng.liu@intel.com" <changpeng.liu@intel.com>,
	"tomassetti.andrea@gmail.com" <tomassetti.andrea@gmail.com>,
	"mpiszczek@ddn.com" <mpiszczek@ddn.com>,
	Cornelia Huck <cohuck@redhat.com>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	Raphael Norwitz <raphael.norwitz@nutanix.com>,
	"xiuchun.lu@intel.com" <xiuchun.lu@intel.com>,
	Thanos Makatos <thanos.makatos@nutanix.com>
Subject: Re: [PATCH v8] introduce vfio-user protocol specification
Date: Tue, 4 May 2021 14:31:45 +0000	[thread overview]
Message-ID: <20210504143141.GA1078723@sent> (raw)
In-Reply-To: <YJFRcdvcOlwm2bd7@stefanha-x1.localdomain>

On Tue, May 04, 2021 at 02:51:45PM +0100, Stefan Hajnoczi wrote:

> On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> > This patch introduces the vfio-user protocol specification (formerly
> > known as VFIO-over-socket), which is designed to allow devices to be

Thanks for your review so far Stefan!

We'll make sure to respond to all your comments as needed, so this is just a
partial response.

> > +Socket Disconnection Behavior
> > +-----------------------------
> > +The server and the client can disconnect from each other, either intentionally
> > +or unexpectedly. Both the client and the server need to know how to handle such
> > +events.
> > +
> > +Server Disconnection
> > +^^^^^^^^^^^^^^^^^^^^
> > +A server disconnecting from the client may indicate that:
> > +
> > +1) A virtual device has been restarted, either intentionally (e.g. because of a
> > +   device update) or unintentionally (e.g. because of a crash).
> > +2) A virtual device has been shut down with no intention to be restarted.
> > +
> > +It is impossible for the client to know whether or not a failure is
> > +intermittent or innocuous and should be retried, therefore the client should
> > +reset the VFIO device when it detects the socket has been disconnected.
> > +Error recovery will be driven by the guest's device error handling
> > +behavior.
> > +
> > +Client Disconnection
> > +^^^^^^^^^^^^^^^^^^^^
> > +The client disconnecting from the server primarily means that the client
> > +has exited. Currently, this means that the guest is shut down so the device is
> > +no longer needed therefore the server can automatically exit. However, there
> > +can be cases where a client disconnection should not result in a server exit:
> > +
> > +1) A single server serving multiple clients.
> > +2) A multi-process QEMU upgrading itself step by step, which is not yet
> > +   implemented.
> > +
> > +Therefore in order for the protocol to be forward compatible the server should
> > +take no action when the client disconnects. If anything happens to the client
> > +the control stack will know about it and can clean up resources
> > +accordingly.
> 
> Also, hot unplug?
> 
> Does anything need to be said about mmaps and file descriptors on
> disconnected? I guess they need to be cleaned up and are not retained
> for future reconnection?

Yes. We're still iterating on the implications of these scenarios and trying to
figure out the right approaches, so a lot of this is still very much
under-specified.

> Are there rules for avoiding deadlock between client->server and
> server->client messages? For example, the client sends
> VFIO_USER_REGION_WRITE and the server sends VFIO_USER_VM_INTERRUPT
> before replying to the write message.
> 
> Multi-threaded clients and servers could end up deadlocking if messages
> are processed while polling threads handle other device activity (e.g.
> I/O requests that cause DMA messages).
> 
> Pipelining has the nice effect that the oldest message must complete
> before the next pipelined message starts. It imposes a maximum issue
> depth of 1. Still, it seems like it would be relatively easy to hit
> re-entrancy or deadlock issues since both the client and the server can
> initiate messages and may need to wait for a response.

It's certainly the case that implementation-wise right now these are issues, at
least on the library side. I think I'm probably OK with requiring responses to
be provided prior to async messages like VM_INTERRUPT.

> > +The version data is an optional JSON byte array with the following format:
> 
> RFC 7159 The JavaScript Object Notation section 8.1. Character Encoding
> says:
> 
>   JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.
> 
> Please indicate the character encoding. I guess it is always UTF-8?

Yes.

> > +| ``"max_fds"``      | number           | Maximum number of file descriptors  |
> > +|                    |                  | the can be received by the sender.  |
> > +|                    |                  | Optional. If not specified then the |
> > +|                    |                  | receiver must assume                |
> > +|                    |                  | ``"max_fds"=1``.                    |
> 
> Maximum per message? Please clarify and consider renaming it to
> max_msg_fds (it's also more consistent with max_msg_size).

Yes, that's a much better name.

> > +| Name         | Type   | Description                                   |
> > ++==============+========+===============================================+
> > +| ``"pgsize"`` | number | Page size of dirty pages bitmap. The smallest |
> > +|              |        | between the client and the server is used.    |
> > ++--------------+--------+-----------------------------------------------+
> 
> "in bytes"?

We'll add to the prelude that all memory sizes are in byte units unless
specified otherwise.

> > +Versioning and Feature Support
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +Upon establishing a connection, the client must send a VFIO_USER_VERSION message
> > +proposing a protocol version and a set of capabilities. The server compares
> > +these with the versions and capabilities it supports and sends a
> > +VFIO_USER_VERSION reply according to the following rules.
> > +
> > +* The major version in the reply must be the same as proposed. If the client
> > +  does not support the proposed major, it closes the connection.
> > +* The minor version in the reply must be equal to or less than the minor
> > +  version proposed.
> > +* The capability list must be a subset of those proposed. If the server
> > +  requires a capability the client did not include, it closes the connection.
> 
> Does the server echo back all capabilities it has accepted so the client
> can still close the connection if it sees the server didn't accept a
> capability?

Yes, if a client *requires* a cap that the server doesn't report back, it will
be missing from the server response JSON. The spec needs more detail here.

The response reflects the server's state. If a server doesn't support a cap, it
won't appear in the server's response at all. If a client doesn't support a cap,
it *will* appear in the server's response anyway.

The values for each capability have cap-specific semantics. For example, for
max_msg_fds, the client->server JSON lets a client give a maximum number of fd's
supported in server->client messages. The server's response JSON, in turn, lets
a server give a maximum number of fd's supported in client->server messages.

migration::pgsize is a min() function instead as you can see above, etc.

> By the way, this DMA mapping design is the eager mapping approach.  Another
> approach is the lazy mapping approach where the server requests translations
> as necessary. The advantage is that the client does not have to send each
> mapping to the server. In the case of VFIO_USER_DMA_READ/WRITE no mappings
> need to be sent at all. Only mmaps need mapping messages.

Are you arguing that we should implement this? It would non-trivially complicate
the implementations on the server-side, where the library "owns" the mappings
logic, but an API user is responsible for doing actual read/writes.

> How do potentially large messages work around max_msg_size? It is hard
> for the client/server to anticipate the maximum message size that will
> be required ahead of time, so they can't really know if they will hit a
> situation where max_msg_size is too low.

Are there specific messages you're worried about? would it help to add a
specification stipulation as to minimum size that clients and servers must
support?

Ultimately the max msg size exists solely to ease implementation: with a
reasonable fixed size, we can always consume the entire data in one go, rather
than doing partial reads. Obviously that needs a limit to avoid unbounded
allocations.

> > +VFIO_USER_DEVICE_GET_INFO
> > +-------------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+----------------------------+
> > +| Name         | Value                      |
> > ++==============+============================+
> > +| Message ID   | <ID>                       |
> > ++--------------+----------------------------+
> > +| Command      | 4                          |
> > ++--------------+----------------------------+
> > +| Message size | 32                         |
> > ++--------------+----------------------------+
> > +| Flags        | Reply bit set in reply     |
> > ++--------------+----------------------------+
> > +| Error        | 0/errno                    |
> > ++--------------+----------------------------+
> > +| Device info  | VFIO device info           |
> > ++--------------+----------------------------+
> > +
> > +This command message is sent by the client to the server to query for basic
> > +information about the device. The VFIO device info structure is defined in
> > +``<linux/vfio.h>`` (``struct vfio_device_info``).
> 
> Wait, "VFIO device info format" below is missing the cap_offset field,
> so it's exactly not the same as <linux/vfio.h>?

We had to move away from directly consuming struct vfio_device_info when
cap_offset was added. Generally trying to use vfio.h at all seems like a bad
idea. That's an implementation thing, but this was a dangling reference we need
to clean up.

regards
john

next prev parent reply	other threads:[~2021-05-04 14:35 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-14 11:41 [PATCH v8] introduce vfio-user protocol specification Thanos Makatos
2021-04-26 15:48 ` Stefan Hajnoczi
2021-04-27 12:02   ` Thanos Makatos
2021-04-27 15:01     ` Stefan Hajnoczi
2021-05-04 13:51 ` Stefan Hajnoczi
2021-05-04 14:31   ` John Levon [this message]
2021-05-05 15:51     ` Stefan Hajnoczi
2021-06-14  9:57     ` Thanos Makatos
2021-05-05 16:19   ` John Levon
2021-05-06  8:49     ` Stefan Hajnoczi
2021-05-07 16:10     ` Thanos Makatos
2021-06-14 10:07   ` Thanos Makatos
2021-05-10 16:57 ` Stefan Hajnoczi
2021-05-10 22:25   ` John Levon
2021-05-11 10:09     ` Stefan Hajnoczi
2021-05-11 10:43       ` John Levon
2021-05-11 15:40         ` Stefan Hajnoczi
2021-05-12  5:08     ` John Johnson
2021-05-19 21:08 ` Alex Williamson
2021-05-19 22:38   ` John Levon
2021-06-14  9:47     ` Thanos Makatos

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210504143141.GA1078723@sent \
    --to=john.levon@nutanix.com \
    --cc=Kanth.Ghatraju@oracle.com \
    --cc=alex.williamson@redhat.com \
    --cc=benjamin.walker@intel.com \
    --cc=cdupontd@redhat.com \
    --cc=changpeng.liu@intel.com \
    --cc=cohuck@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=felipe@nutanix.com \
    --cc=ismael@linux.com \
    --cc=jag.raman@oracle.com \
    --cc=james.r.harris@intel.com \
    --cc=jasowang@redhat.com \
    --cc=john.g.johnson@oracle.com \
    --cc=konrad.wilk@oracle.com \
    --cc=kraxel@redhat.com \
    --cc=kwankhede@nvidia.com \
    --cc=levon@movementarian.org \
    --cc=marcandre.lureau@redhat.com \
    --cc=mpiszczek@ddn.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=raphael.norwitz@nutanix.com \
    --cc=stefanha@redhat.com \
    --cc=swapnil.ingle@nutanix.com \
    --cc=thanos.makatos@nutanix.com \
    --cc=tina.zhang@intel.com \
    --cc=tomassetti.andrea@gmail.com \
    --cc=xiuchun.lu@intel.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yuvalkashtan@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).