Re: [PATCH v8] introduce vfio-user protocol specification

All of lore.kernel.org
 help / color / mirror / Atom feed

From: John Levon <john.levon@nutanix.com>
To: Stefan Hajnoczi <stefanha@redhat.com>
Cc: "benjamin.walker@intel.com" <benjamin.walker@intel.com>,
	John G Johnson <john.g.johnson@oracle.com>,
	Swapnil Ingle <swapnil.ingle@nutanix.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	John Levon <levon@movementarian.org>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	Gerd Hoffmann <kraxel@redhat.com>,
	"tina.zhang@intel.com" <tina.zhang@intel.com>,
	"jag.raman@oracle.com" <jag.raman@oracle.com>,
	"james.r.harris@intel.com" <james.r.harris@intel.com>,
	Elena Ufimtseva <elena.ufimtseva@oracle.com>,
	"Kanth.Ghatraju@oracle.com" <Kanth.Ghatraju@oracle.com>,
	Felipe Franciosi <felipe@nutanix.com>,
	"marcandre.lureau@redhat.com" <marcandre.lureau@redhat.com>,
	Christophe de Dinechin <cdupontd@redhat.com>,
	Yan Zhao <yan.y.zhao@intel.com>,
	"konrad.wilk@oracle.com" <konrad.wilk@oracle.com>,
	"yuvalkashtan@gmail.com" <yuvalkashtan@gmail.com>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>,
	"ismael@linux.com" <ismael@linux.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"changpeng.liu@intel.com" <changpeng.liu@intel.com>,
	"tomassetti.andrea@gmail.com" <tomassetti.andrea@gmail.com>,
	"mpiszczek@ddn.com" <mpiszczek@ddn.com>,
	Cornelia Huck <cohuck@redhat.com>,
	"alex.williamson@redhat.com" <alex.williamson@redhat.com>,
	Raphael Norwitz <raphael.norwitz@nutanix.com>,
	"xiuchun.lu@intel.com" <xiuchun.lu@intel.com>,
	Thanos Makatos <thanos.makatos@nutanix.com>
Subject: Re: [PATCH v8] introduce vfio-user protocol specification
Date: Tue, 4 May 2021 14:31:45 +0000	[thread overview]
Message-ID: <20210504143141.GA1078723@sent> (raw)
In-Reply-To: <YJFRcdvcOlwm2bd7@stefanha-x1.localdomain>

On Tue, May 04, 2021 at 02:51:45PM +0100, Stefan Hajnoczi wrote:

> On Wed, Apr 14, 2021 at 04:41:22AM -0700, Thanos Makatos wrote:
> > This patch introduces the vfio-user protocol specification (formerly
> > known as VFIO-over-socket), which is designed to allow devices to be

Thanks for your review so far Stefan!

We'll make sure to respond to all your comments as needed, so this is just a
partial response.

> > +Socket Disconnection Behavior
> > +-----------------------------
> > +The server and the client can disconnect from each other, either intentionally
> > +or unexpectedly. Both the client and the server need to know how to handle such
> > +events.
> > +
> > +Server Disconnection
> > +^^^^^^^^^^^^^^^^^^^^
> > +A server disconnecting from the client may indicate that:
> > +
> > +1) A virtual device has been restarted, either intentionally (e.g. because of a
> > +   device update) or unintentionally (e.g. because of a crash).
> > +2) A virtual device has been shut down with no intention to be restarted.
> > +
> > +It is impossible for the client to know whether or not a failure is
> > +intermittent or innocuous and should be retried, therefore the client should
> > +reset the VFIO device when it detects the socket has been disconnected.
> > +Error recovery will be driven by the guest's device error handling
> > +behavior.
> > +
> > +Client Disconnection
> > +^^^^^^^^^^^^^^^^^^^^
> > +The client disconnecting from the server primarily means that the client
> > +has exited. Currently, this means that the guest is shut down so the device is
> > +no longer needed therefore the server can automatically exit. However, there
> > +can be cases where a client disconnection should not result in a server exit:
> > +
> > +1) A single server serving multiple clients.
> > +2) A multi-process QEMU upgrading itself step by step, which is not yet
> > +   implemented.
> > +
> > +Therefore in order for the protocol to be forward compatible the server should
> > +take no action when the client disconnects. If anything happens to the client
> > +the control stack will know about it and can clean up resources
> > +accordingly.
> 
> Also, hot unplug?
> 
> Does anything need to be said about mmaps and file descriptors on
> disconnected? I guess they need to be cleaned up and are not retained
> for future reconnection?

Yes. We're still iterating on the implications of these scenarios and trying to
figure out the right approaches, so a lot of this is still very much
under-specified.

> Are there rules for avoiding deadlock between client->server and
> server->client messages? For example, the client sends
> VFIO_USER_REGION_WRITE and the server sends VFIO_USER_VM_INTERRUPT
> before replying to the write message.
> 
> Multi-threaded clients and servers could end up deadlocking if messages
> are processed while polling threads handle other device activity (e.g.
> I/O requests that cause DMA messages).
> 
> Pipelining has the nice effect that the oldest message must complete
> before the next pipelined message starts. It imposes a maximum issue
> depth of 1. Still, it seems like it would be relatively easy to hit
> re-entrancy or deadlock issues since both the client and the server can
> initiate messages and may need to wait for a response.

It's certainly the case that implementation-wise right now these are issues, at
least on the library side. I think I'm probably OK with requiring responses to
be provided prior to async messages like VM_INTERRUPT.

> > +The version data is an optional JSON byte array with the following format:
> 
> RFC 7159 The JavaScript Object Notation section 8.1. Character Encoding
> says:
> 
>   JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.
> 
> Please indicate the character encoding. I guess it is always UTF-8?

Yes.

> > +| ``"max_fds"``      | number           | Maximum number of file descriptors  |
> > +|                    |                  | the can be received by the sender.  |
> > +|                    |                  | Optional. If not specified then the |
> > +|                    |                  | receiver must assume                |
> > +|                    |                  | ``"max_fds"=1``.                    |
> 
> Maximum per message? Please clarify and consider renaming it to
> max_msg_fds (it's also more consistent with max_msg_size).

Yes, that's a much better name.

> > +| Name         | Type   | Description                                   |
> > ++==============+========+===============================================+
> > +| ``"pgsize"`` | number | Page size of dirty pages bitmap. The smallest |
> > +|              |        | between the client and the server is used.    |
> > ++--------------+--------+-----------------------------------------------+
> 
> "in bytes"?

We'll add to the prelude that all memory sizes are in byte units unless
specified otherwise.

> > +Versioning and Feature Support
> > +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > +Upon establishing a connection, the client must send a VFIO_USER_VERSION message
> > +proposing a protocol version and a set of capabilities. The server compares
> > +these with the versions and capabilities it supports and sends a
> > +VFIO_USER_VERSION reply according to the following rules.
> > +
> > +* The major version in the reply must be the same as proposed. If the client
> > +  does not support the proposed major, it closes the connection.
> > +* The minor version in the reply must be equal to or less than the minor
> > +  version proposed.
> > +* The capability list must be a subset of those proposed. If the server
> > +  requires a capability the client did not include, it closes the connection.
> 
> Does the server echo back all capabilities it has accepted so the client
> can still close the connection if it sees the server didn't accept a
> capability?

Yes, if a client *requires* a cap that the server doesn't report back, it will
be missing from the server response JSON. The spec needs more detail here.

The response reflects the server's state. If a server doesn't support a cap, it
won't appear in the server's response at all. If a client doesn't support a cap,
it *will* appear in the server's response anyway.

The values for each capability have cap-specific semantics. For example, for
max_msg_fds, the client->server JSON lets a client give a maximum number of fd's
supported in server->client messages. The server's response JSON, in turn, lets
a server give a maximum number of fd's supported in client->server messages.

migration::pgsize is a min() function instead as you can see above, etc.

> By the way, this DMA mapping design is the eager mapping approach.  Another
> approach is the lazy mapping approach where the server requests translations
> as necessary. The advantage is that the client does not have to send each
> mapping to the server. In the case of VFIO_USER_DMA_READ/WRITE no mappings
> need to be sent at all. Only mmaps need mapping messages.

Are you arguing that we should implement this? It would non-trivially complicate
the implementations on the server-side, where the library "owns" the mappings
logic, but an API user is responsible for doing actual read/writes.

> How do potentially large messages work around max_msg_size? It is hard
> for the client/server to anticipate the maximum message size that will
> be required ahead of time, so they can't really know if they will hit a
> situation where max_msg_size is too low.

Are there specific messages you're worried about? would it help to add a
specification stipulation as to minimum size that clients and servers must
support?

Ultimately the max msg size exists solely to ease implementation: with a
reasonable fixed size, we can always consume the entire data in one go, rather
than doing partial reads. Obviously that needs a limit to avoid unbounded
allocations.

> > +VFIO_USER_DEVICE_GET_INFO
> > +-------------------------
> > +
> > +Message format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+----------------------------+
> > +| Name         | Value                      |
> > ++==============+============================+
> > +| Message ID   | <ID>                       |
> > ++--------------+----------------------------+
> > +| Command      | 4                          |
> > ++--------------+----------------------------+
> > +| Message size | 32                         |
> > ++--------------+----------------------------+
> > +| Flags        | Reply bit set in reply     |
> > ++--------------+----------------------------+
> > +| Error        | 0/errno                    |
> > ++--------------+----------------------------+
> > +| Device info  | VFIO device info           |
> > ++--------------+----------------------------+
> > +
> > +This command message is sent by the client to the server to query for basic
> > +information about the device. The VFIO device info structure is defined in
> > +``<linux/vfio.h>`` (``struct vfio_device_info``).
> 
> Wait, "VFIO device info format" below is missing the cap_offset field,
> so it's exactly not the same as <linux/vfio.h>?

We had to move away from directly consuming struct vfio_device_info when
cap_offset was added. Generally trying to use vfio.h at all seems like a bad
idea. That's an implementation thing, but this was a dangling reference we need
to clean up.

regards
john

next prev parent reply	other threads:[~2021-05-04 14:35 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-14 11:41 [PATCH v8] introduce vfio-user protocol specification Thanos Makatos
2021-04-26 15:48 ` Stefan Hajnoczi
2021-04-27 12:02   ` Thanos Makatos
2021-04-27 15:01     ` Stefan Hajnoczi
2021-05-04 13:51 ` Stefan Hajnoczi
2021-05-04 14:31   ` John Levon [this message]
2021-05-05 15:51     ` Stefan Hajnoczi
2021-06-14  9:57     ` Thanos Makatos
2021-05-05 16:19   ` John Levon
2021-05-06  8:49     ` Stefan Hajnoczi
2021-05-07 16:10     ` Thanos Makatos
2021-06-14 10:07   ` Thanos Makatos
2021-05-10 16:57 ` Stefan Hajnoczi
2021-05-10 22:25   ` John Levon
2021-05-11 10:09     ` Stefan Hajnoczi
2021-05-11 10:43       ` John Levon
2021-05-11 15:40         ` Stefan Hajnoczi
2021-05-12  5:08     ` John Johnson
2021-05-19 21:08 ` Alex Williamson
2021-05-19 22:38   ` John Levon
2021-06-14  9:47     ` Thanos Makatos

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210504143141.GA1078723@sent \
    --to=john.levon@nutanix.com \
    --cc=Kanth.Ghatraju@oracle.com \
    --cc=alex.williamson@redhat.com \
    --cc=benjamin.walker@intel.com \
    --cc=cdupontd@redhat.com \
    --cc=changpeng.liu@intel.com \
    --cc=cohuck@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=felipe@nutanix.com \
    --cc=ismael@linux.com \
    --cc=jag.raman@oracle.com \
    --cc=james.r.harris@intel.com \
    --cc=jasowang@redhat.com \
    --cc=john.g.johnson@oracle.com \
    --cc=konrad.wilk@oracle.com \
    --cc=kraxel@redhat.com \
    --cc=kwankhede@nvidia.com \
    --cc=levon@movementarian.org \
    --cc=marcandre.lureau@redhat.com \
    --cc=mpiszczek@ddn.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=raphael.norwitz@nutanix.com \
    --cc=stefanha@redhat.com \
    --cc=swapnil.ingle@nutanix.com \
    --cc=thanos.makatos@nutanix.com \
    --cc=tina.zhang@intel.com \
    --cc=tomassetti.andrea@gmail.com \
    --cc=xiuchun.lu@intel.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yuvalkashtan@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.