* [RFC v5 01/23] vfio-user: introduce vfio-user protocol specification
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
@ 2022-05-05 17:19 ` John Johnson
2022-05-05 17:19 ` [RFC v5 02/23] vfio-user: add VFIO base abstract class John Johnson
` (21 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:19 UTC (permalink / raw)
To: qemu-devel
From: Thanos Makatos <thanos.makatos@nutanix.com>
This patch introduces the vfio-user protocol specification (formerly
known as VFIO-over-socket), which is designed to allow devices to be
emulated outside QEMU, in a separate process. vfio-user reuses the
existing VFIO defines, structs and concepts.
It has been earlier discussed as an RFC in:
"RFC: use VFIO over a UNIX domain socket to implement device offloading"
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Thanos Makatos <thanos.makatos@nutanix.com>
Signed-off-by: John Levon <john.levon@nutanix.com>
---
docs/devel/index-internals.rst | 1 +
docs/devel/vfio-user.rst | 1453 ++++++++++++++++++++++++++++++++++++++++
MAINTAINERS | 6 +
3 files changed, 1460 insertions(+)
create mode 100644 docs/devel/vfio-user.rst
diff --git a/docs/devel/index-internals.rst b/docs/devel/index-internals.rst
index a50889c..c71a79e 100644
--- a/docs/devel/index-internals.rst
+++ b/docs/devel/index-internals.rst
@@ -17,4 +17,5 @@ Details about QEMU's various subsystems including how to add features to them.
s390-dasd-ipl
tracing
vfio-migration
+ vfio-user
writing-monitor-commands
diff --git a/docs/devel/vfio-user.rst b/docs/devel/vfio-user.rst
new file mode 100644
index 0000000..0fb1e33
--- /dev/null
+++ b/docs/devel/vfio-user.rst
@@ -0,0 +1,1453 @@
+.. include:: <isonum.txt>
+********************************
+vfio-user Protocol Specification
+********************************
+
+--------------
+Version_ 0.9.1
+--------------
+
+.. contents:: Table of Contents
+
+Introduction
+============
+vfio-user is a protocol that allows a device to be emulated in a separate
+process outside of a Virtual Machine Monitor (VMM). vfio-user devices consist
+of a generic VFIO device type, living inside the VMM, which we call the client,
+and the core device implementation, living outside the VMM, which we call the
+server.
+
+The vfio-user specification is partly based on the
+`Linux VFIO ioctl interface <https://www.kernel.org/doc/html/latest/driver-api/vfio.html>`_.
+
+VFIO is a mature and stable API, backed by an extensively used framework. The
+existing VFIO client implementation in QEMU (``qemu/hw/vfio/``) can be largely
+re-used, though there is nothing in this specification that requires that
+particular implementation. None of the VFIO kernel modules are required for
+supporting the protocol, on either the client or server side. Some source
+definitions in VFIO are re-used for vfio-user.
+
+The main idea is to allow a virtual device to function in a separate process in
+the same host over a UNIX domain socket. A UNIX domain socket (``AF_UNIX``) is
+chosen because file descriptors can be trivially sent over it, which in turn
+allows:
+
+* Sharing of client memory for DMA with the server.
+* Sharing of server memory with the client for fast MMIO.
+* Efficient sharing of eventfd's for triggering interrupts.
+
+Other socket types could be used which allow the server to run in a separate
+guest in the same host (``AF_VSOCK``) or remotely (``AF_INET``). Theoretically
+the underlying transport does not necessarily have to be a socket, however we do
+not examine such alternatives. In this protocol version we focus on using a UNIX
+domain socket and introduce basic support for the other two types of sockets
+without considering performance implications.
+
+While passing of file descriptors is desirable for performance reasons, support
+is not necessary for either the client or the server in order to implement the
+protocol. There is always an in-band, message-passing fall back mechanism.
+
+Overview
+========
+
+VFIO is a framework that allows a physical device to be securely passed through
+to a user space process; the device-specific kernel driver does not drive the
+device at all. Typically, the user space process is a VMM and the device is
+passed through to it in order to achieve high performance. VFIO provides an API
+and the required functionality in the kernel. QEMU has adopted VFIO to allow a
+guest to directly access physical devices, instead of emulating them in
+software.
+
+vfio-user reuses the core VFIO concepts defined in its API, but implements them
+as messages to be sent over a socket. It does not change the kernel-based VFIO
+in any way, in fact none of the VFIO kernel modules need to be loaded to use
+vfio-user. It is also possible for the client to concurrently use the current
+kernel-based VFIO for one device, and vfio-user for another device.
+
+VFIO Device Model
+-----------------
+
+A device under VFIO presents a standard interface to the user process. Many of
+the VFIO operations in the existing interface use the ``ioctl()`` system call, and
+references to the existing interface are called the ``ioctl()`` implementation in
+this document.
+
+The following sections describe the set of messages that implement the vfio-user
+interface over a socket. In many cases, the messages are analogous to data
+structures used in the ``ioctl()`` implementation. Messages derived from the
+``ioctl()`` will have a name derived from the ``ioctl()`` command name. E.g., the
+``VFIO_DEVICE_GET_INFO`` ``ioctl()`` command becomes a
+``VFIO_USER_DEVICE_GET_INFO`` message. The purpose of this reuse is to share as
+much code as feasible with the ``ioctl()`` implementation``.
+
+Connection Initiation
+^^^^^^^^^^^^^^^^^^^^^
+
+After the client connects to the server, the initial client message is
+``VFIO_USER_VERSION`` to propose a protocol version and set of capabilities to
+apply to the session. The server replies with a compatible version and set of
+capabilities it supports, or closes the connection if it cannot support the
+advertised version.
+
+Device Information
+^^^^^^^^^^^^^^^^^^
+
+The client uses a ``VFIO_USER_DEVICE_GET_INFO`` message to query the server for
+information about the device. This information includes:
+
+* The device type and whether it supports reset (``VFIO_DEVICE_FLAGS_``),
+* the number of device regions, and
+* the device presents to the client the number of interrupt types the device
+ supports.
+
+Region Information
+^^^^^^^^^^^^^^^^^^
+
+The client uses ``VFIO_USER_DEVICE_GET_REGION_INFO`` messages to query the
+server for information about the device's regions. This information describes:
+
+* Read and write permissions, whether it can be memory mapped, and whether it
+ supports additional capabilities (``VFIO_REGION_INFO_CAP_``).
+* Region index, size, and offset.
+
+When a device region can be mapped by the client, the server provides a file
+descriptor which the client can ``mmap()``. The server is responsible for
+polling for client updates to memory mapped regions.
+
+Region Capabilities
+"""""""""""""""""""
+
+Some regions have additional capabilities that cannot be described adequately
+by the region info data structure. These capabilities are returned in the
+region info reply in a list similar to PCI capabilities in a PCI device's
+configuration space.
+
+Sparse Regions
+""""""""""""""
+A region can be memory-mappable in whole or in part. When only a subset of a
+region can be mapped by the client, a ``VFIO_REGION_INFO_CAP_SPARSE_MMAP``
+capability is included in the region info reply. This capability describes
+which portions can be mapped by the client.
+
+.. Note::
+ For example, in a virtual NVMe controller, sparse regions can be used so
+ that accesses to the NVMe registers (found in the beginning of BAR0) are
+ trapped (an infrequent event), while allowing direct access to the doorbells
+ (an extremely frequent event as every I/O submission requires a write to
+ BAR0), found in the next page after the NVMe registers in BAR0.
+
+Device-Specific Regions
+"""""""""""""""""""""""
+
+A device can define regions additional to the standard ones (e.g. PCI indexes
+0-8). This is achieved by including a ``VFIO_REGION_INFO_CAP_TYPE`` capability
+in the region info reply of a device-specific region. Such regions are reflected
+in ``struct vfio_user_device_info.num_regions``. Thus, for PCI devices this
+value can be equal to, or higher than, ``VFIO_PCI_NUM_REGIONS``.
+
+Region I/O via file descriptors
+-------------------------------
+
+For unmapped regions, region I/O from the client is done via
+``VFIO_USER_REGION_READ/WRITE``. As an optimization, ioeventfds or ioregionfds
+may be configured for sub-regions of some regions. A client may request
+information on these sub-regions via ``VFIO_USER_DEVICE_GET_REGION_IO_FDS``; by
+configuring the returned file descriptors as ioeventfds or ioregionfds, the
+server can be directly notified of I/O (for example, by KVM) without taking a
+trip through the client.
+
+Interrupts
+^^^^^^^^^^
+
+The client uses ``VFIO_USER_DEVICE_GET_IRQ_INFO`` messages to query the server
+for the device's interrupt types. The interrupt types are specific to the bus
+the device is attached to, and the client is expected to know the capabilities
+of each interrupt type. The server can signal an interrupt by directly injecting
+interrupts into the guest via an event file descriptor. The client configures
+how the server signals an interrupt with ``VFIO_USER_SET_IRQS`` messages.
+
+Device Read and Write
+^^^^^^^^^^^^^^^^^^^^^
+
+When the guest executes load or store operations to an unmapped device region,
+the client forwards these operations to the server with
+``VFIO_USER_REGION_READ`` or ``VFIO_USER_REGION_WRITE`` messages. The server
+will reply with data from the device on read operations or an acknowledgement on
+write operations. See `Read and Write Operations`_.
+
+Client memory access
+--------------------
+
+The client uses ``VFIO_USER_DMA_MAP`` and ``VFIO_USER_DMA_UNMAP`` messages to
+inform the server of the valid DMA ranges that the server can access on behalf
+of a device (typically, VM guest memory). DMA memory may be accessed by the
+server via ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE`` messages over the
+socket. In this case, the "DMA" part of the naming is a misnomer.
+
+Actual direct memory access of client memory from the server is possible if the
+client provides file descriptors the server can ``mmap()``. Note that ``mmap()``
+privileges cannot be revoked by the client, therefore file descriptors should
+only be exported in environments where the client trusts the server not to
+corrupt guest memory.
+
+See `Read and Write Operations`_.
+
+Client/server interactions
+==========================
+
+Socket
+------
+
+A server can serve:
+
+1) one or more clients, and/or
+2) one or more virtual devices, belonging to one or more clients.
+
+The current protocol specification requires a dedicated socket per
+client/server connection. It is a server-side implementation detail whether a
+single server handles multiple virtual devices from the same or multiple
+clients. The location of the socket is implementation-specific. Multiplexing
+clients, devices, and servers over the same socket is not supported in this
+version of the protocol.
+
+Authentication
+--------------
+
+For ``AF_UNIX``, we rely on OS mandatory access controls on the socket files,
+therefore it is up to the management layer to set up the socket as required.
+Socket types that span guests or hosts will require a proper authentication
+mechanism. Defining that mechanism is deferred to a future version of the
+protocol.
+
+Command Concurrency
+-------------------
+
+A client may pipeline multiple commands without waiting for previous command
+replies. The server will process commands in the order they are received. A
+consequence of this is if a client issues a command with the *No_reply* bit,
+then subsequently issues a command without *No_reply*, the older command will
+have been processed before the reply to the younger command is sent by the
+server. The client must be aware of the device's capability to process
+concurrent commands if pipelining is used. For example, pipelining allows
+multiple client threads to concurrently access device regions; the client must
+ensure these accesses obey device semantics.
+
+An example is a frame buffer device, where the device may allow concurrent
+access to different areas of video memory, but may have indeterminate behavior
+if concurrent accesses are performed to command or status registers.
+
+Note that unrelated messages sent from the server to the client can appear in
+between a client to server request/reply and vice versa.
+
+Implementers should be prepared for certain commands to exhibit potentially
+unbounded latencies. For example, ``VFIO_USER_DEVICE_RESET`` may take an
+arbitrarily long time to complete; clients should take care not to block
+unnecessarily.
+
+Socket Disconnection Behavior
+-----------------------------
+The server and the client can disconnect from each other, either intentionally
+or unexpectedly. Both the client and the server need to know how to handle such
+events.
+
+Server Disconnection
+^^^^^^^^^^^^^^^^^^^^
+A server disconnecting from the client may indicate that:
+
+1) A virtual device has been restarted, either intentionally (e.g. because of a
+ device update) or unintentionally (e.g. because of a crash).
+2) A virtual device has been shut down with no intention to be restarted.
+
+It is impossible for the client to know whether or not a failure is
+intermittent or innocuous and should be retried, therefore the client should
+reset the VFIO device when it detects the socket has been disconnected.
+Error recovery will be driven by the guest's device error handling
+behavior.
+
+Client Disconnection
+^^^^^^^^^^^^^^^^^^^^
+The client disconnecting from the server primarily means that the client
+has exited. Currently, this means that the guest is shut down so the device is
+no longer needed therefore the server can automatically exit. However, there
+can be cases where a client disconnection should not result in a server exit:
+
+1) A single server serving multiple clients.
+2) A multi-process QEMU upgrading itself step by step, which is not yet
+ implemented.
+
+Therefore in order for the protocol to be forward compatible, the server should
+respond to a client disconnection as follows:
+
+ - all client memory regions are unmapped and cleaned up (including closing any
+ passed file descriptors)
+ - all IRQ file descriptors passed from the old client are closed
+ - the device state should otherwise be retained
+
+The expectation is that when a client reconnects, it will re-establish IRQ and
+client memory mappings.
+
+If anything happens to the client (such as qemu really did exit), the control
+stack will know about it and can clean up resources accordingly.
+
+Security Considerations
+-----------------------
+
+Speaking generally, vfio-user clients should not trust servers, and vice versa.
+Standard tools and mechanisms should be used on both sides to validate input and
+prevent against denial of service scenarios, buffer overflow, etc.
+
+Request Retry and Response Timeout
+----------------------------------
+A failed command is a command that has been successfully sent and has been
+responded to with an error code. Failure to send the command in the first place
+(e.g. because the socket is disconnected) is a different type of error examined
+earlier in the disconnect section.
+
+.. Note::
+ QEMU's VFIO retries certain operations if they fail. While this makes sense
+ for real HW, we don't know for sure whether it makes sense for virtual
+ devices.
+
+Defining a retry and timeout scheme is deferred to a future version of the
+protocol.
+
+Message sizes
+-------------
+
+Some requests have an ``argsz`` field. In a request, it defines the maximum
+expected reply payload size, which should be at least the size of the fixed
+reply payload headers defined here. The *request* payload size is defined by the
+usual ``msg_size`` field in the header, not the ``argsz`` field.
+
+In a reply, the server sets ``argsz`` field to the size needed for a full
+payload size. This may be less than the requested maximum size. This may be
+larger than the requested maximum size: in that case, the full payload is not
+included in the reply, but the ``argsz`` field in the reply indicates the needed
+size, allowing a client to allocate a larger buffer for holding the reply before
+trying again.
+
+In addition, during negotiation (see `Version`_), the client and server may
+each specify a ``max_data_xfer_size`` value; this defines the maximum data that
+may be read or written via one of the ``VFIO_USER_DMA/REGION_READ/WRITE``
+messages; see `Read and Write Operations`_.
+
+Protocol Specification
+======================
+
+To distinguish from the base VFIO symbols, all vfio-user symbols are prefixed
+with ``vfio_user`` or ``VFIO_USER``. In this revision, all data is in the
+endianness of the host system, although this may be relaxed in future
+revisions in cases where the client and server run on different hosts
+with different endianness.
+
+Unless otherwise specified, all sizes should be presumed to be in bytes.
+
+.. _Commands:
+
+Commands
+--------
+The following table lists the VFIO message command IDs, and whether the
+message command is sent from the client or the server.
+
+====================================== ========= =================
+Name Command Request Direction
+====================================== ========= =================
+``VFIO_USER_VERSION`` 1 client -> server
+``VFIO_USER_DMA_MAP`` 2 client -> server
+``VFIO_USER_DMA_UNMAP`` 3 client -> server
+``VFIO_USER_DEVICE_GET_INFO`` 4 client -> server
+``VFIO_USER_DEVICE_GET_REGION_INFO`` 5 client -> server
+``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` 6 client -> server
+``VFIO_USER_DEVICE_GET_IRQ_INFO`` 7 client -> server
+``VFIO_USER_DEVICE_SET_IRQS`` 8 client -> server
+``VFIO_USER_REGION_READ`` 9 client -> server
+``VFIO_USER_REGION_WRITE`` 10 client -> server
+``VFIO_USER_DMA_READ`` 11 server -> client
+``VFIO_USER_DMA_WRITE`` 12 server -> client
+``VFIO_USER_DEVICE_RESET`` 13 client -> server
+====================================== ========= =================
+
+Header
+------
+
+All messages, both command messages and reply messages, are preceded by a
+16-byte header that contains basic information about the message. The header is
+followed by message-specific data described in the sections below.
+
++----------------+--------+-------------+
+| Name | Offset | Size |
++================+========+=============+
+| Message ID | 0 | 2 |
++----------------+--------+-------------+
+| Command | 2 | 2 |
++----------------+--------+-------------+
+| Message size | 4 | 4 |
++----------------+--------+-------------+
+| Flags | 8 | 4 |
++----------------+--------+-------------+
+| | +-----+------------+ |
+| | | Bit | Definition | |
+| | +=====+============+ |
+| | | 0-3 | Type | |
+| | +-----+------------+ |
+| | | 4 | No_reply | |
+| | +-----+------------+ |
+| | | 5 | Error | |
+| | +-----+------------+ |
++----------------+--------+-------------+
+| Error | 12 | 4 |
++----------------+--------+-------------+
+| <message data> | 16 | variable |
++----------------+--------+-------------+
+
+* *Message ID* identifies the message, and is echoed in the command's reply
+ message. Message IDs belong entirely to the sender, can be re-used (even
+ concurrently) and the receiver must not make any assumptions about their
+ uniqueness.
+* *Command* specifies the command to be executed, listed in Commands_. It is
+ also set in the reply header.
+* *Message size* contains the size of the entire message, including the header.
+* *Flags* contains attributes of the message:
+
+ * The *Type* bits indicate the message type.
+
+ * *Command* (value 0x0) indicates a command message.
+ * *Reply* (value 0x1) indicates a reply message acknowledging a previous
+ command with the same message ID.
+ * *No_reply* in a command message indicates that no reply is needed for this
+ command. This is commonly used when multiple commands are sent, and only
+ the last needs acknowledgement.
+ * *Error* in a reply message indicates the command being acknowledged had
+ an error. In this case, the *Error* field will be valid.
+
+* *Error* in a reply message is an optional UNIX errno value. It may be zero
+ even if the Error bit is set in Flags. It is reserved in a command message.
+
+Each command message in Commands_ must be replied to with a reply message,
+unless the message sets the *No_Reply* bit. The reply consists of the header
+with the *Reply* bit set, plus any additional data.
+
+If an error occurs, the reply message must only include the reply header.
+
+As the header is standard in both requests and replies, it is not included in
+the command-specific specifications below; each message definition should be
+appended to the standard header, and the offsets are given from the end of the
+standard header.
+
+``VFIO_USER_VERSION``
+---------------------
+
+.. _Version:
+
+This is the initial message sent by the client after the socket connection is
+established; the same format is used for the server's reply.
+
+Upon establishing a connection, the client must send a ``VFIO_USER_VERSION``
+message proposing a protocol version and a set of capabilities. The server
+compares these with the versions and capabilities it supports and sends a
+``VFIO_USER_VERSION`` reply according to the following rules.
+
+* The major version in the reply must be the same as proposed. If the client
+ does not support the proposed major, it closes the connection.
+* The minor version in the reply must be equal to or less than the minor
+ version proposed.
+* The capability list must be a subset of those proposed. If the server
+ requires a capability the client did not include, it closes the connection.
+
+The protocol major version will only change when incompatible protocol changes
+are made, such as changing the message format. The minor version may change
+when compatible changes are made, such as adding new messages or capabilities,
+Both the client and server must support all minor versions less than the
+maximum minor version it supports. E.g., an implementation that supports
+version 1.3 must also support 1.0 through 1.2.
+
+When making a change to this specification, the protocol version number must
+be included in the form "added in version X.Y"
+
+Request
+^^^^^^^
+
+============== ====== ====
+Name Offset Size
+============== ====== ====
+version major 0 2
+version minor 2 2
+version data 4 variable (including terminating NUL). Optional.
+============== ====== ====
+
+The version data is an optional UTF-8 encoded JSON byte array with the following
+format:
+
++--------------+--------+-----------------------------------+
+| Name | Type | Description |
++==============+========+===================================+
+| capabilities | object | Contains common capabilities that |
+| | | the sender supports. Optional. |
++--------------+--------+-----------------------------------+
+
+Capabilities:
+
++--------------------+--------+------------------------------------------------+
+| Name | Type | Description |
++====================+========+================================================+
+| max_msg_fds | number | Maximum number of file descriptors that can be |
+| | | received by the sender in one message. |
+| | | Optional. If not specified then the receiver |
+| | | must assume a value of ``1``. |
++--------------------+--------+------------------------------------------------+
+| max_data_xfer_size | number | Maximum ``count`` for data transfer messages; |
+| | | see `Read and Write Operations`_. Optional, |
+| | | with a default value of 1048576 bytes. |
++--------------------+--------+------------------------------------------------+
+| migration | object | Migration capability parameters. If missing |
+| | | then migration is not supported by the sender. |
++--------------------+--------+------------------------------------------------+
+
+The migration capability contains the following name/value pairs:
+
++--------+--------+-----------------------------------------------+
+| Name | Type | Description |
++========+========+===============================================+
+| pgsize | number | Page size of dirty pages bitmap. The smallest |
+| | | between the client and the server is used. |
++--------+--------+-----------------------------------------------+
+
+Reply
+^^^^^
+
+The same message format is used in the server's reply with the semantics
+described above.
+
+``VFIO_USER_DMA_MAP``
+---------------------
+
+This command message is sent by the client to the server to inform it of the
+memory regions the server can access. It must be sent before the server can
+perform any DMA to the client. It is normally sent directly after the version
+handshake is completed, but may also occur when memory is added to the client,
+or if the client uses a vIOMMU.
+
+Request
+^^^^^^^
+
+The request payload for this message is a structure of the following format:
+
++-------------+--------+-------------+
+| Name | Offset | Size |
++=============+========+=============+
+| argsz | 0 | 4 |
++-------------+--------+-------------+
+| flags | 4 | 4 |
++-------------+--------+-------------+
+| | +-----+------------+ |
+| | | Bit | Definition | |
+| | +=====+============+ |
+| | | 0 | readable | |
+| | +-----+------------+ |
+| | | 1 | writeable | |
+| | +-----+------------+ |
++-------------+--------+-------------+
+| offset | 8 | 8 |
++-------------+--------+-------------+
+| address | 16 | 8 |
++-------------+--------+-------------+
+| size | 24 | 8 |
++-------------+--------+-------------+
+
+* *argsz* is the size of the above structure. Note there is no reply payload,
+ so this field differs from other message types.
+* *flags* contains the following region attributes:
+
+ * *readable* indicates that the region can be read from.
+
+ * *writeable* indicates that the region can be written to.
+
+* *offset* is the file offset of the region with respect to the associated file
+ descriptor, or zero if the region is not mappable
+* *address* is the base DMA address of the region.
+* *size* is the size of the region.
+
+This structure is 32 bytes in size, so the message size is 16 + 32 bytes.
+
+If the DMA region being added can be directly mapped by the server, a file
+descriptor must be sent as part of the message meta-data. The region can be
+mapped via the mmap() system call. On ``AF_UNIX`` sockets, the file descriptor
+must be passed as ``SCM_RIGHTS`` type ancillary data. Otherwise, if the DMA
+region cannot be directly mapped by the server, no file descriptor must be sent
+as part of the message meta-data and the DMA region can be accessed by the
+server using ``VFIO_USER_DMA_READ`` and ``VFIO_USER_DMA_WRITE`` messages,
+explained in `Read and Write Operations`_. A command to map over an existing
+region must be failed by the server with ``EEXIST`` set in error field in the
+reply.
+
+Reply
+^^^^^
+
+There is no payload in the reply message.
+
+``VFIO_USER_DMA_UNMAP``
+-----------------------
+
+This command message is sent by the client to the server to inform it that a
+DMA region, previously made available via a ``VFIO_USER_DMA_MAP`` command
+message, is no longer available for DMA. It typically occurs when memory is
+subtracted from the client or if the client uses a vIOMMU. The DMA region is
+described by the following structure:
+
+Request
+^^^^^^^
+
+The request payload for this message is a structure of the following format:
+
++--------------+--------+------------------------+
+| Name | Offset | Size |
++==============+========+========================+
+| argsz | 0 | 4 |
++--------------+--------+------------------------+
+| flags | 4 | 4 |
++--------------+--------+------------------------+
+| address | 8 | 8 |
++--------------+--------+------------------------+
+| size | 16 | 8 |
++--------------+--------+------------------------+
+
+* *argsz* is the maximum size of the reply payload.
+* *flags* is unused in this version.
+* *address* is the base DMA address of the DMA region.
+* *size* is the size of the DMA region.
+
+The address and size of the DMA region being unmapped must match exactly a
+previous mapping.
+
+Reply
+^^^^^
+
+Upon receiving a ``VFIO_USER_DMA_UNMAP`` command, if the file descriptor is
+mapped then the server must release all references to that DMA region before
+replying, which potentially includes in-flight DMA transactions.
+
+The server responds with the original DMA entry in the request.
+
+
+``VFIO_USER_DEVICE_GET_INFO``
+-----------------------------
+
+This command message is sent by the client to the server to query for basic
+information about the device.
+
+Request
+^^^^^^^
+
++-------------+--------+--------------------------+
+| Name | Offset | Size |
++=============+========+==========================+
+| argsz | 0 | 4 |
++-------------+--------+--------------------------+
+| flags | 4 | 4 |
++-------------+--------+--------------------------+
+| | +-----+-------------------------+ |
+| | | Bit | Definition | |
+| | +=====+=========================+ |
+| | | 0 | VFIO_DEVICE_FLAGS_RESET | |
+| | +-----+-------------------------+ |
+| | | 1 | VFIO_DEVICE_FLAGS_PCI | |
+| | +-----+-------------------------+ |
++-------------+--------+--------------------------+
+| num_regions | 8 | 4 |
++-------------+--------+--------------------------+
+| num_irqs | 12 | 4 |
++-------------+--------+--------------------------+
+
+* *argsz* is the maximum size of the reply payload
+* all other fields must be zero.
+
+Reply
+^^^^^
+
++-------------+--------+--------------------------+
+| Name | Offset | Size |
++=============+========+==========================+
+| argsz | 0 | 4 |
++-------------+--------+--------------------------+
+| flags | 4 | 4 |
++-------------+--------+--------------------------+
+| | +-----+-------------------------+ |
+| | | Bit | Definition | |
+| | +=====+=========================+ |
+| | | 0 | VFIO_DEVICE_FLAGS_RESET | |
+| | +-----+-------------------------+ |
+| | | 1 | VFIO_DEVICE_FLAGS_PCI | |
+| | +-----+-------------------------+ |
++-------------+--------+--------------------------+
+| num_regions | 8 | 4 |
++-------------+--------+--------------------------+
+| num_irqs | 12 | 4 |
++-------------+--------+--------------------------+
+
+* *argsz* is the size required for the full reply payload (16 bytes today)
+* *flags* contains the following device attributes.
+
+ * ``VFIO_DEVICE_FLAGS_RESET`` indicates that the device supports the
+ ``VFIO_USER_DEVICE_RESET`` message.
+ * ``VFIO_DEVICE_FLAGS_PCI`` indicates that the device is a PCI device.
+
+* *num_regions* is the number of memory regions that the device exposes.
+* *num_irqs* is the number of distinct interrupt types that the device supports.
+
+This version of the protocol only supports PCI devices. Additional devices may
+be supported in future versions.
+
+``VFIO_USER_DEVICE_GET_REGION_INFO``
+------------------------------------
+
+This command message is sent by the client to the server to query for
+information about device regions. The VFIO region info structure is defined in
+``<linux/vfio.h>`` (``struct vfio_region_info``).
+
+Request
+^^^^^^^
+
++------------+--------+------------------------------+
+| Name | Offset | Size |
++============+========+==============================+
+| argsz | 0 | 4 |
++------------+--------+------------------------------+
+| flags | 4 | 4 |
++------------+--------+------------------------------+
+| index | 8 | 4 |
++------------+--------+------------------------------+
+| cap_offset | 12 | 4 |
++------------+--------+------------------------------+
+| size | 16 | 8 |
++------------+--------+------------------------------+
+| offset | 24 | 8 |
++------------+--------+------------------------------+
+
+* *argsz* the maximum size of the reply payload
+* *index* is the index of memory region being queried, it is the only field
+ that is required to be set in the command message.
+* all other fields must be zero.
+
+Reply
+^^^^^
+
++------------+--------+------------------------------+
+| Name | Offset | Size |
++============+========+==============================+
+| argsz | 0 | 4 |
++------------+--------+------------------------------+
+| flags | 4 | 4 |
++------------+--------+------------------------------+
+| | +-----+-----------------------------+ |
+| | | Bit | Definition | |
+| | +=====+=============================+ |
+| | | 0 | VFIO_REGION_INFO_FLAG_READ | |
+| | +-----+-----------------------------+ |
+| | | 1 | VFIO_REGION_INFO_FLAG_WRITE | |
+| | +-----+-----------------------------+ |
+| | | 2 | VFIO_REGION_INFO_FLAG_MMAP | |
+| | +-----+-----------------------------+ |
+| | | 3 | VFIO_REGION_INFO_FLAG_CAPS | |
+| | +-----+-----------------------------+ |
++------------+--------+------------------------------+
++------------+--------+------------------------------+
+| index | 8 | 4 |
++------------+--------+------------------------------+
+| cap_offset | 12 | 4 |
++------------+--------+------------------------------+
+| size | 16 | 8 |
++------------+--------+------------------------------+
+| offset | 24 | 8 |
++------------+--------+------------------------------+
+
+* *argsz* is the size required for the full reply payload (region info structure
+ plus the size of any region capabilities)
+* *flags* are attributes of the region:
+
+ * ``VFIO_REGION_INFO_FLAG_READ`` allows client read access to the region.
+ * ``VFIO_REGION_INFO_FLAG_WRITE`` allows client write access to the region.
+ * ``VFIO_REGION_INFO_FLAG_MMAP`` specifies the client can mmap() the region.
+ When this flag is set, the reply will include a file descriptor in its
+ meta-data. On ``AF_UNIX`` sockets, the file descriptors will be passed as
+ ``SCM_RIGHTS`` type ancillary data.
+ * ``VFIO_REGION_INFO_FLAG_CAPS`` indicates additional capabilities found in the
+ reply.
+
+* *index* is the index of memory region being queried, it is the only field
+ that is required to be set in the command message.
+* *cap_offset* describes where additional region capabilities can be found.
+ cap_offset is relative to the beginning of the VFIO region info structure.
+ The data structure it points is a VFIO cap header defined in
+ ``<linux/vfio.h>``.
+* *size* is the size of the region.
+* *offset* is the offset that should be given to the mmap() system call for
+ regions with the MMAP attribute. It is also used as the base offset when
+ mapping a VFIO sparse mmap area, described below.
+
+VFIO region capabilities
+""""""""""""""""""""""""
+
+The VFIO region information can also include a capabilities list. This list is
+similar to a PCI capability list - each entry has a common header that
+identifies a capability and where the next capability in the list can be found.
+The VFIO capability header format is defined in ``<linux/vfio.h>`` (``struct
+vfio_info_cap_header``).
+
+VFIO cap header format
+""""""""""""""""""""""
+
++---------+--------+------+
+| Name | Offset | Size |
++=========+========+======+
+| id | 0 | 2 |
++---------+--------+------+
+| version | 2 | 2 |
++---------+--------+------+
+| next | 4 | 4 |
++---------+--------+------+
+
+* *id* is the capability identity.
+* *version* is a capability-specific version number.
+* *next* specifies the offset of the next capability in the capability list. It
+ is relative to the beginning of the VFIO region info structure.
+
+VFIO sparse mmap cap header
+"""""""""""""""""""""""""""
+
++------------------+----------------------------------+
+| Name | Value |
++==================+==================================+
+| id | VFIO_REGION_INFO_CAP_SPARSE_MMAP |
++------------------+----------------------------------+
+| version | 0x1 |
++------------------+----------------------------------+
+| next | <next> |
++------------------+----------------------------------+
+| sparse mmap info | VFIO region info sparse mmap |
++------------------+----------------------------------+
+
+This capability is defined when only a subrange of the region supports
+direct access by the client via mmap(). The VFIO sparse mmap area is defined in
+``<linux/vfio.h>`` (``struct vfio_region_sparse_mmap_area`` and ``struct
+vfio_region_info_cap_sparse_mmap``).
+
+VFIO region info cap sparse mmap
+""""""""""""""""""""""""""""""""
+
++----------+--------+------+
+| Name | Offset | Size |
++==========+========+======+
+| nr_areas | 0 | 4 |
++----------+--------+------+
+| reserved | 4 | 4 |
++----------+--------+------+
+| offset | 8 | 8 |
++----------+--------+------+
+| size | 16 | 8 |
++----------+--------+------+
+| ... | | |
++----------+--------+------+
+
+* *nr_areas* is the number of sparse mmap areas in the region.
+* *offset* and size describe a single area that can be mapped by the client.
+ There will be *nr_areas* pairs of offset and size. The offset will be added to
+ the base offset given in the ``VFIO_USER_DEVICE_GET_REGION_INFO`` to form the
+ offset argument of the subsequent mmap() call.
+
+The VFIO sparse mmap area is defined in ``<linux/vfio.h>`` (``struct
+vfio_region_info_cap_sparse_mmap``).
+
+
+``VFIO_USER_DEVICE_GET_REGION_IO_FDS``
+--------------------------------------
+
+Clients can access regions via ``VFIO_USER_REGION_READ/WRITE`` or, if provided, by
+``mmap()`` of a file descriptor provided by the server.
+
+``VFIO_USER_DEVICE_GET_REGION_IO_FDS`` provides an alternative access mechanism via
+file descriptors. This is an optional feature intended for performance
+improvements where an underlying sub-system (such as KVM) supports communication
+across such file descriptors to the vfio-user server, without needing to
+round-trip through the client.
+
+The server returns an array of sub-regions for the requested region. Each
+sub-region describes a span (offset and size) of a region, along with the
+requested file descriptor notification mechanism to use. Each sub-region in the
+response message may choose to use a different method, as defined below. The
+two mechanisms supported in this specification are ioeventfds and ioregionfds.
+
+The server in addition returns a file descriptor in the ancillary data; clients
+are expected to configure each sub-region's file descriptor with the requested
+notification method. For example, a client could configure KVM with the
+requested ioeventfd via a ``KVM_IOEVENTFD`` ``ioctl()``.
+
+Request
+^^^^^^^
+
++-------------+--------+------+
+| Name | Offset | Size |
++=============+========+======+
+| argsz | 0 | 4 |
++-------------+--------+------+
+| flags | 4 | 4 |
++-------------+--------+------+
+| index | 8 | 4 |
++-------------+--------+------+
+| count | 12 | 4 |
++-------------+--------+------+
+
+* *argsz* the maximum size of the reply payload
+* *index* is the index of memory region being queried
+* all other fields must be zero
+
+The client must set ``flags`` to zero and specify the region being queried in
+the ``index``.
+
+Reply
+^^^^^
+
++-------------+--------+------+
+| Name | Offset | Size |
++=============+========+======+
+| argsz | 0 | 4 |
++-------------+--------+------+
+| flags | 4 | 4 |
++-------------+--------+------+
+| index | 8 | 4 |
++-------------+--------+------+
+| count | 12 | 4 |
++-------------+--------+------+
+| sub-regions | 16 | ... |
++-------------+--------+------+
+
+* *argsz* is the size of the region IO FD info structure plus the
+ total size of the sub-region array. Thus, each array entry "i" is at offset
+ i * ((argsz - 32) / count). Note that currently this is 40 bytes for both IO
+ FD types, but this is not to be relied on. As elsewhere, this indicates the
+ full reply payload size needed.
+* *flags* must be zero
+* *index* is the index of memory region being queried
+* *count* is the number of sub-regions in the array
+* *sub-regions* is the array of Sub-Region IO FD info structures
+
+The reply message will additionally include at least one file descriptor in the
+ancillary data. Note that more than one sub-region may share the same file
+descriptor.
+
+Note that it is the client's responsibility to verify the requested values (for
+example, that the requested offset does not exceed the region's bounds).
+
+Each sub-region given in the response has one of two possible structures,
+depending whether *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD`` or
+``VFIO_USER_IO_FD_TYPE_IOREGIONFD``:
+
+Sub-Region IO FD info format (ioeventfd)
+""""""""""""""""""""""""""""""""""""""""
+
++-----------+--------+------+
+| Name | Offset | Size |
++===========+========+======+
+| offset | 0 | 8 |
++-----------+--------+------+
+| size | 8 | 8 |
++-----------+--------+------+
+| fd_index | 16 | 4 |
++-----------+--------+------+
+| type | 20 | 4 |
++-----------+--------+------+
+| flags | 24 | 4 |
++-----------+--------+------+
+| padding | 28 | 4 |
++-----------+--------+------+
+| datamatch | 32 | 8 |
++-----------+--------+------+
+
+* *offset* is the offset of the start of the sub-region within the region
+ requested ("physical address offset" for the region)
+* *size* is the length of the sub-region. This may be zero if the access size is
+ not relevant, which may allow for optimizations
+* *fd_index* is the index in the ancillary data of the FD to use for ioeventfd
+ notification; it may be shared.
+* *type* is ``VFIO_USER_IO_FD_TYPE_IOEVENTFD``
+* *flags* is any of:
+
+ * ``KVM_IOEVENTFD_FLAG_DATAMATCH``
+ * ``KVM_IOEVENTFD_FLAG_PIO``
+ * ``KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY`` (FIXME: makes sense?)
+
+* *datamatch* is the datamatch value if needed
+
+See https://www.kernel.org/doc/Documentation/virtual/kvm/api.txt, *4.59
+KVM_IOEVENTFD* for further context on the ioeventfd-specific fields.
+
+Sub-Region IO FD info format (ioregionfd)
+"""""""""""""""""""""""""""""""""""""""""
+
++-----------+--------+------+
+| Name | Offset | Size |
++===========+========+======+
+| offset | 0 | 8 |
++-----------+--------+------+
+| size | 8 | 8 |
++-----------+--------+------+
+| fd_index | 16 | 4 |
++-----------+--------+------+
+| type | 20 | 4 |
++-----------+--------+------+
+| flags | 24 | 4 |
++-----------+--------+------+
+| padding | 28 | 4 |
++-----------+--------+------+
+| user_data | 32 | 8 |
++-----------+--------+------+
+
+* *offset* is the offset of the start of the sub-region within the region
+ requested ("physical address offset" for the region)
+* *size* is the length of the sub-region. This may be zero if the access size is
+ not relevant, which may allow for optimizations; ``KVM_IOREGION_POSTED_WRITES``
+ must be set in *flags* in this case
+* *fd_index* is the index in the ancillary data of the FD to use for ioregionfd
+ messages; it may be shared
+* *type* is ``VFIO_USER_IO_FD_TYPE_IOREGIONFD``
+* *flags* is any of:
+
+ * ``KVM_IOREGION_PIO``
+ * ``KVM_IOREGION_POSTED_WRITES``
+
+* *user_data* is an opaque value passed back to the server via a message on the
+ file descriptor
+
+For further information on the ioregionfd-specific fields, see:
+https://lore.kernel.org/kvm/cover.1613828726.git.eafanasova@gmail.com/
+
+(FIXME: update with final API docs.)
+
+``VFIO_USER_DEVICE_GET_IRQ_INFO``
+---------------------------------
+
+This command message is sent by the client to the server to query for
+information about device interrupt types. The VFIO IRQ info structure is
+defined in ``<linux/vfio.h>`` (``struct vfio_irq_info``).
+
+Request
+^^^^^^^
+
++-------+--------+---------------------------+
+| Name | Offset | Size |
++=======+========+===========================+
+| argsz | 0 | 4 |
++-------+--------+---------------------------+
+| flags | 4 | 4 |
++-------+--------+---------------------------+
+| | +-----+--------------------------+ |
+| | | Bit | Definition | |
+| | +=====+==========================+ |
+| | | 0 | VFIO_IRQ_INFO_EVENTFD | |
+| | +-----+--------------------------+ |
+| | | 1 | VFIO_IRQ_INFO_MASKABLE | |
+| | +-----+--------------------------+ |
+| | | 2 | VFIO_IRQ_INFO_AUTOMASKED | |
+| | +-----+--------------------------+ |
+| | | 3 | VFIO_IRQ_INFO_NORESIZE | |
+| | +-----+--------------------------+ |
++-------+--------+---------------------------+
+| index | 8 | 4 |
++-------+--------+---------------------------+
+| count | 12 | 4 |
++-------+--------+---------------------------+
+
+* *argsz* is the maximum size of the reply payload (16 bytes today)
+* index is the index of IRQ type being queried (e.g. ``VFIO_PCI_MSIX_IRQ_INDEX``)
+* all other fields must be zero
+
+Reply
+^^^^^
+
++-------+--------+---------------------------+
+| Name | Offset | Size |
++=======+========+===========================+
+| argsz | 0 | 4 |
++-------+--------+---------------------------+
+| flags | 4 | 4 |
++-------+--------+---------------------------+
+| | +-----+--------------------------+ |
+| | | Bit | Definition | |
+| | +=====+==========================+ |
+| | | 0 | VFIO_IRQ_INFO_EVENTFD | |
+| | +-----+--------------------------+ |
+| | | 1 | VFIO_IRQ_INFO_MASKABLE | |
+| | +-----+--------------------------+ |
+| | | 2 | VFIO_IRQ_INFO_AUTOMASKED | |
+| | +-----+--------------------------+ |
+| | | 3 | VFIO_IRQ_INFO_NORESIZE | |
+| | +-----+--------------------------+ |
++-------+--------+---------------------------+
+| index | 8 | 4 |
++-------+--------+---------------------------+
+| count | 12 | 4 |
++-------+--------+---------------------------+
+
+* *argsz* is the size required for the full reply payload (16 bytes today)
+* *flags* defines IRQ attributes:
+
+ * ``VFIO_IRQ_INFO_EVENTFD`` indicates the IRQ type can support server eventfd
+ signalling.
+ * ``VFIO_IRQ_INFO_MASKABLE`` indicates that the IRQ type supports the ``MASK``
+ and ``UNMASK`` actions in a ``VFIO_USER_DEVICE_SET_IRQS`` message.
+ * ``VFIO_IRQ_INFO_AUTOMASKED`` indicates the IRQ type masks itself after being
+ triggered, and the client must send an ``UNMASK`` action to receive new
+ interrupts.
+ * ``VFIO_IRQ_INFO_NORESIZE`` indicates ``VFIO_USER_SET_IRQS`` operations setup
+ interrupts as a set, and new sub-indexes cannot be enabled without disabling
+ the entire type.
+* index is the index of IRQ type being queried
+* count describes the number of interrupts of the queried type.
+
+``VFIO_USER_DEVICE_SET_IRQS``
+-----------------------------
+
+This command message is sent by the client to the server to set actions for
+device interrupt types. The VFIO IRQ set structure is defined in
+``<linux/vfio.h>`` (``struct vfio_irq_set``).
+
+Request
+^^^^^^^
+
++-------+--------+------------------------------+
+| Name | Offset | Size |
++=======+========+==============================+
+| argsz | 0 | 4 |
++-------+--------+------------------------------+
+| flags | 4 | 4 |
++-------+--------+------------------------------+
+| | +-----+-----------------------------+ |
+| | | Bit | Definition | |
+| | +=====+=============================+ |
+| | | 0 | VFIO_IRQ_SET_DATA_NONE | |
+| | +-----+-----------------------------+ |
+| | | 1 | VFIO_IRQ_SET_DATA_BOOL | |
+| | +-----+-----------------------------+ |
+| | | 2 | VFIO_IRQ_SET_DATA_EVENTFD | |
+| | +-----+-----------------------------+ |
+| | | 3 | VFIO_IRQ_SET_ACTION_MASK | |
+| | +-----+-----------------------------+ |
+| | | 4 | VFIO_IRQ_SET_ACTION_UNMASK | |
+| | +-----+-----------------------------+ |
+| | | 5 | VFIO_IRQ_SET_ACTION_TRIGGER | |
+| | +-----+-----------------------------+ |
++-------+--------+------------------------------+
+| index | 8 | 4 |
++-------+--------+------------------------------+
+| start | 12 | 4 |
++-------+--------+------------------------------+
+| count | 16 | 4 |
++-------+--------+------------------------------+
+| data | 20 | variable |
++-------+--------+------------------------------+
+
+* *argsz* is the size of the VFIO IRQ set request payload, including any *data*
+ field. Note there is no reply payload, so this field differs from other
+ message types.
+* *flags* defines the action performed on the interrupt range. The ``DATA``
+ flags describe the data field sent in the message; the ``ACTION`` flags
+ describe the action to be performed. The flags are mutually exclusive for
+ both sets.
+
+ * ``VFIO_IRQ_SET_DATA_NONE`` indicates there is no data field in the command.
+ The action is performed unconditionally.
+ * ``VFIO_IRQ_SET_DATA_BOOL`` indicates the data field is an array of boolean
+ bytes. The action is performed if the corresponding boolean is true.
+ * ``VFIO_IRQ_SET_DATA_EVENTFD`` indicates an array of event file descriptors
+ was sent in the message meta-data. These descriptors will be signalled when
+ the action defined by the action flags occurs. In ``AF_UNIX`` sockets, the
+ descriptors are sent as ``SCM_RIGHTS`` type ancillary data.
+ If no file descriptors are provided, this de-assigns the specified
+ previously configured interrupts.
+ * ``VFIO_IRQ_SET_ACTION_MASK`` indicates a masking event. It can be used with
+ ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to mask an interrupt,
+ or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the guest masks
+ the interrupt.
+ * ``VFIO_IRQ_SET_ACTION_UNMASK`` indicates an unmasking event. It can be used
+ with ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to unmask an
+ interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the
+ guest unmasks the interrupt.
+ * ``VFIO_IRQ_SET_ACTION_TRIGGER`` indicates a triggering event. It can be used
+ with ``VFIO_IRQ_SET_DATA_BOOL`` or ``VFIO_IRQ_SET_DATA_NONE`` to trigger an
+ interrupt, or with ``VFIO_IRQ_SET_DATA_EVENTFD`` to generate an event when the
+ server triggers the interrupt.
+
+* *index* is the index of IRQ type being setup.
+* *start* is the start of the sub-index being set.
+* *count* describes the number of sub-indexes being set. As a special case, a
+ count (and start) of 0, with data flags of ``VFIO_IRQ_SET_DATA_NONE`` disables
+ all interrupts of the index.
+* *data* is an optional field included when the
+ ``VFIO_IRQ_SET_DATA_BOOL`` flag is present. It contains an array of booleans
+ that specify whether the action is to be performed on the corresponding
+ index. It's used when the action is only performed on a subset of the range
+ specified.
+
+Not all interrupt types support every combination of data and action flags.
+The client must know the capabilities of the device and IRQ index before it
+sends a ``VFIO_USER_DEVICE_SET_IRQ`` message.
+
+In typical operation, a specific IRQ may operate as follows:
+
+1. The client sends a ``VFIO_USER_DEVICE_SET_IRQ`` message with
+ ``flags=(VFIO_IRQ_SET_DATA_EVENTFD|VFIO_IRQ_SET_ACTION_TRIGGER)`` along
+ with an eventfd. This associates the IRQ with a particular eventfd on the
+ server side.
+
+#. The client may send a ``VFIO_USER_DEVICE_SET_IRQ`` message with
+ ``flags=(VFIO_IRQ_SET_DATA_EVENTFD|VFIO_IRQ_SET_ACTION_MASK/UNMASK)`` along
+ with another eventfd. This associates the given eventfd with the
+ mask/unmask state on the server side.
+
+#. The server may trigger the IRQ by writing 1 to the eventfd.
+
+#. The server may mask/unmask an IRQ which will write 1 to the corresponding
+ mask/unmask eventfd, if there is one.
+
+5. A client may trigger a device IRQ itself, by sending a
+ ``VFIO_USER_DEVICE_SET_IRQ`` message with
+ ``flags=(VFIO_IRQ_SET_DATA_NONE/BOOL|VFIO_IRQ_SET_ACTION_TRIGGER)``.
+
+6. A client may mask or unmask the IRQ, by sending a
+ ``VFIO_USER_DEVICE_SET_IRQ`` message with
+ ``flags=(VFIO_IRQ_SET_DATA_NONE/BOOL|VFIO_IRQ_SET_ACTION_MASK/UNMASK)``.
+
+Reply
+^^^^^
+
+There is no payload in the reply.
+
+.. _Read and Write Operations:
+
+Note that all of these operations must be supported by the client and/or server,
+even if the corresponding memory or device region has been shared as mappable.
+
+The ``count`` field must not exceed the value of ``max_data_xfer_size`` of the
+peer, for both reads and writes.
+
+``VFIO_USER_REGION_READ``
+-------------------------
+
+If a device region is not mappable, it's not directly accessible by the client
+via ``mmap()`` of the underlying file descriptor. In this case, a client can
+read from a device region with this message.
+
+Request
+^^^^^^^
+
++--------+--------+----------+
+| Name | Offset | Size |
++========+========+==========+
+| offset | 0 | 8 |
++--------+--------+----------+
+| region | 8 | 4 |
++--------+--------+----------+
+| count | 12 | 4 |
++--------+--------+----------+
+
+* *offset* into the region being accessed.
+* *region* is the index of the region being accessed.
+* *count* is the size of the data to be transferred.
+
+Reply
+^^^^^
+
++--------+--------+----------+
+| Name | Offset | Size |
++========+========+==========+
+| offset | 0 | 8 |
++--------+--------+----------+
+| region | 8 | 4 |
++--------+--------+----------+
+| count | 12 | 4 |
++--------+--------+----------+
+| data | 16 | variable |
++--------+--------+----------+
+
+* *offset* into the region accessed.
+* *region* is the index of the region accessed.
+* *count* is the size of the data transferred.
+* *data* is the data that was read from the device region.
+
+``VFIO_USER_REGION_WRITE``
+--------------------------
+
+If a device region is not mappable, it's not directly accessible by the client
+via mmap() of the underlying fd. In this case, a client can write to a device
+region with this message.
+
+Request
+^^^^^^^
+
++--------+--------+----------+
+| Name | Offset | Size |
++========+========+==========+
+| offset | 0 | 8 |
++--------+--------+----------+
+| region | 8 | 4 |
++--------+--------+----------+
+| count | 12 | 4 |
++--------+--------+----------+
+| data | 16 | variable |
++--------+--------+----------+
+
+* *offset* into the region being accessed.
+* *region* is the index of the region being accessed.
+* *count* is the size of the data to be transferred.
+* *data* is the data to write
+
+Reply
+^^^^^
+
++--------+--------+----------+
+| Name | Offset | Size |
++========+========+==========+
+| offset | 0 | 8 |
++--------+--------+----------+
+| region | 8 | 4 |
++--------+--------+----------+
+| count | 12 | 4 |
++--------+--------+----------+
+
+* *offset* into the region accessed.
+* *region* is the index of the region accessed.
+* *count* is the size of the data transferred.
+
+``VFIO_USER_DMA_READ``
+-----------------------
+
+If the client has not shared mappable memory, the server can use this message to
+read from guest memory.
+
+Request
+^^^^^^^
+
++---------+--------+----------+
+| Name | Offset | Size |
++=========+========+==========+
+| address | 0 | 8 |
++---------+--------+----------+
+| count | 8 | 8 |
++---------+--------+----------+
+
+* *address* is the client DMA memory address being accessed. This address must have
+ been previously exported to the server with a ``VFIO_USER_DMA_MAP`` message.
+* *count* is the size of the data to be transferred.
+
+Reply
+^^^^^
+
++---------+--------+----------+
+| Name | Offset | Size |
++=========+========+==========+
+| address | 0 | 8 |
++---------+--------+----------+
+| count | 8 | 8 |
++---------+--------+----------+
+| data | 16 | variable |
++---------+--------+----------+
+
+* *address* is the client DMA memory address being accessed.
+* *count* is the size of the data transferred.
+* *data* is the data read.
+
+``VFIO_USER_DMA_WRITE``
+-----------------------
+
+If the client has not shared mappable memory, the server can use this message to
+write to guest memory.
+
+Request
+^^^^^^^
+
++---------+--------+----------+
+| Name | Offset | Size |
++=========+========+==========+
+| address | 0 | 8 |
++---------+--------+----------+
+| count | 8 | 8 |
++---------+--------+----------+
+| data | 16 | variable |
++---------+--------+----------+
+
+* *address* is the client DMA memory address being accessed. This address must have
+ been previously exported to the server with a ``VFIO_USER_DMA_MAP`` message.
+* *count* is the size of the data to be transferred.
+* *data* is the data to write
+
+Reply
+^^^^^
+
++---------+--------+----------+
+| Name | Offset | Size |
++=========+========+==========+
+| address | 0 | 8 |
++---------+--------+----------+
+| count | 8 | 4 |
++---------+--------+----------+
+
+* *address* is the client DMA memory address being accessed.
+* *count* is the size of the data transferred.
+
+``VFIO_USER_DEVICE_RESET``
+--------------------------
+
+This command message is sent from the client to the server to reset the device.
+Neither the request or reply have a payload.
+
+
+Appendices
+==========
+
+Unused VFIO ``ioctl()`` commands
+--------------------------------
+
+The following VFIO commands do not have an equivalent vfio-user command:
+
+* ``VFIO_GET_API_VERSION``
+* ``VFIO_CHECK_EXTENSION``
+* ``VFIO_SET_IOMMU``
+* ``VFIO_GROUP_GET_STATUS``
+* ``VFIO_GROUP_SET_CONTAINER``
+* ``VFIO_GROUP_UNSET_CONTAINER``
+* ``VFIO_GROUP_GET_DEVICE_FD``
+* ``VFIO_IOMMU_GET_INFO``
+
+However, once support for live migration for VFIO devices is finalized some
+of the above commands may have to be handled by the client in their
+corresponding vfio-user form. This will be addressed in a future protocol
+version.
+
+VFIO groups and containers
+^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The current VFIO implementation includes group and container idioms that
+describe how a device relates to the host IOMMU. In the vfio-user
+implementation, the IOMMU is implemented in SW by the client, and is not
+visible to the server. The simplest idea would be that the client put each
+device into its own group and container.
+
+Backend Program Conventions
+---------------------------
+
+vfio-user backend program conventions are based on the vhost-user ones.
+
+* The backend program must not daemonize itself.
+* No assumptions must be made as to what access the backend program has on the
+ system.
+* File descriptors 0, 1 and 2 must exist, must have regular
+ stdin/stdout/stderr semantics, and can be redirected.
+* The backend program must honor the SIGTERM signal.
+* The backend program must accept the following commands line options:
+
+ * ``--socket-path=PATH``: path to UNIX domain socket,
+ * ``--fd=FDNUM``: file descriptor for UNIX domain socket, incompatible with
+ ``--socket-path``
+* The backend program must be accompanied with a JSON file stored under
+ ``/usr/share/vfio-user``.
+
+TODO add schema similar to docs/interop/vhost-user.json.
diff --git a/MAINTAINERS b/MAINTAINERS
index 294c88a..8117241 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1955,6 +1955,12 @@ F: hw/vfio/ap.c
F: docs/system/s390x/vfio-ap.rst
L: qemu-s390x@nongnu.org
+vfio-user
+M: John G Johnson <john.g.johnson@oracle.com>
+M: Thanos Makatos <thanos.makatos@nutanix.com>
+S: Supported
+F: docs/devel/vfio-user.rst
+
vhost
M: Michael S. Tsirkin <mst@redhat.com>
S: Supported
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 02/23] vfio-user: add VFIO base abstract class
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
2022-05-05 17:19 ` [RFC v5 01/23] vfio-user: introduce vfio-user protocol specification John Johnson
@ 2022-05-05 17:19 ` John Johnson
2022-05-05 17:19 ` [RFC v5 03/23] vfio-user: add container IO ops vector John Johnson
` (20 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:19 UTC (permalink / raw)
To: qemu-devel
Add an abstract base class both the kernel driver
and user socket implementations can use to share code.
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/pci.h | 16 +++++++--
hw/vfio/pci.c | 106 +++++++++++++++++++++++++++++++++++-----------------------
2 files changed, 78 insertions(+), 44 deletions(-)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 6477751..bbc78aa 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -114,8 +114,13 @@ typedef struct VFIOMSIXInfo {
unsigned long *pending;
} VFIOMSIXInfo;
-#define TYPE_VFIO_PCI "vfio-pci"
-OBJECT_DECLARE_SIMPLE_TYPE(VFIOPCIDevice, VFIO_PCI)
+/*
+ * TYPE_VFIO_PCI_BASE is an abstract type used to share code
+ * between VFIO implementations that use a kernel driver
+ * with those that use user sockets.
+ */
+#define TYPE_VFIO_PCI_BASE "vfio-pci-base"
+OBJECT_DECLARE_SIMPLE_TYPE(VFIOPCIDevice, VFIO_PCI_BASE)
struct VFIOPCIDevice {
PCIDevice pdev;
@@ -175,6 +180,13 @@ struct VFIOPCIDevice {
Notifier irqchip_change_notifier;
};
+#define TYPE_VFIO_PCI "vfio-pci"
+OBJECT_DECLARE_SIMPLE_TYPE(VFIOKernPCIDevice, VFIO_PCI)
+
+struct VFIOKernPCIDevice {
+ VFIOPCIDevice device;
+};
+
/* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
static inline bool vfio_pci_is(VFIOPCIDevice *vdev, uint32_t vendor, uint32_t device)
{
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 9fd9fae..4ee5215 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -231,7 +231,7 @@ static void vfio_intx_update(VFIOPCIDevice *vdev, PCIINTxRoute *route)
static void vfio_intx_routing_notifier(PCIDevice *pdev)
{
- VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+ VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
PCIINTxRoute route;
if (vdev->interrupt != VFIO_INT_INTx) {
@@ -460,7 +460,7 @@ static void vfio_update_kvm_msi_virq(VFIOMSIVector *vector, MSIMessage msg,
static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
MSIMessage *msg, IOHandler *handler)
{
- VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+ VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
VFIOMSIVector *vector;
int ret;
@@ -545,7 +545,7 @@ static int vfio_msix_vector_use(PCIDevice *pdev,
static void vfio_msix_vector_release(PCIDevice *pdev, unsigned int nr)
{
- VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+ VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
VFIOMSIVector *vector = &vdev->msi_vectors[nr];
trace_vfio_msix_vector_release(vdev->vbasedev.name, nr);
@@ -1066,7 +1066,7 @@ static const MemoryRegionOps vfio_vga_ops = {
*/
static void vfio_sub_page_bar_update_mapping(PCIDevice *pdev, int bar)
{
- VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+ VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
VFIORegion *region = &vdev->bars[bar].region;
MemoryRegion *mmap_mr, *region_mr, *base_mr;
PCIIORegion *r;
@@ -1112,7 +1112,7 @@ static void vfio_sub_page_bar_update_mapping(PCIDevice *pdev, int bar)
*/
uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
{
- VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+ VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
uint32_t emu_bits = 0, emu_val = 0, phys_val = 0, val;
memcpy(&emu_bits, vdev->emulated_config_bits + addr, len);
@@ -1145,7 +1145,7 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
void vfio_pci_write_config(PCIDevice *pdev,
uint32_t addr, uint32_t val, int len)
{
- VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+ VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
uint32_t val_le = cpu_to_le32(val);
trace_vfio_pci_write_config(vdev->vbasedev.name, addr, val, len);
@@ -2802,7 +2802,7 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
static void vfio_realize(PCIDevice *pdev, Error **errp)
{
- VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+ VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
VFIODevice *vbasedev_iter;
VFIOGroup *group;
char *tmp, *subsys, group_path[PATH_MAX], *group_name;
@@ -3125,7 +3125,7 @@ error:
static void vfio_instance_finalize(Object *obj)
{
- VFIOPCIDevice *vdev = VFIO_PCI(obj);
+ VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
VFIOGroup *group = vdev->vbasedev.group;
vfio_display_finalize(vdev);
@@ -3145,7 +3145,7 @@ static void vfio_instance_finalize(Object *obj)
static void vfio_exitfn(PCIDevice *pdev)
{
- VFIOPCIDevice *vdev = VFIO_PCI(pdev);
+ VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
vfio_unregister_req_notifier(vdev);
vfio_unregister_err_notifier(vdev);
@@ -3164,7 +3164,7 @@ static void vfio_exitfn(PCIDevice *pdev)
static void vfio_pci_reset(DeviceState *dev)
{
- VFIOPCIDevice *vdev = VFIO_PCI(dev);
+ VFIOPCIDevice *vdev = VFIO_PCI_BASE(dev);
trace_vfio_pci_reset(vdev->vbasedev.name);
@@ -3204,7 +3204,7 @@ post_reset:
static void vfio_instance_init(Object *obj)
{
PCIDevice *pci_dev = PCI_DEVICE(obj);
- VFIOPCIDevice *vdev = VFIO_PCI(obj);
+ VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
device_add_bootindex_property(obj, &vdev->bootindex,
"bootindex", NULL,
@@ -3221,24 +3221,12 @@ static void vfio_instance_init(Object *obj)
pci_dev->cap_present |= QEMU_PCI_CAP_EXPRESS;
}
-static Property vfio_pci_dev_properties[] = {
- DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
- DEFINE_PROP_STRING("sysfsdev", VFIOPCIDevice, vbasedev.sysfsdev),
+static Property vfio_pci_base_dev_properties[] = {
DEFINE_PROP_ON_OFF_AUTO("x-pre-copy-dirty-page-tracking", VFIOPCIDevice,
vbasedev.pre_copy_dirty_page_tracking,
ON_OFF_AUTO_ON),
- DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
- display, ON_OFF_AUTO_OFF),
- DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
- DEFINE_PROP_UINT32("yres", VFIOPCIDevice, display_yres, 0),
DEFINE_PROP_UINT32("x-intx-mmap-timeout-ms", VFIOPCIDevice,
intx.mmap_timeout, 1100),
- DEFINE_PROP_BIT("x-vga", VFIOPCIDevice, features,
- VFIO_FEATURE_ENABLE_VGA_BIT, false),
- DEFINE_PROP_BIT("x-req", VFIOPCIDevice, features,
- VFIO_FEATURE_ENABLE_REQ_BIT, true),
- DEFINE_PROP_BIT("x-igd-opregion", VFIOPCIDevice, features,
- VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
DEFINE_PROP_BOOL("x-enable-migration", VFIOPCIDevice,
vbasedev.enable_migration, false),
DEFINE_PROP_BOOL("x-no-mmap", VFIOPCIDevice, vbasedev.no_mmap, false),
@@ -3247,8 +3235,6 @@ static Property vfio_pci_dev_properties[] = {
DEFINE_PROP_BOOL("x-no-kvm-intx", VFIOPCIDevice, no_kvm_intx, false),
DEFINE_PROP_BOOL("x-no-kvm-msi", VFIOPCIDevice, no_kvm_msi, false),
DEFINE_PROP_BOOL("x-no-kvm-msix", VFIOPCIDevice, no_kvm_msix, false),
- DEFINE_PROP_BOOL("x-no-geforce-quirks", VFIOPCIDevice,
- no_geforce_quirks, false),
DEFINE_PROP_BOOL("x-no-kvm-ioeventfd", VFIOPCIDevice, no_kvm_ioeventfd,
false),
DEFINE_PROP_BOOL("x-no-vfio-ioeventfd", VFIOPCIDevice, no_vfio_ioeventfd,
@@ -3259,10 +3245,6 @@ static Property vfio_pci_dev_properties[] = {
sub_vendor_id, PCI_ANY_ID),
DEFINE_PROP_UINT32("x-pci-sub-device-id", VFIOPCIDevice,
sub_device_id, PCI_ANY_ID),
- DEFINE_PROP_UINT32("x-igd-gms", VFIOPCIDevice, igd_gms, 0),
- DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice,
- nv_gpudirect_clique,
- qdev_prop_nv_gpudirect_clique, uint8_t),
DEFINE_PROP_OFF_AUTO_PCIBAR("x-msix-relocation", VFIOPCIDevice, msix_relo,
OFF_AUTOPCIBAR_OFF),
/*
@@ -3273,28 +3255,25 @@ static Property vfio_pci_dev_properties[] = {
DEFINE_PROP_END_OF_LIST(),
};
-static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
+static void vfio_pci_base_dev_class_init(ObjectClass *klass, void *data)
{
DeviceClass *dc = DEVICE_CLASS(klass);
PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
- dc->reset = vfio_pci_reset;
- device_class_set_props(dc, vfio_pci_dev_properties);
- dc->desc = "VFIO-based PCI device assignment";
+ device_class_set_props(dc, vfio_pci_base_dev_properties);
+ dc->desc = "VFIO PCI base device";
set_bit(DEVICE_CATEGORY_MISC, dc->categories);
- pdc->realize = vfio_realize;
pdc->exit = vfio_exitfn;
pdc->config_read = vfio_pci_read_config;
pdc->config_write = vfio_pci_write_config;
}
-static const TypeInfo vfio_pci_dev_info = {
- .name = TYPE_VFIO_PCI,
+static const TypeInfo vfio_pci_base_dev_info = {
+ .name = TYPE_VFIO_PCI_BASE,
.parent = TYPE_PCI_DEVICE,
- .instance_size = sizeof(VFIOPCIDevice),
- .class_init = vfio_pci_dev_class_init,
- .instance_init = vfio_instance_init,
- .instance_finalize = vfio_instance_finalize,
+ .instance_size = 0,
+ .abstract = true,
+ .class_init = vfio_pci_base_dev_class_init,
.interfaces = (InterfaceInfo[]) {
{ INTERFACE_PCIE_DEVICE },
{ INTERFACE_CONVENTIONAL_PCI_DEVICE },
@@ -3302,6 +3281,48 @@ static const TypeInfo vfio_pci_dev_info = {
},
};
+static Property vfio_pci_dev_properties[] = {
+ DEFINE_PROP_PCI_HOST_DEVADDR("host", VFIOPCIDevice, host),
+ DEFINE_PROP_STRING("sysfsdev", VFIOPCIDevice, vbasedev.sysfsdev),
+ DEFINE_PROP_ON_OFF_AUTO("display", VFIOPCIDevice,
+ display, ON_OFF_AUTO_OFF),
+ DEFINE_PROP_UINT32("xres", VFIOPCIDevice, display_xres, 0),
+ DEFINE_PROP_UINT32("yres", VFIOPCIDevice, display_yres, 0),
+ DEFINE_PROP_BIT("x-vga", VFIOPCIDevice, features,
+ VFIO_FEATURE_ENABLE_VGA_BIT, false),
+ DEFINE_PROP_BIT("x-req", VFIOPCIDevice, features,
+ VFIO_FEATURE_ENABLE_REQ_BIT, true),
+ DEFINE_PROP_BIT("x-igd-opregion", VFIOPCIDevice, features,
+ VFIO_FEATURE_ENABLE_IGD_OPREGION_BIT, false),
+ DEFINE_PROP_BOOL("x-no-geforce-quirks", VFIOPCIDevice,
+ no_geforce_quirks, false),
+ DEFINE_PROP_UINT32("x-igd-gms", VFIOPCIDevice, igd_gms, 0),
+ DEFINE_PROP_UNSIGNED_NODEFAULT("x-nv-gpudirect-clique", VFIOPCIDevice,
+ nv_gpudirect_clique,
+ qdev_prop_nv_gpudirect_clique, uint8_t),
+ DEFINE_PROP_END_OF_LIST(),
+};
+
+static void vfio_pci_dev_class_init(ObjectClass *klass, void *data)
+{
+ DeviceClass *dc = DEVICE_CLASS(klass);
+ PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
+
+ dc->reset = vfio_pci_reset;
+ device_class_set_props(dc, vfio_pci_dev_properties);
+ dc->desc = "VFIO-based PCI device assignment";
+ pdc->realize = vfio_realize;
+}
+
+static const TypeInfo vfio_pci_dev_info = {
+ .name = TYPE_VFIO_PCI,
+ .parent = TYPE_VFIO_PCI_BASE,
+ .instance_size = sizeof(VFIOKernPCIDevice),
+ .class_init = vfio_pci_dev_class_init,
+ .instance_init = vfio_instance_init,
+ .instance_finalize = vfio_instance_finalize,
+};
+
static Property vfio_pci_dev_nohotplug_properties[] = {
DEFINE_PROP_BOOL("ramfb", VFIOPCIDevice, enable_ramfb, false),
DEFINE_PROP_END_OF_LIST(),
@@ -3318,12 +3339,13 @@ static void vfio_pci_nohotplug_dev_class_init(ObjectClass *klass, void *data)
static const TypeInfo vfio_pci_nohotplug_dev_info = {
.name = TYPE_VFIO_PCI_NOHOTPLUG,
.parent = TYPE_VFIO_PCI,
- .instance_size = sizeof(VFIOPCIDevice),
+ .instance_size = sizeof(VFIOKernPCIDevice),
.class_init = vfio_pci_nohotplug_dev_class_init,
};
static void register_vfio_pci_dev_type(void)
{
+ type_register_static(&vfio_pci_base_dev_info);
type_register_static(&vfio_pci_dev_info);
type_register_static(&vfio_pci_nohotplug_dev_info);
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 03/23] vfio-user: add container IO ops vector
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
2022-05-05 17:19 ` [RFC v5 01/23] vfio-user: introduce vfio-user protocol specification John Johnson
2022-05-05 17:19 ` [RFC v5 02/23] vfio-user: add VFIO base abstract class John Johnson
@ 2022-05-05 17:19 ` John Johnson
2022-05-05 17:19 ` [RFC v5 04/23] vfio-user: add region cache John Johnson
` (19 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:19 UTC (permalink / raw)
To: qemu-devel
Used for communication with VFIO driver
(prep work for vfio-user, which will communicate over a socket)
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
---
include/hw/vfio/vfio-common.h | 33 +++++++++++
hw/vfio/common.c | 126 ++++++++++++++++++++++++++++--------------
2 files changed, 117 insertions(+), 42 deletions(-)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 8af11b0..2761a62 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -75,6 +75,7 @@ typedef struct VFIOAddressSpace {
} VFIOAddressSpace;
struct VFIOGroup;
+typedef struct VFIOContIO VFIOContIO;
typedef struct VFIOContainer {
VFIOAddressSpace *space;
@@ -83,6 +84,7 @@ typedef struct VFIOContainer {
MemoryListener prereg_listener;
unsigned iommu_type;
Error *error;
+ VFIOContIO *io_ops;
bool initialized;
bool dirty_pages_supported;
uint64_t dirty_pgsizes;
@@ -154,6 +156,37 @@ struct VFIODeviceOps {
int (*vfio_load_config)(VFIODevice *vdev, QEMUFile *f);
};
+#ifdef CONFIG_LINUX
+
+/*
+ * The next 2 ops vectors are how Devices and Containers
+ * communicate with the server. The default option is
+ * through ioctl() to the kernel VFIO driver, but vfio-user
+ * can use a socket to a remote process.
+ */
+
+struct VFIOContIO {
+ int (*dma_map)(VFIOContainer *container,
+ struct vfio_iommu_type1_dma_map *map);
+ int (*dma_unmap)(VFIOContainer *container,
+ struct vfio_iommu_type1_dma_unmap *unmap,
+ struct vfio_bitmap *bitmap);
+ int (*dirty_bitmap)(VFIOContainer *container,
+ struct vfio_iommu_type1_dirty_bitmap *bitmap,
+ struct vfio_iommu_type1_dirty_bitmap_get *range);
+};
+
+#define CONT_DMA_MAP(cont, map) \
+ ((cont)->io_ops->dma_map((cont), (map)))
+#define CONT_DMA_UNMAP(cont, unmap, bitmap) \
+ ((cont)->io_ops->dma_unmap((cont), (unmap), (bitmap)))
+#define CONT_DIRTY_BITMAP(cont, bitmap, range) \
+ ((cont)->io_ops->dirty_bitmap((cont), (bitmap), (range)))
+
+extern VFIOContIO vfio_cont_io_ioctl;
+
+#endif /* CONFIG_LINUX */
+
typedef struct VFIOGroup {
int fd;
int groupid;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 2b1f78f..917da0f 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -431,12 +431,12 @@ static int vfio_dma_unmap_bitmap(VFIOContainer *container,
goto unmap_exit;
}
- ret = ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap);
+ ret = CONT_DMA_UNMAP(container, unmap, bitmap);
if (!ret) {
cpu_physical_memory_set_dirty_lebitmap((unsigned long *)bitmap->data,
iotlb->translated_addr, pages);
} else {
- error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %m");
+ error_report("VFIO_UNMAP_DMA with DIRTY_BITMAP : %s", strerror(-ret));
}
g_free(bitmap->data);
@@ -464,30 +464,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
return vfio_dma_unmap_bitmap(container, iova, size, iotlb);
}
- while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, &unmap)) {
- /*
- * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
- * v4.15) where an overflow in its wrap-around check prevents us from
- * unmapping the last page of the address space. Test for the error
- * condition and re-try the unmap excluding the last page. The
- * expectation is that we've never mapped the last page anyway and this
- * unmap request comes via vIOMMU support which also makes it unlikely
- * that this page is used. This bug was introduced well after type1 v2
- * support was introduced, so we shouldn't need to test for v1. A fix
- * is queued for kernel v5.0 so this workaround can be removed once
- * affected kernels are sufficiently deprecated.
- */
- if (errno == EINVAL && unmap.size && !(unmap.iova + unmap.size) &&
- container->iommu_type == VFIO_TYPE1v2_IOMMU) {
- trace_vfio_dma_unmap_overflow_workaround();
- unmap.size -= 1ULL << ctz64(container->pgsizes);
- continue;
- }
- error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
- return -errno;
- }
-
- return 0;
+ return CONT_DMA_UNMAP(container, &unmap, NULL);
}
static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
@@ -500,24 +477,18 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
.iova = iova,
.size = size,
};
+ int ret;
if (!readonly) {
map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
}
- /*
- * Try the mapping, if it fails with EBUSY, unmap the region and try
- * again. This shouldn't be necessary, but we sometimes see it in
- * the VGA ROM space.
- */
- if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0 ||
- (errno == EBUSY && vfio_dma_unmap(container, iova, size, NULL) == 0 &&
- ioctl(container->fd, VFIO_IOMMU_MAP_DMA, &map) == 0)) {
- return 0;
- }
+ ret = CONT_DMA_MAP(container, &map);
- error_report("VFIO_MAP_DMA failed: %s", strerror(errno));
- return -errno;
+ if (ret < 0) {
+ error_report("VFIO_MAP_DMA failed: %s", strerror(-ret));
+ }
+ return ret;
}
static void vfio_host_win_add(VFIOContainer *container,
@@ -1230,10 +1201,10 @@ static void vfio_set_dirty_page_tracking(VFIOContainer *container, bool start)
dirty.flags = VFIO_IOMMU_DIRTY_PAGES_FLAG_STOP;
}
- ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, &dirty);
+ ret = CONT_DIRTY_BITMAP(container, &dirty, NULL);
if (ret) {
error_report("Failed to set dirty tracking flag 0x%x errno: %d",
- dirty.flags, errno);
+ dirty.flags, -ret);
}
}
@@ -1283,11 +1254,11 @@ static int vfio_get_dirty_bitmap(VFIOContainer *container, uint64_t iova,
goto err_out;
}
- ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, dbitmap);
+ ret = CONT_DIRTY_BITMAP(container, dbitmap, range);
if (ret) {
error_report("Failed to get dirty bitmap for iova: 0x%"PRIx64
" size: 0x%"PRIx64" err: %d", (uint64_t)range->iova,
- (uint64_t)range->size, errno);
+ (uint64_t)range->size, -ret);
goto err_out;
}
@@ -2058,6 +2029,7 @@ static int vfio_connect_container(VFIOGroup *group, AddressSpace *as,
container->error = NULL;
container->dirty_pages_supported = false;
container->dma_max_mappings = 0;
+ container->io_ops = &vfio_cont_io_ioctl;
QLIST_INIT(&container->giommu_list);
QLIST_INIT(&container->hostwin_list);
QLIST_INIT(&container->vrdl_list);
@@ -2594,3 +2566,73 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
}
return vfio_eeh_container_op(container, op);
}
+
+/*
+ * Traditional ioctl() based io_ops
+ */
+
+static int vfio_io_dma_map(VFIOContainer *container,
+ struct vfio_iommu_type1_dma_map *map)
+{
+
+ /*
+ * Try the mapping, if it fails with EBUSY, unmap the region and try
+ * again. This shouldn't be necessary, but we sometimes see it in
+ * the VGA ROM space.
+ */
+ if (ioctl(container->fd, VFIO_IOMMU_MAP_DMA, map) == 0 ||
+ (errno == EBUSY &&
+ vfio_dma_unmap(container, map->iova, map->size, NULL) == 0 &&
+ ioctl(container->fd, VFIO_IOMMU_MAP_DMA, map) == 0)) {
+ return 0;
+ }
+ return -errno;
+}
+
+static int vfio_io_dma_unmap(VFIOContainer *container,
+ struct vfio_iommu_type1_dma_unmap *unmap,
+ struct vfio_bitmap *bitmap)
+{
+
+ while (ioctl(container->fd, VFIO_IOMMU_UNMAP_DMA, unmap)) {
+ /*
+ * The type1 backend has an off-by-one bug in the kernel (71a7d3d78e3c
+ * v4.15) where an overflow in its wrap-around check prevents us from
+ * unmapping the last page of the address space. Test for the error
+ * condition and re-try the unmap excluding the last page. The
+ * expectation is that we've never mapped the last page anyway and this
+ * unmap request comes via vIOMMU support which also makes it unlikely
+ * that this page is used. This bug was introduced well after type1 v2
+ * support was introduced, so we shouldn't need to test for v1. A fix
+ * is queued for kernel v5.0 so this workaround can be removed once
+ * affected kernels are sufficiently deprecated.
+ */
+ if (errno == EINVAL && unmap->size && !(unmap->iova + unmap->size) &&
+ container->iommu_type == VFIO_TYPE1v2_IOMMU) {
+ trace_vfio_dma_unmap_overflow_workaround();
+ unmap->size -= 1ULL << ctz64(container->pgsizes);
+ continue;
+ }
+ error_report("VFIO_UNMAP_DMA failed: %s", strerror(errno));
+ return -errno;
+ }
+
+ return 0;
+}
+
+static int vfio_io_dirty_bitmap(VFIOContainer *container,
+ struct vfio_iommu_type1_dirty_bitmap *bitmap,
+ struct vfio_iommu_type1_dirty_bitmap_get *range)
+{
+ int ret;
+
+ ret = ioctl(container->fd, VFIO_IOMMU_DIRTY_PAGES, bitmap);
+
+ return ret < 0 ? -errno : ret;
+}
+
+VFIOContIO vfio_cont_io_ioctl = {
+ .dma_map = vfio_io_dma_map,
+ .dma_unmap = vfio_io_dma_unmap,
+ .dirty_bitmap = vfio_io_dirty_bitmap,
+};
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 04/23] vfio-user: add region cache
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (2 preceding siblings ...)
2022-05-05 17:19 ` [RFC v5 03/23] vfio-user: add container IO ops vector John Johnson
@ 2022-05-05 17:19 ` John Johnson
2022-05-05 17:19 ` [RFC v5 05/23] vfio-user: add device IO ops vector John Johnson
` (18 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:19 UTC (permalink / raw)
To: qemu-devel
cache VFIO_DEVICE_GET_REGION_INFO results to reduce
memory alloc/free cycles and as prep work for vfio-user
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
include/hw/vfio/vfio-common.h | 2 ++
hw/vfio/ccw.c | 5 -----
hw/vfio/common.c | 41 +++++++++++++++++++++++++++++++++++------
hw/vfio/igd.c | 23 +++++++++--------------
hw/vfio/migration.c | 2 --
hw/vfio/pci-quirks.c | 19 +++++--------------
hw/vfio/pci.c | 8 --------
7 files changed, 51 insertions(+), 49 deletions(-)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 2761a62..1a032f4 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -145,6 +145,7 @@ typedef struct VFIODevice {
VFIOMigration *migration;
Error *migration_blocker;
OnOffAuto pre_copy_dirty_page_tracking;
+ struct vfio_region_info **regions;
} VFIODevice;
struct VFIODeviceOps {
@@ -258,6 +259,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
struct vfio_region_info **info);
int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
uint32_t subtype, struct vfio_region_info **info);
+void vfio_get_all_regions(VFIODevice *vbasedev);
bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type);
struct vfio_info_cap_header *
vfio_get_region_info_cap(struct vfio_region_info *info, uint16_t id);
diff --git a/hw/vfio/ccw.c b/hw/vfio/ccw.c
index 0354737..06b588c 100644
--- a/hw/vfio/ccw.c
+++ b/hw/vfio/ccw.c
@@ -517,7 +517,6 @@ static void vfio_ccw_get_region(VFIOCCWDevice *vcdev, Error **errp)
vcdev->io_region_offset = info->offset;
vcdev->io_region = g_malloc0(info->size);
- g_free(info);
/* check for the optional async command region */
ret = vfio_get_dev_region_info(vdev, VFIO_REGION_TYPE_CCW,
@@ -530,7 +529,6 @@ static void vfio_ccw_get_region(VFIOCCWDevice *vcdev, Error **errp)
}
vcdev->async_cmd_region_offset = info->offset;
vcdev->async_cmd_region = g_malloc0(info->size);
- g_free(info);
}
ret = vfio_get_dev_region_info(vdev, VFIO_REGION_TYPE_CCW,
@@ -543,7 +541,6 @@ static void vfio_ccw_get_region(VFIOCCWDevice *vcdev, Error **errp)
}
vcdev->schib_region_offset = info->offset;
vcdev->schib_region = g_malloc(info->size);
- g_free(info);
}
ret = vfio_get_dev_region_info(vdev, VFIO_REGION_TYPE_CCW,
@@ -557,7 +554,6 @@ static void vfio_ccw_get_region(VFIOCCWDevice *vcdev, Error **errp)
}
vcdev->crw_region_offset = info->offset;
vcdev->crw_region = g_malloc(info->size);
- g_free(info);
}
return;
@@ -567,7 +563,6 @@ out_err:
g_free(vcdev->schib_region);
g_free(vcdev->async_cmd_region);
g_free(vcdev->io_region);
- g_free(info);
return;
}
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 917da0f..d9290f3 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1568,8 +1568,6 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
}
}
- g_free(info);
-
trace_vfio_region_setup(vbasedev->name, index, name,
region->flags, region->fd_offset, region->size);
return 0;
@@ -2325,6 +2323,16 @@ void vfio_put_group(VFIOGroup *group)
}
}
+void vfio_get_all_regions(VFIODevice *vbasedev)
+{
+ struct vfio_region_info *info;
+ int i;
+
+ for (i = 0; i < vbasedev->num_regions; i++) {
+ vfio_get_region_info(vbasedev, i, &info);
+ }
+}
+
int vfio_get_device(VFIOGroup *group, const char *name,
VFIODevice *vbasedev, Error **errp)
{
@@ -2380,12 +2388,23 @@ int vfio_get_device(VFIOGroup *group, const char *name,
trace_vfio_get_device(name, dev_info.flags, dev_info.num_regions,
dev_info.num_irqs);
+ vfio_get_all_regions(vbasedev);
vbasedev->reset_works = !!(dev_info.flags & VFIO_DEVICE_FLAGS_RESET);
return 0;
}
void vfio_put_base_device(VFIODevice *vbasedev)
{
+ if (vbasedev->regions != NULL) {
+ int i;
+
+ for (i = 0; i < vbasedev->num_regions; i++) {
+ g_free(vbasedev->regions[i]);
+ }
+ g_free(vbasedev->regions);
+ vbasedev->regions = NULL;
+ }
+
if (!vbasedev->group) {
return;
}
@@ -2400,6 +2419,17 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
{
size_t argsz = sizeof(struct vfio_region_info);
+ /* create region cache */
+ if (vbasedev->regions == NULL) {
+ vbasedev->regions = g_new0(struct vfio_region_info *,
+ vbasedev->num_regions);
+ }
+ /* check cache */
+ if (vbasedev->regions[index] != NULL) {
+ *info = vbasedev->regions[index];
+ return 0;
+ }
+
*info = g_malloc0(argsz);
(*info)->index = index;
@@ -2419,6 +2449,9 @@ retry:
goto retry;
}
+ /* fill cache */
+ vbasedev->regions[index] = *info;
+
return 0;
}
@@ -2437,7 +2470,6 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
hdr = vfio_get_region_info_cap(*info, VFIO_REGION_INFO_CAP_TYPE);
if (!hdr) {
- g_free(*info);
continue;
}
@@ -2449,8 +2481,6 @@ int vfio_get_dev_region_info(VFIODevice *vbasedev, uint32_t type,
if (cap_type->type == type && cap_type->subtype == subtype) {
return 0;
}
-
- g_free(*info);
}
*info = NULL;
@@ -2466,7 +2496,6 @@ bool vfio_has_region_cap(VFIODevice *vbasedev, int region, uint16_t cap_type)
if (vfio_get_region_info_cap(info, cap_type)) {
ret = true;
}
- g_free(info);
}
return ret;
diff --git a/hw/vfio/igd.c b/hw/vfio/igd.c
index afe3fe7..22efa1a 100644
--- a/hw/vfio/igd.c
+++ b/hw/vfio/igd.c
@@ -425,7 +425,7 @@ void vfio_probe_igd_bar4_quirk(VFIOPCIDevice *vdev, int nr)
if ((ret || !rom->size) && !vdev->pdev.romfile) {
error_report("IGD device %s has no ROM, legacy mode disabled",
vdev->vbasedev.name);
- goto out;
+ return;
}
/*
@@ -436,7 +436,7 @@ void vfio_probe_igd_bar4_quirk(VFIOPCIDevice *vdev, int nr)
error_report("IGD device %s hotplugged, ROM disabled, "
"legacy mode disabled", vdev->vbasedev.name);
vdev->rom_read_failed = true;
- goto out;
+ return;
}
/*
@@ -449,7 +449,7 @@ void vfio_probe_igd_bar4_quirk(VFIOPCIDevice *vdev, int nr)
if (ret) {
error_report("IGD device %s does not support OpRegion access,"
"legacy mode disabled", vdev->vbasedev.name);
- goto out;
+ return;
}
ret = vfio_get_dev_region_info(&vdev->vbasedev,
@@ -458,7 +458,7 @@ void vfio_probe_igd_bar4_quirk(VFIOPCIDevice *vdev, int nr)
if (ret) {
error_report("IGD device %s does not support host bridge access,"
"legacy mode disabled", vdev->vbasedev.name);
- goto out;
+ return;
}
ret = vfio_get_dev_region_info(&vdev->vbasedev,
@@ -467,7 +467,7 @@ void vfio_probe_igd_bar4_quirk(VFIOPCIDevice *vdev, int nr)
if (ret) {
error_report("IGD device %s does not support LPC bridge access,"
"legacy mode disabled", vdev->vbasedev.name);
- goto out;
+ return;
}
gmch = vfio_pci_read_config(&vdev->pdev, IGD_GMCH, 4);
@@ -481,7 +481,7 @@ void vfio_probe_igd_bar4_quirk(VFIOPCIDevice *vdev, int nr)
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
error_report("IGD device %s failed to enable VGA access, "
"legacy mode disabled", vdev->vbasedev.name);
- goto out;
+ return;
}
/* Create our LPC/ISA bridge */
@@ -489,7 +489,7 @@ void vfio_probe_igd_bar4_quirk(VFIOPCIDevice *vdev, int nr)
if (ret) {
error_report("IGD device %s failed to create LPC bridge, "
"legacy mode disabled", vdev->vbasedev.name);
- goto out;
+ return;
}
/* Stuff some host values into the VM PCI host bridge */
@@ -497,7 +497,7 @@ void vfio_probe_igd_bar4_quirk(VFIOPCIDevice *vdev, int nr)
if (ret) {
error_report("IGD device %s failed to modify host bridge, "
"legacy mode disabled", vdev->vbasedev.name);
- goto out;
+ return;
}
/* Setup OpRegion access */
@@ -505,7 +505,7 @@ void vfio_probe_igd_bar4_quirk(VFIOPCIDevice *vdev, int nr)
if (ret) {
error_append_hint(&err, "IGD legacy mode disabled\n");
error_reportf_err(err, VFIO_MSG_PREFIX, vdev->vbasedev.name);
- goto out;
+ return;
}
/* Setup our quirk to munge GTT addresses to the VM allocated buffer */
@@ -608,9 +608,4 @@ void vfio_probe_igd_bar4_quirk(VFIOPCIDevice *vdev, int nr)
trace_vfio_pci_igd_bdsm_enabled(vdev->vbasedev.name, ggms_mb + gms_mb);
-out:
- g_free(rom);
- g_free(opregion);
- g_free(host);
- g_free(lpc);
}
diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c
index ff6b45d..04bfb3a 100644
--- a/hw/vfio/migration.c
+++ b/hw/vfio/migration.c
@@ -876,13 +876,11 @@ int vfio_migration_probe(VFIODevice *vbasedev, Error **errp)
}
trace_vfio_migration_probe(vbasedev->name, info->index);
- g_free(info);
return 0;
add_blocker:
error_setg(&vbasedev->migration_blocker,
"VFIO device doesn't support migration");
- g_free(info);
ret = migrate_add_blocker(vbasedev->migration_blocker, errp);
if (ret < 0) {
diff --git a/hw/vfio/pci-quirks.c b/hw/vfio/pci-quirks.c
index 0cf69a8..5cfd93d 100644
--- a/hw/vfio/pci-quirks.c
+++ b/hw/vfio/pci-quirks.c
@@ -1601,16 +1601,14 @@ int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp)
hdr = vfio_get_region_info_cap(nv2reg, VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
if (!hdr) {
- ret = -ENODEV;
- goto free_exit;
+ return -ENODEV;
}
cap = (void *) hdr;
p = mmap(NULL, nv2reg->size, PROT_READ | PROT_WRITE,
MAP_SHARED, vdev->vbasedev.fd, nv2reg->offset);
if (p == MAP_FAILED) {
- ret = -errno;
- goto free_exit;
+ return -errno;
}
quirk = vfio_quirk_alloc(1);
@@ -1623,8 +1621,6 @@ int vfio_pci_nvidia_v100_ram_init(VFIOPCIDevice *vdev, Error **errp)
(void *) (uintptr_t) cap->tgt);
trace_vfio_pci_nvidia_gpu_setup_quirk(vdev->vbasedev.name, cap->tgt,
nv2reg->size);
-free_exit:
- g_free(nv2reg);
return ret;
}
@@ -1651,16 +1647,14 @@ int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp)
hdr = vfio_get_region_info_cap(atsdreg,
VFIO_REGION_INFO_CAP_NVLINK2_SSATGT);
if (!hdr) {
- ret = -ENODEV;
- goto free_exit;
+ return -ENODEV;
}
captgt = (void *) hdr;
hdr = vfio_get_region_info_cap(atsdreg,
VFIO_REGION_INFO_CAP_NVLINK2_LNKSPD);
if (!hdr) {
- ret = -ENODEV;
- goto free_exit;
+ return -ENODEV;
}
capspeed = (void *) hdr;
@@ -1669,8 +1663,7 @@ int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp)
p = mmap(NULL, atsdreg->size, PROT_READ | PROT_WRITE,
MAP_SHARED, vdev->vbasedev.fd, atsdreg->offset);
if (p == MAP_FAILED) {
- ret = -errno;
- goto free_exit;
+ return -errno;
}
quirk = vfio_quirk_alloc(1);
@@ -1690,8 +1683,6 @@ int vfio_pci_nvlink2_init(VFIOPCIDevice *vdev, Error **errp)
(void *) (uintptr_t) capspeed->link_speed);
trace_vfio_pci_nvlink2_setup_quirk_lnkspd(vdev->vbasedev.name,
capspeed->link_speed);
-free_exit:
- g_free(atsdreg);
return ret;
}
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 4ee5215..35b6551 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -793,8 +793,6 @@ static void vfio_pci_load_rom(VFIOPCIDevice *vdev)
vdev->rom_size = size = reg_info->size;
vdev->rom_offset = reg_info->offset;
- g_free(reg_info);
-
if (!vdev->rom_size) {
vdev->rom_read_failed = true;
error_report("vfio-pci: Cannot read device rom at "
@@ -2521,7 +2519,6 @@ int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
error_setg(errp, "unexpected VGA info, flags 0x%lx, size 0x%lx",
(unsigned long)reg_info->flags,
(unsigned long)reg_info->size);
- g_free(reg_info);
return -EINVAL;
}
@@ -2530,8 +2527,6 @@ int vfio_populate_vga(VFIOPCIDevice *vdev, Error **errp)
vdev->vga->fd_offset = reg_info->offset;
vdev->vga->fd = vdev->vbasedev.fd;
- g_free(reg_info);
-
vdev->vga->region[QEMU_PCI_VGA_MEM].offset = QEMU_PCI_VGA_MEM_BASE;
vdev->vga->region[QEMU_PCI_VGA_MEM].nr = QEMU_PCI_VGA_MEM;
QLIST_INIT(&vdev->vga->region[QEMU_PCI_VGA_MEM].quirks);
@@ -2626,8 +2621,6 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
}
vdev->config_offset = reg_info->offset;
- g_free(reg_info);
-
if (vdev->features & VFIO_FEATURE_ENABLE_VGA) {
ret = vfio_populate_vga(vdev, errp);
if (ret) {
@@ -3035,7 +3028,6 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
}
ret = vfio_pci_igd_opregion_init(vdev, opregion, errp);
- g_free(opregion);
if (ret) {
goto out_teardown;
}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 05/23] vfio-user: add device IO ops vector
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (3 preceding siblings ...)
2022-05-05 17:19 ` [RFC v5 04/23] vfio-user: add region cache John Johnson
@ 2022-05-05 17:19 ` John Johnson
2022-05-05 17:19 ` [RFC v5 06/23] vfio-user: Define type vfio_user_pci_dev_info John Johnson
` (17 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:19 UTC (permalink / raw)
To: qemu-devel
Used for communication with VFIO driver
(prep work for vfio-user, which will communicate over a socket)
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
---
include/hw/vfio/vfio-common.h | 27 ++++++++
hw/vfio/common.c | 107 +++++++++++++++++++++++++++-----
hw/vfio/pci.c | 140 ++++++++++++++++++++++++++----------------
3 files changed, 206 insertions(+), 68 deletions(-)
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 1a032f4..826cd98 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -124,6 +124,7 @@ typedef struct VFIOHostDMAWindow {
} VFIOHostDMAWindow;
typedef struct VFIODeviceOps VFIODeviceOps;
+typedef struct VFIODevIO VFIODevIO;
typedef struct VFIODevice {
QLIST_ENTRY(VFIODevice) next;
@@ -139,6 +140,7 @@ typedef struct VFIODevice {
bool ram_block_discard_allowed;
bool enable_migration;
VFIODeviceOps *ops;
+ VFIODevIO *io_ops;
unsigned int num_irqs;
unsigned int num_regions;
unsigned int flags;
@@ -165,6 +167,30 @@ struct VFIODeviceOps {
* through ioctl() to the kernel VFIO driver, but vfio-user
* can use a socket to a remote process.
*/
+struct VFIODevIO {
+ int (*get_info)(VFIODevice *vdev, struct vfio_device_info *info);
+ int (*get_region_info)(VFIODevice *vdev,
+ struct vfio_region_info *info);
+ int (*get_irq_info)(VFIODevice *vdev, struct vfio_irq_info *irq);
+ int (*set_irqs)(VFIODevice *vdev, struct vfio_irq_set *irqs);
+ int (*region_read)(VFIODevice *vdev, uint8_t nr, off_t off, uint32_t size,
+ void *data);
+ int (*region_write)(VFIODevice *vdev, uint8_t nr, off_t off, uint32_t size,
+ void *data);
+};
+
+#define VDEV_GET_INFO(vdev, info) \
+ ((vdev)->io_ops->get_info((vdev), (info)))
+#define VDEV_GET_REGION_INFO(vdev, info) \
+ ((vdev)->io_ops->get_region_info((vdev), (info)))
+#define VDEV_GET_IRQ_INFO(vdev, irq) \
+ ((vdev)->io_ops->get_irq_info((vdev), (irq)))
+#define VDEV_SET_IRQS(vdev, irqs) \
+ ((vdev)->io_ops->set_irqs((vdev), (irqs)))
+#define VDEV_REGION_READ(vdev, nr, off, size, data) \
+ ((vdev)->io_ops->region_read((vdev), (nr), (off), (size), (data)))
+#define VDEV_REGION_WRITE(vdev, nr, off, size, data) \
+ ((vdev)->io_ops->region_write((vdev), (nr), (off), (size), (data)))
struct VFIOContIO {
int (*dma_map)(VFIOContainer *container,
@@ -184,6 +210,7 @@ struct VFIOContIO {
#define CONT_DIRTY_BITMAP(cont, bitmap, range) \
((cont)->io_ops->dirty_bitmap((cont), (bitmap), (range)))
+extern VFIODevIO vfio_dev_io_ioctl;
extern VFIOContIO vfio_cont_io_ioctl;
#endif /* CONFIG_LINUX */
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index d9290f3..0616169 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -70,7 +70,7 @@ void vfio_disable_irqindex(VFIODevice *vbasedev, int index)
.count = 0,
};
- ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+ VDEV_SET_IRQS(vbasedev, &irq_set);
}
void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
@@ -83,7 +83,7 @@ void vfio_unmask_single_irqindex(VFIODevice *vbasedev, int index)
.count = 1,
};
- ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+ VDEV_SET_IRQS(vbasedev, &irq_set);
}
void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
@@ -96,7 +96,7 @@ void vfio_mask_single_irqindex(VFIODevice *vbasedev, int index)
.count = 1,
};
- ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, &irq_set);
+ VDEV_SET_IRQS(vbasedev, &irq_set);
}
static inline const char *action_to_str(int action)
@@ -177,9 +177,7 @@ int vfio_set_irq_signaling(VFIODevice *vbasedev, int index, int subindex,
pfd = (int32_t *)&irq_set->data;
*pfd = fd;
- if (ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irq_set)) {
- ret = -errno;
- }
+ ret = VDEV_SET_IRQS(vbasedev, irq_set);
g_free(irq_set);
if (!ret) {
@@ -214,6 +212,7 @@ void vfio_region_write(void *opaque, hwaddr addr,
uint32_t dword;
uint64_t qword;
} buf;
+ int ret;
switch (size) {
case 1:
@@ -233,13 +232,15 @@ void vfio_region_write(void *opaque, hwaddr addr,
break;
}
- if (pwrite(vbasedev->fd, &buf, size, region->fd_offset + addr) != size) {
+ ret = VDEV_REGION_WRITE(vbasedev, region->nr, addr, size, &buf);
+ if (ret != size) {
+ const char *err = ret < 0 ? strerror(-ret) : "short write";
+
error_report("%s(%s:region%d+0x%"HWADDR_PRIx", 0x%"PRIx64
- ",%d) failed: %m",
+ ",%d) failed: %s",
__func__, vbasedev->name, region->nr,
- addr, data, size);
+ addr, data, size, err);
}
-
trace_vfio_region_write(vbasedev->name, region->nr, addr, data, size);
/*
@@ -265,13 +266,18 @@ uint64_t vfio_region_read(void *opaque,
uint64_t qword;
} buf;
uint64_t data = 0;
+ int ret;
+
+ ret = VDEV_REGION_READ(vbasedev, region->nr, addr, size, &buf);
+ if (ret != size) {
+ const char *err = ret < 0 ? strerror(-ret) : "short read";
- if (pread(vbasedev->fd, &buf, size, region->fd_offset + addr) != size) {
- error_report("%s(%s:region%d+0x%"HWADDR_PRIx", %d) failed: %m",
+ error_report("%s(%s:region%d+0x%"HWADDR_PRIx", %d) failed: %s",
__func__, vbasedev->name, region->nr,
- addr, size);
+ addr, size, err);
return (uint64_t)-1;
}
+
switch (size) {
case 1:
data = buf.byte;
@@ -2418,6 +2424,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
struct vfio_region_info **info)
{
size_t argsz = sizeof(struct vfio_region_info);
+ int ret;
/* create region cache */
if (vbasedev->regions == NULL) {
@@ -2436,10 +2443,11 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
retry:
(*info)->argsz = argsz;
- if (ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, *info)) {
+ ret = VDEV_GET_REGION_INFO(vbasedev, *info);
+ if (ret != 0) {
g_free(*info);
*info = NULL;
- return -errno;
+ return ret;
}
if ((*info)->argsz > argsz) {
@@ -2600,6 +2608,75 @@ int vfio_eeh_as_op(AddressSpace *as, uint32_t op)
* Traditional ioctl() based io_ops
*/
+static int vfio_io_get_info(VFIODevice *vbasedev, struct vfio_device_info *info)
+{
+ int ret;
+
+ ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_INFO, info);
+
+ return ret < 0 ? -errno : ret;
+}
+
+static int vfio_io_get_region_info(VFIODevice *vbasedev,
+ struct vfio_region_info *info)
+{
+ int ret;
+
+ ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, info);
+
+ return ret < 0 ? -errno : ret;
+}
+
+static int vfio_io_get_irq_info(VFIODevice *vbasedev,
+ struct vfio_irq_info *info)
+{
+ int ret;
+
+ ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_IRQ_INFO, info);
+
+ return ret < 0 ? -errno : ret;
+}
+
+static int vfio_io_set_irqs(VFIODevice *vbasedev, struct vfio_irq_set *irqs)
+{
+ int ret;
+
+ ret = ioctl(vbasedev->fd, VFIO_DEVICE_SET_IRQS, irqs);
+
+ return ret < 0 ? -errno : ret;
+}
+
+static int vfio_io_region_read(VFIODevice *vbasedev, uint8_t index, off_t off,
+ uint32_t size, void *data)
+{
+ struct vfio_region_info *info = vbasedev->regions[index];
+ int ret;
+
+ ret = pread(vbasedev->fd, data, size, info->offset + off);
+
+ return ret < 0 ? -errno : ret;
+}
+
+static int vfio_io_region_write(VFIODevice *vbasedev, uint8_t index, off_t off,
+ uint32_t size, void *data)
+{
+ struct vfio_region_info *info = vbasedev->regions[index];
+ int ret;
+
+ ret = pwrite(vbasedev->fd, data, size, info->offset + off);
+
+ return ret < 0 ? -errno : ret;
+}
+
+VFIODevIO vfio_dev_io_ioctl = {
+ .get_info = vfio_io_get_info,
+ .get_region_info = vfio_io_get_region_info,
+ .get_irq_info = vfio_io_get_irq_info,
+ .set_irqs = vfio_io_set_irqs,
+ .region_read = vfio_io_region_read,
+ .region_write = vfio_io_region_write,
+};
+
static int vfio_io_dma_map(VFIOContainer *container,
struct vfio_iommu_type1_dma_map *map)
{
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 35b6551..4524342 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -43,6 +43,14 @@
#include "migration/blocker.h"
#include "migration/qemu-file.h"
+/* convenience macros for PCI config space */
+#define VDEV_CONFIG_READ(vbasedev, off, size, data) \
+ VDEV_REGION_READ((vbasedev), VFIO_PCI_CONFIG_REGION_INDEX, (off), \
+ (size), (data))
+#define VDEV_CONFIG_WRITE(vbasedev, off, size, data) \
+ VDEV_REGION_WRITE((vbasedev), VFIO_PCI_CONFIG_REGION_INDEX, (off), \
+ (size), (data))
+
#define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
static void vfio_disable_interrupts(VFIOPCIDevice *vdev);
@@ -402,7 +410,7 @@ static int vfio_enable_vectors(VFIOPCIDevice *vdev, bool msix)
fds[i] = fd;
}
- ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_SET_IRQS, irq_set);
+ ret = VDEV_SET_IRQS(&vdev->vbasedev, irq_set);
g_free(irq_set);
@@ -775,14 +783,16 @@ static void vfio_update_msi(VFIOPCIDevice *vdev)
static void vfio_pci_load_rom(VFIOPCIDevice *vdev)
{
+ VFIODevice *vbasedev = &vdev->vbasedev;
struct vfio_region_info *reg_info;
uint64_t size;
off_t off = 0;
ssize_t bytes;
+ int ret;
- if (vfio_get_region_info(&vdev->vbasedev,
- VFIO_PCI_ROM_REGION_INDEX, ®_info)) {
- error_report("vfio: Error getting ROM info: %m");
+ ret = vfio_get_region_info(vbasedev, VFIO_PCI_ROM_REGION_INDEX, ®_info);
+ if (ret < 0) {
+ error_report("vfio: Error getting ROM info: %s", strerror(-ret));
return;
}
@@ -807,18 +817,19 @@ static void vfio_pci_load_rom(VFIOPCIDevice *vdev)
memset(vdev->rom, 0xff, size);
while (size) {
- bytes = pread(vdev->vbasedev.fd, vdev->rom + off,
- size, vdev->rom_offset + off);
+ bytes = VDEV_REGION_READ(vbasedev, VFIO_PCI_ROM_REGION_INDEX, off,
+ size, vdev->rom + off);
if (bytes == 0) {
break;
} else if (bytes > 0) {
off += bytes;
size -= bytes;
} else {
- if (errno == EINTR || errno == EAGAIN) {
+ if (bytes == -EINTR || bytes == -EAGAIN) {
continue;
}
- error_report("vfio: Error reading device ROM: %m");
+ error_report("vfio: Error reading device ROM: %s",
+ strerror(-bytes));
break;
}
}
@@ -906,11 +917,10 @@ static const MemoryRegionOps vfio_rom_ops = {
static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
{
+ VFIODevice *vbasedev = &vdev->vbasedev;
uint32_t orig, size = cpu_to_le32((uint32_t)PCI_ROM_ADDRESS_MASK);
- off_t offset = vdev->config_offset + PCI_ROM_ADDRESS;
DeviceState *dev = DEVICE(vdev);
char *name;
- int fd = vdev->vbasedev.fd;
if (vdev->pdev.romfile || !vdev->pdev.rom_bar) {
/* Since pci handles romfile, just print a message and return */
@@ -927,11 +937,12 @@ static void vfio_pci_size_rom(VFIOPCIDevice *vdev)
* Use the same size ROM BAR as the physical device. The contents
* will get filled in later when the guest tries to read it.
*/
- if (pread(fd, &orig, 4, offset) != 4 ||
- pwrite(fd, &size, 4, offset) != 4 ||
- pread(fd, &size, 4, offset) != 4 ||
- pwrite(fd, &orig, 4, offset) != 4) {
- error_report("%s(%s) failed: %m", __func__, vdev->vbasedev.name);
+ if (VDEV_CONFIG_READ(vbasedev, PCI_ROM_ADDRESS, 4, &orig) != 4 ||
+ VDEV_CONFIG_WRITE(vbasedev, PCI_ROM_ADDRESS, 4, &size) != 4 ||
+ VDEV_CONFIG_READ(vbasedev, PCI_ROM_ADDRESS, 4, &size) != 4 ||
+ VDEV_CONFIG_WRITE(vbasedev, PCI_ROM_ADDRESS, 4, &orig) != 4) {
+
+ error_report("%s(%s) ROM access failed", __func__, vbasedev->name);
return;
}
@@ -1111,6 +1122,7 @@ static void vfio_sub_page_bar_update_mapping(PCIDevice *pdev, int bar)
uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
{
VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
+ VFIODevice *vbasedev = &vdev->vbasedev;
uint32_t emu_bits = 0, emu_val = 0, phys_val = 0, val;
memcpy(&emu_bits, vdev->emulated_config_bits + addr, len);
@@ -1123,12 +1135,13 @@ uint32_t vfio_pci_read_config(PCIDevice *pdev, uint32_t addr, int len)
if (~emu_bits & (0xffffffffU >> (32 - len * 8))) {
ssize_t ret;
- ret = pread(vdev->vbasedev.fd, &phys_val, len,
- vdev->config_offset + addr);
+ ret = VDEV_CONFIG_READ(vbasedev, addr, len, &phys_val);
if (ret != len) {
- error_report("%s(%s, 0x%x, 0x%x) failed: %m",
- __func__, vdev->vbasedev.name, addr, len);
- return -errno;
+ const char *err = ret < 0 ? strerror(-ret) : "short read";
+
+ error_report("%s(%s, 0x%x, 0x%x) failed: %s",
+ __func__, vbasedev->name, addr, len, err);
+ return -1;
}
phys_val = le32_to_cpu(phys_val);
}
@@ -1144,15 +1157,19 @@ void vfio_pci_write_config(PCIDevice *pdev,
uint32_t addr, uint32_t val, int len)
{
VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
+ VFIODevice *vbasedev = &vdev->vbasedev;
uint32_t val_le = cpu_to_le32(val);
+ int ret;
trace_vfio_pci_write_config(vdev->vbasedev.name, addr, val, len);
/* Write everything to VFIO, let it filter out what we can't write */
- if (pwrite(vdev->vbasedev.fd, &val_le, len, vdev->config_offset + addr)
- != len) {
- error_report("%s(%s, 0x%x, 0x%x, 0x%x) failed: %m",
- __func__, vdev->vbasedev.name, addr, val, len);
+ ret = VDEV_CONFIG_WRITE(vbasedev, addr, len, &val_le);
+ if (ret != len) {
+ const char *err = ret < 0 ? strerror(-ret) : "short write";
+
+ error_report("%s(%s, 0x%x, 0x%x, 0x%x) failed: %s",
+ __func__, vbasedev->name, addr, val, len, err);
}
/* MSI/MSI-X Enabling/Disabling */
@@ -1240,10 +1257,13 @@ static int vfio_msi_setup(VFIOPCIDevice *vdev, int pos, Error **errp)
int ret, entries;
Error *err = NULL;
- if (pread(vdev->vbasedev.fd, &ctrl, sizeof(ctrl),
- vdev->config_offset + pos + PCI_CAP_FLAGS) != sizeof(ctrl)) {
- error_setg_errno(errp, errno, "failed reading MSI PCI_CAP_FLAGS");
- return -errno;
+ ret = VDEV_CONFIG_READ(&vdev->vbasedev, pos + PCI_CAP_FLAGS,
+ sizeof(ctrl), &ctrl);
+ if (ret != sizeof(ctrl)) {
+ const char *err = ret < 0 ? strerror(-ret) : "short read";
+
+ error_setg(errp, "failed reading MSI PCI_CAP_FLAGS %s", err);
+ return ret;
}
ctrl = le16_to_cpu(ctrl);
@@ -1445,33 +1465,39 @@ static void vfio_pci_relocate_msix(VFIOPCIDevice *vdev, Error **errp)
*/
static void vfio_msix_early_setup(VFIOPCIDevice *vdev, Error **errp)
{
+ VFIODevice *vbasedev = &vdev->vbasedev;
uint8_t pos;
uint16_t ctrl;
uint32_t table, pba;
- int fd = vdev->vbasedev.fd;
VFIOMSIXInfo *msix;
+ int ret;
pos = pci_find_capability(&vdev->pdev, PCI_CAP_ID_MSIX);
if (!pos) {
return;
}
- if (pread(fd, &ctrl, sizeof(ctrl),
- vdev->config_offset + pos + PCI_MSIX_FLAGS) != sizeof(ctrl)) {
- error_setg_errno(errp, errno, "failed to read PCI MSIX FLAGS");
- return;
+ ret = VDEV_CONFIG_READ(vbasedev, pos + PCI_MSIX_FLAGS,
+ sizeof(ctrl), &ctrl);
+ if (ret != sizeof(ctrl)) {
+ const char *err = ret < 0 ? strerror(-ret) : "short read";
+
+ error_setg(errp, "failed to read PCI MSIX FLAGS %s", err);
}
- if (pread(fd, &table, sizeof(table),
- vdev->config_offset + pos + PCI_MSIX_TABLE) != sizeof(table)) {
- error_setg_errno(errp, errno, "failed to read PCI MSIX TABLE");
- return;
+ ret = VDEV_CONFIG_READ(vbasedev, pos + PCI_MSIX_TABLE,
+ sizeof(table), &table);
+ if (ret != sizeof(table)) {
+ const char *err = ret < 0 ? strerror(-ret) : "short read";
+
+ error_setg(errp, "failed to read PCI MSIX TABLE %s", err);
}
- if (pread(fd, &pba, sizeof(pba),
- vdev->config_offset + pos + PCI_MSIX_PBA) != sizeof(pba)) {
- error_setg_errno(errp, errno, "failed to read PCI MSIX PBA");
- return;
+ ret = VDEV_CONFIG_READ(vbasedev, pos + PCI_MSIX_PBA, sizeof(pba), &pba);
+ if (ret != sizeof(pba)) {
+ const char *err = ret < 0 ? strerror(-ret) : "short read";
+
+ error_setg(errp, "failed to read PCI MSIX PBA %s", err);
}
ctrl = le16_to_cpu(ctrl);
@@ -1609,7 +1635,6 @@ static void vfio_mmap_set_enabled(VFIOPCIDevice *vdev, bool enabled)
static void vfio_bar_prepare(VFIOPCIDevice *vdev, int nr)
{
VFIOBAR *bar = &vdev->bars[nr];
-
uint32_t pci_bar;
int ret;
@@ -1619,10 +1644,12 @@ static void vfio_bar_prepare(VFIOPCIDevice *vdev, int nr)
}
/* Determine what type of BAR this is for registration */
- ret = pread(vdev->vbasedev.fd, &pci_bar, sizeof(pci_bar),
- vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr));
+ ret = VDEV_CONFIG_READ(&vdev->vbasedev, PCI_BASE_ADDRESS_0 + (4 * nr),
+ sizeof(pci_bar), &pci_bar);
if (ret != sizeof(pci_bar)) {
- error_report("vfio: Failed to read BAR %d (%m)", nr);
+ const char *err = ret < 0 ? strerror(-ret) : "short read";
+
+ error_report("vfio: Failed to read BAR %d (%s)", nr, err);
return;
}
@@ -2170,8 +2197,9 @@ static void vfio_pci_pre_reset(VFIOPCIDevice *vdev)
static void vfio_pci_post_reset(VFIOPCIDevice *vdev)
{
+ VFIODevice *vbasedev = &vdev->vbasedev;
Error *err = NULL;
- int nr;
+ int ret, nr;
vfio_intx_enable(vdev, &err);
if (err) {
@@ -2179,13 +2207,16 @@ static void vfio_pci_post_reset(VFIOPCIDevice *vdev)
}
for (nr = 0; nr < PCI_NUM_REGIONS - 1; ++nr) {
- off_t addr = vdev->config_offset + PCI_BASE_ADDRESS_0 + (4 * nr);
+ off_t addr = PCI_BASE_ADDRESS_0 + (4 * nr);
uint32_t val = 0;
uint32_t len = sizeof(val);
- if (pwrite(vdev->vbasedev.fd, &val, len, addr) != len) {
- error_report("%s(%s) reset bar %d failed: %m", __func__,
- vdev->vbasedev.name, nr);
+ ret = VDEV_CONFIG_WRITE(vbasedev, addr, len, &val);
+ if (ret != len) {
+ const char *err = ret < 0 ? strerror(-ret) : "short write";
+
+ error_report("%s(%s) reset bar %d failed: %s", __func__,
+ vbasedev->name, nr, err);
}
}
@@ -2632,7 +2663,7 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
irq_info.index = VFIO_PCI_ERR_IRQ_INDEX;
- ret = ioctl(vdev->vbasedev.fd, VFIO_DEVICE_GET_IRQ_INFO, &irq_info);
+ ret = VDEV_GET_IRQ_INFO(vbasedev, &irq_info);
if (ret) {
/* This can fail for an old kernel or legacy PCI dev */
trace_vfio_populate_device_get_irq_info_failure(strerror(errno));
@@ -2751,8 +2782,10 @@ static void vfio_register_req_notifier(VFIOPCIDevice *vdev)
return;
}
- if (ioctl(vdev->vbasedev.fd,
- VFIO_DEVICE_GET_IRQ_INFO, &irq_info) < 0 || irq_info.count < 1) {
+ if (VDEV_GET_IRQ_INFO(&vdev->vbasedev, &irq_info) < 0) {
+ return;
+ }
+ if (irq_info.count < 1) {
return;
}
@@ -2830,6 +2863,7 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
vdev->vbasedev.ops = &vfio_pci_ops;
vdev->vbasedev.type = VFIO_DEVICE_TYPE_PCI;
vdev->vbasedev.dev = DEVICE(vdev);
+ vdev->vbasedev.io_ops = &vfio_dev_io_ioctl;
tmp = g_strdup_printf("%s/iommu_group", vdev->vbasedev.sysfsdev);
len = readlink(tmp, group_path, sizeof(group_path));
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 06/23] vfio-user: Define type vfio_user_pci_dev_info
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (4 preceding siblings ...)
2022-05-05 17:19 ` [RFC v5 05/23] vfio-user: add device IO ops vector John Johnson
@ 2022-05-05 17:19 ` John Johnson
2022-05-05 17:19 ` [RFC v5 07/23] vfio-user: connect vfio proxy to remote server John Johnson
` (16 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:19 UTC (permalink / raw)
To: qemu-devel
New class for vfio-user with its class and instance
constructors and destructors, and its pci ops.
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/pci.h | 8 +++++
hw/vfio/common.c | 5 ++++
hw/vfio/pci.c | 90 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
hw/vfio/Kconfig | 10 +++++++
4 files changed, 113 insertions(+)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index bbc78aa..59e636c 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -187,6 +187,14 @@ struct VFIOKernPCIDevice {
VFIOPCIDevice device;
};
+#define TYPE_VFIO_USER_PCI "vfio-user-pci"
+OBJECT_DECLARE_SIMPLE_TYPE(VFIOUserPCIDevice, VFIO_USER_PCI)
+
+struct VFIOUserPCIDevice {
+ VFIOPCIDevice device;
+ char *sock_name;
+};
+
/* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
static inline bool vfio_pci_is(VFIOPCIDevice *vdev, uint32_t vendor, uint32_t device)
{
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 0616169..da18fd5 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -1742,6 +1742,11 @@ void vfio_reset_handler(void *opaque)
QLIST_FOREACH(group, &vfio_group_list, next) {
QLIST_FOREACH(vbasedev, &group->device_list, next) {
if (vbasedev->dev->realized && vbasedev->needs_reset) {
+ if (vbasedev->ops->vfio_hot_reset_multi == NULL) {
+ error_printf("%s: No hot reset handler specified\n",
+ vbasedev->name);
+ continue;
+ }
vbasedev->ops->vfio_hot_reset_multi(vbasedev);
}
}
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 4524342..be8fe1d 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -19,6 +19,7 @@
*/
#include "qemu/osdep.h"
+#include CONFIG_DEVICES
#include <linux/vfio.h>
#include <sys/ioctl.h>
@@ -3377,3 +3378,92 @@ static void register_vfio_pci_dev_type(void)
}
type_init(register_vfio_pci_dev_type)
+
+
+#ifdef CONFIG_VFIO_USER_PCI
+
+/*
+ * vfio-user routines.
+ */
+
+/*
+ * Emulated devices don't use host hot reset
+ */
+static void vfio_user_compute_needs_reset(VFIODevice *vbasedev)
+{
+ vbasedev->needs_reset = false;
+}
+
+static VFIODeviceOps vfio_user_pci_ops = {
+ .vfio_compute_needs_reset = vfio_user_compute_needs_reset,
+ .vfio_eoi = vfio_intx_eoi,
+ .vfio_get_object = vfio_pci_get_object,
+ .vfio_save_config = vfio_pci_save_config,
+ .vfio_load_config = vfio_pci_load_config,
+};
+
+static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
+{
+ ERRP_GUARD();
+ VFIOUserPCIDevice *udev = VFIO_USER_PCI(pdev);
+ VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
+ VFIODevice *vbasedev = &vdev->vbasedev;
+
+ /*
+ * TODO: make option parser understand SocketAddress
+ * and use that instead of having scalar options
+ * for each socket type.
+ */
+ if (!udev->sock_name) {
+ error_setg(errp, "No socket specified");
+ error_append_hint(errp, "Use -device vfio-user-pci,socket=<name>\n");
+ return;
+ }
+
+ vbasedev->name = g_strdup_printf("VFIO user <%s>", udev->sock_name);
+ vbasedev->dev = DEVICE(vdev);
+ vbasedev->fd = -1;
+ vbasedev->type = VFIO_DEVICE_TYPE_PCI;
+ vbasedev->ops = &vfio_user_pci_ops;
+
+}
+
+static void vfio_user_instance_finalize(Object *obj)
+{
+ VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
+
+ vfio_put_device(vdev);
+}
+
+static Property vfio_user_pci_dev_properties[] = {
+ DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
+ DEFINE_PROP_END_OF_LIST(),
+};
+
+static void vfio_user_pci_dev_class_init(ObjectClass *klass, void *data)
+{
+ DeviceClass *dc = DEVICE_CLASS(klass);
+ PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
+
+ device_class_set_props(dc, vfio_user_pci_dev_properties);
+ dc->desc = "VFIO over socket PCI device assignment";
+ pdc->realize = vfio_user_pci_realize;
+}
+
+static const TypeInfo vfio_user_pci_dev_info = {
+ .name = TYPE_VFIO_USER_PCI,
+ .parent = TYPE_VFIO_PCI_BASE,
+ .instance_size = sizeof(VFIOUserPCIDevice),
+ .class_init = vfio_user_pci_dev_class_init,
+ .instance_init = vfio_instance_init,
+ .instance_finalize = vfio_user_instance_finalize,
+};
+
+static void register_vfio_user_dev_type(void)
+{
+ type_register_static(&vfio_user_pci_dev_info);
+}
+
+type_init(register_vfio_user_dev_type)
+
+#endif /* VFIO_USER_PCI */
diff --git a/hw/vfio/Kconfig b/hw/vfio/Kconfig
index 7cdba05..301894e 100644
--- a/hw/vfio/Kconfig
+++ b/hw/vfio/Kconfig
@@ -2,6 +2,10 @@ config VFIO
bool
depends on LINUX
+config VFIO_USER
+ bool
+ depends on VFIO
+
config VFIO_PCI
bool
default y
@@ -9,6 +13,12 @@ config VFIO_PCI
select EDID
depends on LINUX && PCI
+config VFIO_USER_PCI
+ bool
+ default y
+ select VFIO_USER
+ depends on VFIO_PCI
+
config VFIO_CCW
bool
default y
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 07/23] vfio-user: connect vfio proxy to remote server
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (5 preceding siblings ...)
2022-05-05 17:19 ` [RFC v5 06/23] vfio-user: Define type vfio_user_pci_dev_info John Johnson
@ 2022-05-05 17:19 ` John Johnson
2022-05-05 17:19 ` [RFC v5 08/23] vfio-user: define socket receive functions John Johnson
` (15 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:19 UTC (permalink / raw)
To: qemu-devel
add user.c & user.h files for vfio-user code
add proxy struct to handle comms with remote server
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/user.h | 78 +++++++++++++++++++
include/hw/vfio/vfio-common.h | 2 +
hw/vfio/pci.c | 19 +++++
hw/vfio/user.c | 170 ++++++++++++++++++++++++++++++++++++++++++
MAINTAINERS | 4 +
hw/vfio/meson.build | 1 +
6 files changed, 274 insertions(+)
create mode 100644 hw/vfio/user.h
create mode 100644 hw/vfio/user.c
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
new file mode 100644
index 0000000..da92862
--- /dev/null
+++ b/hw/vfio/user.h
@@ -0,0 +1,78 @@
+#ifndef VFIO_USER_H
+#define VFIO_USER_H
+
+/*
+ * vfio protocol over a UNIX socket.
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ *
+ */
+
+typedef struct {
+ int send_fds;
+ int recv_fds;
+ int *fds;
+} VFIOUserFDs;
+
+enum msg_type {
+ VFIO_MSG_NONE,
+ VFIO_MSG_ASYNC,
+ VFIO_MSG_WAIT,
+ VFIO_MSG_NOWAIT,
+ VFIO_MSG_REQ,
+};
+
+typedef struct VFIOUserMsg {
+ QTAILQ_ENTRY(VFIOUserMsg) next;
+ VFIOUserFDs *fds;
+ uint32_t rsize;
+ uint32_t id;
+ QemuCond cv;
+ bool complete;
+ enum msg_type type;
+} VFIOUserMsg;
+
+
+enum proxy_state {
+ VFIO_PROXY_CONNECTED = 1,
+ VFIO_PROXY_ERROR = 2,
+ VFIO_PROXY_CLOSING = 3,
+ VFIO_PROXY_CLOSED = 4,
+};
+
+typedef QTAILQ_HEAD(VFIOUserMsgQ, VFIOUserMsg) VFIOUserMsgQ;
+
+typedef struct VFIOProxy {
+ QLIST_ENTRY(VFIOProxy) next;
+ char *sockname;
+ struct QIOChannel *ioc;
+ void (*request)(void *opaque, VFIOUserMsg *msg);
+ void *req_arg;
+ int flags;
+ QemuCond close_cv;
+ AioContext *ctx;
+ QEMUBH *req_bh;
+
+ /*
+ * above only changed when BQL is held
+ * below are protected by per-proxy lock
+ */
+ QemuMutex lock;
+ VFIOUserMsgQ free;
+ VFIOUserMsgQ pending;
+ VFIOUserMsgQ incoming;
+ VFIOUserMsgQ outgoing;
+ VFIOUserMsg *last_nowait;
+ enum proxy_state state;
+} VFIOProxy;
+
+/* VFIOProxy flags */
+#define VFIO_PROXY_CLIENT 0x1
+
+VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp);
+void vfio_user_disconnect(VFIOProxy *proxy);
+
+#endif /* VFIO_USER_H */
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 826cd98..3eb0b19 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -76,6 +76,7 @@ typedef struct VFIOAddressSpace {
struct VFIOGroup;
typedef struct VFIOContIO VFIOContIO;
+typedef struct VFIOProxy VFIOProxy;
typedef struct VFIOContainer {
VFIOAddressSpace *space;
@@ -147,6 +148,7 @@ typedef struct VFIODevice {
VFIOMigration *migration;
Error *migration_blocker;
OnOffAuto pre_copy_dirty_page_tracking;
+ VFIOProxy *proxy;
struct vfio_region_info **regions;
} VFIODevice;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index be8fe1d..8f65074 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -43,6 +43,7 @@
#include "qapi/error.h"
#include "migration/blocker.h"
#include "migration/qemu-file.h"
+#include "hw/vfio/user.h"
/* convenience macros for PCI config space */
#define VDEV_CONFIG_READ(vbasedev, off, size, data) \
@@ -3408,6 +3409,9 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
VFIOUserPCIDevice *udev = VFIO_USER_PCI(pdev);
VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
VFIODevice *vbasedev = &vdev->vbasedev;
+ SocketAddress addr;
+ VFIOProxy *proxy;
+ Error *err = NULL;
/*
* TODO: make option parser understand SocketAddress
@@ -3420,6 +3424,16 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
return;
}
+ memset(&addr, 0, sizeof(addr));
+ addr.type = SOCKET_ADDRESS_TYPE_UNIX;
+ addr.u.q_unix.path = udev->sock_name;
+ proxy = vfio_user_connect_dev(&addr, &err);
+ if (!proxy) {
+ error_setg(errp, "Remote proxy not found");
+ return;
+ }
+ vbasedev->proxy = proxy;
+
vbasedev->name = g_strdup_printf("VFIO user <%s>", udev->sock_name);
vbasedev->dev = DEVICE(vdev);
vbasedev->fd = -1;
@@ -3431,8 +3445,13 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
static void vfio_user_instance_finalize(Object *obj)
{
VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
+ VFIODevice *vbasedev = &vdev->vbasedev;
vfio_put_device(vdev);
+
+ if (vbasedev->proxy != NULL) {
+ vfio_user_disconnect(vbasedev->proxy);
+ }
}
static Property vfio_user_pci_dev_properties[] = {
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
new file mode 100644
index 0000000..c843f90
--- /dev/null
+++ b/hw/vfio/user.c
@@ -0,0 +1,170 @@
+/*
+ * vfio protocol over a UNIX socket.
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory.
+ *
+ */
+
+#include "qemu/osdep.h"
+#include <linux/vfio.h>
+#include <sys/ioctl.h>
+
+#include "qemu/error-report.h"
+#include "qapi/error.h"
+#include "qemu/main-loop.h"
+#include "hw/hw.h"
+#include "hw/vfio/vfio-common.h"
+#include "hw/vfio/vfio.h"
+#include "qemu/sockets.h"
+#include "io/channel.h"
+#include "io/channel-socket.h"
+#include "io/channel-util.h"
+#include "sysemu/iothread.h"
+#include "user.h"
+
+static IOThread *vfio_user_iothread;
+
+static void vfio_user_shutdown(VFIOProxy *proxy);
+
+
+/*
+ * Functions called by main, CPU, or iothread threads
+ */
+
+static void vfio_user_shutdown(VFIOProxy *proxy)
+{
+ qio_channel_shutdown(proxy->ioc, QIO_CHANNEL_SHUTDOWN_READ, NULL);
+ qio_channel_set_aio_fd_handler(proxy->ioc, proxy->ctx, NULL, NULL, NULL);
+}
+
+/*
+ * Functions only called by iothread
+ */
+
+static void vfio_user_cb(void *opaque)
+{
+ VFIOProxy *proxy = opaque;
+
+ QEMU_LOCK_GUARD(&proxy->lock);
+
+ proxy->state = VFIO_PROXY_CLOSED;
+ qemu_cond_signal(&proxy->close_cv);
+}
+
+
+/*
+ * Functions called by main or CPU threads
+ */
+
+static QLIST_HEAD(, VFIOProxy) vfio_user_sockets =
+ QLIST_HEAD_INITIALIZER(vfio_user_sockets);
+
+VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp)
+{
+ VFIOProxy *proxy;
+ QIOChannelSocket *sioc;
+ QIOChannel *ioc;
+ char *sockname;
+
+ if (addr->type != SOCKET_ADDRESS_TYPE_UNIX) {
+ error_setg(errp, "vfio_user_connect - bad address family");
+ return NULL;
+ }
+ sockname = addr->u.q_unix.path;
+
+ sioc = qio_channel_socket_new();
+ ioc = QIO_CHANNEL(sioc);
+ if (qio_channel_socket_connect_sync(sioc, addr, errp)) {
+ object_unref(OBJECT(ioc));
+ return NULL;
+ }
+ qio_channel_set_blocking(ioc, false, NULL);
+
+ proxy = g_malloc0(sizeof(VFIOProxy));
+ proxy->sockname = g_strdup_printf("unix:%s", sockname);
+ proxy->ioc = ioc;
+ proxy->flags = VFIO_PROXY_CLIENT;
+ proxy->state = VFIO_PROXY_CONNECTED;
+
+ qemu_mutex_init(&proxy->lock);
+ qemu_cond_init(&proxy->close_cv);
+
+ if (vfio_user_iothread == NULL) {
+ vfio_user_iothread = iothread_create("VFIO user", errp);
+ }
+
+ proxy->ctx = iothread_get_aio_context(vfio_user_iothread);
+
+ QTAILQ_INIT(&proxy->outgoing);
+ QTAILQ_INIT(&proxy->incoming);
+ QTAILQ_INIT(&proxy->free);
+ QTAILQ_INIT(&proxy->pending);
+ QLIST_INSERT_HEAD(&vfio_user_sockets, proxy, next);
+
+ return proxy;
+}
+
+void vfio_user_disconnect(VFIOProxy *proxy)
+{
+ VFIOUserMsg *r1, *r2;
+
+ qemu_mutex_lock(&proxy->lock);
+
+ /* our side is quitting */
+ if (proxy->state == VFIO_PROXY_CONNECTED) {
+ vfio_user_shutdown(proxy);
+ if (!QTAILQ_EMPTY(&proxy->pending)) {
+ error_printf("vfio_user_disconnect: outstanding requests\n");
+ }
+ }
+ object_unref(OBJECT(proxy->ioc));
+ proxy->ioc = NULL;
+
+ proxy->state = VFIO_PROXY_CLOSING;
+ QTAILQ_FOREACH_SAFE(r1, &proxy->outgoing, next, r2) {
+ qemu_cond_destroy(&r1->cv);
+ QTAILQ_REMOVE(&proxy->pending, r1, next);
+ g_free(r1);
+ }
+ QTAILQ_FOREACH_SAFE(r1, &proxy->incoming, next, r2) {
+ qemu_cond_destroy(&r1->cv);
+ QTAILQ_REMOVE(&proxy->pending, r1, next);
+ g_free(r1);
+ }
+ QTAILQ_FOREACH_SAFE(r1, &proxy->pending, next, r2) {
+ qemu_cond_destroy(&r1->cv);
+ QTAILQ_REMOVE(&proxy->pending, r1, next);
+ g_free(r1);
+ }
+ QTAILQ_FOREACH_SAFE(r1, &proxy->free, next, r2) {
+ qemu_cond_destroy(&r1->cv);
+ QTAILQ_REMOVE(&proxy->free, r1, next);
+ g_free(r1);
+ }
+
+ /*
+ * Make sure the iothread isn't blocking anywhere
+ * with a ref to this proxy by waiting for a BH
+ * handler to run after the proxy fd handlers were
+ * deleted above.
+ */
+ aio_bh_schedule_oneshot(proxy->ctx, vfio_user_cb, proxy);
+ qemu_cond_wait(&proxy->close_cv, &proxy->lock);
+
+ /* we now hold the only ref to proxy */
+ qemu_mutex_unlock(&proxy->lock);
+ qemu_cond_destroy(&proxy->close_cv);
+ qemu_mutex_destroy(&proxy->lock);
+
+ QLIST_REMOVE(proxy, next);
+ if (QLIST_EMPTY(&vfio_user_sockets)) {
+ iothread_destroy(vfio_user_iothread);
+ vfio_user_iothread = NULL;
+ }
+
+ g_free(proxy->sockname);
+ g_free(proxy);
+}
diff --git a/MAINTAINERS b/MAINTAINERS
index 8117241..cd44f91 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1958,8 +1958,12 @@ L: qemu-s390x@nongnu.org
vfio-user
M: John G Johnson <john.g.johnson@oracle.com>
M: Thanos Makatos <thanos.makatos@nutanix.com>
+M: Elena Ufimtseva <elena.ufimtseva@oracle.com>
+M: Jagannathan Raman <jag.raman@oracle.com>
S: Supported
F: docs/devel/vfio-user.rst
+F: hw/vfio/user.c
+F: hw/vfio/user.h
vhost
M: Michael S. Tsirkin <mst@redhat.com>
diff --git a/hw/vfio/meson.build b/hw/vfio/meson.build
index da9af29..2f86f72 100644
--- a/hw/vfio/meson.build
+++ b/hw/vfio/meson.build
@@ -9,6 +9,7 @@ vfio_ss.add(when: 'CONFIG_VFIO_PCI', if_true: files(
'pci-quirks.c',
'pci.c',
))
+vfio_ss.add(when: 'CONFIG_VFIO_USER', if_true: files('user.c'))
vfio_ss.add(when: 'CONFIG_VFIO_CCW', if_true: files('ccw.c'))
vfio_ss.add(when: 'CONFIG_VFIO_PLATFORM', if_true: files('platform.c'))
vfio_ss.add(when: 'CONFIG_VFIO_XGMAC', if_true: files('calxeda-xgmac.c'))
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 08/23] vfio-user: define socket receive functions
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (6 preceding siblings ...)
2022-05-05 17:19 ` [RFC v5 07/23] vfio-user: connect vfio proxy to remote server John Johnson
@ 2022-05-05 17:19 ` John Johnson
2022-05-05 17:19 ` [RFC v5 09/23] vfio-user: define socket send functions John Johnson
` (14 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:19 UTC (permalink / raw)
To: qemu-devel
Add infrastructure needed to receive incoming messages
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/user-protocol.h | 54 +++++++
hw/vfio/user.h | 8 +
hw/vfio/pci.c | 6 +
hw/vfio/user.c | 404 ++++++++++++++++++++++++++++++++++++++++++++++++
MAINTAINERS | 1 +
5 files changed, 473 insertions(+)
create mode 100644 hw/vfio/user-protocol.h
diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
new file mode 100644
index 0000000..d23877c
--- /dev/null
+++ b/hw/vfio/user-protocol.h
@@ -0,0 +1,54 @@
+#ifndef VFIO_USER_PROTOCOL_H
+#define VFIO_USER_PROTOCOL_H
+
+/*
+ * vfio protocol over a UNIX socket.
+ *
+ * Copyright © 2018, 2021 Oracle and/or its affiliates.
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See
+ * the COPYING file in the top-level directory.
+ *
+ * Each message has a standard header that describes the command
+ * being sent, which is almost always a VFIO ioctl().
+ *
+ * The header may be followed by command-specific data, such as the
+ * region and offset info for read and write commands.
+ */
+
+typedef struct {
+ uint16_t id;
+ uint16_t command;
+ uint32_t size;
+ uint32_t flags;
+ uint32_t error_reply;
+} VFIOUserHdr;
+
+/* VFIOUserHdr commands */
+enum vfio_user_command {
+ VFIO_USER_VERSION = 1,
+ VFIO_USER_DMA_MAP = 2,
+ VFIO_USER_DMA_UNMAP = 3,
+ VFIO_USER_DEVICE_GET_INFO = 4,
+ VFIO_USER_DEVICE_GET_REGION_INFO = 5,
+ VFIO_USER_DEVICE_GET_REGION_IO_FDS = 6,
+ VFIO_USER_DEVICE_GET_IRQ_INFO = 7,
+ VFIO_USER_DEVICE_SET_IRQS = 8,
+ VFIO_USER_REGION_READ = 9,
+ VFIO_USER_REGION_WRITE = 10,
+ VFIO_USER_DMA_READ = 11,
+ VFIO_USER_DMA_WRITE = 12,
+ VFIO_USER_DEVICE_RESET = 13,
+ VFIO_USER_DIRTY_PAGES = 14,
+ VFIO_USER_MAX,
+};
+
+/* VFIOUserHdr flags */
+#define VFIO_USER_REQUEST 0x0
+#define VFIO_USER_REPLY 0x1
+#define VFIO_USER_TYPE 0xF
+
+#define VFIO_USER_NO_REPLY 0x10
+#define VFIO_USER_ERROR 0x20
+
+#endif /* VFIO_USER_PROTOCOL_H */
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index da92862..68a1080 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -11,6 +11,8 @@
*
*/
+#include "user-protocol.h"
+
typedef struct {
int send_fds;
int recv_fds;
@@ -27,6 +29,7 @@ enum msg_type {
typedef struct VFIOUserMsg {
QTAILQ_ENTRY(VFIOUserMsg) next;
+ VFIOUserHdr *hdr;
VFIOUserFDs *fds;
uint32_t rsize;
uint32_t id;
@@ -66,6 +69,8 @@ typedef struct VFIOProxy {
VFIOUserMsgQ incoming;
VFIOUserMsgQ outgoing;
VFIOUserMsg *last_nowait;
+ VFIOUserMsg *part_recv;
+ size_t recv_left;
enum proxy_state state;
} VFIOProxy;
@@ -74,5 +79,8 @@ typedef struct VFIOProxy {
VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp);
void vfio_user_disconnect(VFIOProxy *proxy);
+void vfio_user_set_handler(VFIODevice *vbasedev,
+ void (*handler)(void *opaque, VFIOUserMsg *msg),
+ void *reqarg);
#endif /* VFIO_USER_H */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 8f65074..7ef11c0 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3387,6 +3387,11 @@ type_init(register_vfio_pci_dev_type)
* vfio-user routines.
*/
+static void vfio_user_pci_process_req(void *opaque, VFIOUserMsg *msg)
+{
+
+}
+
/*
* Emulated devices don't use host hot reset
*/
@@ -3433,6 +3438,7 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
return;
}
vbasedev->proxy = proxy;
+ vfio_user_set_handler(vbasedev, vfio_user_pci_process_req, vdev);
vbasedev->name = g_strdup_printf("VFIO user <%s>", udev->sock_name);
vbasedev->dev = DEVICE(vdev);
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index c843f90..16b37cb 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -25,10 +25,26 @@
#include "sysemu/iothread.h"
#include "user.h"
+static uint64_t max_xfer_size;
static IOThread *vfio_user_iothread;
static void vfio_user_shutdown(VFIOProxy *proxy);
+static VFIOUserMsg *vfio_user_getmsg(VFIOProxy *proxy, VFIOUserHdr *hdr,
+ VFIOUserFDs *fds);
+static VFIOUserFDs *vfio_user_getfds(int numfds);
+static void vfio_user_recycle(VFIOProxy *proxy, VFIOUserMsg *msg);
+static void vfio_user_recv(void *opaque);
+static int vfio_user_recv_one(VFIOProxy *proxy);
+static void vfio_user_cb(void *opaque);
+
+static void vfio_user_request(void *opaque);
+
+static inline void vfio_user_set_error(VFIOUserHdr *hdr, uint32_t err)
+{
+ hdr->flags |= VFIO_USER_ERROR;
+ hdr->error_reply = err;
+}
/*
* Functions called by main, CPU, or iothread threads
@@ -40,10 +56,338 @@ static void vfio_user_shutdown(VFIOProxy *proxy)
qio_channel_set_aio_fd_handler(proxy->ioc, proxy->ctx, NULL, NULL, NULL);
}
+static VFIOUserMsg *vfio_user_getmsg(VFIOProxy *proxy, VFIOUserHdr *hdr,
+ VFIOUserFDs *fds)
+{
+ VFIOUserMsg *msg;
+
+ msg = QTAILQ_FIRST(&proxy->free);
+ if (msg != NULL) {
+ QTAILQ_REMOVE(&proxy->free, msg, next);
+ } else {
+ msg = g_malloc0(sizeof(*msg));
+ qemu_cond_init(&msg->cv);
+ }
+
+ msg->hdr = hdr;
+ msg->fds = fds;
+ return msg;
+}
+
+/*
+ * Recycle a message list entry to the free list.
+ */
+static void vfio_user_recycle(VFIOProxy *proxy, VFIOUserMsg *msg)
+{
+ if (msg->type == VFIO_MSG_NONE) {
+ error_printf("vfio_user_recycle - freeing free msg\n");
+ return;
+ }
+
+ /* free msg buffer if no one is waiting to consume the reply */
+ if (msg->type == VFIO_MSG_NOWAIT || msg->type == VFIO_MSG_ASYNC) {
+ g_free(msg->hdr);
+ if (msg->fds != NULL) {
+ g_free(msg->fds);
+ }
+ }
+
+ msg->type = VFIO_MSG_NONE;
+ msg->hdr = NULL;
+ msg->fds = NULL;
+ msg->complete = false;
+ QTAILQ_INSERT_HEAD(&proxy->free, msg, next);
+}
+
+static VFIOUserFDs *vfio_user_getfds(int numfds)
+{
+ VFIOUserFDs *fds = g_malloc0(sizeof(*fds) + (numfds * sizeof(int)));
+
+ fds->fds = (int *)((char *)fds + sizeof(*fds));
+
+ return fds;
+}
+
/*
* Functions only called by iothread
*/
+/*
+ * Process a received message.
+ */
+static void vfio_user_process(VFIOProxy *proxy, VFIOUserMsg *msg, bool isreply)
+{
+
+ /*
+ * Replies signal a waiter, if none just check for errors
+ * and free the message buffer.
+ *
+ * Requests get queued for the BH.
+ */
+ if (isreply) {
+ msg->complete = true;
+ if (msg->type == VFIO_MSG_WAIT) {
+ qemu_cond_signal(&msg->cv);
+ } else {
+ if (msg->hdr->flags & VFIO_USER_ERROR) {
+ error_printf("vfio_user_rcv: error reply on async request ");
+ error_printf("command %x error %s\n", msg->hdr->command,
+ strerror(msg->hdr->error_reply));
+ }
+ /* youngest nowait msg has been ack'd */
+ if (proxy->last_nowait == msg) {
+ proxy->last_nowait = NULL;
+ }
+ vfio_user_recycle(proxy, msg);
+ }
+ } else {
+ QTAILQ_INSERT_TAIL(&proxy->incoming, msg, next);
+ qemu_bh_schedule(proxy->req_bh);
+ }
+}
+
+/*
+ * Complete a partial message read
+ */
+static int vfio_user_complete(VFIOProxy *proxy, Error **errp)
+{
+ VFIOUserMsg *msg = proxy->part_recv;
+ size_t msgleft = proxy->recv_left;
+ bool isreply;
+ char *data;
+ int ret;
+
+ data = (char *)msg->hdr + (msg->hdr->size - msgleft);
+ while (msgleft > 0) {
+ ret = qio_channel_read(proxy->ioc, data, msgleft, errp);
+
+ /* error or would block */
+ if (ret <= 0) {
+ /* try for rest on next iternation */
+ if (ret == QIO_CHANNEL_ERR_BLOCK) {
+ proxy->recv_left = msgleft;
+ }
+ return ret;
+ }
+
+ msgleft -= ret;
+ data += ret;
+ }
+
+ /*
+ * Read complete message, process it.
+ */
+ proxy->part_recv = NULL;
+ proxy->recv_left = 0;
+ isreply = (msg->hdr->flags & VFIO_USER_TYPE) == VFIO_USER_REPLY;
+ vfio_user_process(proxy, msg, isreply);
+
+ /* return positive value */
+ return 1;
+}
+
+static void vfio_user_recv(void *opaque)
+{
+ VFIOProxy *proxy = opaque;
+
+ QEMU_LOCK_GUARD(&proxy->lock);
+
+ if (proxy->state == VFIO_PROXY_CONNECTED) {
+ while (vfio_user_recv_one(proxy) == 0) {
+ ;
+ }
+ }
+}
+
+/*
+ * Receive and process one incoming message.
+ *
+ * For replies, find matching outgoing request and wake any waiters.
+ * For requests, queue in incoming list and run request BH.
+ */
+static int vfio_user_recv_one(VFIOProxy *proxy)
+{
+ VFIOUserMsg *msg = NULL;
+ g_autofree int *fdp = NULL;
+ VFIOUserFDs *reqfds;
+ VFIOUserHdr hdr;
+ struct iovec iov = {
+ .iov_base = &hdr,
+ .iov_len = sizeof(hdr),
+ };
+ bool isreply = false;
+ int i, ret;
+ size_t msgleft, numfds = 0;
+ char *data = NULL;
+ char *buf = NULL;
+ Error *local_err = NULL;
+
+ /*
+ * Complete any partial reads
+ */
+ if (proxy->part_recv != NULL) {
+ ret = vfio_user_complete(proxy, &local_err);
+
+ /* still not complete, try later */
+ if (ret == QIO_CHANNEL_ERR_BLOCK) {
+ return ret;
+ }
+
+ if (ret <= 0) {
+ goto fatal;
+ }
+ /* else fall into reading another msg */
+ }
+
+ /*
+ * Read header
+ */
+ ret = qio_channel_readv_full(proxy->ioc, &iov, 1, &fdp, &numfds,
+ &local_err);
+ if (ret == QIO_CHANNEL_ERR_BLOCK) {
+ return ret;
+ }
+
+ /* read error or other side closed connection */
+ if (ret <= 0) {
+ goto fatal;
+ }
+
+ if (ret < sizeof(msg)) {
+ error_setg(&local_err, "short read of header");
+ goto fatal;
+ }
+
+ /*
+ * Validate header
+ */
+ if (hdr.size < sizeof(VFIOUserHdr)) {
+ error_setg(&local_err, "bad header size");
+ goto fatal;
+ }
+ switch (hdr.flags & VFIO_USER_TYPE) {
+ case VFIO_USER_REQUEST:
+ isreply = false;
+ break;
+ case VFIO_USER_REPLY:
+ isreply = true;
+ break;
+ default:
+ error_setg(&local_err, "unknown message type");
+ goto fatal;
+ }
+
+ /*
+ * For replies, find the matching pending request.
+ * For requests, reap incoming FDs.
+ */
+ if (isreply) {
+ QTAILQ_FOREACH(msg, &proxy->pending, next) {
+ if (hdr.id == msg->id) {
+ break;
+ }
+ }
+ if (msg == NULL) {
+ error_setg(&local_err, "unexpected reply");
+ goto err;
+ }
+ QTAILQ_REMOVE(&proxy->pending, msg, next);
+
+ /*
+ * Process any received FDs
+ */
+ if (numfds != 0) {
+ if (msg->fds == NULL || msg->fds->recv_fds < numfds) {
+ error_setg(&local_err, "unexpected FDs");
+ goto err;
+ }
+ msg->fds->recv_fds = numfds;
+ memcpy(msg->fds->fds, fdp, numfds * sizeof(int));
+ }
+ } else {
+ if (numfds != 0) {
+ reqfds = vfio_user_getfds(numfds);
+ memcpy(reqfds->fds, fdp, numfds * sizeof(int));
+ } else {
+ reqfds = NULL;
+ }
+ }
+
+ /*
+ * Put the whole message into a single buffer.
+ */
+ if (isreply) {
+ if (hdr.size > msg->rsize) {
+ error_setg(&local_err, "reply larger than recv buffer");
+ goto err;
+ }
+ *msg->hdr = hdr;
+ data = (char *)msg->hdr + sizeof(hdr);
+ } else {
+ if (hdr.size > max_xfer_size) {
+ error_setg(&local_err, "vfio_user_recv request larger than max");
+ goto err;
+ }
+ buf = g_malloc0(hdr.size);
+ memcpy(buf, &hdr, sizeof(hdr));
+ data = buf + sizeof(hdr);
+ msg = vfio_user_getmsg(proxy, (VFIOUserHdr *)buf, reqfds);
+ msg->type = VFIO_MSG_REQ;
+ }
+
+ /*
+ * Read rest of message.
+ */
+ msgleft = hdr.size - sizeof(hdr);
+ while (msgleft > 0) {
+ ret = qio_channel_read(proxy->ioc, data, msgleft, &local_err);
+
+ /* prepare to complete read on next iternation */
+ if (ret == QIO_CHANNEL_ERR_BLOCK) {
+ proxy->part_recv = msg;
+ proxy->recv_left = msgleft;
+ return ret;
+ }
+
+ if (ret <= 0) {
+ goto fatal;
+ }
+
+ msgleft -= ret;
+ data += ret;
+ }
+
+ vfio_user_process(proxy, msg, isreply);
+ return 0;
+
+ /*
+ * fatal means the other side closed or we don't trust the stream
+ * err means this message is corrupt
+ */
+fatal:
+ vfio_user_shutdown(proxy);
+ proxy->state = VFIO_PROXY_ERROR;
+
+ /* set error if server side closed */
+ if (ret == 0) {
+ error_setg(&local_err, "server closed socket");
+ }
+
+err:
+ for (i = 0; i < numfds; i++) {
+ close(fdp[i]);
+ }
+ if (isreply && msg != NULL) {
+ /* force an error to keep sending thread from hanging */
+ vfio_user_set_error(msg->hdr, EINVAL);
+ msg->complete = true;
+ qemu_cond_signal(&msg->cv);
+ }
+ error_prepend(&local_err, "vfio_user_recv: ");
+ error_report_err(local_err);
+ return -1;
+}
+
static void vfio_user_cb(void *opaque)
{
VFIOProxy *proxy = opaque;
@@ -59,6 +403,51 @@ static void vfio_user_cb(void *opaque)
* Functions called by main or CPU threads
*/
+/*
+ * Process incoming requests.
+ *
+ * The bus-specific callback has the form:
+ * request(opaque, msg)
+ * where 'opaque' was specified in vfio_user_set_handler
+ * and 'msg' is the inbound message.
+ *
+ * The callback is responsible for disposing of the message buffer,
+ * usually by re-using it when calling vfio_send_reply or vfio_send_error,
+ * both of which free their message buffer when the reply is sent.
+ *
+ * If the callback uses a new buffer, it needs to free the old one.
+ */
+static void vfio_user_request(void *opaque)
+{
+ VFIOProxy *proxy = opaque;
+ VFIOUserMsgQ new, free;
+ VFIOUserMsg *msg, *m1;
+
+ /* reap all incoming */
+ QTAILQ_INIT(&new);
+ WITH_QEMU_LOCK_GUARD(&proxy->lock) {
+ QTAILQ_FOREACH_SAFE(msg, &proxy->incoming, next, m1) {
+ QTAILQ_REMOVE(&proxy->pending, msg, next);
+ QTAILQ_INSERT_TAIL(&new, msg, next);
+ }
+ }
+
+ /* process list */
+ QTAILQ_INIT(&free);
+ QTAILQ_FOREACH_SAFE(msg, &new, next, m1) {
+ QTAILQ_REMOVE(&new, msg, next);
+ proxy->request(proxy->req_arg, msg);
+ QTAILQ_INSERT_HEAD(&free, msg, next);
+ }
+
+ /* free list */
+ WITH_QEMU_LOCK_GUARD(&proxy->lock) {
+ QTAILQ_FOREACH_SAFE(msg, &free, next, m1) {
+ vfio_user_recycle(proxy, msg);
+ }
+ }
+}
+
static QLIST_HEAD(, VFIOProxy) vfio_user_sockets =
QLIST_HEAD_INITIALIZER(vfio_user_sockets);
@@ -97,6 +486,7 @@ VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp)
}
proxy->ctx = iothread_get_aio_context(vfio_user_iothread);
+ proxy->req_bh = qemu_bh_new(vfio_user_request, proxy);
QTAILQ_INIT(&proxy->outgoing);
QTAILQ_INIT(&proxy->incoming);
@@ -107,6 +497,18 @@ VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp)
return proxy;
}
+void vfio_user_set_handler(VFIODevice *vbasedev,
+ void (*handler)(void *opaque, VFIOUserMsg *msg),
+ void *req_arg)
+{
+ VFIOProxy *proxy = vbasedev->proxy;
+
+ proxy->request = handler;
+ proxy->req_arg = req_arg;
+ qio_channel_set_aio_fd_handler(proxy->ioc, proxy->ctx,
+ vfio_user_recv, NULL, proxy);
+}
+
void vfio_user_disconnect(VFIOProxy *proxy)
{
VFIOUserMsg *r1, *r2;
@@ -122,6 +524,8 @@ void vfio_user_disconnect(VFIOProxy *proxy)
}
object_unref(OBJECT(proxy->ioc));
proxy->ioc = NULL;
+ qemu_bh_delete(proxy->req_bh);
+ proxy->req_bh = NULL;
proxy->state = VFIO_PROXY_CLOSING;
QTAILQ_FOREACH_SAFE(r1, &proxy->outgoing, next, r2) {
diff --git a/MAINTAINERS b/MAINTAINERS
index cd44f91..c81f8b6 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -1964,6 +1964,7 @@ S: Supported
F: docs/devel/vfio-user.rst
F: hw/vfio/user.c
F: hw/vfio/user.h
+F: hw/vfio/user-protocol.h
vhost
M: Michael S. Tsirkin <mst@redhat.com>
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 09/23] vfio-user: define socket send functions
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (7 preceding siblings ...)
2022-05-05 17:19 ` [RFC v5 08/23] vfio-user: define socket receive functions John Johnson
@ 2022-05-05 17:19 ` John Johnson
2022-05-05 17:19 ` [RFC v5 10/23] vfio-user: get device info John Johnson
` (13 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:19 UTC (permalink / raw)
To: qemu-devel
Also negotiate protocol version with remote server
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
---
hw/vfio/pci.h | 1 +
hw/vfio/user-protocol.h | 41 +++++
hw/vfio/user.h | 2 +
hw/vfio/pci.c | 16 ++
hw/vfio/user.c | 414 +++++++++++++++++++++++++++++++++++++++++++++++-
5 files changed, 473 insertions(+), 1 deletion(-)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 59e636c..ec9f345 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -193,6 +193,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(VFIOUserPCIDevice, VFIO_USER_PCI)
struct VFIOUserPCIDevice {
VFIOPCIDevice device;
char *sock_name;
+ bool send_queued; /* all sends are queued */
};
/* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
index d23877c..a0889f6 100644
--- a/hw/vfio/user-protocol.h
+++ b/hw/vfio/user-protocol.h
@@ -51,4 +51,45 @@ enum vfio_user_command {
#define VFIO_USER_NO_REPLY 0x10
#define VFIO_USER_ERROR 0x20
+
+/*
+ * VFIO_USER_VERSION
+ */
+typedef struct {
+ VFIOUserHdr hdr;
+ uint16_t major;
+ uint16_t minor;
+ char capabilities[];
+} VFIOUserVersion;
+
+#define VFIO_USER_MAJOR_VER 0
+#define VFIO_USER_MINOR_VER 0
+
+#define VFIO_USER_CAP "capabilities"
+
+/* "capabilities" members */
+#define VFIO_USER_CAP_MAX_FDS "max_msg_fds"
+#define VFIO_USER_CAP_MAX_XFER "max_data_xfer_size"
+#define VFIO_USER_CAP_MIGR "migration"
+
+/* "migration" member */
+#define VFIO_USER_CAP_PGSIZE "pgsize"
+
+/*
+ * Max FDs mainly comes into play when a device supports multiple interrupts
+ * where each ones uses an eventfd to inject it into the guest.
+ * It is clamped by the the number of FDs the qio channel supports in a
+ * single message.
+ */
+#define VFIO_USER_DEF_MAX_FDS 8
+#define VFIO_USER_MAX_MAX_FDS 16
+
+/*
+ * Max transfer limits the amount of data in region and DMA messages.
+ * Region R/W will be very small (limited by how much a single instruction
+ * can process) so just use a reasonable limit here.
+ */
+#define VFIO_USER_DEF_MAX_XFER (1024 * 1024)
+#define VFIO_USER_MAX_MAX_XFER (64 * 1024 * 1024)
+
#endif /* VFIO_USER_PROTOCOL_H */
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 68a1080..00d21bf 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -76,11 +76,13 @@ typedef struct VFIOProxy {
/* VFIOProxy flags */
#define VFIO_PROXY_CLIENT 0x1
+#define VFIO_PROXY_FORCE_QUEUED 0x4
VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp);
void vfio_user_disconnect(VFIOProxy *proxy);
void vfio_user_set_handler(VFIODevice *vbasedev,
void (*handler)(void *opaque, VFIOUserMsg *msg),
void *reqarg);
+int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
#endif /* VFIO_USER_H */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7ef11c0..7e5b910 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3440,12 +3440,27 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
vbasedev->proxy = proxy;
vfio_user_set_handler(vbasedev, vfio_user_pci_process_req, vdev);
+ if (udev->send_queued) {
+ proxy->flags |= VFIO_PROXY_FORCE_QUEUED;
+ }
+
+ vfio_user_validate_version(vbasedev, &err);
+ if (err != NULL) {
+ error_propagate(errp, err);
+ goto error;
+ }
+
vbasedev->name = g_strdup_printf("VFIO user <%s>", udev->sock_name);
vbasedev->dev = DEVICE(vdev);
vbasedev->fd = -1;
vbasedev->type = VFIO_DEVICE_TYPE_PCI;
vbasedev->ops = &vfio_user_pci_ops;
+ return;
+
+error:
+ vfio_user_disconnect(proxy);
+ error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
}
static void vfio_user_instance_finalize(Object *obj)
@@ -3462,6 +3477,7 @@ static void vfio_user_instance_finalize(Object *obj)
static Property vfio_user_pci_dev_properties[] = {
DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
+ DEFINE_PROP_BOOL("x-send-queued", VFIOUserPCIDevice, send_queued, false),
DEFINE_PROP_END_OF_LIST(),
};
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 16b37cb..dc3f1a6 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -23,12 +23,20 @@
#include "io/channel-socket.h"
#include "io/channel-util.h"
#include "sysemu/iothread.h"
+#include "qapi/qmp/qdict.h"
+#include "qapi/qmp/qjson.h"
+#include "qapi/qmp/qnull.h"
+#include "qapi/qmp/qstring.h"
+#include "qapi/qmp/qnum.h"
#include "user.h"
-static uint64_t max_xfer_size;
+static uint64_t max_xfer_size = VFIO_USER_DEF_MAX_XFER;
+static uint64_t max_send_fds = VFIO_USER_DEF_MAX_FDS;
+static int wait_time = 1000; /* wait 1 sec for replies */
static IOThread *vfio_user_iothread;
static void vfio_user_shutdown(VFIOProxy *proxy);
+static int vfio_user_send_qio(VFIOProxy *proxy, VFIOUserMsg *msg);
static VFIOUserMsg *vfio_user_getmsg(VFIOProxy *proxy, VFIOUserHdr *hdr,
VFIOUserFDs *fds);
static VFIOUserFDs *vfio_user_getfds(int numfds);
@@ -36,9 +44,16 @@ static void vfio_user_recycle(VFIOProxy *proxy, VFIOUserMsg *msg);
static void vfio_user_recv(void *opaque);
static int vfio_user_recv_one(VFIOProxy *proxy);
+static void vfio_user_send(void *opaque);
+static int vfio_user_send_one(VFIOProxy *proxy, VFIOUserMsg *msg);
static void vfio_user_cb(void *opaque);
static void vfio_user_request(void *opaque);
+static int vfio_user_send_queued(VFIOProxy *proxy, VFIOUserMsg *msg);
+static void vfio_user_send_wait(VFIOProxy *proxy, VFIOUserHdr *hdr,
+ VFIOUserFDs *fds, int rsize, bool nobql);
+static void vfio_user_request_msg(VFIOUserHdr *hdr, uint16_t cmd,
+ uint32_t size, uint32_t flags);
static inline void vfio_user_set_error(VFIOUserHdr *hdr, uint32_t err)
{
@@ -56,6 +71,32 @@ static void vfio_user_shutdown(VFIOProxy *proxy)
qio_channel_set_aio_fd_handler(proxy->ioc, proxy->ctx, NULL, NULL, NULL);
}
+static int vfio_user_send_qio(VFIOProxy *proxy, VFIOUserMsg *msg)
+{
+ VFIOUserFDs *fds = msg->fds;
+ struct iovec iov = {
+ .iov_base = msg->hdr,
+ .iov_len = msg->hdr->size,
+ };
+ size_t numfds = 0;
+ int ret, *fdp = NULL;
+ Error *local_err = NULL;
+
+ if (fds != NULL && fds->send_fds != 0) {
+ numfds = fds->send_fds;
+ fdp = fds->fds;
+ }
+
+ ret = qio_channel_writev_full(proxy->ioc, &iov, 1, fdp, numfds, &local_err);
+
+ if (ret == -1) {
+ vfio_user_set_error(msg->hdr, EIO);
+ vfio_user_shutdown(proxy);
+ error_report_err(local_err);
+ }
+ return ret;
+}
+
static VFIOUserMsg *vfio_user_getmsg(VFIOProxy *proxy, VFIOUserHdr *hdr,
VFIOUserFDs *fds)
{
@@ -388,6 +429,53 @@ err:
return -1;
}
+/*
+ * Send messages from outgoing queue when the socket buffer has space.
+ * If we deplete 'outgoing', remove ourselves from the poll list.
+ */
+static void vfio_user_send(void *opaque)
+{
+ VFIOProxy *proxy = opaque;
+ VFIOUserMsg *msg;
+
+ QEMU_LOCK_GUARD(&proxy->lock);
+
+ if (proxy->state == VFIO_PROXY_CONNECTED) {
+ while (!QTAILQ_EMPTY(&proxy->outgoing)) {
+ msg = QTAILQ_FIRST(&proxy->outgoing);
+ if (vfio_user_send_one(proxy, msg) < 0) {
+ return;
+ }
+ }
+ qio_channel_set_aio_fd_handler(proxy->ioc, proxy->ctx,
+ vfio_user_recv, NULL, proxy);
+ }
+}
+
+/*
+ * Send a single message.
+ *
+ * Sent async messages are freed, others are moved to pending queue.
+ */
+static int vfio_user_send_one(VFIOProxy *proxy, VFIOUserMsg *msg)
+{
+ int ret;
+
+ ret = vfio_user_send_qio(proxy, msg);
+ if (ret < 0) {
+ return ret;
+ }
+
+ QTAILQ_REMOVE(&proxy->outgoing, msg, next);
+ if (msg->type == VFIO_MSG_ASYNC) {
+ vfio_user_recycle(proxy, msg);
+ } else {
+ QTAILQ_INSERT_TAIL(&proxy->pending, msg, next);
+ }
+
+ return 0;
+}
+
static void vfio_user_cb(void *opaque)
{
VFIOProxy *proxy = opaque;
@@ -448,6 +536,130 @@ static void vfio_user_request(void *opaque)
}
}
+/*
+ * Messages are queued onto the proxy's outgoing list.
+ *
+ * It handles 3 types of messages:
+ *
+ * async messages - replies and posted writes
+ *
+ * There will be no reply from the server, so message
+ * buffers are freed after they're sent.
+ *
+ * nowait messages - map/unmap during address space transactions
+ *
+ * These are also sent async, but a reply is expected so that
+ * vfio_wait_reqs() can wait for the youngest nowait request.
+ * They transition from the outgoing list to the pending list
+ * when sent, and are freed when the reply is received.
+ *
+ * wait messages - all other requests
+ *
+ * The reply to these messages is waited for by their caller.
+ * They also transition from outgoing to pending when sent, but
+ * the message buffer is returned to the caller with the reply
+ * contents. The caller is responsible for freeing these messages.
+ *
+ * As an optimization, if the outgoing list and the socket send
+ * buffer are empty, the message is sent inline instead of being
+ * added to the outgoing list. The rest of the transitions are
+ * unchanged.
+ *
+ * returns 0 if the message was sent or queued
+ * returns -1 on send error
+ */
+static int vfio_user_send_queued(VFIOProxy *proxy, VFIOUserMsg *msg)
+{
+ int ret;
+
+ /*
+ * Unsent outgoing msgs - add to tail
+ */
+ if (!QTAILQ_EMPTY(&proxy->outgoing)) {
+ QTAILQ_INSERT_TAIL(&proxy->outgoing, msg, next);
+ return 0;
+ }
+
+ /*
+ * Try inline - if blocked, queue it and kick send poller
+ */
+ if (proxy->flags & VFIO_PROXY_FORCE_QUEUED) {
+ ret = QIO_CHANNEL_ERR_BLOCK;
+ } else {
+ ret = vfio_user_send_qio(proxy, msg);
+ }
+ if (ret == QIO_CHANNEL_ERR_BLOCK) {
+ QTAILQ_INSERT_HEAD(&proxy->outgoing, msg, next);
+ qio_channel_set_aio_fd_handler(proxy->ioc, proxy->ctx,
+ vfio_user_recv, vfio_user_send,
+ proxy);
+ return 0;
+ }
+ if (ret == -1) {
+ return ret;
+ }
+
+ /*
+ * Sent - free async, add others to pending
+ */
+ if (msg->type == VFIO_MSG_ASYNC) {
+ vfio_user_recycle(proxy, msg);
+ } else {
+ QTAILQ_INSERT_TAIL(&proxy->pending, msg, next);
+ }
+
+ return 0;
+}
+
+static void vfio_user_send_wait(VFIOProxy *proxy, VFIOUserHdr *hdr,
+ VFIOUserFDs *fds, int rsize, bool nobql)
+{
+ VFIOUserMsg *msg;
+ bool iolock = false;
+ int ret;
+
+ if (hdr->flags & VFIO_USER_NO_REPLY) {
+ error_printf("vfio_user_send_wait on async message\n");
+ return;
+ }
+
+ /*
+ * We may block later, so use a per-proxy lock and drop
+ * BQL while we sleep unless 'nobql' says not to.
+ */
+ qemu_mutex_lock(&proxy->lock);
+ if (!nobql) {
+ iolock = qemu_mutex_iothread_locked();
+ if (iolock) {
+ qemu_mutex_unlock_iothread();
+ }
+ }
+
+ msg = vfio_user_getmsg(proxy, hdr, fds);
+ msg->id = hdr->id;
+ msg->rsize = rsize ? rsize : hdr->size;
+ msg->type = VFIO_MSG_WAIT;
+
+ ret = vfio_user_send_queued(proxy, msg);
+
+ if (ret == 0) {
+ while (!msg->complete) {
+ if (!qemu_cond_timedwait(&msg->cv, &proxy->lock, wait_time)) {
+ QTAILQ_REMOVE(&proxy->pending, msg, next);
+ vfio_user_set_error(hdr, ETIMEDOUT);
+ break;
+ }
+ }
+ }
+ vfio_user_recycle(proxy, msg);
+
+ /* lock order is BQL->proxy - don't hold proxy when getting BQL */
+ qemu_mutex_unlock(&proxy->lock);
+ if (iolock) {
+ qemu_mutex_lock_iothread();
+ }
+}
+
static QLIST_HEAD(, VFIOProxy) vfio_user_sockets =
QLIST_HEAD_INITIALIZER(vfio_user_sockets);
@@ -572,3 +784,203 @@ void vfio_user_disconnect(VFIOProxy *proxy)
g_free(proxy->sockname);
g_free(proxy);
}
+
+static void vfio_user_request_msg(VFIOUserHdr *hdr, uint16_t cmd,
+ uint32_t size, uint32_t flags)
+{
+ static uint16_t next_id;
+
+ hdr->id = qatomic_fetch_inc(&next_id);
+ hdr->command = cmd;
+ hdr->size = size;
+ hdr->flags = (flags & ~VFIO_USER_TYPE) | VFIO_USER_REQUEST;
+ hdr->error_reply = 0;
+}
+
+struct cap_entry {
+ const char *name;
+ int (*check)(QObject *qobj, Error **errp);
+};
+
+static int caps_parse(QDict *qdict, struct cap_entry caps[], Error **errp)
+{
+ QObject *qobj;
+ struct cap_entry *p;
+
+ for (p = caps; p->name != NULL; p++) {
+ qobj = qdict_get(qdict, p->name);
+ if (qobj != NULL) {
+ if (p->check(qobj, errp)) {
+ return -1;
+ }
+ qdict_del(qdict, p->name);
+ }
+ }
+
+ /* warning, for now */
+ if (qdict_size(qdict) != 0) {
+ error_printf("spurious capabilities\n");
+ }
+ return 0;
+}
+
+static int check_pgsize(QObject *qobj, Error **errp)
+{
+ QNum *qn = qobject_to(QNum, qobj);
+ uint64_t pgsize;
+
+ if (qn == NULL || !qnum_get_try_uint(qn, &pgsize)) {
+ error_setg(errp, "malformed %s", VFIO_USER_CAP_PGSIZE);
+ return -1;
+ }
+ return pgsize == 4096 ? 0 : -1;
+}
+
+static struct cap_entry caps_migr[] = {
+ { VFIO_USER_CAP_PGSIZE, check_pgsize },
+ { NULL }
+};
+
+static int check_max_fds(QObject *qobj, Error **errp)
+{
+ QNum *qn = qobject_to(QNum, qobj);
+
+ if (qn == NULL || !qnum_get_try_uint(qn, &max_send_fds) ||
+ max_send_fds > VFIO_USER_MAX_MAX_FDS) {
+ error_setg(errp, "malformed %s", VFIO_USER_CAP_MAX_FDS);
+ return -1;
+ }
+ return 0;
+}
+
+static int check_max_xfer(QObject *qobj, Error **errp)
+{
+ QNum *qn = qobject_to(QNum, qobj);
+
+ if (qn == NULL || !qnum_get_try_uint(qn, &max_xfer_size) ||
+ max_xfer_size > VFIO_USER_MAX_MAX_XFER) {
+ error_setg(errp, "malformed %s", VFIO_USER_CAP_MAX_XFER);
+ return -1;
+ }
+ return 0;
+}
+
+static int check_migr(QObject *qobj, Error **errp)
+{
+ QDict *qdict = qobject_to(QDict, qobj);
+
+ if (qdict == NULL) {
+ error_setg(errp, "malformed %s", VFIO_USER_CAP_MAX_FDS);
+ return -1;
+ }
+ return caps_parse(qdict, caps_migr, errp);
+}
+
+static struct cap_entry caps_cap[] = {
+ { VFIO_USER_CAP_MAX_FDS, check_max_fds },
+ { VFIO_USER_CAP_MAX_XFER, check_max_xfer },
+ { VFIO_USER_CAP_MIGR, check_migr },
+ { NULL }
+};
+
+static int check_cap(QObject *qobj, Error **errp)
+{
+ QDict *qdict = qobject_to(QDict, qobj);
+
+ if (qdict == NULL) {
+ error_setg(errp, "malformed %s", VFIO_USER_CAP);
+ return -1;
+ }
+ return caps_parse(qdict, caps_cap, errp);
+}
+
+static struct cap_entry ver_0_0[] = {
+ { VFIO_USER_CAP, check_cap },
+ { NULL }
+};
+
+static int caps_check(int minor, const char *caps, Error **errp)
+{
+ QObject *qobj;
+ QDict *qdict;
+ int ret;
+
+ qobj = qobject_from_json(caps, NULL);
+ if (qobj == NULL) {
+ error_setg(errp, "malformed capabilities %s", caps);
+ return -1;
+ }
+ qdict = qobject_to(QDict, qobj);
+ if (qdict == NULL) {
+ error_setg(errp, "capabilities %s not an object", caps);
+ qobject_unref(qobj);
+ return -1;
+ }
+ ret = caps_parse(qdict, ver_0_0, errp);
+
+ qobject_unref(qobj);
+ return ret;
+}
+
+static GString *caps_json(void)
+{
+ QDict *dict = qdict_new();
+ QDict *capdict = qdict_new();
+ QDict *migdict = qdict_new();
+ GString *str;
+
+ qdict_put_int(migdict, VFIO_USER_CAP_PGSIZE, 4096);
+ qdict_put_obj(capdict, VFIO_USER_CAP_MIGR, QOBJECT(migdict));
+
+ qdict_put_int(capdict, VFIO_USER_CAP_MAX_FDS, VFIO_USER_MAX_MAX_FDS);
+ qdict_put_int(capdict, VFIO_USER_CAP_MAX_XFER, VFIO_USER_DEF_MAX_XFER);
+
+ qdict_put_obj(dict, VFIO_USER_CAP, QOBJECT(capdict));
+
+ str = qobject_to_json(QOBJECT(dict));
+ qobject_unref(dict);
+ return str;
+}
+
+int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp)
+{
+ g_autofree VFIOUserVersion *msgp;
+ GString *caps;
+ char *reply;
+ int size, caplen;
+
+ caps = caps_json();
+ caplen = caps->len + 1;
+ size = sizeof(*msgp) + caplen;
+ msgp = g_malloc0(size);
+
+ vfio_user_request_msg(&msgp->hdr, VFIO_USER_VERSION, size, 0);
+ msgp->major = VFIO_USER_MAJOR_VER;
+ msgp->minor = VFIO_USER_MINOR_VER;
+ memcpy(&msgp->capabilities, caps->str, caplen);
+ g_string_free(caps, true);
+
+ vfio_user_send_wait(vbasedev->proxy, &msgp->hdr, NULL, 0, false);
+ if (msgp->hdr.flags & VFIO_USER_ERROR) {
+ error_setg_errno(errp, msgp->hdr.error_reply, "version reply");
+ return -1;
+ }
+
+ if (msgp->major != VFIO_USER_MAJOR_VER ||
+ msgp->minor > VFIO_USER_MINOR_VER) {
+ error_setg(errp, "incompatible server version");
+ return -1;
+ }
+
+ reply = msgp->capabilities;
+ if (reply[msgp->hdr.size - sizeof(*msgp) - 1] != '\0') {
+ error_setg(errp, "corrupt version reply");
+ return -1;
+ }
+
+ if (caps_check(msgp->minor, reply, errp) != 0) {
+ return -1;
+ }
+
+ return 0;
+}
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 10/23] vfio-user: get device info
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (8 preceding siblings ...)
2022-05-05 17:19 ` [RFC v5 09/23] vfio-user: define socket send functions John Johnson
@ 2022-05-05 17:19 ` John Johnson
2022-05-05 17:19 ` [RFC v5 11/23] vfio-user: get region info John Johnson
` (12 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:19 UTC (permalink / raw)
To: qemu-devel
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/user-protocol.h | 14 +++++++++++++
hw/vfio/user.h | 2 ++
hw/vfio/pci.c | 26 ++++++++++++++++++++++++
hw/vfio/user.c | 54 +++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 96 insertions(+)
diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
index a0889f6..4ad8f45 100644
--- a/hw/vfio/user-protocol.h
+++ b/hw/vfio/user-protocol.h
@@ -92,4 +92,18 @@ typedef struct {
#define VFIO_USER_DEF_MAX_XFER (1024 * 1024)
#define VFIO_USER_MAX_MAX_XFER (64 * 1024 * 1024)
+
+/*
+ * VFIO_USER_DEVICE_GET_INFO
+ * imported from struct_device_info
+ */
+typedef struct {
+ VFIOUserHdr hdr;
+ uint32_t argsz;
+ uint32_t flags;
+ uint32_t num_regions;
+ uint32_t num_irqs;
+ uint32_t cap_offset;
+} VFIOUserDeviceInfo;
+
#endif /* VFIO_USER_PROTOCOL_H */
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 00d21bf..633b3ea 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -85,4 +85,6 @@ void vfio_user_set_handler(VFIODevice *vbasedev,
void *reqarg);
int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
+extern VFIODevIO vfio_dev_io_sock;
+
#endif /* VFIO_USER_H */
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 7e5b910..68d6f0c 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3416,6 +3416,8 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
VFIODevice *vbasedev = &vdev->vbasedev;
SocketAddress addr;
VFIOProxy *proxy;
+ struct vfio_device_info info;
+ int ret;
Error *err = NULL;
/*
@@ -3455,6 +3457,30 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
vbasedev->fd = -1;
vbasedev->type = VFIO_DEVICE_TYPE_PCI;
vbasedev->ops = &vfio_user_pci_ops;
+ vbasedev->io_ops = &vfio_dev_io_sock;
+
+ ret = VDEV_GET_INFO(vbasedev, &info);
+ if (ret) {
+ error_setg_errno(errp, -ret, "get info failure");
+ goto error;
+ }
+ /* must be PCI */
+ if ((info.flags & VFIO_DEVICE_FLAGS_PCI) == 0) {
+ error_setg(errp, "remote device not PCI");
+ goto error;
+ }
+
+ vbasedev->num_irqs = info.num_irqs;
+ vbasedev->num_regions = info.num_regions;
+ vbasedev->flags = info.flags;
+ vbasedev->reset_works = !!(info.flags & VFIO_DEVICE_FLAGS_RESET);
+
+ vfio_get_all_regions(vbasedev);
+ vfio_populate_device(vdev, &err);
+ if (err) {
+ error_propagate(errp, err);
+ goto error;
+ }
return;
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index dc3f1a6..51e23dd 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -30,6 +30,13 @@
#include "qapi/qmp/qnum.h"
#include "user.h"
+/*
+ * These are to defend against a malign server trying
+ * to force us to run out of memory.
+ */
+#define VFIO_USER_MAX_REGIONS 100
+#define VFIO_USER_MAX_IRQS 50
+
static uint64_t max_xfer_size = VFIO_USER_DEF_MAX_XFER;
static uint64_t max_send_fds = VFIO_USER_DEF_MAX_FDS;
static int wait_time = 1000; /* wait 1 sec for replies */
@@ -984,3 +991,50 @@ int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp)
return 0;
}
+
+static int vfio_user_get_info(VFIOProxy *proxy, struct vfio_device_info *info)
+{
+ VFIOUserDeviceInfo msg;
+
+ memset(&msg, 0, sizeof(msg));
+ vfio_user_request_msg(&msg.hdr, VFIO_USER_DEVICE_GET_INFO, sizeof(msg), 0);
+ msg.argsz = sizeof(struct vfio_device_info);
+
+ vfio_user_send_wait(proxy, &msg.hdr, NULL, 0, false);
+ if (msg.hdr.flags & VFIO_USER_ERROR) {
+ return -msg.hdr.error_reply;
+ }
+
+ memcpy(info, &msg.argsz, sizeof(*info));
+ return 0;
+}
+
+
+/*
+ * Socket-based io_ops
+ */
+
+static int vfio_user_io_get_info(VFIODevice *vbasedev,
+ struct vfio_device_info *info)
+{
+ int ret;
+
+ ret = vfio_user_get_info(vbasedev->proxy, info);
+ if (ret) {
+ return ret;
+ }
+
+ /* defend against a malicious server */
+ if (info->num_regions > VFIO_USER_MAX_REGIONS ||
+ info->num_irqs > VFIO_USER_MAX_IRQS) {
+ error_printf("vfio_user_get_info: invalid reply\n");
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
+VFIODevIO vfio_dev_io_sock = {
+ .get_info = vfio_user_io_get_info,
+};
+
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 11/23] vfio-user: get region info
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (9 preceding siblings ...)
2022-05-05 17:19 ` [RFC v5 10/23] vfio-user: get device info John Johnson
@ 2022-05-05 17:19 ` John Johnson
2022-05-05 17:19 ` [RFC v5 12/23] vfio-user: region read/write John Johnson
` (11 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:19 UTC (permalink / raw)
To: qemu-devel
Add per-region FD to support mmap() of remote device regions
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/user-protocol.h | 14 ++++++++++
include/hw/vfio/vfio-common.h | 8 +++---
hw/vfio/common.c | 32 ++++++++++++++++++++---
hw/vfio/user.c | 59 +++++++++++++++++++++++++++++++++++++++++++
4 files changed, 107 insertions(+), 6 deletions(-)
diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
index 4ad8f45..caa523a 100644
--- a/hw/vfio/user-protocol.h
+++ b/hw/vfio/user-protocol.h
@@ -106,4 +106,18 @@ typedef struct {
uint32_t cap_offset;
} VFIOUserDeviceInfo;
+/*
+ * VFIO_USER_DEVICE_GET_REGION_INFO
+ * imported from struct_vfio_region_info
+ */
+typedef struct {
+ VFIOUserHdr hdr;
+ uint32_t argsz;
+ uint32_t flags;
+ uint32_t index;
+ uint32_t cap_offset;
+ uint64_t size;
+ uint64_t offset;
+} VFIOUserRegionInfo;
+
#endif /* VFIO_USER_PROTOCOL_H */
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 3eb0b19..2552557 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -56,6 +56,7 @@ typedef struct VFIORegion {
uint32_t nr_mmaps;
VFIOMmap *mmaps;
uint8_t nr; /* cache the region number for debug */
+ int fd; /* fd to mmap() region */
} VFIORegion;
typedef struct VFIOMigration {
@@ -150,6 +151,7 @@ typedef struct VFIODevice {
OnOffAuto pre_copy_dirty_page_tracking;
VFIOProxy *proxy;
struct vfio_region_info **regions;
+ int *regfds;
} VFIODevice;
struct VFIODeviceOps {
@@ -172,7 +174,7 @@ struct VFIODeviceOps {
struct VFIODevIO {
int (*get_info)(VFIODevice *vdev, struct vfio_device_info *info);
int (*get_region_info)(VFIODevice *vdev,
- struct vfio_region_info *info);
+ struct vfio_region_info *info, int *fd);
int (*get_irq_info)(VFIODevice *vdev, struct vfio_irq_info *irq);
int (*set_irqs)(VFIODevice *vdev, struct vfio_irq_set *irqs);
int (*region_read)(VFIODevice *vdev, uint8_t nr, off_t off, uint32_t size,
@@ -183,8 +185,8 @@ struct VFIODevIO {
#define VDEV_GET_INFO(vdev, info) \
((vdev)->io_ops->get_info((vdev), (info)))
-#define VDEV_GET_REGION_INFO(vdev, info) \
- ((vdev)->io_ops->get_region_info((vdev), (info)))
+#define VDEV_GET_REGION_INFO(vdev, info, fd) \
+ ((vdev)->io_ops->get_region_info((vdev), (info), (fd)))
#define VDEV_GET_IRQ_INFO(vdev, irq) \
((vdev)->io_ops->get_irq_info((vdev), (irq)))
#define VDEV_SET_IRQS(vdev, irqs) \
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index da18fd5..c30da14 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -40,6 +40,7 @@
#include "trace.h"
#include "qapi/error.h"
#include "migration/migration.h"
+#include "hw/vfio/user.h"
VFIOGroupList vfio_group_list =
QLIST_HEAD_INITIALIZER(vfio_group_list);
@@ -1554,6 +1555,11 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
region->size = info->size;
region->fd_offset = info->offset;
region->nr = index;
+ if (vbasedev->regfds != NULL) {
+ region->fd = vbasedev->regfds[index];
+ } else {
+ region->fd = vbasedev->fd;
+ }
if (region->size) {
region->mem = g_new0(MemoryRegion, 1);
@@ -1605,7 +1611,7 @@ int vfio_region_mmap(VFIORegion *region)
for (i = 0; i < region->nr_mmaps; i++) {
region->mmaps[i].mmap = mmap(NULL, region->mmaps[i].size, prot,
- MAP_SHARED, region->vbasedev->fd,
+ MAP_SHARED, region->fd,
region->fd_offset +
region->mmaps[i].offset);
if (region->mmaps[i].mmap == MAP_FAILED) {
@@ -2410,10 +2416,17 @@ void vfio_put_base_device(VFIODevice *vbasedev)
int i;
for (i = 0; i < vbasedev->num_regions; i++) {
+ if (vbasedev->regfds != NULL && vbasedev->regfds[i] != -1) {
+ close(vbasedev->regfds[i]);
+ }
g_free(vbasedev->regions[i]);
}
g_free(vbasedev->regions);
vbasedev->regions = NULL;
+ if (vbasedev->regfds != NULL) {
+ g_free(vbasedev->regfds);
+ vbasedev->regfds = NULL;
+ }
}
if (!vbasedev->group) {
@@ -2429,12 +2442,16 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
struct vfio_region_info **info)
{
size_t argsz = sizeof(struct vfio_region_info);
+ int fd = -1;
int ret;
/* create region cache */
if (vbasedev->regions == NULL) {
vbasedev->regions = g_new0(struct vfio_region_info *,
vbasedev->num_regions);
+ if (vbasedev->proxy != NULL) {
+ vbasedev->regfds = g_new0(int, vbasedev->num_regions);
+ }
}
/* check cache */
if (vbasedev->regions[index] != NULL) {
@@ -2448,7 +2465,7 @@ int vfio_get_region_info(VFIODevice *vbasedev, int index,
retry:
(*info)->argsz = argsz;
- ret = VDEV_GET_REGION_INFO(vbasedev, *info);
+ ret = VDEV_GET_REGION_INFO(vbasedev, *info, &fd);
if (ret != 0) {
g_free(*info);
*info = NULL;
@@ -2458,12 +2475,19 @@ retry:
if ((*info)->argsz > argsz) {
argsz = (*info)->argsz;
*info = g_realloc(*info, argsz);
+ if (fd != -1) {
+ close(fd);
+ fd = -1;
+ }
goto retry;
}
/* fill cache */
vbasedev->regions[index] = *info;
+ if (vbasedev->regfds != NULL) {
+ vbasedev->regfds[index] = fd;
+ }
return 0;
}
@@ -2623,10 +2647,12 @@ static int vfio_io_get_info(VFIODevice *vbasedev, struct vfio_device_info *info)
}
static int vfio_io_get_region_info(VFIODevice *vbasedev,
- struct vfio_region_info *info)
+ struct vfio_region_info *info,
+ int *fd)
{
int ret;
+ *fd = -1;
ret = ioctl(vbasedev->fd, VFIO_DEVICE_GET_REGION_INFO, info);
return ret < 0 ? -errno : ret;
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 51e23dd..c87699a 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -1009,6 +1009,40 @@ static int vfio_user_get_info(VFIOProxy *proxy, struct vfio_device_info *info)
return 0;
}
+static int vfio_user_get_region_info(VFIOProxy *proxy,
+ struct vfio_region_info *info,
+ VFIOUserFDs *fds)
+{
+ g_autofree VFIOUserRegionInfo *msgp = NULL;
+ uint32_t size;
+
+ /* data returned can be larger than vfio_region_info */
+ if (info->argsz < sizeof(*info)) {
+ error_printf("vfio_user_get_region_info argsz too small\n");
+ return -EINVAL;
+ }
+ if (fds != NULL && fds->send_fds != 0) {
+ error_printf("vfio_user_get_region_info can't send FDs\n");
+ return -EINVAL;
+ }
+
+ size = info->argsz + sizeof(VFIOUserHdr);
+ msgp = g_malloc0(size);
+
+ vfio_user_request_msg(&msgp->hdr, VFIO_USER_DEVICE_GET_REGION_INFO,
+ sizeof(*msgp), 0);
+ msgp->argsz = info->argsz;
+ msgp->index = info->index;
+
+ vfio_user_send_wait(proxy, &msgp->hdr, fds, size, false);
+ if (msgp->hdr.flags & VFIO_USER_ERROR) {
+ return -msgp->hdr.error_reply;
+ }
+
+ memcpy(info, &msgp->argsz, info->argsz);
+ return 0;
+}
+
/*
* Socket-based io_ops
@@ -1034,7 +1068,32 @@ static int vfio_user_io_get_info(VFIODevice *vbasedev,
return 0;
}
+static int vfio_user_io_get_region_info(VFIODevice *vbasedev,
+ struct vfio_region_info *info,
+ int *fd)
+{
+ int ret;
+ VFIOUserFDs fds = { 0, 1, fd};
+
+ ret = vfio_user_get_region_info(vbasedev->proxy, info, &fds);
+ if (ret) {
+ return ret;
+ }
+
+ if (info->index > vbasedev->num_regions) {
+ return -EINVAL;
+ }
+ /* cap_offset in valid area */
+ if ((info->flags & VFIO_REGION_INFO_FLAG_CAPS) &&
+ (info->cap_offset < sizeof(*info) || info->cap_offset > info->argsz)) {
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
VFIODevIO vfio_dev_io_sock = {
.get_info = vfio_user_io_get_info,
+ .get_region_info = vfio_user_io_get_region_info,
};
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 12/23] vfio-user: region read/write
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (10 preceding siblings ...)
2022-05-05 17:19 ` [RFC v5 11/23] vfio-user: get region info John Johnson
@ 2022-05-05 17:19 ` John Johnson
2022-05-05 17:19 ` [RFC v5 13/23] vfio-user: pci_user_realize PCI setup John Johnson
` (10 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:19 UTC (permalink / raw)
To: qemu-devel
Add support for posted writes on remote devices
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/pci.h | 1 +
hw/vfio/user-protocol.h | 12 +++++
hw/vfio/user.h | 1 +
include/hw/vfio/vfio-common.h | 7 +--
hw/vfio/common.c | 10 +++-
hw/vfio/pci.c | 9 +++-
hw/vfio/user.c | 109 ++++++++++++++++++++++++++++++++++++++++++
7 files changed, 143 insertions(+), 6 deletions(-)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index ec9f345..643ff75 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -194,6 +194,7 @@ struct VFIOUserPCIDevice {
VFIOPCIDevice device;
char *sock_name;
bool send_queued; /* all sends are queued */
+ bool no_post; /* all regions write are sync */
};
/* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
index caa523a..b1ea55f 100644
--- a/hw/vfio/user-protocol.h
+++ b/hw/vfio/user-protocol.h
@@ -120,4 +120,16 @@ typedef struct {
uint64_t offset;
} VFIOUserRegionInfo;
+/*
+ * VFIO_USER_REGION_READ
+ * VFIO_USER_REGION_WRITE
+ */
+typedef struct {
+ VFIOUserHdr hdr;
+ uint64_t offset;
+ uint32_t region;
+ uint32_t count;
+ char data[];
+} VFIOUserRegionRW;
+
#endif /* VFIO_USER_PROTOCOL_H */
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 633b3ea..a641351 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -77,6 +77,7 @@ typedef struct VFIOProxy {
/* VFIOProxy flags */
#define VFIO_PROXY_CLIENT 0x1
#define VFIO_PROXY_FORCE_QUEUED 0x4
+#define VFIO_PROXY_NO_POST 0x8
VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp);
void vfio_user_disconnect(VFIOProxy *proxy);
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 2552557..4118b8a 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -57,6 +57,7 @@ typedef struct VFIORegion {
VFIOMmap *mmaps;
uint8_t nr; /* cache the region number for debug */
int fd; /* fd to mmap() region */
+ bool post_wr; /* writes can be posted */
} VFIORegion;
typedef struct VFIOMigration {
@@ -180,7 +181,7 @@ struct VFIODevIO {
int (*region_read)(VFIODevice *vdev, uint8_t nr, off_t off, uint32_t size,
void *data);
int (*region_write)(VFIODevice *vdev, uint8_t nr, off_t off, uint32_t size,
- void *data);
+ void *data, bool post);
};
#define VDEV_GET_INFO(vdev, info) \
@@ -193,8 +194,8 @@ struct VFIODevIO {
((vdev)->io_ops->set_irqs((vdev), (irqs)))
#define VDEV_REGION_READ(vdev, nr, off, size, data) \
((vdev)->io_ops->region_read((vdev), (nr), (off), (size), (data)))
-#define VDEV_REGION_WRITE(vdev, nr, off, size, data) \
- ((vdev)->io_ops->region_write((vdev), (nr), (off), (size), (data)))
+#define VDEV_REGION_WRITE(vdev, nr, off, size, data, post) \
+ ((vdev)->io_ops->region_write((vdev), (nr), (off), (size), (data), (post)))
struct VFIOContIO {
int (*dma_map)(VFIOContainer *container,
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index c30da14..351f727 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -213,6 +213,7 @@ void vfio_region_write(void *opaque, hwaddr addr,
uint32_t dword;
uint64_t qword;
} buf;
+ bool post = region->post_wr;
int ret;
switch (size) {
@@ -233,7 +234,11 @@ void vfio_region_write(void *opaque, hwaddr addr,
break;
}
- ret = VDEV_REGION_WRITE(vbasedev, region->nr, addr, size, &buf);
+ /* read-after-write hazard if guest can directly access region */
+ if (region->nr_mmaps) {
+ post = false;
+ }
+ ret = VDEV_REGION_WRITE(vbasedev, region->nr, addr, size, &buf, post);
if (ret != size) {
const char *err = ret < 0 ? strerror(-ret) : "short write";
@@ -1555,6 +1560,7 @@ int vfio_region_setup(Object *obj, VFIODevice *vbasedev, VFIORegion *region,
region->size = info->size;
region->fd_offset = info->offset;
region->nr = index;
+ region->post_wr = false;
if (vbasedev->regfds != NULL) {
region->fd = vbasedev->regfds[index];
} else {
@@ -2689,7 +2695,7 @@ static int vfio_io_region_read(VFIODevice *vbasedev, uint8_t index, off_t off,
}
static int vfio_io_region_write(VFIODevice *vbasedev, uint8_t index, off_t off,
- uint32_t size, void *data)
+ uint32_t size, void *data, bool post)
{
struct vfio_region_info *info = vbasedev->regions[index];
int ret;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 68d6f0c..98520dd 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -51,7 +51,7 @@
(size), (data))
#define VDEV_CONFIG_WRITE(vbasedev, off, size, data) \
VDEV_REGION_WRITE((vbasedev), VFIO_PCI_CONFIG_REGION_INDEX, (off), \
- (size), (data))
+ (size), (data), false)
#define TYPE_VFIO_PCI_NOHOTPLUG "vfio-pci-nohotplug"
@@ -1661,6 +1661,9 @@ static void vfio_bar_prepare(VFIOPCIDevice *vdev, int nr)
bar->type = pci_bar & (bar->ioport ? ~PCI_BASE_ADDRESS_IO_MASK :
~PCI_BASE_ADDRESS_MEM_MASK);
bar->size = bar->region.size;
+
+ /* IO regions are sync, memory can be async */
+ bar->region.post_wr = (bar->ioport == 0);
}
static void vfio_bars_prepare(VFIOPCIDevice *vdev)
@@ -3445,6 +3448,9 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
if (udev->send_queued) {
proxy->flags |= VFIO_PROXY_FORCE_QUEUED;
}
+ if (udev->no_post) {
+ proxy->flags |= VFIO_PROXY_NO_POST;
+ }
vfio_user_validate_version(vbasedev, &err);
if (err != NULL) {
@@ -3504,6 +3510,7 @@ static void vfio_user_instance_finalize(Object *obj)
static Property vfio_user_pci_dev_properties[] = {
DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
DEFINE_PROP_BOOL("x-send-queued", VFIOUserPCIDevice, send_queued, false),
+ DEFINE_PROP_BOOL("x-no-posted-writes", VFIOUserPCIDevice, no_post, false),
DEFINE_PROP_END_OF_LIST(),
};
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index c87699a..fb6851e 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -57,6 +57,8 @@ static void vfio_user_cb(void *opaque);
static void vfio_user_request(void *opaque);
static int vfio_user_send_queued(VFIOProxy *proxy, VFIOUserMsg *msg);
+static void vfio_user_send_async(VFIOProxy *proxy, VFIOUserHdr *hdr,
+ VFIOUserFDs *fds);
static void vfio_user_send_wait(VFIOProxy *proxy, VFIOUserHdr *hdr,
VFIOUserFDs *fds, int rsize, bool nobql);
static void vfio_user_request_msg(VFIOUserHdr *hdr, uint16_t cmd,
@@ -618,6 +620,33 @@ static int vfio_user_send_queued(VFIOProxy *proxy, VFIOUserMsg *msg)
return 0;
}
+/*
+ * async send - msg can be queued, but will be freed when sent
+ */
+static void vfio_user_send_async(VFIOProxy *proxy, VFIOUserHdr *hdr,
+ VFIOUserFDs *fds)
+{
+ VFIOUserMsg *msg;
+ int ret;
+
+ if (!(hdr->flags & (VFIO_USER_NO_REPLY | VFIO_USER_REPLY))) {
+ error_printf("vfio_user_send_async on sync message\n");
+ return;
+ }
+
+ QEMU_LOCK_GUARD(&proxy->lock);
+
+ msg = vfio_user_getmsg(proxy, hdr, fds);
+ msg->id = hdr->id;
+ msg->rsize = 0;
+ msg->type = VFIO_MSG_ASYNC;
+
+ ret = vfio_user_send_queued(proxy, msg);
+ if (ret < 0) {
+ vfio_user_recycle(proxy, msg);
+ }
+}
+
static void vfio_user_send_wait(VFIOProxy *proxy, VFIOUserHdr *hdr,
VFIOUserFDs *fds, int rsize, bool nobql)
{
@@ -1043,6 +1072,70 @@ static int vfio_user_get_region_info(VFIOProxy *proxy,
return 0;
}
+static int vfio_user_region_read(VFIOProxy *proxy, uint8_t index, off_t offset,
+ uint32_t count, void *data)
+{
+ g_autofree VFIOUserRegionRW *msgp = NULL;
+ int size = sizeof(*msgp) + count;
+
+ if (count > max_xfer_size) {
+ return -EINVAL;
+ }
+
+ msgp = g_malloc0(size);
+ vfio_user_request_msg(&msgp->hdr, VFIO_USER_REGION_READ, sizeof(*msgp), 0);
+ msgp->offset = offset;
+ msgp->region = index;
+ msgp->count = count;
+
+ vfio_user_send_wait(proxy, &msgp->hdr, NULL, size, false);
+ if (msgp->hdr.flags & VFIO_USER_ERROR) {
+ return -msgp->hdr.error_reply;
+ } else if (msgp->count > count) {
+ return -E2BIG;
+ } else {
+ memcpy(data, &msgp->data, msgp->count);
+ }
+
+ return msgp->count;
+}
+
+static int vfio_user_region_write(VFIOProxy *proxy, uint8_t index, off_t offset,
+ uint32_t count, void *data, bool post)
+{
+ VFIOUserRegionRW *msgp = NULL;
+ int flags = post ? VFIO_USER_NO_REPLY : 0;
+ int size = sizeof(*msgp) + count;
+ int ret;
+
+ if (count > max_xfer_size) {
+ return -EINVAL;
+ }
+
+ msgp = g_malloc0(size);
+ vfio_user_request_msg(&msgp->hdr, VFIO_USER_REGION_WRITE, size, flags);
+ msgp->offset = offset;
+ msgp->region = index;
+ msgp->count = count;
+ memcpy(&msgp->data, data, count);
+
+ /* async send will free msg after it's sent */
+ if (post && !(proxy->flags & VFIO_PROXY_NO_POST)) {
+ vfio_user_send_async(proxy, &msgp->hdr, NULL);
+ return count;
+ }
+
+ vfio_user_send_wait(proxy, &msgp->hdr, NULL, 0, false);
+ if (msgp->hdr.flags & VFIO_USER_ERROR) {
+ ret = -msgp->hdr.error_reply;
+ } else {
+ ret = count;
+ }
+
+ g_free(msgp);
+ return ret;
+}
+
/*
* Socket-based io_ops
@@ -1092,8 +1185,24 @@ static int vfio_user_io_get_region_info(VFIODevice *vbasedev,
return 0;
}
+static int vfio_user_io_region_read(VFIODevice *vbasedev, uint8_t index,
+ off_t off, uint32_t size, void *data)
+{
+ return vfio_user_region_read(vbasedev->proxy, index, off, size, data);
+}
+
+static int vfio_user_io_region_write(VFIODevice *vbasedev, uint8_t index,
+ off_t off, unsigned size, void *data,
+ bool post)
+{
+ return vfio_user_region_write(vbasedev->proxy, index, off, size, data,
+ post);
+}
+
VFIODevIO vfio_dev_io_sock = {
.get_info = vfio_user_io_get_info,
.get_region_info = vfio_user_io_get_region_info,
+ .region_read = vfio_user_io_region_read,
+ .region_write = vfio_user_io_region_write,
};
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 13/23] vfio-user: pci_user_realize PCI setup
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (11 preceding siblings ...)
2022-05-05 17:19 ` [RFC v5 12/23] vfio-user: region read/write John Johnson
@ 2022-05-05 17:19 ` John Johnson
2022-05-05 17:19 ` [RFC v5 14/23] vfio-user: forward msix BAR accesses to server John Johnson
` (9 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:19 UTC (permalink / raw)
To: qemu-devel
PCI BARs read from remote device
PCI config reads/writes sent to remote server
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/pci.c | 275 ++++++++++++++++++++++++++++++++++++----------------------
1 file changed, 172 insertions(+), 103 deletions(-)
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 98520dd..1be6683 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -2831,6 +2831,132 @@ static void vfio_unregister_req_notifier(VFIOPCIDevice *vdev)
vdev->req_enabled = false;
}
+static void vfio_pci_config_setup(VFIOPCIDevice *vdev, Error **errp)
+{
+ PCIDevice *pdev = &vdev->pdev;
+ Error *err = NULL;
+
+ /* vfio emulates a lot for us, but some bits need extra love */
+ vdev->emulated_config_bits = g_malloc0(vdev->config_size);
+
+ /* QEMU can choose to expose the ROM or not */
+ memset(vdev->emulated_config_bits + PCI_ROM_ADDRESS, 0xff, 4);
+ /* QEMU can also add or extend BARs */
+ memset(vdev->emulated_config_bits + PCI_BASE_ADDRESS_0, 0xff, 6 * 4);
+
+ /*
+ * The PCI spec reserves vendor ID 0xffff as an invalid value. The
+ * device ID is managed by the vendor and need only be a 16-bit value.
+ * Allow any 16-bit value for subsystem so they can be hidden or changed.
+ */
+ if (vdev->vendor_id != PCI_ANY_ID) {
+ if (vdev->vendor_id >= 0xffff) {
+ error_setg(errp, "invalid PCI vendor ID provided");
+ return;
+ }
+ vfio_add_emulated_word(vdev, PCI_VENDOR_ID, vdev->vendor_id, ~0);
+ trace_vfio_pci_emulated_vendor_id(vdev->vbasedev.name, vdev->vendor_id);
+ } else {
+ vdev->vendor_id = pci_get_word(pdev->config + PCI_VENDOR_ID);
+ }
+
+ if (vdev->device_id != PCI_ANY_ID) {
+ if (vdev->device_id > 0xffff) {
+ error_setg(errp, "invalid PCI device ID provided");
+ return;
+ }
+ vfio_add_emulated_word(vdev, PCI_DEVICE_ID, vdev->device_id, ~0);
+ trace_vfio_pci_emulated_device_id(vdev->vbasedev.name, vdev->device_id);
+ } else {
+ vdev->device_id = pci_get_word(pdev->config + PCI_DEVICE_ID);
+ }
+
+ if (vdev->sub_vendor_id != PCI_ANY_ID) {
+ if (vdev->sub_vendor_id > 0xffff) {
+ error_setg(errp, "invalid PCI subsystem vendor ID provided");
+ return;
+ }
+ vfio_add_emulated_word(vdev, PCI_SUBSYSTEM_VENDOR_ID,
+ vdev->sub_vendor_id, ~0);
+ trace_vfio_pci_emulated_sub_vendor_id(vdev->vbasedev.name,
+ vdev->sub_vendor_id);
+ }
+
+ if (vdev->sub_device_id != PCI_ANY_ID) {
+ if (vdev->sub_device_id > 0xffff) {
+ error_setg(errp, "invalid PCI subsystem device ID provided");
+ return;
+ }
+ vfio_add_emulated_word(vdev, PCI_SUBSYSTEM_ID, vdev->sub_device_id, ~0);
+ trace_vfio_pci_emulated_sub_device_id(vdev->vbasedev.name,
+ vdev->sub_device_id);
+ }
+
+ /* QEMU can change multi-function devices to single function, or reverse */
+ vdev->emulated_config_bits[PCI_HEADER_TYPE] =
+ PCI_HEADER_TYPE_MULTI_FUNCTION;
+
+ /* Restore or clear multifunction, this is always controlled by QEMU */
+ if (vdev->pdev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
+ vdev->pdev.config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
+ } else {
+ vdev->pdev.config[PCI_HEADER_TYPE] &= ~PCI_HEADER_TYPE_MULTI_FUNCTION;
+ }
+
+ /*
+ * Clear host resource mapping info. If we choose not to register a
+ * BAR, such as might be the case with the option ROM, we can get
+ * confusing, unwritable, residual addresses from the host here.
+ */
+ memset(&vdev->pdev.config[PCI_BASE_ADDRESS_0], 0, 24);
+ memset(&vdev->pdev.config[PCI_ROM_ADDRESS], 0, 4);
+
+ vfio_pci_size_rom(vdev);
+
+ vfio_bars_prepare(vdev);
+
+ vfio_msix_early_setup(vdev, &err);
+ if (err) {
+ error_propagate(errp, err);
+ return;
+ }
+
+ vfio_bars_register(vdev);
+}
+
+static int vfio_interrupt_setup(VFIOPCIDevice *vdev, Error **errp)
+{
+ PCIDevice *pdev = &vdev->pdev;
+ int ret;
+
+ /* QEMU emulates all of MSI & MSIX */
+ if (pdev->cap_present & QEMU_PCI_CAP_MSIX) {
+ memset(vdev->emulated_config_bits + pdev->msix_cap, 0xff,
+ MSIX_CAP_LENGTH);
+ }
+
+ if (pdev->cap_present & QEMU_PCI_CAP_MSI) {
+ memset(vdev->emulated_config_bits + pdev->msi_cap, 0xff,
+ vdev->msi_cap_size);
+ }
+
+ if (vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) {
+ vdev->intx.mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
+ vfio_intx_mmap_enable, vdev);
+ pci_device_set_intx_routing_notifier(&vdev->pdev,
+ vfio_intx_routing_notifier);
+ vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
+ kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
+ ret = vfio_intx_enable(vdev, errp);
+ if (ret) {
+ pci_device_set_intx_routing_notifier(&vdev->pdev, NULL);
+ kvm_irqchip_remove_change_notifier(&vdev->irqchip_change_notifier);
+ return ret;
+ }
+ }
+ return 0;
+}
+
static void vfio_realize(PCIDevice *pdev, Error **errp)
{
VFIOPCIDevice *vdev = VFIO_PCI_BASE(pdev);
@@ -2946,92 +3072,16 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
goto error;
}
- /* vfio emulates a lot for us, but some bits need extra love */
- vdev->emulated_config_bits = g_malloc0(vdev->config_size);
-
- /* QEMU can choose to expose the ROM or not */
- memset(vdev->emulated_config_bits + PCI_ROM_ADDRESS, 0xff, 4);
- /* QEMU can also add or extend BARs */
- memset(vdev->emulated_config_bits + PCI_BASE_ADDRESS_0, 0xff, 6 * 4);
-
- /*
- * The PCI spec reserves vendor ID 0xffff as an invalid value. The
- * device ID is managed by the vendor and need only be a 16-bit value.
- * Allow any 16-bit value for subsystem so they can be hidden or changed.
- */
- if (vdev->vendor_id != PCI_ANY_ID) {
- if (vdev->vendor_id >= 0xffff) {
- error_setg(errp, "invalid PCI vendor ID provided");
- goto error;
- }
- vfio_add_emulated_word(vdev, PCI_VENDOR_ID, vdev->vendor_id, ~0);
- trace_vfio_pci_emulated_vendor_id(vdev->vbasedev.name, vdev->vendor_id);
- } else {
- vdev->vendor_id = pci_get_word(pdev->config + PCI_VENDOR_ID);
- }
-
- if (vdev->device_id != PCI_ANY_ID) {
- if (vdev->device_id > 0xffff) {
- error_setg(errp, "invalid PCI device ID provided");
- goto error;
- }
- vfio_add_emulated_word(vdev, PCI_DEVICE_ID, vdev->device_id, ~0);
- trace_vfio_pci_emulated_device_id(vdev->vbasedev.name, vdev->device_id);
- } else {
- vdev->device_id = pci_get_word(pdev->config + PCI_DEVICE_ID);
- }
-
- if (vdev->sub_vendor_id != PCI_ANY_ID) {
- if (vdev->sub_vendor_id > 0xffff) {
- error_setg(errp, "invalid PCI subsystem vendor ID provided");
- goto error;
- }
- vfio_add_emulated_word(vdev, PCI_SUBSYSTEM_VENDOR_ID,
- vdev->sub_vendor_id, ~0);
- trace_vfio_pci_emulated_sub_vendor_id(vdev->vbasedev.name,
- vdev->sub_vendor_id);
- }
-
- if (vdev->sub_device_id != PCI_ANY_ID) {
- if (vdev->sub_device_id > 0xffff) {
- error_setg(errp, "invalid PCI subsystem device ID provided");
- goto error;
- }
- vfio_add_emulated_word(vdev, PCI_SUBSYSTEM_ID, vdev->sub_device_id, ~0);
- trace_vfio_pci_emulated_sub_device_id(vdev->vbasedev.name,
- vdev->sub_device_id);
- }
-
- /* QEMU can change multi-function devices to single function, or reverse */
- vdev->emulated_config_bits[PCI_HEADER_TYPE] =
- PCI_HEADER_TYPE_MULTI_FUNCTION;
-
- /* Restore or clear multifunction, this is always controlled by QEMU */
- if (vdev->pdev.cap_present & QEMU_PCI_CAP_MULTIFUNCTION) {
- vdev->pdev.config[PCI_HEADER_TYPE] |= PCI_HEADER_TYPE_MULTI_FUNCTION;
- } else {
- vdev->pdev.config[PCI_HEADER_TYPE] &= ~PCI_HEADER_TYPE_MULTI_FUNCTION;
- }
-
- /*
- * Clear host resource mapping info. If we choose not to register a
- * BAR, such as might be the case with the option ROM, we can get
- * confusing, unwritable, residual addresses from the host here.
- */
- memset(&vdev->pdev.config[PCI_BASE_ADDRESS_0], 0, 24);
- memset(&vdev->pdev.config[PCI_ROM_ADDRESS], 0, 4);
-
- vfio_pci_size_rom(vdev);
-
- vfio_bars_prepare(vdev);
-
- vfio_msix_early_setup(vdev, &err);
+ vfio_pci_config_setup(vdev, &err);
if (err) {
- error_propagate(errp, err);
goto error;
}
- vfio_bars_register(vdev);
+ /*
+ * vfio_pci_config_setup will have registered the device's BARs
+ * and setup any MSIX BARs, so errors after it succeeds must
+ * use out_teardown
+ */
ret = vfio_add_capabilities(vdev, errp);
if (ret) {
@@ -3072,29 +3122,15 @@ static void vfio_realize(PCIDevice *pdev, Error **errp)
}
}
- /* QEMU emulates all of MSI & MSIX */
- if (pdev->cap_present & QEMU_PCI_CAP_MSIX) {
- memset(vdev->emulated_config_bits + pdev->msix_cap, 0xff,
- MSIX_CAP_LENGTH);
- }
-
- if (pdev->cap_present & QEMU_PCI_CAP_MSI) {
- memset(vdev->emulated_config_bits + pdev->msi_cap, 0xff,
- vdev->msi_cap_size);
+ ret = vfio_interrupt_setup(vdev, errp);
+ if (ret) {
+ goto out_teardown;
}
- if (vfio_pci_read_config(&vdev->pdev, PCI_INTERRUPT_PIN, 1)) {
- vdev->intx.mmap_timer = timer_new_ms(QEMU_CLOCK_VIRTUAL,
- vfio_intx_mmap_enable, vdev);
- pci_device_set_intx_routing_notifier(&vdev->pdev,
- vfio_intx_routing_notifier);
- vdev->irqchip_change_notifier.notify = vfio_irqchip_change;
- kvm_irqchip_add_change_notifier(&vdev->irqchip_change_notifier);
- ret = vfio_intx_enable(vdev, errp);
- if (ret) {
- goto out_deregister;
- }
- }
+ /*
+ * vfio_interrupt_setup will have setup INTx's KVM routing
+ * so errors after it succeeds must use out_deregister
+ */
if (vdev->display != ON_OFF_AUTO_OFF) {
ret = vfio_display_probe(vdev, errp);
@@ -3488,8 +3524,41 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
goto error;
}
+ /* Get a copy of config space */
+ ret = VDEV_REGION_READ(vbasedev, VFIO_PCI_CONFIG_REGION_INDEX, 0,
+ MIN(pci_config_size(pdev), vdev->config_size),
+ pdev->config);
+ if (ret < (int)MIN(pci_config_size(&vdev->pdev), vdev->config_size)) {
+ error_setg_errno(errp, -ret, "failed to read device config space");
+ goto error;
+ }
+
+ vfio_pci_config_setup(vdev, &err);
+ if (err) {
+ goto error;
+ }
+
+ /*
+ * vfio_pci_config_setup will have registered the device's BARs
+ * and setup any MSIX BARs, so errors after it succeeds must
+ * use out_teardown
+ */
+
+ ret = vfio_add_capabilities(vdev, errp);
+ if (ret) {
+ goto out_teardown;
+ }
+
+ ret = vfio_interrupt_setup(vdev, errp);
+ if (ret) {
+ goto out_teardown;
+ }
+
return;
+out_teardown:
+ vfio_teardown_msi(vdev);
+ vfio_bars_exit(vdev);
error:
vfio_user_disconnect(proxy);
error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 14/23] vfio-user: forward msix BAR accesses to server
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (12 preceding siblings ...)
2022-05-05 17:19 ` [RFC v5 13/23] vfio-user: pci_user_realize PCI setup John Johnson
@ 2022-05-05 17:19 ` John Johnson
2022-05-05 17:19 ` [RFC v5 15/23] vfio-user: get and set IRQs John Johnson
` (8 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:19 UTC (permalink / raw)
To: qemu-devel
Server holds device current device pending state
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/pci.h | 1 +
hw/vfio/pci.c | 112 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 113 insertions(+)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index 643ff75..a4eb5b9 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -112,6 +112,7 @@ typedef struct VFIOMSIXInfo {
uint32_t table_offset;
uint32_t pba_offset;
unsigned long *pending;
+ MemoryRegion *msix_regions;
} VFIOMSIXInfo;
/*
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 1be6683..bc70968 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3426,6 +3426,111 @@ type_init(register_vfio_pci_dev_type)
* vfio-user routines.
*/
+/*
+ * The server maintains the device's pending interrupts,
+ * via its MSIX table and PBA, so we treat these acceses
+ * like PCI config space and forward them.
+ */
+static uint64_t vfio_user_table_read(void *opaque, hwaddr addr,
+ unsigned size)
+{
+ VFIOPCIDevice *vdev = opaque;
+ uint64_t data;
+
+ /* server doesn't change these, so local copy is good */
+ memory_region_dispatch_read(&vdev->pdev.msix_table_mmio, addr,
+ &data, size_memop(size) | MO_LE,
+ MEMTXATTRS_UNSPECIFIED);
+ return data;
+}
+
+static void vfio_user_table_write(void *opaque, hwaddr addr,
+ uint64_t data, unsigned size)
+{
+ VFIOPCIDevice *vdev = opaque;
+ VFIORegion *region = &vdev->bars[vdev->msix->table_bar].region;
+
+ /* forward, then perform locally */
+ vfio_region_write(region, addr + vdev->msix->table_offset, data, size);
+ memory_region_dispatch_write(&vdev->pdev.msix_table_mmio, addr,
+ data, size_memop(size) | MO_LE,
+ MEMTXATTRS_UNSPECIFIED);
+}
+
+static const MemoryRegionOps vfio_user_table_ops = {
+ .read = vfio_user_table_read,
+ .write = vfio_user_table_write,
+ .endianness = DEVICE_LITTLE_ENDIAN,
+};
+
+static uint64_t vfio_user_pba_read(void *opaque, hwaddr addr,
+ unsigned size)
+{
+ VFIOPCIDevice *vdev = opaque;
+ VFIORegion *region = &vdev->bars[vdev->msix->pba_bar].region;
+ uint64_t data;
+
+ /* server copy is what matters */
+ data = vfio_region_read(region, addr + vdev->msix->pba_offset, size);
+ return data;
+}
+
+static void vfio_user_pba_write(void *opaque, hwaddr addr,
+ uint64_t data, unsigned size)
+{
+ /* dropped */
+}
+
+static const MemoryRegionOps vfio_user_pba_ops = {
+ .read = vfio_user_pba_read,
+ .write = vfio_user_pba_write,
+ .endianness = DEVICE_LITTLE_ENDIAN,
+};
+
+static void vfio_user_msix_setup(VFIOPCIDevice *vdev)
+{
+ MemoryRegion *vfio_reg, *msix_reg, *new_reg;
+
+ vdev->msix->msix_regions = g_new0(MemoryRegion, 2);
+
+ vfio_reg = vdev->bars[vdev->msix->table_bar].mr;
+ msix_reg = &vdev->pdev.msix_table_mmio;
+ new_reg = &vdev->msix->msix_regions[0];
+ memory_region_init_io(new_reg, OBJECT(vdev), &vfio_user_table_ops, vdev,
+ "VFIO MSIX table", int128_get64(msix_reg->size));
+ memory_region_add_subregion_overlap(vfio_reg, vdev->msix->table_offset,
+ new_reg, 1);
+
+ vfio_reg = vdev->bars[vdev->msix->pba_bar].mr;
+ msix_reg = &vdev->pdev.msix_pba_mmio;
+ new_reg = &vdev->msix->msix_regions[1];
+ memory_region_init_io(new_reg, OBJECT(vdev), &vfio_user_pba_ops, vdev,
+ "VFIO MSIX PBA", int128_get64(msix_reg->size));
+ memory_region_add_subregion_overlap(vfio_reg, vdev->msix->pba_offset,
+ new_reg, 1);
+}
+
+static void vfio_user_msix_teardown(VFIOPCIDevice *vdev)
+{
+ MemoryRegion *mr, *sub;
+
+ mr = vdev->bars[vdev->msix->table_bar].mr;
+ sub = &vdev->msix->msix_regions[0];
+ memory_region_del_subregion(mr, sub);
+
+ mr = vdev->bars[vdev->msix->pba_bar].mr;
+ sub = &vdev->msix->msix_regions[1];
+ memory_region_del_subregion(mr, sub);
+
+ g_free(vdev->msix->msix_regions);
+ vdev->msix->msix_regions = NULL;
+}
+
+/*
+ * Incoming request message callback.
+ *
+ * Runs off main loop, so BQL held.
+ */
static void vfio_user_pci_process_req(void *opaque, VFIOUserMsg *msg)
{
@@ -3548,6 +3653,9 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
if (ret) {
goto out_teardown;
}
+ if (vdev->msix != NULL) {
+ vfio_user_msix_setup(vdev);
+ }
ret = vfio_interrupt_setup(vdev, errp);
if (ret) {
@@ -3569,6 +3677,10 @@ static void vfio_user_instance_finalize(Object *obj)
VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
VFIODevice *vbasedev = &vdev->vbasedev;
+ if (vdev->msix != NULL) {
+ vfio_user_msix_teardown(vdev);
+ }
+
vfio_put_device(vdev);
if (vbasedev->proxy != NULL) {
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 15/23] vfio-user: get and set IRQs
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (13 preceding siblings ...)
2022-05-05 17:19 ` [RFC v5 14/23] vfio-user: forward msix BAR accesses to server John Johnson
@ 2022-05-05 17:19 ` John Johnson
2022-05-05 17:19 ` [RFC v5 16/23] vfio-user: proxy container connect/disconnect John Johnson
` (7 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:19 UTC (permalink / raw)
To: qemu-devel
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/user-protocol.h | 25 +++++++++
hw/vfio/pci.c | 9 +++-
hw/vfio/user.c | 131 ++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 163 insertions(+), 2 deletions(-)
diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
index b1ea55f..4852882 100644
--- a/hw/vfio/user-protocol.h
+++ b/hw/vfio/user-protocol.h
@@ -121,6 +121,31 @@ typedef struct {
} VFIOUserRegionInfo;
/*
+ * VFIO_USER_DEVICE_GET_IRQ_INFO
+ * imported from struct vfio_irq_info
+ */
+typedef struct {
+ VFIOUserHdr hdr;
+ uint32_t argsz;
+ uint32_t flags;
+ uint32_t index;
+ uint32_t count;
+} VFIOUserIRQInfo;
+
+/*
+ * VFIO_USER_DEVICE_SET_IRQS
+ * imported from struct vfio_irq_set
+ */
+typedef struct {
+ VFIOUserHdr hdr;
+ uint32_t argsz;
+ uint32_t flags;
+ uint32_t index;
+ uint32_t start;
+ uint32_t count;
+} VFIOUserIRQSet;
+
+/*
* VFIO_USER_REGION_READ
* VFIO_USER_REGION_WRITE
*/
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index bc70968..0a4208b 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -517,7 +517,7 @@ static int vfio_msix_vector_do_use(PCIDevice *pdev, unsigned int nr,
vdev->nr_vectors = nr + 1;
ret = vfio_enable_vectors(vdev, true);
if (ret) {
- error_report("vfio: failed to enable vectors, %d", ret);
+ error_report("vfio: failed to enable vectors, %s", strerror(-ret));
}
} else {
Error *err = NULL;
@@ -662,7 +662,8 @@ retry:
ret = vfio_enable_vectors(vdev, false);
if (ret) {
if (ret < 0) {
- error_report("vfio: Error: Failed to setup MSI fds: %m");
+ error_report("vfio: Error: Failed to setup MSI fds: %s",
+ strerror(-ret));
} else if (ret != vdev->nr_vectors) {
error_report("vfio: Error: Failed to enable %d "
"MSI vectors, retry with %d", vdev->nr_vectors, ret);
@@ -2669,6 +2670,7 @@ static void vfio_populate_device(VFIOPCIDevice *vdev, Error **errp)
irq_info.index = VFIO_PCI_ERR_IRQ_INDEX;
ret = VDEV_GET_IRQ_INFO(vbasedev, &irq_info);
+
if (ret) {
/* This can fail for an old kernel or legacy PCI dev */
trace_vfio_populate_device_get_irq_info_failure(strerror(errno));
@@ -3662,6 +3664,9 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
goto out_teardown;
}
+ vfio_register_err_notifier(vdev);
+ vfio_register_req_notifier(vdev);
+
return;
out_teardown:
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index fb6851e..d0140d6 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -1072,6 +1072,113 @@ static int vfio_user_get_region_info(VFIOProxy *proxy,
return 0;
}
+static int vfio_user_get_irq_info(VFIOProxy *proxy,
+ struct vfio_irq_info *info)
+{
+ VFIOUserIRQInfo msg;
+
+ memset(&msg, 0, sizeof(msg));
+ vfio_user_request_msg(&msg.hdr, VFIO_USER_DEVICE_GET_IRQ_INFO,
+ sizeof(msg), 0);
+ msg.argsz = info->argsz;
+ msg.index = info->index;
+
+ vfio_user_send_wait(proxy, &msg.hdr, NULL, 0, false);
+ if (msg.hdr.flags & VFIO_USER_ERROR) {
+ return -msg.hdr.error_reply;
+ }
+
+ memcpy(info, &msg.argsz, sizeof(*info));
+ return 0;
+}
+
+static int irq_howmany(int *fdp, int cur, int max)
+{
+ int n = 0;
+
+ if (fdp[cur] != -1) {
+ do {
+ n++;
+ } while (n < max && fdp[cur + n] != -1 && n < max_send_fds);
+ } else {
+ do {
+ n++;
+ } while (n < max && fdp[cur + n] == -1 && n < max_send_fds);
+ }
+
+ return n;
+}
+
+static int vfio_user_set_irqs(VFIOProxy *proxy, struct vfio_irq_set *irq)
+{
+ g_autofree VFIOUserIRQSet *msgp = NULL;
+ uint32_t size, nfds, send_fds, sent_fds;
+
+ if (irq->argsz < sizeof(*irq)) {
+ error_printf("vfio_user_set_irqs argsz too small\n");
+ return -EINVAL;
+ }
+
+ /*
+ * Handle simple case
+ */
+ if ((irq->flags & VFIO_IRQ_SET_DATA_EVENTFD) == 0) {
+ size = sizeof(VFIOUserHdr) + irq->argsz;
+ msgp = g_malloc0(size);
+
+ vfio_user_request_msg(&msgp->hdr, VFIO_USER_DEVICE_SET_IRQS, size, 0);
+ msgp->argsz = irq->argsz;
+ msgp->flags = irq->flags;
+ msgp->index = irq->index;
+ msgp->start = irq->start;
+ msgp->count = irq->count;
+
+ vfio_user_send_wait(proxy, &msgp->hdr, NULL, 0, false);
+ if (msgp->hdr.flags & VFIO_USER_ERROR) {
+ return -msgp->hdr.error_reply;
+ }
+
+ return 0;
+ }
+
+ /*
+ * Calculate the number of FDs to send
+ * and adjust argsz
+ */
+ nfds = (irq->argsz - sizeof(*irq)) / sizeof(int);
+ irq->argsz = sizeof(*irq);
+ msgp = g_malloc0(sizeof(*msgp));
+ /*
+ * Send in chunks if over max_send_fds
+ */
+ for (sent_fds = 0; nfds > sent_fds; sent_fds += send_fds) {
+ VFIOUserFDs *arg_fds, loop_fds;
+
+ /* must send all valid FDs or all invalid FDs in single msg */
+ send_fds = irq_howmany((int *)irq->data, sent_fds, nfds - sent_fds);
+
+ vfio_user_request_msg(&msgp->hdr, VFIO_USER_DEVICE_SET_IRQS,
+ sizeof(*msgp), 0);
+ msgp->argsz = irq->argsz;
+ msgp->flags = irq->flags;
+ msgp->index = irq->index;
+ msgp->start = irq->start + sent_fds;
+ msgp->count = send_fds;
+
+ loop_fds.send_fds = send_fds;
+ loop_fds.recv_fds = 0;
+ loop_fds.fds = (int *)irq->data + sent_fds;
+ arg_fds = loop_fds.fds[0] != -1 ? &loop_fds : NULL;
+
+ vfio_user_send_wait(proxy, &msgp->hdr, arg_fds, 0, false);
+ if (msgp->hdr.flags & VFIO_USER_ERROR) {
+ return -msgp->hdr.error_reply;
+ }
+ }
+
+ return 0;
+}
+
static int vfio_user_region_read(VFIOProxy *proxy, uint8_t index, off_t offset,
uint32_t count, void *data)
{
@@ -1185,6 +1292,28 @@ static int vfio_user_io_get_region_info(VFIODevice *vbasedev,
return 0;
}
+static int vfio_user_io_get_irq_info(VFIODevice *vbasedev,
+ struct vfio_irq_info *irq)
+{
+ int ret;
+
+ ret = vfio_user_get_irq_info(vbasedev->proxy, irq);
+ if (ret) {
+ return ret;
+ }
+
+ if (irq->index > vbasedev->num_irqs) {
+ return -EINVAL;
+ }
+ return 0;
+}
+
+static int vfio_user_io_set_irqs(VFIODevice *vbasedev,
+ struct vfio_irq_set *irqs)
+{
+ return vfio_user_set_irqs(vbasedev->proxy, irqs);
+}
+
static int vfio_user_io_region_read(VFIODevice *vbasedev, uint8_t index,
off_t off, uint32_t size, void *data)
{
@@ -1202,6 +1331,8 @@ static int vfio_user_io_region_write(VFIODevice *vbasedev, uint8_t index,
VFIODevIO vfio_dev_io_sock = {
.get_info = vfio_user_io_get_info,
.get_region_info = vfio_user_io_get_region_info,
+ .get_irq_info = vfio_user_io_get_irq_info,
+ .set_irqs = vfio_user_io_set_irqs,
.region_read = vfio_user_io_region_read,
.region_write = vfio_user_io_region_write,
};
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 16/23] vfio-user: proxy container connect/disconnect
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (14 preceding siblings ...)
2022-05-05 17:19 ` [RFC v5 15/23] vfio-user: get and set IRQs John Johnson
@ 2022-05-05 17:19 ` John Johnson
2022-05-05 17:20 ` [RFC v5 17/23] vfio-user: dma map/unmap operations John Johnson
` (6 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:19 UTC (permalink / raw)
To: qemu-devel
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/user.h | 1 +
include/hw/vfio/vfio-common.h | 3 ++
hw/vfio/common.c | 105 ++++++++++++++++++++++++++++++++++++++++++
hw/vfio/pci.c | 25 ++++++++++
hw/vfio/user.c | 3 ++
5 files changed, 137 insertions(+)
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index a641351..742e1a9 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -87,5 +87,6 @@ void vfio_user_set_handler(VFIODevice *vbasedev,
int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
extern VFIODevIO vfio_dev_io_sock;
+extern VFIOContIO vfio_cont_io_sock;
#endif /* VFIO_USER_H */
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 4118b8a..59a8299 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -94,6 +94,7 @@ typedef struct VFIOContainer {
uint64_t max_dirty_bitmap_size;
unsigned long pgsizes;
unsigned int dma_max_mappings;
+ VFIOProxy *proxy;
QLIST_HEAD(, VFIOGuestIOMMU) giommu_list;
QLIST_HEAD(, VFIOHostDMAWindow) hostwin_list;
QLIST_HEAD(, VFIOGroup) group_list;
@@ -278,6 +279,8 @@ VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp);
void vfio_put_group(VFIOGroup *group);
int vfio_get_device(VFIOGroup *group, const char *name,
VFIODevice *vbasedev, Error **errp);
+void vfio_connect_proxy(VFIOProxy *proxy, VFIOGroup *group, AddressSpace *as);
+void vfio_disconnect_proxy(VFIOGroup *group);
extern const MemoryRegionOps vfio_region_ops;
typedef QLIST_HEAD(VFIOGroupList, VFIOGroup) VFIOGroupList;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index 351f727..beb5689 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -19,6 +19,7 @@
*/
#include "qemu/osdep.h"
+#include CONFIG_DEVICES
#include <sys/ioctl.h>
#ifdef CONFIG_KVM
#include <linux/kvm.h>
@@ -2209,6 +2210,62 @@ put_space_exit:
return ret;
}
+
+#ifdef CONFIG_VFIO_USER
+
+void vfio_connect_proxy(VFIOProxy *proxy, VFIOGroup *group, AddressSpace *as)
+{
+ VFIOAddressSpace *space;
+ VFIOContainer *container;
+
+ if (QLIST_EMPTY(&vfio_group_list)) {
+ qemu_register_reset(vfio_reset_handler, NULL);
+ }
+
+ QLIST_INSERT_HEAD(&vfio_group_list, group, next);
+
+ /*
+ * try to mirror vfio_connect_container()
+ * as much as possible
+ */
+
+ space = vfio_get_address_space(as);
+
+ container = g_malloc0(sizeof(*container));
+ container->space = space;
+ container->fd = -1;
+ container->io_ops = &vfio_cont_io_sock;
+ QLIST_INIT(&container->giommu_list);
+ QLIST_INIT(&container->hostwin_list);
+ container->proxy = proxy;
+
+ /*
+ * The proxy uses a SW IOMMU in lieu of the HW one
+ * used in the ioctl() version. Use TYPE1 with the
+ * target's page size for maximum capatibility
+ */
+ container->iommu_type = VFIO_TYPE1_IOMMU;
+ vfio_host_win_add(container, 0, (hwaddr)-1, TARGET_PAGE_SIZE);
+ container->pgsizes = TARGET_PAGE_SIZE;
+
+ container->dirty_pages_supported = true;
+ container->max_dirty_bitmap_size = VFIO_USER_DEF_MAX_XFER;
+ container->dirty_pgsizes = TARGET_PAGE_SIZE;
+
+ QLIST_INIT(&container->group_list);
+ QLIST_INSERT_HEAD(&space->containers, container, next);
+
+ group->container = container;
+ QLIST_INSERT_HEAD(&container->group_list, group, container_next);
+
+ container->listener = vfio_memory_listener;
+ memory_listener_register(&container->listener, container->space->as);
+ container->initialized = true;
+}
+
+#endif /* CONFIG_VFIO_USER */
+
+
static void vfio_disconnect_container(VFIOGroup *group)
{
VFIOContainer *container = group->container;
@@ -2258,6 +2315,54 @@ static void vfio_disconnect_container(VFIOGroup *group)
}
}
+
+#ifdef CONFIG_VFIO_USER
+
+void vfio_disconnect_proxy(VFIOGroup *group)
+{
+ VFIOContainer *container = group->container;
+ VFIOAddressSpace *space = container->space;
+ VFIOGuestIOMMU *giommu, *tmp;
+ VFIOHostDMAWindow *hostwin, *next;
+
+ /*
+ * try to mirror vfio_disconnect_container()
+ * as much as possible, knowing each device
+ * is in one group and one container
+ */
+
+ QLIST_REMOVE(group, container_next);
+ group->container = NULL;
+
+ /*
+ * Explicitly release the listener first before unset container,
+ * since unset may destroy the backend container if it's the last
+ * group.
+ */
+ memory_listener_unregister(&container->listener);
+
+ QLIST_REMOVE(container, next);
+
+ QLIST_FOREACH_SAFE(giommu, &container->giommu_list, giommu_next, tmp) {
+ memory_region_unregister_iommu_notifier(
+ MEMORY_REGION(giommu->iommu), &giommu->n);
+ QLIST_REMOVE(giommu, giommu_next);
+ g_free(giommu);
+ }
+
+ QLIST_FOREACH_SAFE(hostwin, &container->hostwin_list, hostwin_next,
+ next) {
+ QLIST_REMOVE(hostwin, hostwin_next);
+ g_free(hostwin);
+ }
+
+ g_free(container);
+ vfio_put_address_space(space);
+}
+
+#endif /* CONFIG_VFIO_USER */
+
+
VFIOGroup *vfio_get_group(int groupid, AddressSpace *as, Error **errp)
{
VFIOGroup *group;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 0a4208b..054a2bd 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3562,6 +3562,7 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
VFIODevice *vbasedev = &vdev->vbasedev;
SocketAddress addr;
VFIOProxy *proxy;
+ VFIOGroup *group = NULL;
struct vfio_device_info info;
int ret;
Error *err = NULL;
@@ -3608,6 +3609,19 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
vbasedev->ops = &vfio_user_pci_ops;
vbasedev->io_ops = &vfio_dev_io_sock;
+ /*
+ * each device gets its own group and container
+ * make them unrelated to any host IOMMU groupings
+ */
+ group = g_malloc0(sizeof(*group));
+ group->fd = -1;
+ group->groupid = -1;
+ QLIST_INIT(&group->device_list);
+ QLIST_INSERT_HEAD(&group->device_list, vbasedev, next);
+ vbasedev->group = group;
+
+ vfio_connect_proxy(proxy, group, pci_device_iommu_address_space(pdev));
+
ret = VDEV_GET_INFO(vbasedev, &info);
if (ret) {
error_setg_errno(errp, -ret, "get info failure");
@@ -3673,6 +3687,10 @@ out_teardown:
vfio_teardown_msi(vdev);
vfio_bars_exit(vdev);
error:
+ if (group != NULL) {
+ vfio_disconnect_proxy(group);
+ g_free(group);
+ }
vfio_user_disconnect(proxy);
error_prepend(errp, VFIO_MSG_PREFIX, vdev->vbasedev.name);
}
@@ -3681,6 +3699,13 @@ static void vfio_user_instance_finalize(Object *obj)
{
VFIOPCIDevice *vdev = VFIO_PCI_BASE(obj);
VFIODevice *vbasedev = &vdev->vbasedev;
+ VFIOGroup *group = vbasedev->group;
+
+ if (group != NULL) {
+ vfio_disconnect_proxy(group);
+ g_free(group);
+ vbasedev->group = NULL;
+ }
if (vdev->msix != NULL) {
vfio_user_msix_teardown(vdev);
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index d0140d6..9906d81 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -1337,3 +1337,6 @@ VFIODevIO vfio_dev_io_sock = {
.region_write = vfio_user_io_region_write,
};
+
+VFIOContIO vfio_cont_io_sock = {
+};
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 17/23] vfio-user: dma map/unmap operations
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (15 preceding siblings ...)
2022-05-05 17:19 ` [RFC v5 16/23] vfio-user: proxy container connect/disconnect John Johnson
@ 2022-05-05 17:20 ` John Johnson
2022-05-05 17:20 ` [RFC v5 18/23] vfio-user: secure DMA support John Johnson
` (5 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:20 UTC (permalink / raw)
To: qemu-devel
Add ability to do async operations during memory transactions
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
---
hw/vfio/user-protocol.h | 32 +++++++
include/hw/vfio/vfio-common.h | 9 +-
hw/vfio/common.c | 63 +++++++++---
hw/vfio/user.c | 217 ++++++++++++++++++++++++++++++++++++++++++
4 files changed, 305 insertions(+), 16 deletions(-)
diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
index 4852882..ad63f21 100644
--- a/hw/vfio/user-protocol.h
+++ b/hw/vfio/user-protocol.h
@@ -94,6 +94,31 @@ typedef struct {
/*
+ * VFIO_USER_DMA_MAP
+ * imported from struct vfio_iommu_type1_dma_map
+ */
+typedef struct {
+ VFIOUserHdr hdr;
+ uint32_t argsz;
+ uint32_t flags;
+ uint64_t offset; /* FD offset */
+ uint64_t iova;
+ uint64_t size;
+} VFIOUserDMAMap;
+
+/*
+ * VFIO_USER_DMA_UNMAP
+ * imported from struct vfio_iommu_type1_dma_unmap
+ */
+typedef struct {
+ VFIOUserHdr hdr;
+ uint32_t argsz;
+ uint32_t flags;
+ uint64_t iova;
+ uint64_t size;
+} VFIOUserDMAUnmap;
+
+/*
* VFIO_USER_DEVICE_GET_INFO
* imported from struct_device_info
*/
@@ -157,4 +182,11 @@ typedef struct {
char data[];
} VFIOUserRegionRW;
+/*imported from struct vfio_bitmap */
+typedef struct {
+ uint64_t pgsize;
+ uint64_t size;
+ char data[];
+} VFIOUserBitmap;
+
#endif /* VFIO_USER_PROTOCOL_H */
diff --git a/include/hw/vfio/vfio-common.h b/include/hw/vfio/vfio-common.h
index 59a8299..a84e10a 100644
--- a/include/hw/vfio/vfio-common.h
+++ b/include/hw/vfio/vfio-common.h
@@ -90,6 +90,7 @@ typedef struct VFIOContainer {
VFIOContIO *io_ops;
bool initialized;
bool dirty_pages_supported;
+ bool async_ops;
uint64_t dirty_pgsizes;
uint64_t max_dirty_bitmap_size;
unsigned long pgsizes;
@@ -199,7 +200,7 @@ struct VFIODevIO {
((vdev)->io_ops->region_write((vdev), (nr), (off), (size), (data), (post)))
struct VFIOContIO {
- int (*dma_map)(VFIOContainer *container,
+ int (*dma_map)(VFIOContainer *container, MemoryRegion *mr,
struct vfio_iommu_type1_dma_map *map);
int (*dma_unmap)(VFIOContainer *container,
struct vfio_iommu_type1_dma_unmap *unmap,
@@ -207,14 +208,16 @@ struct VFIOContIO {
int (*dirty_bitmap)(VFIOContainer *container,
struct vfio_iommu_type1_dirty_bitmap *bitmap,
struct vfio_iommu_type1_dirty_bitmap_get *range);
+ void (*wait_commit)(VFIOContainer *container);
};
-#define CONT_DMA_MAP(cont, map) \
- ((cont)->io_ops->dma_map((cont), (map)))
+#define CONT_DMA_MAP(cont, mr, map) \
+ ((cont)->io_ops->dma_map((cont), (mr), (map)))
#define CONT_DMA_UNMAP(cont, unmap, bitmap) \
((cont)->io_ops->dma_unmap((cont), (unmap), (bitmap)))
#define CONT_DIRTY_BITMAP(cont, bitmap, range) \
((cont)->io_ops->dirty_bitmap((cont), (bitmap), (range)))
+#define CONT_WAIT_COMMIT(cont) ((cont)->io_ops->wait_commit(cont))
extern VFIODevIO vfio_dev_io_ioctl;
extern VFIOContIO vfio_cont_io_ioctl;
diff --git a/hw/vfio/common.c b/hw/vfio/common.c
index beb5689..a9d9991 100644
--- a/hw/vfio/common.c
+++ b/hw/vfio/common.c
@@ -480,7 +480,7 @@ static int vfio_dma_unmap(VFIOContainer *container,
return CONT_DMA_UNMAP(container, &unmap, NULL);
}
-static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
+static int vfio_dma_map(VFIOContainer *container, MemoryRegion *mr, hwaddr iova,
ram_addr_t size, void *vaddr, bool readonly)
{
struct vfio_iommu_type1_dma_map map = {
@@ -496,7 +496,7 @@ static int vfio_dma_map(VFIOContainer *container, hwaddr iova,
map.flags |= VFIO_DMA_MAP_FLAG_WRITE;
}
- ret = CONT_DMA_MAP(container, &map);
+ ret = CONT_DMA_MAP(container, mr, &map);
if (ret < 0) {
error_report("VFIO_MAP_DMA failed: %s", strerror(-ret));
@@ -559,7 +559,8 @@ static bool vfio_listener_skipped_section(MemoryRegionSection *section)
/* Called with rcu_read_lock held. */
static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
- ram_addr_t *ram_addr, bool *read_only)
+ ram_addr_t *ram_addr, bool *read_only,
+ MemoryRegion **mrp)
{
MemoryRegion *mr;
hwaddr xlat;
@@ -640,6 +641,10 @@ static bool vfio_get_xlat_addr(IOMMUTLBEntry *iotlb, void **vaddr,
*read_only = !writable || mr->readonly;
}
+ if (mrp != NULL) {
+ *mrp = mr;
+ }
+
return true;
}
@@ -647,6 +652,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
{
VFIOGuestIOMMU *giommu = container_of(n, VFIOGuestIOMMU, n);
VFIOContainer *container = giommu->container;
+ MemoryRegion *mr;
hwaddr iova = iotlb->iova + giommu->iommu_offset;
void *vaddr;
int ret;
@@ -665,7 +671,7 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
if ((iotlb->perm & IOMMU_RW) != IOMMU_NONE) {
bool read_only;
- if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only)) {
+ if (!vfio_get_xlat_addr(iotlb, &vaddr, NULL, &read_only, &mr)) {
goto out;
}
/*
@@ -675,14 +681,14 @@ static void vfio_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
* of vaddr will always be there, even if the memory object is
* destroyed and its backing memory munmap-ed.
*/
- ret = vfio_dma_map(container, iova,
+ ret = vfio_dma_map(container, mr, iova,
iotlb->addr_mask + 1, vaddr,
read_only);
if (ret) {
error_report("vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
- "0x%"HWADDR_PRIx", %p) = %d (%m)",
+ "0x%"HWADDR_PRIx", %p)",
container, iova,
- iotlb->addr_mask + 1, vaddr, ret);
+ iotlb->addr_mask + 1, vaddr);
}
} else {
ret = vfio_dma_unmap(container, iova, iotlb->addr_mask + 1, iotlb);
@@ -737,7 +743,7 @@ static int vfio_ram_discard_notify_populate(RamDiscardListener *rdl,
section->offset_within_address_space;
vaddr = memory_region_get_ram_ptr(section->mr) + start;
- ret = vfio_dma_map(vrdl->container, iova, next - start,
+ ret = vfio_dma_map(vrdl->container, section->mr, iova, next - start,
vaddr, section->readonly);
if (ret) {
/* Rollback */
@@ -845,6 +851,29 @@ static void vfio_unregister_ram_discard_listener(VFIOContainer *container,
g_free(vrdl);
}
+static void vfio_listener_begin(MemoryListener *listener)
+{
+ VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+
+ /*
+ * When DMA space is the physical address space,
+ * the region add/del listeners will fire during
+ * memory update transactions. These depend on BQL
+ * being held, so do any resulting map/demap ops async
+ * while keeping BQL.
+ */
+ container->async_ops = true;
+}
+
+static void vfio_listener_commit(MemoryListener *listener)
+{
+ VFIOContainer *container = container_of(listener, VFIOContainer, listener);
+
+ /* wait here for any async requests sent during the transaction */
+ CONT_WAIT_COMMIT(container);
+ container->async_ops = false;
+}
+
static void vfio_listener_region_add(MemoryListener *listener,
MemoryRegionSection *section)
{
@@ -1044,12 +1073,12 @@ static void vfio_listener_region_add(MemoryListener *listener,
}
}
- ret = vfio_dma_map(container, iova, int128_get64(llsize),
+ ret = vfio_dma_map(container, section->mr, iova, int128_get64(llsize),
vaddr, section->readonly);
if (ret) {
error_setg(&err, "vfio_dma_map(%p, 0x%"HWADDR_PRIx", "
- "0x%"HWADDR_PRIx", %p) = %d (%m)",
- container, iova, int128_get64(llsize), vaddr, ret);
+ "0x%"HWADDR_PRIx", %p)",
+ container, iova, int128_get64(llsize), vaddr);
if (memory_region_is_ram_device(section->mr)) {
/* Allow unexpected mappings not to be fatal for RAM devices */
error_report_err(err);
@@ -1310,7 +1339,7 @@ static void vfio_iommu_map_dirty_notify(IOMMUNotifier *n, IOMMUTLBEntry *iotlb)
}
rcu_read_lock();
- if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL)) {
+ if (vfio_get_xlat_addr(iotlb, NULL, &translated_addr, NULL, NULL)) {
int ret;
ret = vfio_get_dirty_bitmap(container, iova, iotlb->addr_mask + 1,
@@ -1428,6 +1457,8 @@ static void vfio_listener_log_sync(MemoryListener *listener,
static const MemoryListener vfio_memory_listener = {
.name = "vfio",
+ .begin = vfio_listener_begin,
+ .commit = vfio_listener_commit,
.region_add = vfio_listener_region_add,
.region_del = vfio_listener_region_del,
.log_global_start = vfio_listener_log_global_start,
@@ -2819,7 +2850,7 @@ VFIODevIO vfio_dev_io_ioctl = {
.region_write = vfio_io_region_write,
};
-static int vfio_io_dma_map(VFIOContainer *container,
+static int vfio_io_dma_map(VFIOContainer *container, MemoryRegion *mr,
struct vfio_iommu_type1_dma_map *map)
{
@@ -2879,8 +2910,14 @@ static int vfio_io_dirty_bitmap(VFIOContainer *container,
return ret < 0 ? -errno : ret;
}
+static void vfio_io_wait_commit(VFIOContainer *container)
+{
+ /* ioctl()s are synchronous */
+}
+
VFIOContIO vfio_cont_io_ioctl = {
.dma_map = vfio_io_dma_map,
.dma_unmap = vfio_io_dma_unmap,
.dirty_bitmap = vfio_io_dirty_bitmap,
+ .wait_commit = vfio_io_wait_commit,
};
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 9906d81..29eff8a 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -59,8 +59,11 @@ static void vfio_user_request(void *opaque);
static int vfio_user_send_queued(VFIOProxy *proxy, VFIOUserMsg *msg);
static void vfio_user_send_async(VFIOProxy *proxy, VFIOUserHdr *hdr,
VFIOUserFDs *fds);
+static void vfio_user_send_nowait(VFIOProxy *proxy, VFIOUserHdr *hdr,
+ VFIOUserFDs *fds, int rsize);
static void vfio_user_send_wait(VFIOProxy *proxy, VFIOUserHdr *hdr,
VFIOUserFDs *fds, int rsize, bool nobql);
+static void vfio_user_wait_reqs(VFIOProxy *proxy);
static void vfio_user_request_msg(VFIOUserHdr *hdr, uint16_t cmd,
uint32_t size, uint32_t flags);
@@ -647,6 +650,36 @@ static void vfio_user_send_async(VFIOProxy *proxy, VFIOUserHdr *hdr,
}
}
+/*
+ * nowait send - vfio_wait_reqs() can wait for it later
+ */
+static void vfio_user_send_nowait(VFIOProxy *proxy, VFIOUserHdr *hdr,
+ VFIOUserFDs *fds, int rsize)
+{
+ VFIOUserMsg *msg;
+ int ret;
+
+ if (hdr->flags & VFIO_USER_NO_REPLY) {
+ error_printf("vfio_user_send_nowait on async message\n");
+ return;
+ }
+
+ QEMU_LOCK_GUARD(&proxy->lock);
+
+ msg = vfio_user_getmsg(proxy, hdr, fds);
+ msg->id = hdr->id;
+ msg->rsize = rsize ? rsize : hdr->size;
+ msg->type = VFIO_MSG_NOWAIT;
+
+ ret = vfio_user_send_queued(proxy, msg);
+ if (ret < 0) {
+ vfio_user_recycle(proxy, msg);
+ return;
+ }
+
+ proxy->last_nowait = msg;
+}
+
static void vfio_user_send_wait(VFIOProxy *proxy, VFIOUserHdr *hdr,
VFIOUserFDs *fds, int rsize, bool nobql)
{
@@ -696,6 +729,57 @@ static void vfio_user_send_wait(VFIOProxy *proxy, VFIOUserHdr *hdr,
}
}
+static void vfio_user_wait_reqs(VFIOProxy *proxy)
+{
+ VFIOUserMsg *msg;
+ bool iolock = false;
+
+ /*
+ * Any DMA map/unmap requests sent in the middle
+ * of a memory region transaction were sent nowait.
+ * Wait for them here.
+ */
+ qemu_mutex_lock(&proxy->lock);
+ if (proxy->last_nowait != NULL) {
+ iolock = qemu_mutex_iothread_locked();
+ if (iolock) {
+ qemu_mutex_unlock_iothread();
+ }
+
+ /*
+ * Change type to WAIT to wait for reply
+ */
+ msg = proxy->last_nowait;
+ msg->type = VFIO_MSG_WAIT;
+ while (!msg->complete) {
+ if (!qemu_cond_timedwait(&msg->cv, &proxy->lock, wait_time)) {
+ QTAILQ_REMOVE(&proxy->pending, msg, next);
+ error_printf("vfio_wait_reqs - timed out\n");
+ break;
+ }
+ }
+
+ if (msg->hdr->flags & VFIO_USER_ERROR) {
+ error_printf("vfio_user_wait_reqs - error reply on async request ");
+ error_printf("command %x error %s\n", msg->hdr->command,
+ strerror(msg->hdr->error_reply));
+ }
+
+ proxy->last_nowait = NULL;
+ /*
+ * Change type back to NOWAIT to free
+ */
+ msg->type = VFIO_MSG_NOWAIT;
+ vfio_user_recycle(proxy, msg);
+ }
+
+ /* lock order is BQL->proxy - don't hold proxy when getting BQL */
+ qemu_mutex_unlock(&proxy->lock);
+ if (iolock) {
+ qemu_mutex_lock_iothread();
+ }
+}
+
static QLIST_HEAD(, VFIOProxy) vfio_user_sockets =
QLIST_HEAD_INITIALIZER(vfio_user_sockets);
@@ -1021,6 +1105,103 @@ int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp)
return 0;
}
+static int vfio_user_dma_map(VFIOProxy *proxy,
+ struct vfio_iommu_type1_dma_map *map,
+ int fd, bool will_commit)
+{
+ VFIOUserFDs *fds = NULL;
+ VFIOUserDMAMap *msgp = g_malloc0(sizeof(*msgp));
+ int ret;
+
+ vfio_user_request_msg(&msgp->hdr, VFIO_USER_DMA_MAP, sizeof(*msgp), 0);
+ msgp->argsz = map->argsz;
+ msgp->flags = map->flags;
+ msgp->offset = map->vaddr;
+ msgp->iova = map->iova;
+ msgp->size = map->size;
+
+ /*
+ * The will_commit case sends without blocking or dropping BQL.
+ * They're later waited for in vfio_send_wait_reqs.
+ */
+ if (will_commit) {
+ /* can't use auto variable since we don't block */
+ if (fd != -1) {
+ fds = vfio_user_getfds(1);
+ fds->send_fds = 1;
+ fds->fds[0] = fd;
+ }
+ vfio_user_send_nowait(proxy, &msgp->hdr, fds, 0);
+ ret = 0;
+ } else {
+ VFIOUserFDs local_fds = { 1, 0, &fd };
+
+ fds = fd != -1 ? &local_fds : NULL;
+ vfio_user_send_wait(proxy, &msgp->hdr, fds, 0, will_commit);
+ ret = (msgp->hdr.flags & VFIO_USER_ERROR) ? -msgp->hdr.error_reply : 0;
+ g_free(msgp);
+ }
+
+ return ret;
+}
+
+static int vfio_user_dma_unmap(VFIOProxy *proxy,
+ struct vfio_iommu_type1_dma_unmap *unmap,
+ struct vfio_bitmap *bitmap, bool will_commit)
+{
+ struct {
+ VFIOUserDMAUnmap msg;
+ VFIOUserBitmap bitmap;
+ } *msgp = NULL;
+ int msize, rsize;
+ bool blocking = !will_commit;
+
+ if (bitmap == NULL &&
+ (unmap->flags & VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP)) {
+ error_printf("vfio_user_dma_unmap mismatched flags and bitmap\n");
+ return -EINVAL;
+ }
+
+ /*
+ * If a dirty bitmap is returned, allocate extra space for it
+ * and block for reply even in the will_commit case.
+ * Otherwise, can send the unmap request without waiting.
+ */
+ if (bitmap != NULL) {
+ blocking = true;
+ msize = sizeof(*msgp);
+ rsize = msize + bitmap->size;
+ msgp = g_malloc0(rsize);
+ msgp->bitmap.pgsize = bitmap->pgsize;
+ msgp->bitmap.size = bitmap->size;
+ } else {
+ msize = rsize = sizeof(VFIOUserDMAUnmap);
+ msgp = g_malloc0(rsize);
+ }
+
+ vfio_user_request_msg(&msgp->msg.hdr, VFIO_USER_DMA_UNMAP, msize, 0);
+ msgp->msg.argsz = rsize - sizeof(VFIOUserHdr);
+ msgp->msg.argsz = unmap->argsz;
+ msgp->msg.flags = unmap->flags;
+ msgp->msg.iova = unmap->iova;
+ msgp->msg.size = unmap->size;
+
+ if (blocking) {
+ vfio_user_send_wait(proxy, &msgp->msg.hdr, NULL, rsize, will_commit);
+ if (msgp->msg.hdr.flags & VFIO_USER_ERROR) {
+ return -msgp->msg.hdr.error_reply;
+ }
+ if (bitmap != NULL) {
+ memcpy(bitmap->data, &msgp->bitmap.data, bitmap->size);
+ }
+ g_free(msgp);
+ } else {
+ vfio_user_send_nowait(proxy, &msgp->msg.hdr, NULL, rsize);
+ }
+
+ return 0;
+}
+
static int vfio_user_get_info(VFIOProxy *proxy, struct vfio_device_info *info)
{
VFIOUserDeviceInfo msg;
@@ -1338,5 +1519,41 @@ VFIODevIO vfio_dev_io_sock = {
};
+static int vfio_user_io_dma_map(VFIOContainer *container, MemoryRegion *mr,
+ struct vfio_iommu_type1_dma_map *map)
+{
+ int fd = memory_region_get_fd(mr);
+
+ /*
+ * map->vaddr enters as a QEMU process address
+ * make it either a file offset for mapped areas or 0
+ */
+ if (fd != -1) {
+ void *addr = (void *)(uintptr_t)map->vaddr;
+
+ map->vaddr = qemu_ram_block_host_offset(mr->ram_block, addr);
+ } else {
+ map->vaddr = 0;
+ }
+
+ return vfio_user_dma_map(container->proxy, map, fd, container->async_ops);
+}
+
+static int vfio_user_io_dma_unmap(VFIOContainer *container,
+ struct vfio_iommu_type1_dma_unmap *unmap,
+ struct vfio_bitmap *bitmap)
+{
+ return vfio_user_dma_unmap(container->proxy, unmap, bitmap,
+ container->async_ops);
+}
+
+static void vfio_user_io_wait_commit(VFIOContainer *container)
+{
+ vfio_user_wait_reqs(container->proxy);
+}
+
VFIOContIO vfio_cont_io_sock = {
+ .dma_map = vfio_user_io_dma_map,
+ .dma_unmap = vfio_user_io_dma_unmap,
+ .wait_commit = vfio_user_io_wait_commit,
};
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 18/23] vfio-user: secure DMA support
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (16 preceding siblings ...)
2022-05-05 17:20 ` [RFC v5 17/23] vfio-user: dma map/unmap operations John Johnson
@ 2022-05-05 17:20 ` John Johnson
2022-05-05 17:20 ` [RFC v5 19/23] vfio-user: dma read/write operations John Johnson
` (4 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:20 UTC (permalink / raw)
To: qemu-devel
Secure DMA forces the remote process to use DMA r/w messages
instead of directly mapping guest memeory.
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/pci.h | 1 +
hw/vfio/user.h | 1 +
hw/vfio/pci.c | 4 ++++
hw/vfio/user.c | 2 +-
4 files changed, 7 insertions(+), 1 deletion(-)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index a4eb5b9..c207847 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -194,6 +194,7 @@ OBJECT_DECLARE_SIMPLE_TYPE(VFIOUserPCIDevice, VFIO_USER_PCI)
struct VFIOUserPCIDevice {
VFIOPCIDevice device;
char *sock_name;
+ bool secure_dma; /* disable shared mem for DMA */
bool send_queued; /* all sends are queued */
bool no_post; /* all regions write are sync */
};
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 742e1a9..ec764d3 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -76,6 +76,7 @@ typedef struct VFIOProxy {
/* VFIOProxy flags */
#define VFIO_PROXY_CLIENT 0x1
+#define VFIO_PROXY_SECURE 0x2
#define VFIO_PROXY_FORCE_QUEUED 0x4
#define VFIO_PROXY_NO_POST 0x8
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 054a2bd..2faf890 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3589,6 +3589,9 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
vbasedev->proxy = proxy;
vfio_user_set_handler(vbasedev, vfio_user_pci_process_req, vdev);
+ if (udev->secure_dma) {
+ proxy->flags |= VFIO_PROXY_SECURE;
+ }
if (udev->send_queued) {
proxy->flags |= VFIO_PROXY_FORCE_QUEUED;
}
@@ -3720,6 +3723,7 @@ static void vfio_user_instance_finalize(Object *obj)
static Property vfio_user_pci_dev_properties[] = {
DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
+ DEFINE_PROP_BOOL("secure-dma", VFIOUserPCIDevice, secure_dma, false),
DEFINE_PROP_BOOL("x-send-queued", VFIOUserPCIDevice, send_queued, false),
DEFINE_PROP_BOOL("x-no-posted-writes", VFIOUserPCIDevice, no_post, false),
DEFINE_PROP_END_OF_LIST(),
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 29eff8a..b08108c 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -1528,7 +1528,7 @@ static int vfio_user_io_dma_map(VFIOContainer *container, MemoryRegion *mr,
* map->vaddr enters as a QEMU process address
* make it either a file offset for mapped areas or 0
*/
- if (fd != -1) {
+ if (fd != -1 && (container->proxy->flags & VFIO_PROXY_SECURE) == 0) {
void *addr = (void *)(uintptr_t)map->vaddr;
map->vaddr = qemu_ram_block_host_offset(mr->ram_block, addr);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 19/23] vfio-user: dma read/write operations
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (17 preceding siblings ...)
2022-05-05 17:20 ` [RFC v5 18/23] vfio-user: secure DMA support John Johnson
@ 2022-05-05 17:20 ` John Johnson
2022-05-05 17:20 ` [RFC v5 20/23] vfio-user: pci reset John Johnson
` (3 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:20 UTC (permalink / raw)
To: qemu-devel
Messages from server to client that peform device DMA.
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/user-protocol.h | 11 ++++++
hw/vfio/user.h | 4 ++
hw/vfio/pci.c | 100 ++++++++++++++++++++++++++++++++++++++++++++++++
hw/vfio/user.c | 60 ++++++++++++++++++++++++++++-
4 files changed, 174 insertions(+), 1 deletion(-)
diff --git a/hw/vfio/user-protocol.h b/hw/vfio/user-protocol.h
index ad63f21..8932311 100644
--- a/hw/vfio/user-protocol.h
+++ b/hw/vfio/user-protocol.h
@@ -182,6 +182,17 @@ typedef struct {
char data[];
} VFIOUserRegionRW;
+/*
+ * VFIO_USER_DMA_READ
+ * VFIO_USER_DMA_WRITE
+ */
+typedef struct {
+ VFIOUserHdr hdr;
+ uint64_t offset;
+ uint32_t count;
+ char data[];
+} VFIOUserDMARW;
+
/*imported from struct vfio_bitmap */
typedef struct {
uint64_t pgsize;
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index ec764d3..412c77a 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -82,9 +82,13 @@ typedef struct VFIOProxy {
VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp);
void vfio_user_disconnect(VFIOProxy *proxy);
+uint64_t vfio_user_max_xfer(void);
void vfio_user_set_handler(VFIODevice *vbasedev,
void (*handler)(void *opaque, VFIOUserMsg *msg),
void *reqarg);
+void vfio_user_send_reply(VFIOProxy *proxy, VFIOUserHdr *hdr, int size);
+void vfio_user_send_error(VFIOProxy *proxy, VFIOUserHdr *hdr, int error);
+void vfio_user_putfds(VFIOUserMsg *msg);
int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
extern VFIODevIO vfio_dev_io_sock;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 2faf890..25b3ebb 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3528,6 +3528,85 @@ static void vfio_user_msix_teardown(VFIOPCIDevice *vdev)
vdev->msix->msix_regions = NULL;
}
+static void vfio_user_dma_read(VFIOPCIDevice *vdev, VFIOUserDMARW *msg)
+{
+ PCIDevice *pdev = &vdev->pdev;
+ VFIOProxy *proxy = vdev->vbasedev.proxy;
+ VFIOUserDMARW *res;
+ MemTxResult r;
+ size_t size;
+
+ if (msg->hdr.size < sizeof(*msg)) {
+ vfio_user_send_error(proxy, &msg->hdr, EINVAL);
+ return;
+ }
+ if (msg->count > vfio_user_max_xfer()) {
+ vfio_user_send_error(proxy, &msg->hdr, E2BIG);
+ return;
+ }
+
+ /* switch to our own message buffer */
+ size = msg->count + sizeof(VFIOUserDMARW);
+ res = g_malloc0(size);
+ memcpy(res, msg, sizeof(*res));
+ g_free(msg);
+
+ r = pci_dma_read(pdev, res->offset, &res->data, res->count);
+
+ switch (r) {
+ case MEMTX_OK:
+ if (res->hdr.flags & VFIO_USER_NO_REPLY) {
+ g_free(res);
+ return;
+ }
+ vfio_user_send_reply(proxy, &res->hdr, size);
+ break;
+ case MEMTX_ERROR:
+ vfio_user_send_error(proxy, &res->hdr, EFAULT);
+ break;
+ case MEMTX_DECODE_ERROR:
+ vfio_user_send_error(proxy, &res->hdr, ENODEV);
+ break;
+ }
+}
+
+static void vfio_user_dma_write(VFIOPCIDevice *vdev, VFIOUserDMARW *msg)
+{
+ PCIDevice *pdev = &vdev->pdev;
+ VFIOProxy *proxy = vdev->vbasedev.proxy;
+ MemTxResult r;
+
+ if (msg->hdr.size < sizeof(*msg)) {
+ vfio_user_send_error(proxy, &msg->hdr, EINVAL);
+ return;
+ }
+ /* make sure transfer count isn't larger than the message data */
+ if (msg->count > msg->hdr.size - sizeof(*msg)) {
+ vfio_user_send_error(proxy, &msg->hdr, E2BIG);
+ return;
+ }
+
+ r = pci_dma_write(pdev, msg->offset, &msg->data, msg->count);
+
+ switch (r) {
+ case MEMTX_OK:
+ if ((msg->hdr.flags & VFIO_USER_NO_REPLY) == 0) {
+ vfio_user_send_reply(proxy, &msg->hdr, sizeof(msg->hdr));
+ } else {
+ g_free(msg);
+ }
+ break;
+ case MEMTX_ERROR:
+ vfio_user_send_error(proxy, &msg->hdr, EFAULT);
+ break;
+ case MEMTX_DECODE_ERROR:
+ vfio_user_send_error(proxy, &msg->hdr, ENODEV);
+ break;
+ }
+
+ return;
+}
+
/*
* Incoming request message callback.
*
@@ -3535,9 +3614,30 @@ static void vfio_user_msix_teardown(VFIOPCIDevice *vdev)
*/
static void vfio_user_pci_process_req(void *opaque, VFIOUserMsg *msg)
{
+ VFIOPCIDevice *vdev = opaque;
+ VFIOUserHdr *hdr = msg->hdr;
+
+ /* no incoming PCI requests pass FDs */
+ if (msg->fds != NULL) {
+ vfio_user_send_error(vdev->vbasedev.proxy, hdr, EINVAL);
+ vfio_user_putfds(msg);
+ return;
+ }
+ switch (hdr->command) {
+ case VFIO_USER_DMA_READ:
+ vfio_user_dma_read(vdev, (VFIOUserDMARW *)hdr);
+ break;
+ case VFIO_USER_DMA_WRITE:
+ vfio_user_dma_write(vdev, (VFIOUserDMARW *)hdr);
+ break;
+ default:
+ error_printf("vfio_user_process_req unknown cmd %d\n", hdr->command);
+ vfio_user_send_error(vdev->vbasedev.proxy, hdr, ENOSYS);
+ }
}
+
/*
* Emulated devices don't use host hot reset
*/
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index b08108c..1a0d002 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -77,6 +77,11 @@ static inline void vfio_user_set_error(VFIOUserHdr *hdr, uint32_t err)
* Functions called by main, CPU, or iothread threads
*/
+uint64_t vfio_user_max_xfer(void)
+{
+ return max_xfer_size;
+}
+
static void vfio_user_shutdown(VFIOProxy *proxy)
{
qio_channel_shutdown(proxy->ioc, QIO_CHANNEL_SHUTDOWN_READ, NULL);
@@ -377,7 +382,7 @@ static int vfio_user_recv_one(VFIOProxy *proxy)
*msg->hdr = hdr;
data = (char *)msg->hdr + sizeof(hdr);
} else {
- if (hdr.size > max_xfer_size) {
+ if (hdr.size > max_xfer_size + sizeof(VFIOUserDMARW)) {
error_setg(&local_err, "vfio_user_recv request larger than max");
goto err;
}
@@ -780,6 +785,59 @@ static void vfio_user_wait_reqs(VFIOProxy *proxy)
}
}
+/*
+ * Reply to an incoming request.
+ */
+void vfio_user_send_reply(VFIOProxy *proxy, VFIOUserHdr *hdr, int size)
+{
+
+ if (size < sizeof(VFIOUserHdr)) {
+ error_printf("vfio_user_send_reply - size too small\n");
+ g_free(hdr);
+ return;
+ }
+
+ /*
+ * convert header to associated reply
+ */
+ hdr->flags = VFIO_USER_REPLY;
+ hdr->size = size;
+
+ vfio_user_send_async(proxy, hdr, NULL);
+}
+
+/*
+ * Send an error reply to an incoming request.
+ */
+void vfio_user_send_error(VFIOProxy *proxy, VFIOUserHdr *hdr, int error)
+{
+
+ /*
+ * convert header to associated reply
+ */
+ hdr->flags = VFIO_USER_REPLY;
+ hdr->flags |= VFIO_USER_ERROR;
+ hdr->error_reply = error;
+ hdr->size = sizeof(*hdr);
+
+ vfio_user_send_async(proxy, hdr, NULL);
+}
+
+/*
+ * Close FDs erroneously received in an incoming request.
+ */
+void vfio_user_putfds(VFIOUserMsg *msg)
+{
+ VFIOUserFDs *fds = msg->fds;
+ int i;
+
+ for (i = 0; i < fds->recv_fds; i++) {
+ close(fds->fds[i]);
+ }
+ g_free(fds);
+ msg->fds = NULL;
+}
+
static QLIST_HEAD(, VFIOProxy) vfio_user_sockets =
QLIST_HEAD_INITIALIZER(vfio_user_sockets);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 20/23] vfio-user: pci reset
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (18 preceding siblings ...)
2022-05-05 17:20 ` [RFC v5 19/23] vfio-user: dma read/write operations John Johnson
@ 2022-05-05 17:20 ` John Johnson
2022-05-05 17:20 ` [RFC v5 21/23] vfio-user: add 'x-msg-timeout' option that specifies msg wait times John Johnson
` (2 subsequent siblings)
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:20 UTC (permalink / raw)
To: qemu-devel
Message to tell the server to reset the device.
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/user.h | 1 +
hw/vfio/pci.c | 15 +++++++++++++++
hw/vfio/user.c | 12 ++++++++++++
3 files changed, 28 insertions(+)
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 412c77a..902facf 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -90,6 +90,7 @@ void vfio_user_send_reply(VFIOProxy *proxy, VFIOUserHdr *hdr, int size);
void vfio_user_send_error(VFIOProxy *proxy, VFIOUserHdr *hdr, int error);
void vfio_user_putfds(VFIOUserMsg *msg);
int vfio_user_validate_version(VFIODevice *vbasedev, Error **errp);
+void vfio_user_reset(VFIOProxy *proxy);
extern VFIODevIO vfio_dev_io_sock;
extern VFIOContIO vfio_cont_io_sock;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index 25b3ebb..f4b4a30 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3821,6 +3821,20 @@ static void vfio_user_instance_finalize(Object *obj)
}
}
+static void vfio_user_pci_reset(DeviceState *dev)
+{
+ VFIOPCIDevice *vdev = VFIO_PCI_BASE(dev);
+ VFIODevice *vbasedev = &vdev->vbasedev;
+
+ vfio_pci_pre_reset(vdev);
+
+ if (vbasedev->reset_works) {
+ vfio_user_reset(vbasedev->proxy);
+ }
+
+ vfio_pci_post_reset(vdev);
+}
+
static Property vfio_user_pci_dev_properties[] = {
DEFINE_PROP_STRING("socket", VFIOUserPCIDevice, sock_name),
DEFINE_PROP_BOOL("secure-dma", VFIOUserPCIDevice, secure_dma, false),
@@ -3834,6 +3848,7 @@ static void vfio_user_pci_dev_class_init(ObjectClass *klass, void *data)
DeviceClass *dc = DEVICE_CLASS(klass);
PCIDeviceClass *pdc = PCI_DEVICE_CLASS(klass);
+ dc->reset = vfio_user_pci_reset;
device_class_set_props(dc, vfio_user_pci_dev_properties);
dc->desc = "VFIO over socket PCI device assignment";
pdc->realize = vfio_user_pci_realize;
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 1a0d002..262d1a7 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -1482,6 +1482,18 @@ static int vfio_user_region_write(VFIOProxy *proxy, uint8_t index, off_t offset,
return ret;
}
+void vfio_user_reset(VFIOProxy *proxy)
+{
+ VFIOUserHdr msg;
+
+ vfio_user_request_msg(&msg, VFIO_USER_DEVICE_RESET, sizeof(msg), 0);
+
+ vfio_user_send_wait(proxy, &msg, NULL, 0, false);
+ if (msg.flags & VFIO_USER_ERROR) {
+ error_printf("reset reply error %d\n", msg.error_reply);
+ }
+}
+
/*
* Socket-based io_ops
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 21/23] vfio-user: add 'x-msg-timeout' option that specifies msg wait times
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (19 preceding siblings ...)
2022-05-05 17:20 ` [RFC v5 20/23] vfio-user: pci reset John Johnson
@ 2022-05-05 17:20 ` John Johnson
2022-05-05 17:20 ` [RFC v5 22/23] vfio-user: add tracing to send/recv paths John Johnson
2022-05-05 17:20 ` [RFC v5 23/23] vfio-user: add dirty_bitmap stub until it support migration John Johnson
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:20 UTC (permalink / raw)
To: qemu-devel
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/pci.h | 1 +
hw/vfio/user.h | 1 +
hw/vfio/pci.c | 4 ++++
hw/vfio/user.c | 7 ++++---
4 files changed, 10 insertions(+), 3 deletions(-)
diff --git a/hw/vfio/pci.h b/hw/vfio/pci.h
index c207847..ca50858 100644
--- a/hw/vfio/pci.h
+++ b/hw/vfio/pci.h
@@ -197,6 +197,7 @@ struct VFIOUserPCIDevice {
bool secure_dma; /* disable shared mem for DMA */
bool send_queued; /* all sends are queued */
bool no_post; /* all regions write are sync */
+ uint32_t wait_time; /* timeout for message replies */
};
/* Use uin32_t for vendor & device so PCI_ANY_ID expands and cannot match hw */
diff --git a/hw/vfio/user.h b/hw/vfio/user.h
index 902facf..18c6404 100644
--- a/hw/vfio/user.h
+++ b/hw/vfio/user.h
@@ -55,6 +55,7 @@ typedef struct VFIOProxy {
void (*request)(void *opaque, VFIOUserMsg *msg);
void *req_arg;
int flags;
+ uint32_t wait_time;
QemuCond close_cv;
AioContext *ctx;
QEMUBH *req_bh;
diff --git a/hw/vfio/pci.c b/hw/vfio/pci.c
index f4b4a30..b103d08 100644
--- a/hw/vfio/pci.c
+++ b/hw/vfio/pci.c
@@ -3698,6 +3698,9 @@ static void vfio_user_pci_realize(PCIDevice *pdev, Error **errp)
if (udev->no_post) {
proxy->flags |= VFIO_PROXY_NO_POST;
}
+ if (udev->wait_time) {
+ proxy->wait_time = udev->wait_time;
+ }
vfio_user_validate_version(vbasedev, &err);
if (err != NULL) {
@@ -3840,6 +3843,7 @@ static Property vfio_user_pci_dev_properties[] = {
DEFINE_PROP_BOOL("secure-dma", VFIOUserPCIDevice, secure_dma, false),
DEFINE_PROP_BOOL("x-send-queued", VFIOUserPCIDevice, send_queued, false),
DEFINE_PROP_BOOL("x-no-posted-writes", VFIOUserPCIDevice, no_post, false),
+ DEFINE_PROP_UINT32("x-msg-timeout", VFIOUserPCIDevice, wait_time, 0),
DEFINE_PROP_END_OF_LIST(),
};
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index 262d1a7..ec2d89b 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -39,7 +39,7 @@
static uint64_t max_xfer_size = VFIO_USER_DEF_MAX_XFER;
static uint64_t max_send_fds = VFIO_USER_DEF_MAX_FDS;
-static int wait_time = 1000; /* wait 1 sec for replies */
+static uint32_t wait_time = 1000; /* wait 1 sec for replies */
static IOThread *vfio_user_iothread;
static void vfio_user_shutdown(VFIOProxy *proxy);
@@ -718,7 +718,7 @@ static void vfio_user_send_wait(VFIOProxy *proxy, VFIOUserHdr *hdr,
if (ret == 0) {
while (!msg->complete) {
- if (!qemu_cond_timedwait(&msg->cv, &proxy->lock, wait_time)) {
+ if (!qemu_cond_timedwait(&msg->cv, &proxy->lock, proxy->wait_time)) {
QTAILQ_REMOVE(&proxy->pending, msg, next);
vfio_user_set_error(hdr, ETIMEDOUT);
break;
@@ -757,7 +757,7 @@ static void vfio_user_wait_reqs(VFIOProxy *proxy)
msg = proxy->last_nowait;
msg->type = VFIO_MSG_WAIT;
while (!msg->complete) {
- if (!qemu_cond_timedwait(&msg->cv, &proxy->lock, wait_time)) {
+ if (!qemu_cond_timedwait(&msg->cv, &proxy->lock, proxy->wait_time)) {
QTAILQ_REMOVE(&proxy->pending, msg, next);
error_printf("vfio_wait_reqs - timed out\n");
break;
@@ -867,6 +867,7 @@ VFIOProxy *vfio_user_connect_dev(SocketAddress *addr, Error **errp)
proxy->ioc = ioc;
proxy->flags = VFIO_PROXY_CLIENT;
proxy->state = VFIO_PROXY_CONNECTED;
+ proxy->wait_time = wait_time;
qemu_mutex_init(&proxy->lock);
qemu_cond_init(&proxy->close_cv);
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 22/23] vfio-user: add tracing to send/recv paths
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (20 preceding siblings ...)
2022-05-05 17:20 ` [RFC v5 21/23] vfio-user: add 'x-msg-timeout' option that specifies msg wait times John Johnson
@ 2022-05-05 17:20 ` John Johnson
2022-05-05 17:20 ` [RFC v5 23/23] vfio-user: add dirty_bitmap stub until it support migration John Johnson
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:20 UTC (permalink / raw)
To: qemu-devel
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/user.c | 8 ++++++++
hw/vfio/trace-events | 5 +++++
2 files changed, 13 insertions(+)
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index ec2d89b..a3e4dc8 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -29,6 +29,8 @@
#include "qapi/qmp/qstring.h"
#include "qapi/qmp/qnum.h"
#include "user.h"
+#include "trace.h"
+
/*
* These are to defend against a malign server trying
@@ -111,6 +113,8 @@ static int vfio_user_send_qio(VFIOProxy *proxy, VFIOUserMsg *msg)
vfio_user_shutdown(proxy);
error_report_err(local_err);
}
+ trace_vfio_user_send_write(msg->hdr->id, ret);
+
return ret;
}
@@ -227,6 +231,7 @@ static int vfio_user_complete(VFIOProxy *proxy, Error **errp)
}
return ret;
}
+ trace_vfio_user_recv_read(msg->hdr->id, ret);
msgleft -= ret;
data += ret;
@@ -334,6 +339,8 @@ static int vfio_user_recv_one(VFIOProxy *proxy)
error_setg(&local_err, "unknown message type");
goto fatal;
}
+ trace_vfio_user_recv_hdr(proxy->sockname, hdr.id, hdr.command, hdr.size,
+ hdr.flags);
/*
* For replies, find the matching pending request.
@@ -410,6 +417,7 @@ static int vfio_user_recv_one(VFIOProxy *proxy)
if (ret <= 0) {
goto fatal;
}
+ trace_vfio_user_recv_read(hdr.id, ret);
msgleft -= ret;
data += ret;
diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events
index 0ef1b5f..ea4bd7e 100644
--- a/hw/vfio/trace-events
+++ b/hw/vfio/trace-events
@@ -165,3 +165,8 @@ vfio_load_state_device_data(const char *name, uint64_t data_offset, uint64_t dat
vfio_load_cleanup(const char *name) " (%s)"
vfio_get_dirty_bitmap(int fd, uint64_t iova, uint64_t size, uint64_t bitmap_size, uint64_t start) "container fd=%d, iova=0x%"PRIx64" size= 0x%"PRIx64" bitmap_size=0x%"PRIx64" start=0x%"PRIx64
vfio_iommu_map_dirty_notify(uint64_t iova_start, uint64_t iova_end) "iommu dirty @ 0x%"PRIx64" - 0x%"PRIx64
+
+# user.c
+vfio_user_recv_hdr(const char *name, uint16_t id, uint16_t cmd, uint32_t size, uint32_t flags) " (%s) id %x cmd %x size %x flags %x"
+vfio_user_recv_read(uint16_t id, int read) " id %x read %x"
+vfio_user_send_write(uint16_t id, int wrote) " id %x wrote %x"
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [RFC v5 23/23] vfio-user: add dirty_bitmap stub until it support migration
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
` (21 preceding siblings ...)
2022-05-05 17:20 ` [RFC v5 22/23] vfio-user: add tracing to send/recv paths John Johnson
@ 2022-05-05 17:20 ` John Johnson
22 siblings, 0 replies; 23+ messages in thread
From: John Johnson @ 2022-05-05 17:20 UTC (permalink / raw)
To: qemu-devel
Signed-off-by: John G Johnson <john.g.johnson@oracle.com>
Signed-off-by: Elena Ufimtseva <elena.ufimtseva@oracle.com>
Signed-off-by: Jagannathan Raman <jag.raman@oracle.com>
---
hw/vfio/user.c | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/hw/vfio/user.c b/hw/vfio/user.c
index a3e4dc8..eb79785 100644
--- a/hw/vfio/user.c
+++ b/hw/vfio/user.c
@@ -1626,6 +1626,15 @@ static int vfio_user_io_dma_unmap(VFIOContainer *container,
container->async_ops);
}
+static int vfio_user_io_dirty_bitmap(VFIOContainer *container,
+ struct vfio_iommu_type1_dirty_bitmap *bitmap,
+ struct vfio_iommu_type1_dirty_bitmap_get *range)
+{
+
+ /* vfio-user doesn't support migration */
+ return -EINVAL;
+}
+
static void vfio_user_io_wait_commit(VFIOContainer *container)
{
vfio_user_wait_reqs(container->proxy);
@@ -1634,5 +1643,6 @@ static void vfio_user_io_wait_commit(VFIOContainer *container)
VFIOContIO vfio_cont_io_sock = {
.dma_map = vfio_user_io_dma_map,
.dma_unmap = vfio_user_io_dma_unmap,
+ .dirty_bitmap = vfio_user_io_dirty_bitmap,
.wait_commit = vfio_user_io_wait_commit,
};
--
1.8.3.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
end of thread, other threads:[~2022-05-05 17:58 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <cover.1651709440.git.john.g.johnson@oracle.com>
2022-05-05 17:19 ` [RFC v5 01/23] vfio-user: introduce vfio-user protocol specification John Johnson
2022-05-05 17:19 ` [RFC v5 02/23] vfio-user: add VFIO base abstract class John Johnson
2022-05-05 17:19 ` [RFC v5 03/23] vfio-user: add container IO ops vector John Johnson
2022-05-05 17:19 ` [RFC v5 04/23] vfio-user: add region cache John Johnson
2022-05-05 17:19 ` [RFC v5 05/23] vfio-user: add device IO ops vector John Johnson
2022-05-05 17:19 ` [RFC v5 06/23] vfio-user: Define type vfio_user_pci_dev_info John Johnson
2022-05-05 17:19 ` [RFC v5 07/23] vfio-user: connect vfio proxy to remote server John Johnson
2022-05-05 17:19 ` [RFC v5 08/23] vfio-user: define socket receive functions John Johnson
2022-05-05 17:19 ` [RFC v5 09/23] vfio-user: define socket send functions John Johnson
2022-05-05 17:19 ` [RFC v5 10/23] vfio-user: get device info John Johnson
2022-05-05 17:19 ` [RFC v5 11/23] vfio-user: get region info John Johnson
2022-05-05 17:19 ` [RFC v5 12/23] vfio-user: region read/write John Johnson
2022-05-05 17:19 ` [RFC v5 13/23] vfio-user: pci_user_realize PCI setup John Johnson
2022-05-05 17:19 ` [RFC v5 14/23] vfio-user: forward msix BAR accesses to server John Johnson
2022-05-05 17:19 ` [RFC v5 15/23] vfio-user: get and set IRQs John Johnson
2022-05-05 17:19 ` [RFC v5 16/23] vfio-user: proxy container connect/disconnect John Johnson
2022-05-05 17:20 ` [RFC v5 17/23] vfio-user: dma map/unmap operations John Johnson
2022-05-05 17:20 ` [RFC v5 18/23] vfio-user: secure DMA support John Johnson
2022-05-05 17:20 ` [RFC v5 19/23] vfio-user: dma read/write operations John Johnson
2022-05-05 17:20 ` [RFC v5 20/23] vfio-user: pci reset John Johnson
2022-05-05 17:20 ` [RFC v5 21/23] vfio-user: add 'x-msg-timeout' option that specifies msg wait times John Johnson
2022-05-05 17:20 ` [RFC v5 22/23] vfio-user: add tracing to send/recv paths John Johnson
2022-05-05 17:20 ` [RFC v5 23/23] vfio-user: add dirty_bitmap stub until it support migration John Johnson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).