qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: John Levon <john.levon@nutanix.com>
To: Alex Williamson <alex.williamson@redhat.com>
Cc: "benjamin.walker@intel.com" <benjamin.walker@intel.com>,
	John G Johnson <john.g.johnson@oracle.com>,
	Swapnil Ingle <swapnil.ingle@nutanix.com>,
	"Michael S . Tsirkin" <mst@redhat.com>,
	Jason Wang <jasowang@redhat.com>,
	"qemu-devel@nongnu.org" <qemu-devel@nongnu.org>,
	John Levon <levon@movementarian.org>,
	Kirti Wankhede <kwankhede@nvidia.com>,
	Gerd Hoffmann <kraxel@redhat.com>,
	"tina.zhang@intel.com" <tina.zhang@intel.com>,
	"jag.raman@oracle.com" <jag.raman@oracle.com>,
	"james.r.harris@intel.com" <james.r.harris@intel.com>,
	Elena Ufimtseva <elena.ufimtseva@oracle.com>,
	"Kanth.Ghatraju@oracle.com" <Kanth.Ghatraju@oracle.com>,
	Felipe Franciosi <felipe@nutanix.com>,
	"marcandre.lureau@redhat.com" <marcandre.lureau@redhat.com>,
	Yan Zhao <yan.y.zhao@intel.com>,
	"konrad.wilk@oracle.com" <konrad.wilk@oracle.com>,
	"yuvalkashtan@gmail.com" <yuvalkashtan@gmail.com>,
	"dgilbert@redhat.com" <dgilbert@redhat.com>,
	"ismael@linux.com" <ismael@linux.com>,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	"changpeng.liu@intel.com" <changpeng.liu@intel.com>,
	"tomassetti.andrea@gmail.com" <tomassetti.andrea@gmail.com>,
	"mpiszczek@ddn.com" <mpiszczek@ddn.com>,
	Cornelia Huck <cohuck@redhat.com>,
	Christophe de Dinechin <cdupontd@redhat.com>,
	Raphael Norwitz <raphael.norwitz@nutanix.com>,
	"xiuchun.lu@intel.com" <xiuchun.lu@intel.com>,
	Thanos Makatos <thanos.makatos@nutanix.com>
Subject: Re: [PATCH v8] introduce vfio-user protocol specification
Date: Wed, 19 May 2021 22:38:48 +0000	[thread overview]
Message-ID: <20210519223844.GA1036405@sent> (raw)
In-Reply-To: <20210519150817.6a90985b.alex.williamson@redhat.com>

On Wed, May 19, 2021 at 03:08:17PM -0600, Alex Williamson wrote:

> > +VFIO_USER_DMA_MAP
> > +-----------------
> > +
> > +Message Format
> > +^^^^^^^^^^^^^^
> > +
> > ++--------------+------------------------+
> > +| Name         | Value                  |
> > ++==============+========================+
> > +| Message ID   | <ID>                   |
> > ++--------------+------------------------+
> > +| Command      | 2                      |
> > ++--------------+------------------------+
> > +| Message size | 16 + table size        |
> > ++--------------+------------------------+
> > +| Flags        | Reply bit set in reply |
> > ++--------------+------------------------+
> > +| Error        | 0/errno                |
> > ++--------------+------------------------+
> > +| Table        | array of table entries |
> > ++--------------+------------------------+
> > +
> > +This command message is sent by the client to the server to inform it of the
> > +memory regions the server can access. It must be sent before the server can
> > +perform any DMA to the client. It is normally sent directly after the version
> > +handshake is completed, but may also occur when memory is added to the client,
> > +or if the client uses a vIOMMU. If the client does not expect the server to
> > +perform DMA then it does not need to send to the server VFIO_USER_DMA_MAP
> > +commands. If the server does not need to perform DMA then it can ignore such
> > +commands but it must still reply to them. The table is an array of the
> > +following structure:
> > +
> > +Table entry format
> > +^^^^^^^^^^^^^^^^^^
> > +
> > ++-------------+--------+-------------+
> > +| Name        | Offset | Size        |
> > ++=============+========+=============+
> > +| Address     | 0      | 8           |
> > ++-------------+--------+-------------+
> > +| Size        | 8      | 8           |
> > ++-------------+--------+-------------+
> > +| Offset      | 16     | 8           |
> > ++-------------+--------+-------------+
> > +| Protections | 24     | 4           |
> > ++-------------+--------+-------------+
> > +| Flags       | 28     | 4           |
> > ++-------------+--------+-------------+
> > +|             | +-----+------------+ |
> > +|             | | Bit | Definition | |
> > +|             | +=====+============+ |
> > +|             | | 0   | Mappable   | |
> > +|             | +-----+------------+ |
> > ++-------------+--------+-------------+
> > +
> > +* *Address* is the base DMA address of the region.
> > +* *Size* is the size of the region.
> > +* *Offset* is the file offset of the region with respect to the associated file
> > +  descriptor.
> 
> It might help to explicitly state this value is ignored by the server
> for non-mappable DMA, otherwise a server might assume a specific value
> is required (ex. zero) for such cases.

Generally we say that unused inputs must be zero, but yes, this should be
clarified, thanks.

> > +* *Protections* are the region's protection attributes as encoded in
> > +  ``<sys/mman.h>``.
> > +* *Flags* contains the following region attributes:
> > +
> > +  * *Mappable* indicates that the region can be mapped via the mmap() system
> > +    call using the file descriptor provided in the message meta-data.
> > +
> > +This structure is 32 bytes in size, so the message size is:
> > +16 + (# of table entries * 32).
> > +
> > +If a DMA region being added can be directly mapped by the server, an array of
> > +file descriptors must be sent as part of the message meta-data. Each mappable
> > +region entry must have a corresponding file descriptor. On AF_UNIX sockets, the
> 
> Is this saying that if the client provides table entries where indexes
> 1, 3, and 5 are indicated as mappable, then there must be an ancillary
> file descriptor array with 3 entries, where fd[0] maps to entry[1],
> fd[1]:entry[3], and fd[2]:entry[5], even if fd[0-2] are all the
> same file descriptor?

Yes. Though we are planning to change these calls to only support single regions
which would make this moot.

> > +file descriptors must be passed as SCM_RIGHTS type ancillary data. Otherwise,
> > +if a DMA region cannot be directly mapped by the server, it can be accessed by
> > +the server using VFIO_USER_DMA_READ and VFIO_USER_DMA_WRITE messages, explained
> > +in `Read and Write Operations`_. A command to map over an existing region must
> > +be failed by the server with ``EEXIST`` set in error field in the reply.
> > +
> > +Adding multiple DMA regions can partially fail. The response does not indicate
> > +which regions were added and which were not, therefore it is a client
> > +implementation detail how to recover from the failure.
> > +
> > +.. Note::
> > +   The server can optionally remove succesfully added DMA regions making this
> > +   operation atomic.
> > +   The client can recover by attempting to unmap one by one all the DMA regions
> > +   in the VFIO_USER_DMA_MAP command, ignoring failures for regions that do not
> > +   exist.
> 
> What's the benefit of specifying this server behavior as optional?  I'm
> afraid this unspecified error recovery behavior might actually deter
> clients from performing batch mappings.  Servers also have little
> incentive to do their own cleanup if the client has no way to detect
> that behavior.

This may well be moot too.

> > +VFIO_USER_DMA_UNMAP
> > +-------------------
> > +
> > +This command message is sent by the client to the server to inform it that a
> > +DMA region, previously made available via a VFIO_USER_DMA_MAP command message,
> > +is no longer available for DMA. It typically occurs when memory is subtracted
> > +from the client or if the client uses a vIOMMU. If the client does not expect
> > +the server to perform DMA then it does not need to send to the server
> > +VFIO_USER_DMA_UNMAP commands. If the server does not need to perform DMA then
> > +it can ignore such commands but it must still reply to them. The table is an
> 
> I'm confused why expectation of DMA plays a factor here.  For example,
> if QEMU unplugs a DIMM and the server has an mmap of the file descriptor
> related to that DIMM, does it get to retain the mmap if it doesn't
> currently have any DMA queued targeting that address range?  Can QEMU
> skip sending an unmap if the PCI bus master bit is disabled on the
> device preventing further DMA?  How can the associated file descriptor
> get released?  This doesn't feel strongly specified.

I thought we'd removed those sentences actually, as they're just confusing. In
reality, everything is going to both send and handle map/unmap messages.

> Are there any assumptions about address and size of the unmap command
> relative to the original map command or is the client freely allowed to
> bisect, overlap, or overextend previous mappings?

Good question. Filed https://github.com/nutanix/libvfio-user/issues/504 to
track this.

I actually don't know what clients would like to be able to do in this respect.

> > +* *Offset* is the file offset of the region with respect to the associated file
> > +  descriptor.
> > +* *Protections* are the region's protection attributes as encoded in
> > +  ``<sys/mman.h>``.
> > +* *Flags* contains the following region attributes:
> > +
> > +  * *VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP* indicates that a dirty page bitmap
> > +    must be populated before unmapping the DMA region. The client must provide
> > +    a ``struct vfio_bitmap`` in the VFIO bitmaps field for each region, with
> > +    the ``vfio_bitmap.pgsize`` and ``vfio_bitmap.size`` fields initialized.
> > +
> > +* *VFIO Bitmaps* contains one ``struct vfio_bitmap`` per region (explained
> > +  below) if ``VFIO_DMA_UNMAP_FLAG_GET_DIRTY_BITMAP`` is set in Flags.
> 
> How will this be extended when new flags are added to get new data or
> new data formats?  Note for instance that the kernel struct
> vfio_iommu_type1_dma_unmap specifies the data[] as opaque in general and
> only specifies it as struct vfio_bitmap for the case where
> GET_DIRTY_BITMAP is specified.

We're planning to do something like that for the new (actually more like vfio
again) map/unmap format.

> > +Upon receiving a VFIO_USER_DMA_UNMAP command, if the file descriptor is mapped
> > +then the server must release all references to that DMA region before replying,
> > +which includes potentially in flight DMA transactions. Removing a portion of a
> > +DMA region is possible.
> 
> Ah, maybe this answers my question about unmap vs map, but it also seems
> to contradict the description allowing the server to ignore unmap
> requests if no DMA is expected when we state here that the server MUST
> release references.

The text is confusing (which is why I've removed it again). What it's really
trying to say is:

If there is a server implementation (such as the gpio-pci-idio-16 sample) that
never needs to access guest memory at all, then the server can choose to ignore
DMA_MAP/UNMAP - so it would never keep any references around in the first place.

It's not a useful thing to mention in the spec IMHO.

> Is this also a good place to point out that the max messages size of
> 4096 is extremely limiting for returning a dirty bitmap for most use
> cases?  Some discussion of the error codes for such a case might be
> relevant here too.

It's a silly low default. The only implementation so far reports 65536 for what
it's worth.

We are also prototyping implementation changes such that this limit can be
removed altogether; hopefully that will come in a future spec update.

> > +VFIO_USER_VM_INTERRUPT
> > +
> > +Interrupt info format
> > +^^^^^^^^^^^^^^^^^^^^^
> > +
> > ++-----------+--------+------+
> > +| Name      | Offset | Size |
> > ++===========+========+======+
> > +| Sub-index | 16     | 4    |
> > ++-----------+--------+------+
> > +
> > +* *Sub-index* is relative to the IRQ index, e.g., the vector number used in PCI
> > +  MSI/X type interrupts.
> 
> Sorry if I'm blind, but where is the index itself provided?

You were confused because the message makes no sense as it's defined. It's been
removed.

Thanks for taking a look, much appreciated!

regards
john

  reply	other threads:[~2021-05-19 22:40 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-04-14 11:41 [PATCH v8] introduce vfio-user protocol specification Thanos Makatos
2021-04-26 15:48 ` Stefan Hajnoczi
2021-04-27 12:02   ` Thanos Makatos
2021-04-27 15:01     ` Stefan Hajnoczi
2021-05-04 13:51 ` Stefan Hajnoczi
2021-05-04 14:31   ` John Levon
2021-05-05 15:51     ` Stefan Hajnoczi
2021-06-14  9:57     ` Thanos Makatos
2021-05-05 16:19   ` John Levon
2021-05-06  8:49     ` Stefan Hajnoczi
2021-05-07 16:10     ` Thanos Makatos
2021-06-14 10:07   ` Thanos Makatos
2021-05-10 16:57 ` Stefan Hajnoczi
2021-05-10 22:25   ` John Levon
2021-05-11 10:09     ` Stefan Hajnoczi
2021-05-11 10:43       ` John Levon
2021-05-11 15:40         ` Stefan Hajnoczi
2021-05-12  5:08     ` John Johnson
2021-05-19 21:08 ` Alex Williamson
2021-05-19 22:38   ` John Levon [this message]
2021-06-14  9:47     ` Thanos Makatos

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20210519223844.GA1036405@sent \
    --to=john.levon@nutanix.com \
    --cc=Kanth.Ghatraju@oracle.com \
    --cc=alex.williamson@redhat.com \
    --cc=benjamin.walker@intel.com \
    --cc=cdupontd@redhat.com \
    --cc=changpeng.liu@intel.com \
    --cc=cohuck@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=felipe@nutanix.com \
    --cc=ismael@linux.com \
    --cc=jag.raman@oracle.com \
    --cc=james.r.harris@intel.com \
    --cc=jasowang@redhat.com \
    --cc=john.g.johnson@oracle.com \
    --cc=konrad.wilk@oracle.com \
    --cc=kraxel@redhat.com \
    --cc=kwankhede@nvidia.com \
    --cc=levon@movementarian.org \
    --cc=marcandre.lureau@redhat.com \
    --cc=mpiszczek@ddn.com \
    --cc=mst@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=raphael.norwitz@nutanix.com \
    --cc=stefanha@redhat.com \
    --cc=swapnil.ingle@nutanix.com \
    --cc=thanos.makatos@nutanix.com \
    --cc=tina.zhang@intel.com \
    --cc=tomassetti.andrea@gmail.com \
    --cc=xiuchun.lu@intel.com \
    --cc=yan.y.zhao@intel.com \
    --cc=yuvalkashtan@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).