public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: Jason Wang <jasowang@redhat.com>
To: Elena Afanasova <eafanasova@gmail.com>, kvm@vger.kernel.org
Cc: mst@redhat.com, john.g.johnson@oracle.com, dinechin@redhat.com,
	cohuck@redhat.com, felipe@nutanix.com,
	Stefan Hajnoczi <stefanha@redhat.com>,
	Elena Ufimtseva <elena.ufimtseva@oracle.com>,
	Jag Raman <jag.raman@oracle.com>
Subject: Re: MMIO/PIO dispatch file descriptors (ioregionfd) design discussion
Date: Thu, 26 Nov 2020 11:37:30 +0800	[thread overview]
Message-ID: <0447ec50-6fe8-4f10-73db-e3feec2da61c@redhat.com> (raw)
In-Reply-To: <CAFO2pHzmVf7g3z0RikQbYnejwcWRtHKV=npALs49eRDJdt4mJQ@mail.gmail.com>


On 2020/11/26 上午3:21, Elena Afanasova wrote:
> Hello,
>
> I'm an Outreachy intern with QEMU and I’m working on implementing the 
> ioregionfd API in KVM.
> So I’d like to resume the ioregionfd design discussion. The latest 
> version of the ioregionfd API document is provided below.
>
> Overview
> --------
> ioregionfd is a KVM dispatch mechanism for handling MMIO/PIO accesses 
> over a
> file descriptor without returning from ioctl(KVM_RUN). This allows device
> emulation to run in another task separate from the vCPU task.
>
> This is achieved through KVM ioctls for registering MMIO/PIO regions 
> and a wire
> protocol that KVM uses to communicate with a task handling an MMIO/PIO 
> access.
>
> The traditional ioctl(KVM_RUN) dispatch mechanism with device 
> emulation in a
> separate task looks like this:
>
>    kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task <---messages---> 
> device task
>
> ioregionfd improves performance by eliminating the need for the vCPU 
> task to
> forward MMIO/PIO exits to device emulation tasks:


I wonder at which cases we care performance like this. (Note that 
vhost-user suppots set|get_config() for a while).


>
>    kvm.ko  <---ioctl(KVM_RUN)---> VMM vCPU task
>      ^
>      `---ioregionfd---> device task


It's better to draw a device task via the KVM_RUN path to show the 
possible advantage.


>
> Both multi-threaded and multi-process VMMs can take advantage of 
> ioregionfd to
> run device emulation in dedicated threads and processes, respectively.
>
> This mechanism is similar to ioeventfd except it supports all read and 
> write
> accesses, whereas ioeventfd only supports posted doorbell writes.
>
> Traditional ioctl(KVM_RUN) dispatch and ioeventfd continue to work 
> alongside
> the new mechanism, but only one mechanism handles a MMIO/PIO access.
>
> KVM_CREATE_IOREGIONFD
> ---------------------
> :Capability: KVM_CAP_IOREGIONFD
> :Architectures: all
> :Type: system ioctl
> :Parameters: none
> :Returns: an ioregionfd file descriptor, -1 on error
>
> This ioctl creates a new ioregionfd and returns the file descriptor. 
> The fd can
> be used to handle MMIO/PIO accesses instead of returning from 
> ioctl(KVM_RUN)
> with KVM_EXIT_MMIO or KVM_EXIT_PIO. One or more MMIO or PIO regions 
> must be
> registered with KVM_SET_IOREGION in order to receive MMIO/PIO accesses 
> on the
> fd. An ioregionfd can be used with multiple VMs and its lifecycle is 
> not tied
> to a specific VM.
>
> When the last file descriptor for an ioregionfd is closed, all regions
> registered with KVM_SET_IOREGION are dropped and guest accesses to those
> regions cause ioctl(KVM_RUN) to return again.


I may miss something, but I don't see any special requirement of this 
fd. The fd just a transport of a protocol between KVM and userspace 
process. So instead of mandating a new type, it might be better to allow 
any type of fd to be attached. (E.g pipe or socket).


>
> KVM_SET_IOREGION
> ----------------
> :Capability: KVM_CAP_IOREGIONFD
> :Architectures: all
> :Type: vm ioctl
> :Parameters: struct kvm_ioregion (in)
> :Returns: 0 on success, -1 on error
>
> This ioctl adds, modifies, or removes an ioregionfd MMIO or PIO 
> region. Guest
> read and write accesses are dispatched through the given ioregionfd 
> instead of
> returning from ioctl(KVM_RUN).
>
> ::
>
>   struct kvm_ioregion {
>       __u64 guest_paddr; /* guest physical address */
>       __u64 memory_size; /* bytes */
>       __u64 user_data;
>       __s32 fd; /* previously created with KVM_CREATE_IOREGIONFD */
>       __u32 flags;
>       __u8  pad[32];
>   };
>
>   /* for kvm_ioregion::flags */
>   #define KVM_IOREGION_PIO           (1u << 0)
>   #define KVM_IOREGION_POSTED_WRITES (1u << 1)
>
> If a new region would split an existing region -1 is returned and errno is
> EINVAL.
>
> Regions can be deleted by setting fd to -1. If no existing region matches
> guest_paddr and memory_size then -1 is returned and errno is ENOENT.
>
> Existing regions can be modified as long as guest_paddr and memory_size
> match an existing region.
>
> MMIO is the default. The KVM_IOREGION_PIO flag selects PIO instead.
>
> The user_data value is included in messages KVM writes to the 
> ioregionfd upon
> guest access. KVM does not interpret user_data.
>
> Both read and write guest accesses wait for a response before entering the
> guest again. The KVM_IOREGION_POSTED_WRITES flag does not wait for a 
> response
> and immediately enters the guest again. This is suitable for accesses 
> that do
> not require synchronous emulation, such as posted doorbell register 
> writes.
> Note that guest writes may block the vCPU despite 
> KVM_IOREGION_POSTED_WRITES if
> the device is too slow in reading from the ioregionfd.
>
> Wire protocol
> -------------
> The protocol spoken over the file descriptor is as follows. The device 
> reads
> commands from the file descriptor with the following layout::
>
>   struct ioregionfd_cmd {
>       __u32 info;
>       __u32 padding;
>       __u64 user_data;
>       __u64 offset;
>       __u64 data;
>   };
>
> The info field layout is as follows::
>
>   bits:  | 31 ... 8 |  6   | 5 ... 4 | 3 ... 0 |
>   field: | reserved | resp |   size  |   cmd   |
>
> The cmd field identifies the operation to perform::
>
>   #define IOREGIONFD_CMD_READ  0
>   #define IOREGIONFD_CMD_WRITE 1
>
> The size field indicates the size of the access::
>
>   #define IOREGIONFD_SIZE_8BIT  0
>   #define IOREGIONFD_SIZE_16BIT 1
>   #define IOREGIONFD_SIZE_32BIT 2
>   #define IOREGIONFD_SIZE_64BIT 3
>
> If the command is IOREGIONFD_CMD_WRITE then the resp bit indicates 
> whether or
> not a response must be sent.
>
> The user_data field contains the opaque value provided to 
> KVM_SET_IOREGION.
> Applications can use this to uniquely identify the region that is being
> accessed.
>
> The offset field contains the byte offset being accessed within a region
> that was registered with KVM_SET_IOREGION.
>
> If the command is IOREGIONFD_CMD_WRITE then data contains the value
> being written. The data value is a 64-bit integer in host endianness,
> regardless of the access size.
>
> The device sends responses by writing the following structure to the
> file descriptor::
>
>   struct ioregionfd_resp {
>       __u64 data;
>       __u8 pad[24];
>   };
>
> The data field contains the value read by an IOREGIONFD_CMD_READ
> command. This field is zero for other commands. The data value is a 64-bit
> integer in host endianness, regardless of the access size.
>
> Ordering
> --------
> Guest accesses are delivered in order, including posted writes.
>
> Signals
> -------
> The vCPU task can be interrupted by a signal while waiting for an 
> ioregionfd
> response. In this case ioctl(KVM_RUN) returns with -EINTR. Guest entry is
> deferred until ioctl(KVM_RUN) is called again and the response has 
> been written
> to the ioregionfd.
>
> Security
> --------
> Device emulation processes may be untrusted in multi-process VMM 
> architectures.
> Therefore the control plane and the data plane of ioregionfd are 
> separate. A
> task that only has access to an ioregionfd is unable to add/modify/remove
> regions since that requires ioctls on a KVM vm fd. This ensures that 
> device
> emulation processes can only service MMIO/PIO accesses for regions 
> that the VMM
> registered on their behalf.
>
> Multi-queue scalability
> -----------------------
> The protocol is synchronous - only one command/response cycle is in 
> flight at a
> time - but the vCPU will be blocked until the response has been processed
> anyway. If another vCPU accesses an MMIO or PIO region belonging to 
> the same
> ioregionfd during this time then it waits for the first access to 
> complete.
>
> Per-queue ioregionfds can be set up to take advantage of concurrency on
> multi-queue devices.
>
> Polling
> -------
> Userspace can poll ioregionfd by submitting an io_uring IORING_OP_READ 
> request
> and polling the cq ring to detect when the read has completed. 
> Although this
> dispatch mechanism incurs more overhead than polling directly on guest 
> RAM, it
> captures each write access and supports reads.
>
> Does it obsolete ioeventfd?
> ---------------------------
> No, although KVM_IOREGION_POSTED_WRITES offers somewhat similar 
> functionality
> to ioeventfd, there are differences. The datamatch functionality of 
> ioeventfd
> is not available and would need to be implemented by the device emulation
> program.


This means another dispatching layer in the device emulation.

Thanks


> Due to the counter semantics of eventfds there is automatic coalescing
> of repeated accesses with ioeventfd. Overall ioeventfd is lighter 
> weight but
> also more limited.


       reply	other threads:[~2020-11-26  3:38 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
     [not found] <CAFO2pHzmVf7g3z0RikQbYnejwcWRtHKV=npALs49eRDJdt4mJQ@mail.gmail.com>
2020-11-26  3:37 ` Jason Wang [this message]
2020-11-26 12:36   ` MMIO/PIO dispatch file descriptors (ioregionfd) design discussion Stefan Hajnoczi
2020-11-27  3:39     ` Jason Wang
2020-11-27 13:44       ` Stefan Hajnoczi
2020-11-30  2:14         ` Jason Wang
2020-11-30 12:47           ` Stefan Hajnoczi
2020-12-01  4:05             ` Jason Wang
2020-12-01 10:35               ` Stefan Hajnoczi
2020-12-02  2:53                 ` Jason Wang
2020-12-02 14:17                 ` Elena Afanasova
2020-11-25 20:44 Elena Afanasova
2020-12-02 18:06 ` Peter Xu
2020-12-03 11:10   ` Stefan Hajnoczi
2020-12-03 11:34     ` Michael S. Tsirkin
2020-12-04 13:23       ` Stefan Hajnoczi
2020-12-03 14:40     ` Peter Xu
2020-12-07 14:58       ` Stefan Hajnoczi
2021-10-12  5:34 ` elena
2021-10-25 12:42   ` Stefan Hajnoczi
2021-10-25 15:21     ` Elena
2021-10-25 16:56       ` Stefan Hajnoczi
2021-10-26 19:01       ` John Levon
2021-10-27 10:15         ` Stefan Hajnoczi
2021-10-27 12:22           ` John Levon
2021-10-28  8:14             ` Stefan Hajnoczi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=0447ec50-6fe8-4f10-73db-e3feec2da61c@redhat.com \
    --to=jasowang@redhat.com \
    --cc=cohuck@redhat.com \
    --cc=dinechin@redhat.com \
    --cc=eafanasova@gmail.com \
    --cc=elena.ufimtseva@oracle.com \
    --cc=felipe@nutanix.com \
    --cc=jag.raman@oracle.com \
    --cc=john.g.johnson@oracle.com \
    --cc=kvm@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=stefanha@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox