* [RFC PATCH] KVM: optionally commit write on ioeventfd write
@ 2022-10-05 21:15 Thanos Makatos
2025-09-05 14:09 ` Sean Christopherson
0 siblings, 1 reply; 15+ messages in thread
From: Thanos Makatos @ 2022-10-05 21:15 UTC (permalink / raw)
To: kvm
Cc: john.levon, mst, john.g.johnson, dinechin, cohuck, jasowang,
stefanha, jag.raman, eafanasova, elena.ufimtseva, changpeng.liu,
james.r.harris, benjamin.walker, Thanos Makatos
This patch is a slightly different take on the ioregionfd mechanism
previously described here:
https://lore.kernel.org/all/88ca79d2e378dcbfb3988b562ad2c16c4f929ac7.camel@gmail.com/
The goal of this new mechanism, which we tentatively call shadow
ioeventfd in lack of a better name, is to speed up doorbell writes on
NVMe controllers emulated outside of the VMM. Currently, a doorbell
write to an NVMe SQ tail doorbell requires returning from ioctl(KVM_RUN)
and the VMM communicating the event, along with the doorbell value, to
the NVMe controller emulation task. With the shadow ioeventfd, the NVMe
emulation task is directly notified of the doorbell write and can find
the doorbell value in a known location, without the interference of the
VMM.
To demonstrate the performance benefit of the shadow ioeventfd
mechanism, I've implemented a test using the vfio-user protocol for
enabling out-of-process device emulation, which can be found here:
https://github.com/tmakatos/muser/commit/7adfe45 I've patched QEMU to
enable shadow ioeventfd here:
https://github.com/tmakatos/qemu-oracle/commit/55f2781 This is based on
John Johnson's not-yet-merged vfio-user server patches. In this test,
the guest repeatedly writes to two pieces of memory: one accelarated by
a shadow ioeventfd and the other not. Writing to the piece of memory
accelarated by the shadow ioeventfd is 4 times faster.
Signed-off-by: Thanos Makatos <thanos.makatos@nutanix.com>
---
include/uapi/linux/kvm.h | 5 ++++-
tools/include/uapi/linux/kvm.h | 2 ++
virt/kvm/eventfd.c | 9 +++++++++
3 files changed, 15 insertions(+), 1 deletion(-)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index eed0315a77a6..0a884ac1cc76 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -804,6 +804,7 @@ enum {
kvm_ioeventfd_flag_nr_deassign,
kvm_ioeventfd_flag_nr_virtio_ccw_notify,
kvm_ioeventfd_flag_nr_fast_mmio,
+ kvm_ioevetnfd_flag_nr_commit_write,
kvm_ioeventfd_flag_nr_max,
};
@@ -812,16 +813,18 @@ enum {
#define KVM_IOEVENTFD_FLAG_DEASSIGN (1 << kvm_ioeventfd_flag_nr_deassign)
#define KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY \
(1 << kvm_ioeventfd_flag_nr_virtio_ccw_notify)
+#define KVM_IOEVENTFD_FLAG_COMMIT_WRITE (1 << kvm_ioevetnfd_flag_nr_commit_write)
#define KVM_IOEVENTFD_VALID_FLAG_MASK ((1 << kvm_ioeventfd_flag_nr_max) - 1)
struct kvm_ioeventfd {
__u64 datamatch;
__u64 addr; /* legal pio/mmio address */
+ __u64 vaddr; /* user address to write to if COMMIT_WRITE is set */
__u32 len; /* 1, 2, 4, or 8 bytes; or 0 to ignore length */
__s32 fd;
__u32 flags;
- __u8 pad[36];
+ __u8 pad[28];
};
#define KVM_X86_DISABLE_EXITS_MWAIT (1 << 0)
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index eed0315a77a6..ee64ff1abccc 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -804,6 +804,7 @@ enum {
kvm_ioeventfd_flag_nr_deassign,
kvm_ioeventfd_flag_nr_virtio_ccw_notify,
kvm_ioeventfd_flag_nr_fast_mmio,
+ kvm_ioevetnfd_flag_nr_commit_write,
kvm_ioeventfd_flag_nr_max,
};
@@ -812,6 +813,7 @@ enum {
#define KVM_IOEVENTFD_FLAG_DEASSIGN (1 << kvm_ioeventfd_flag_nr_deassign)
#define KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY \
(1 << kvm_ioeventfd_flag_nr_virtio_ccw_notify)
+#define KVM_IOEVENTFD_FLAG_COMMIT_WRITE (1 << kvm_ioevetnfd_flag_nr_commit_write)
#define KVM_IOEVENTFD_VALID_FLAG_MASK ((1 << kvm_ioeventfd_flag_nr_max) - 1)
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 2a3ed401ce46..c98e7b54fafa 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -682,6 +682,8 @@ struct _ioeventfd {
struct kvm_io_device dev;
u8 bus_idx;
bool wildcard;
+ bool commit_write;
+ void *vaddr;
};
static inline struct _ioeventfd *
@@ -753,6 +755,10 @@ ioeventfd_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
if (!ioeventfd_in_range(p, addr, len, val))
return -EOPNOTSUPP;
+ if (p->commit_write) {
+ if (unlikely(copy_to_user(p->vaddr, val, len)))
+ return -EFAULT;
+ }
eventfd_signal(p->eventfd, 1);
return 0;
}
@@ -832,6 +838,9 @@ static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
else
p->wildcard = true;
+ p->commit_write = args->flags & KVM_IOEVENTFD_FLAG_COMMIT_WRITE;
+ p->vaddr = (void *)args->vaddr;
+
mutex_lock(&kvm->slots_lock);
/* Verify that there isn't a match already */
--
2.22.3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [RFC PATCH] KVM: optionally commit write on ioeventfd write
2022-10-05 21:15 [RFC PATCH] KVM: optionally commit write on ioeventfd write Thanos Makatos
@ 2025-09-05 14:09 ` Sean Christopherson
2025-12-02 9:15 ` Thanos Makatos
0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2025-09-05 14:09 UTC (permalink / raw)
To: Thanos Makatos
Cc: kvm, john.levon, mst, john.g.johnson, dinechin, cohuck, jasowang,
stefanha, jag.raman, eafanasova, elena.ufimtseva, changpeng.liu,
james.r.harris, benjamin.walker
On Wed, Oct 05, 2022, Thanos Makatos wrote:
Amusingly, I floated this exact idea internally without ever seeing this patch
(we ended up going a different direction). Sadly, I can't claim infringement,
as my suggestion was timestamped from December 2022 :-D
If this is useful for y'all, I don't see a reason not to do it.
> ---
> include/uapi/linux/kvm.h | 5 ++++-
> tools/include/uapi/linux/kvm.h | 2 ++
> virt/kvm/eventfd.c | 9 +++++++++
> 3 files changed, 15 insertions(+), 1 deletion(-)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index eed0315a77a6..0a884ac1cc76 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -804,6 +804,7 @@ enum {
> kvm_ioeventfd_flag_nr_deassign,
> kvm_ioeventfd_flag_nr_virtio_ccw_notify,
> kvm_ioeventfd_flag_nr_fast_mmio,
> + kvm_ioevetnfd_flag_nr_commit_write,
> kvm_ioeventfd_flag_nr_max,
> };
>
> @@ -812,16 +813,18 @@ enum {
> #define KVM_IOEVENTFD_FLAG_DEASSIGN (1 << kvm_ioeventfd_flag_nr_deassign)
> #define KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY \
> (1 << kvm_ioeventfd_flag_nr_virtio_ccw_notify)
> +#define KVM_IOEVENTFD_FLAG_COMMIT_WRITE (1 << kvm_ioevetnfd_flag_nr_commit_write)
Maybe POST_WRITE to try to capture the effective semantics? As for read after
write hazards, my vote is to document that KVM provides no guarantees on that
front. I can't envision a use case where it makes sense to provide guarantees
in the kernel, since doing so would largely defeat the purpose of handling writes
in the fastpath.
> #define KVM_IOEVENTFD_VALID_FLAG_MASK ((1 << kvm_ioeventfd_flag_nr_max) - 1)
>
> struct kvm_ioeventfd {
> __u64 datamatch;
> __u64 addr; /* legal pio/mmio address */
> + __u64 vaddr; /* user address to write to if COMMIT_WRITE is set */
This needs to be placed at the end, i.e. actually needs to consume the pad[]
bytes. Inserting into the middle changes the layout of the structure and thus
breaks ABI.
And maybe post_addr (or commit_addr)? Because vaddr might be interpreted as the
host virtual address that corresponds to "addr", which may or may not be the case.
> __u32 len; /* 1, 2, 4, or 8 bytes; or 0 to ignore length */
> __s32 fd;
> __u32 flags;
> - __u8 pad[36];
> + __u8 pad[28];
> };
...
> @@ -812,6 +813,7 @@ enum {
> #define KVM_IOEVENTFD_FLAG_DEASSIGN (1 << kvm_ioeventfd_flag_nr_deassign)
> #define KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY \
> (1 << kvm_ioeventfd_flag_nr_virtio_ccw_notify)
> +#define KVM_IOEVENTFD_FLAG_COMMIT_WRITE (1 << kvm_ioevetnfd_flag_nr_commit_write)
>
> #define KVM_IOEVENTFD_VALID_FLAG_MASK ((1 << kvm_ioeventfd_flag_nr_max) - 1)
>
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 2a3ed401ce46..c98e7b54fafa 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -682,6 +682,8 @@ struct _ioeventfd {
> struct kvm_io_device dev;
> u8 bus_idx;
> bool wildcard;
> + bool commit_write;
> + void *vaddr;
There's no need for a separate bool, just pivot on the validity of the pointer.
The simplest approach is to disallow NULL pointers (which aren't technically
illegal for userspace, but I doubt any use case actually cares). Alternatively,
set the internal pointer to e.g. -EINVAL and then act on !IS_ERR().
The pointer also needs to be tagged __user.
> };
>
> static inline struct _ioeventfd *
> @@ -753,6 +755,10 @@ ioeventfd_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
> if (!ioeventfd_in_range(p, addr, len, val))
> return -EOPNOTSUPP;
>
> + if (p->commit_write) {
> + if (unlikely(copy_to_user(p->vaddr, val, len)))
This needs to check that len > 0. I think it's also worth hoisting the validity
checks into kvm_assign_ioeventfd_idx() so that this can use the slightly more
optimal __copy_to_user().
E.g.
if (args->flags & KVM_IOEVENTFD_FLAG_REDIRECT) {
if (!args->len || !args->post_addr ||
args->redirect != untagged_addr(args->post_addr) ||
!access_ok((void __user *)(unsigned long)args->post_addr, args->len)) {
ret = -EINVAL;
goto fail;
}
p->post_addr = (void __user *)(unsigned long)args->post_addr;
}
And then the usage here can be
if (p->post_addr && __copy_to_user(p->post_addr, val, len))
return -EFAULT;
I assume the spinlock in eventfd_signal() provides ordering even on weakly
ordered architectures, but we should double check that, i.e. that we don't need
an explicitly barrier of some kind.
Lastly, I believe kvm_deassign_ioeventfd_idx() needs to check for a match on
post_addr (or whatever it gets named).
> + return -EFAULT;
> + }
> eventfd_signal(p->eventfd, 1);
> return 0;
> }
> @@ -832,6 +838,9 @@ static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
> else
> p->wildcard = true;
>
> + p->commit_write = args->flags & KVM_IOEVENTFD_FLAG_COMMIT_WRITE;
> + p->vaddr = (void *)args->vaddr;
> +
> mutex_lock(&kvm->slots_lock);
>
> /* Verify that there isn't a match already */
> --
> 2.22.3
>
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [RFC PATCH] KVM: optionally commit write on ioeventfd write
2025-09-05 14:09 ` Sean Christopherson
@ 2025-12-02 9:15 ` Thanos Makatos
2025-12-02 22:53 ` Sean Christopherson
0 siblings, 1 reply; 15+ messages in thread
From: Thanos Makatos @ 2025-12-02 9:15 UTC (permalink / raw)
To: Sean Christopherson
Cc: kvm@vger.kernel.org, John Levon, mst@redhat.com,
john.g.johnson@oracle.com, dinechin@redhat.com, cohuck@redhat.com,
jasowang@redhat.com, stefanha@redhat.com, jag.raman@oracle.com,
eafanasova@gmail.com, elena.ufimtseva@oracle.com,
changpeng.liu@intel.com, james.r.harris@intel.com,
benjamin.walker@intel.com
> -----Original Message-----
> From: Sean Christopherson <seanjc@google.com>
> Sent: 05 September 2025 15:10
> To: Thanos Makatos <thanos.makatos@nutanix.com>
> Cc: kvm@vger.kernel.org; John Levon <john.levon@nutanix.com>;
> mst@redhat.com; john.g.johnson@oracle.com; dinechin@redhat.com;
> cohuck@redhat.com; jasowang@redhat.com; stefanha@redhat.com;
> jag.raman@oracle.com; eafanasova@gmail.com;
> elena.ufimtseva@oracle.com; changpeng.liu@intel.com;
> james.r.harris@intel.com; benjamin.walker@intel.com
> Subject: Re: [RFC PATCH] KVM: optionally commit write on ioeventfd write
>
> !-------------------------------------------------------------------|
> CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> On Wed, Oct 05, 2022, Thanos Makatos wrote:
>
> Amusingly, I floated this exact idea internally without ever seeing this patch
> (we ended up going a different direction). Sadly, I can't claim infringement,
> as my suggestion was timestamped from December 2022 :-D
>
> If this is useful for y'all, I don't see a reason not to do it.
>
> > ---
> > include/uapi/linux/kvm.h | 5 ++++-
> > tools/include/uapi/linux/kvm.h | 2 ++
> > virt/kvm/eventfd.c | 9 +++++++++
> > 3 files changed, 15 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index eed0315a77a6..0a884ac1cc76 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -804,6 +804,7 @@ enum {
> > kvm_ioeventfd_flag_nr_deassign,
> > kvm_ioeventfd_flag_nr_virtio_ccw_notify,
> > kvm_ioeventfd_flag_nr_fast_mmio,
> > + kvm_ioevetnfd_flag_nr_commit_write,
> > kvm_ioeventfd_flag_nr_max,
> > };
> >
> > @@ -812,16 +813,18 @@ enum {
> > #define KVM_IOEVENTFD_FLAG_DEASSIGN (1 <<
> kvm_ioeventfd_flag_nr_deassign)
> > #define KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY \
> > (1 << kvm_ioeventfd_flag_nr_virtio_ccw_notify)
> > +#define KVM_IOEVENTFD_FLAG_COMMIT_WRITE (1 <<
> kvm_ioevetnfd_flag_nr_commit_write)
>
> Maybe POST_WRITE to try to capture the effective semantics?
Ack.
> As for read after
> write hazards, my vote is to document that KVM provides no guarantees on
> that
> front. I can't envision a use case where it makes sense to provide guarantees
> in the kernel, since doing so would largely defeat the purpose of handling
> writes
> in the fastpath.
Ack.
>
> > #define KVM_IOEVENTFD_VALID_FLAG_MASK ((1 <<
> kvm_ioeventfd_flag_nr_max) - 1)
> >
> > struct kvm_ioeventfd {
> > __u64 datamatch;
> > __u64 addr; /* legal pio/mmio address */
> > + __u64 vaddr; /* user address to write to if COMMIT_WRITE is set
> */
>
> This needs to be placed at the end, i.e. actually needs to consume the pad[]
> bytes. Inserting into the middle changes the layout of the structure and thus
> breaks ABI.
Ack.
>
> And maybe post_addr (or commit_addr)? Because vaddr might be
> interpreted as the
> host virtual address that corresponds to "addr", which may or may not be the
> case.
Ack.
>
> > __u32 len; /* 1, 2, 4, or 8 bytes; or 0 to ignore length */
> > __s32 fd;
> > __u32 flags;
> > - __u8 pad[36];
> > + __u8 pad[28];
> > };
>
> ...
>
> > @@ -812,6 +813,7 @@ enum {
> > #define KVM_IOEVENTFD_FLAG_DEASSIGN (1 <<
> kvm_ioeventfd_flag_nr_deassign)
> > #define KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY \
> > (1 << kvm_ioeventfd_flag_nr_virtio_ccw_notify)
> > +#define KVM_IOEVENTFD_FLAG_COMMIT_WRITE (1 <<
> kvm_ioevetnfd_flag_nr_commit_write)
> >
> > #define KVM_IOEVENTFD_VALID_FLAG_MASK ((1 <<
> kvm_ioeventfd_flag_nr_max) - 1)
> >
> > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> > index 2a3ed401ce46..c98e7b54fafa 100644
> > --- a/virt/kvm/eventfd.c
> > +++ b/virt/kvm/eventfd.c
> > @@ -682,6 +682,8 @@ struct _ioeventfd {
> > struct kvm_io_device dev;
> > u8 bus_idx;
> > bool wildcard;
> > + bool commit_write;
> > + void *vaddr;
>
> There's no need for a separate bool, just pivot on the validity of the pointer.
> The simplest approach is to disallow NULL pointers (which aren't technically
> illegal for userspace, but I doubt any use case actually cares).
Ack.
> Alternatively, set the internal pointer to e.g. -EINVAL and then act on !IS_ERR().
>
> The pointer also needs to be tagged __user.
Ack.
>
> > };
> >
> > static inline struct _ioeventfd *
> > @@ -753,6 +755,10 @@ ioeventfd_write(struct kvm_vcpu *vcpu, struct
> kvm_io_device *this, gpa_t addr,
> > if (!ioeventfd_in_range(p, addr, len, val))
> > return -EOPNOTSUPP;
> >
> > + if (p->commit_write) {
> > + if (unlikely(copy_to_user(p->vaddr, val, len)))
>
> This needs to check that len > 0.
Ack.
> I think it's also worth hoisting the validity
> checks into kvm_assign_ioeventfd_idx() so that this can use the slightly more
> optimal __copy_to_user().
>
> E.g.
>
> if (args->flags & KVM_IOEVENTFD_FLAG_REDIRECT) {
> if (!args->len || !args->post_addr ||
> != untagged_addr(args->post_addr) ||
> !access_ok((void __user *)(unsigned long)args->post_addr, args->len)) {
> ret = -EINVAL;
> goto fail;
> }
>
> p->post_addr = (void __user *)(unsigned long)args-
> >post_addr;
> }
>
> And then the usage here can be
>
> if (p->post_addr && __copy_to_user(p->post_addr, val, len))
> return -EFAULT;
>
Did you mean to write __copy_to_user(p->redirect, val, len) here?
> I assume the spinlock in eventfd_signal() provides ordering even on weakly
> ordered architectures, but we should double check that, i.e. that we don't
> need
> an explicitly barrier of some kind.
Are you talking about the possibility of whoever polls the eventfd not observing the value being written?
>
> Lastly, I believe kvm_deassign_ioeventfd_idx() needs to check for a match on
> post_addr (or whatever it gets named).
Ack.
>
> > + return -EFAULT;
> > + }
> > eventfd_signal(p->eventfd, 1);
> > return 0;
> > }
> > @@ -832,6 +838,9 @@ static int kvm_assign_ioeventfd_idx(struct kvm
> *kvm,
> > else
> > p->wildcard = true;
> >
> > + p->commit_write = args->flags &
> KVM_IOEVENTFD_FLAG_COMMIT_WRITE;
> > + p->vaddr = (void *)args->vaddr;
> > +
> > mutex_lock(&kvm->slots_lock);
> >
> > /* Verify that there isn't a match already */
> > --
> > 2.22.3
> >
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH] KVM: optionally commit write on ioeventfd write
2025-12-02 9:15 ` Thanos Makatos
@ 2025-12-02 22:53 ` Sean Christopherson
2025-12-03 14:25 ` Thanos Makatos
0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2025-12-02 22:53 UTC (permalink / raw)
To: Thanos Makatos
Cc: kvm@vger.kernel.org, John Levon, mst@redhat.com,
john.g.johnson@oracle.com, dinechin@redhat.com, cohuck@redhat.com,
jasowang@redhat.com, stefanha@redhat.com, jag.raman@oracle.com,
eafanasova@gmail.com, elena.ufimtseva@oracle.com,
changpeng.liu@intel.com, james.r.harris@intel.com,
benjamin.walker@intel.com
On Tue, Dec 02, 2025, Thanos Makatos wrote:
> > -----Original Message-----
> > I think it's also worth hoisting the validity
> > checks into kvm_assign_ioeventfd_idx() so that this can use the slightly more
> > optimal __copy_to_user().
> >
> > E.g.
> >
> > if (args->flags & KVM_IOEVENTFD_FLAG_REDIRECT) {
> > if (!args->len || !args->post_addr ||
> > != untagged_addr(args->post_addr) ||
> > !access_ok((void __user *)(unsigned long)args->post_addr, args->len)) {
> > ret = -EINVAL;
> > goto fail;
> > }
> >
> > p->post_addr = (void __user *)(unsigned long)args-
> > >post_addr;
> > }
> >
> > And then the usage here can be
> >
> > if (p->post_addr && __copy_to_user(p->post_addr, val, len))
> > return -EFAULT;
> >
>
> Did you mean to write __copy_to_user(p->redirect, val, len) here?
I don't think so? Ah, it's KVM_IOEVENTFD_FLAG_REDIRECT that's stale. That
should have been something like KVM_IOEVENTFD_FLAG_POST_WRITES.
> > I assume the spinlock in eventfd_signal() provides ordering even on weakly
> > ordered architectures, but we should double check that, i.e. that we don't
> > need an explicitly barrier of some kind.
>
> Are you talking about the possibility of whoever polls the eventfd not
> observing the value being written?
Ya, KVM needs to ensure the write is visible before the wakeup occurs.
Side topic, Paolo had an off-the-cuff idea of adding uAPI to support notifications
on memslot ranges, as opposed to posting writes via ioeventfd. E.g. add a memslot
flag, or maybe a memory attribute, that causes KVM to write-protect a region,
emulate in response to writes, and then notify an eventfd after emulating the
write. It'd be a lot like KVM_MEM_READONLY, except that KVM would commit the
write to memory and notify, as opposed to exiting to userspace.
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [RFC PATCH] KVM: optionally commit write on ioeventfd write
2025-12-02 22:53 ` Sean Christopherson
@ 2025-12-03 14:25 ` Thanos Makatos
2025-12-15 22:42 ` Thanos Makatos
2025-12-19 1:20 ` Sean Christopherson
0 siblings, 2 replies; 15+ messages in thread
From: Thanos Makatos @ 2025-12-03 14:25 UTC (permalink / raw)
To: Sean Christopherson
Cc: kvm@vger.kernel.org, John Levon, mst@redhat.com,
dinechin@redhat.com, cohuck@redhat.com, jasowang@redhat.com,
stefanha@redhat.com, jag.raman@oracle.com, eafanasova@gmail.com,
elena.ufimtseva@oracle.com
> -----Original Message-----
> From: Sean Christopherson <seanjc@google.com>
> Sent: 02 December 2025 22:54
> To: Thanos Makatos <thanos.makatos@nutanix.com>
> Cc: kvm@vger.kernel.org; John Levon <john.levon@nutanix.com>;
> mst@redhat.com; john.g.johnson@oracle.com; dinechin@redhat.com;
> cohuck@redhat.com; jasowang@redhat.com; stefanha@redhat.com;
> jag.raman@oracle.com; eafanasova@gmail.com;
> elena.ufimtseva@oracle.com; changpeng.liu@intel.com;
> james.r.harris@intel.com; benjamin.walker@intel.com
> Subject: Re: [RFC PATCH] KVM: optionally commit write on ioeventfd write
>
> !-------------------------------------------------------------------|
> CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> On Tue, Dec 02, 2025, Thanos Makatos wrote:
> > > -----Original Message-----
> > > I think it's also worth hoisting the validity
> > > checks into kvm_assign_ioeventfd_idx() so that this can use the slightly
> more
> > > optimal __copy_to_user().
> > >
> > > E.g.
> > >
> > > if (args->flags & KVM_IOEVENTFD_FLAG_REDIRECT) {
> > > if (!args->len || !args->post_addr ||
> > > != untagged_addr(args->post_addr) ||
> > > !access_ok((void __user *)(unsigned long)args->post_addr,
> args->len)) {
> > > ret = -EINVAL;
> > > goto fail;
> > > }
> > >
> > > p->post_addr = (void __user *)(unsigned long)args-
> > > >post_addr;
> > > }
> > >
> > > And then the usage here can be
> > >
> > > if (p->post_addr && __copy_to_user(p->post_addr, val, len))
> > > return -EFAULT;
> > >
> >
> > Did you mean to write __copy_to_user(p->redirect, val, len) here?
>
> I don't think so? Ah, it's KVM_IOEVENTFD_FLAG_REDIRECT that's stale. That
> should have been something like KVM_IOEVENTFD_FLAG_POST_WRITES.
My quoting somehow removed a line, here's what you wrote initially:
if (args->flags & KVM_IOEVENTFD_FLAG_REDIRECT) {
if (!args->len || !args->post_addr ||
args->redirect != untagged_addr(args->post_addr) ||
!access_ok((void __user *)(unsigned long)args->post_addr, args->len)) {
ret = -EINVAL;
goto fail;
}
p->post_addr = (void __user *)(unsigned long)args->post_addr;
}
It's the "args->redirect != untagged_addr(args->post_addr)" part that's confused me,
should this be the address we copy the value to?
>
> > > I assume the spinlock in eventfd_signal() provides ordering even on
> weakly
> > > ordered architectures, but we should double check that, i.e. that we don't
> > > need an explicitly barrier of some kind.
> >
> > Are you talking about the possibility of whoever polls the eventfd not
> > observing the value being written?
>
> Ya, KVM needs to ensure the write is visible before the wakeup occurs.
According to https://docs.kernel.org/dev-tools/lkmm/docs/locking.html the spinlock is enough:
"Locking is well-known and the common use cases are straightforward: Any
CPU holding a given lock sees any changes previously seen or made by any
CPU before it previously released that same lock."
>
> Side topic, Paolo had an off-the-cuff idea of adding uAPI to support
> notifications
> on memslot ranges, as opposed to posting writes via ioeventfd. E.g. add a
> memslot
> flag, or maybe a memory attribute, that causes KVM to write-protect a region,
> emulate in response to writes, and then notify an eventfd after emulating the
> write. It'd be a lot like KVM_MEM_READONLY, except that KVM would commit
> the
> write to memory and notify, as opposed to exiting to userspace.
Are you thinking for reusing/adapting the mechanism in this patch for that?
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [RFC PATCH] KVM: optionally commit write on ioeventfd write
2025-12-03 14:25 ` Thanos Makatos
@ 2025-12-15 22:42 ` Thanos Makatos
2025-12-19 1:20 ` Sean Christopherson
1 sibling, 0 replies; 15+ messages in thread
From: Thanos Makatos @ 2025-12-15 22:42 UTC (permalink / raw)
To: Sean Christopherson
Cc: kvm@vger.kernel.org, John Levon, mst@redhat.com,
dinechin@redhat.com, cohuck@redhat.com, jasowang@redhat.com,
stefanha@redhat.com, jag.raman@oracle.com, eafanasova@gmail.com,
elena.ufimtseva@oracle.com
> -----Original Message-----
> From: Thanos Makatos
> Sent: 03 December 2025 14:26
> To: Sean Christopherson <seanjc@google.com>
> Cc: kvm@vger.kernel.org; John Levon <john.levon@nutanix.com>;
> mst@redhat.com; dinechin@redhat.com; cohuck@redhat.com;
> jasowang@redhat.com; stefanha@redhat.com; jag.raman@oracle.com;
> eafanasova@gmail.com; elena.ufimtseva@oracle.com
> Subject: RE: [RFC PATCH] KVM: optionally commit write on ioeventfd write
>
> > -----Original Message-----
> > From: Sean Christopherson <seanjc@google.com>
> > Sent: 02 December 2025 22:54
> > To: Thanos Makatos <thanos.makatos@nutanix.com>
> > Cc: kvm@vger.kernel.org; John Levon <john.levon@nutanix.com>;
> > mst@redhat.com; john.g.johnson@oracle.com; dinechin@redhat.com;
> > cohuck@redhat.com; jasowang@redhat.com; stefanha@redhat.com;
> > jag.raman@oracle.com; eafanasova@gmail.com;
> > elena.ufimtseva@oracle.com; changpeng.liu@intel.com;
> > james.r.harris@intel.com; benjamin.walker@intel.com
> > Subject: Re: [RFC PATCH] KVM: optionally commit write on ioeventfd write
> >
> > !-------------------------------------------------------------------|
> > CAUTION: External Email
> >
> > |-------------------------------------------------------------------!
> >
> > On Tue, Dec 02, 2025, Thanos Makatos wrote:
> > > > -----Original Message-----
> > > > I think it's also worth hoisting the validity
> > > > checks into kvm_assign_ioeventfd_idx() so that this can use the slightly
> > more
> > > > optimal __copy_to_user().
> > > >
> > > > E.g.
> > > >
> > > > if (args->flags & KVM_IOEVENTFD_FLAG_REDIRECT) {
> > > > if (!args->len || !args->post_addr ||
> > > > != untagged_addr(args->post_addr) ||
> > > > !access_ok((void __user *)(unsigned long)args->post_addr,
> > args->len)) {
> > > > ret = -EINVAL;
> > > > goto fail;
> > > > }
> > > >
> > > > p->post_addr = (void __user *)(unsigned long)args-
> > > > >post_addr;
> > > > }
> > > >
> > > > And then the usage here can be
> > > >
> > > > if (p->post_addr && __copy_to_user(p->post_addr, val, len))
> > > > return -EFAULT;
> > > >
> > >
> > > Did you mean to write __copy_to_user(p->redirect, val, len) here?
> >
> > I don't think so? Ah, it's KVM_IOEVENTFD_FLAG_REDIRECT that's stale.
> That
> > should have been something like KVM_IOEVENTFD_FLAG_POST_WRITES.
>
> My quoting somehow removed a line, here's what you wrote initially:
>
> if (args->flags & KVM_IOEVENTFD_FLAG_REDIRECT) {
> if (!args->len || !args->post_addr ||
> args->redirect != untagged_addr(args->post_addr) ||
> !access_ok((void __user *)(unsigned long)args->post_addr,
> args->len)) {
> ret = -EINVAL;
> goto fail;
> }
>
> p->post_addr = (void __user *)(unsigned long)args-
> >post_addr;
> }
>
> It's the "args->redirect != untagged_addr(args->post_addr)" part that's
> confused me,
> should this be the address we copy the value to?
>
> >
> > > > I assume the spinlock in eventfd_signal() provides ordering even on
> > weakly
> > > > ordered architectures, but we should double check that, i.e. that we
> don't
> > > > need an explicitly barrier of some kind.
> > >
> > > Are you talking about the possibility of whoever polls the eventfd not
> > > observing the value being written?
> >
> > Ya, KVM needs to ensure the write is visible before the wakeup occurs.
>
> According to https://docs.kernel.org/dev-tools/lkmm/docs/locking.html the
> spinlock is enough:
> "Locking is well-known and the common use cases are straightforward: Any
> CPU holding a given lock sees any changes previously seen or made by any
> CPU before it previously released that same lock."
>
> >
> > Side topic, Paolo had an off-the-cuff idea of adding uAPI to support
> > notifications
> > on memslot ranges, as opposed to posting writes via ioeventfd. E.g. add a
> > memslot
> > flag, or maybe a memory attribute, that causes KVM to write-protect a
> region,
> > emulate in response to writes, and then notify an eventfd after emulating
> the
> > write. It'd be a lot like KVM_MEM_READONLY, except that KVM would
> commit
> > the
> > write to memory and notify, as opposed to exiting to userspace.
>
> Are you thinking for reusing/adapting the mechanism in this patch for that?
ping
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH] KVM: optionally commit write on ioeventfd write
2025-12-03 14:25 ` Thanos Makatos
2025-12-15 22:42 ` Thanos Makatos
@ 2025-12-19 1:20 ` Sean Christopherson
2025-12-19 10:36 ` John Levon
2026-01-13 19:43 ` Thanos Makatos
1 sibling, 2 replies; 15+ messages in thread
From: Sean Christopherson @ 2025-12-19 1:20 UTC (permalink / raw)
To: Thanos Makatos
Cc: kvm@vger.kernel.org, John Levon, mst@redhat.com,
dinechin@redhat.com, cohuck@redhat.com, jasowang@redhat.com,
stefanha@redhat.com, jag.raman@oracle.com, eafanasova@gmail.com,
elena.ufimtseva@oracle.com, Paolo Bonzini
+Paolo (just realized Paolo isn't on the Cc)
On Wed, Dec 03, 2025, Thanos Makatos wrote:
> > From: Sean Christopherson <seanjc@google.com>
> > Side topic, Paolo had an off-the-cuff idea of adding uAPI to support
> > notifications on memslot ranges, as opposed to posting writes via
> > ioeventfd. E.g. add a memslot flag, or maybe a memory attribute, that
> > causes KVM to write-protect a region, emulate in response to writes, and
> > then notify an eventfd after emulating the write. It'd be a lot like
> > KVM_MEM_READONLY, except that KVM would commit the write to memory and
> > notify, as opposed to exiting to userspace.
>
> Are you thinking for reusing/adapting the mechanism in this patch for that?
Paolo's idea was to forego this patch entirely and instead add a more generic
write-notify mechanism. In practice, the only real difference is that the writes
would be fully in-place instead of a redirection, which in turn would allow the
guest to read without triggering a VM-Exit, and I suppose might save userspace
from some dirty logging operations.
While I really like the mechanics of the idea, after sketching out the basic
gist (see below), I'm not convinced the additional complexity is worth the gains.
Unless reading from NVMe submission queues is a common operation, it doesn't seem
like eliding VM-Exits on reads buys much.
Every arch would need to be updated to handle the new way of handling emulated
writes, with varying degrees of complexity. E.g. on x86 I think it would just be
teaching the MMU about the new "emulate on write" behavior, but for arm64 (and
presumably any other architecture without a generic emulator), it would be that
plus new code to actually commit the write to guest memory.
The other scary aspect is correctly handling "writable from KVM" and "can't be
mapped writable". Getting that correct in all places is non-trivial, and seems
like it could be a pain to maintain, which potentially fatal failure modes, e.g.
if KVM writes guest memory but fails to notify, tracking down the bug would be
"fun".
So my vote is to add POST_WRITE functionality to I/O eventfd, and hold off on a
generic write-notify mechanism until there's a (really) strong use case.
Paolo, thoughts?
---
arch/x86/kvm/mmu/mmu.c | 6 +++--
include/linux/kvm_host.h | 2 ++
include/uapi/linux/kvm.h | 3 ++-
virt/kvm/kvm_main.c | 51 +++++++++++++++++++++++++++++++++-------
4 files changed, 51 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 02c450686b4a..acad277ba2a1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3493,7 +3493,8 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
* into the spte otherwise read access on readonly gfn also can
* caused mmio page fault and treat it as mmio access.
*/
- if (fault->pfn == KVM_PFN_ERR_RO_FAULT)
+ if (fault->pfn == KVM_PFN_ERR_RO_FAULT ||
+ fault->pfn == KVM_PFN_ERR_WRITE_NOTIFY)
return RET_PF_EMULATE;
if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
@@ -4582,7 +4583,8 @@ static int kvm_mmu_faultin_pfn_gmem(struct kvm_vcpu *vcpu,
return r;
}
- fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
+ fault->map_writable = !(fault->slot->flags &
+ (KVM_MEM_READONLY | KVM_MEM_WRITE_NOTIFY));
fault->max_level = kvm_max_level_for_order(max_order);
return RET_PF_CONTINUE;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d93f75b05ae2..e75dc5c2a279 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -99,6 +99,7 @@
#define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
#define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3)
#define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
+#define KVM_PFN_ERR_WRITE_NOTIFY (KVM_PFN_ERR_MASK + 5)
/*
* error pfns indicate that the gfn is in slot but faild to
@@ -615,6 +616,7 @@ struct kvm_memory_slot {
pgoff_t pgoff;
} gmem;
#endif
+ struct eventfd_ctx *eventfd;
};
static inline bool kvm_slot_has_gmem(const struct kvm_memory_slot *slot)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index dddb781b0507..c3d084a09d6c 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -39,7 +39,7 @@ struct kvm_userspace_memory_region2 {
__u64 userspace_addr;
__u64 guest_memfd_offset;
__u32 guest_memfd;
- __u32 pad1;
+ __u32 eventfd;
__u64 pad2[14];
};
@@ -51,6 +51,7 @@ struct kvm_userspace_memory_region2 {
#define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
#define KVM_MEM_READONLY (1UL << 1)
#define KVM_MEM_GUEST_MEMFD (1UL << 2)
+#define KVM_MEM_WRITE_NOTIFY (1UL << 3)
/* for KVM_IRQ_LINE */
struct kvm_irq_level {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f1f6a71b2b5f..e58d43bae757 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -953,6 +953,8 @@ static void kvm_destroy_dirty_bitmap(struct kvm_memory_slot *memslot)
/* This does not remove the slot from struct kvm_memslots data structures */
static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
{
+ if (slot->flags & KVM_MEM_WRITE_NOTIFY)
+ eventfd_ctx_put(slot->eventfd);
if (slot->flags & KVM_MEM_GUEST_MEMFD)
kvm_gmem_unbind(slot);
@@ -1607,11 +1609,15 @@ static int check_memory_region_flags(struct kvm *kvm,
/*
* GUEST_MEMFD is incompatible with read-only memslots, as writes to
* read-only memslots have emulated MMIO, not page fault, semantics,
- * and KVM doesn't allow emulated MMIO for private memory.
+ * and KVM doesn't allow emulated MMIO for private memory. Ditto for
+ * write-notify memslots (emulated exitless MMIO).
*/
- if (kvm_arch_has_readonly_mem(kvm) &&
- !(mem->flags & KVM_MEM_GUEST_MEMFD))
- valid_flags |= KVM_MEM_READONLY;
+ if (!mem->flags & KVM_MEM_GUEST_MEMFD) {
+ if (kvm_arch_has_readonly_mem(kvm))
+ valid_flags |= KVM_MEM_READONLY;
+ if (kvm_arch_has_write_notify_mem(kvm))
+ valid_flags |= KVM_MEM_WRITE_NOTIFY;
+ }
if (mem->flags & ~valid_flags)
return -EINVAL;
@@ -2100,7 +2106,9 @@ static int kvm_set_memory_region(struct kvm *kvm,
return -EINVAL;
if ((mem->userspace_addr != old->userspace_addr) ||
(npages != old->npages) ||
- ((mem->flags ^ old->flags) & (KVM_MEM_READONLY | KVM_MEM_GUEST_MEMFD)))
+ ((mem->flags ^ old->flags) & (KVM_MEM_READONLY |
+ KVM_MEM_GUEST_MEMFD |
+ KVM_MEM_WRITE_NOTIFY)))
return -EINVAL;
if (base_gfn != old->base_gfn)
@@ -2131,13 +2139,29 @@ static int kvm_set_memory_region(struct kvm *kvm,
if (r)
goto out;
}
+ if (mem->flags & KVM_MEM_WRITE_NOTIFY) {
+ CLASS(fd, f)(mem->eventfd);
+ if (fd_empty(f)) {
+ r = -EBADF;
+ goto out_unbind;
+ }
+
+ new->eventfd = eventfd_ctx_fileget(fd_file(f));
+ if (IS_ERR(new->eventfd)) {
+ r = PTR_ERR(new->eventfd);
+ goto out_unbind;
+ }
+ }
r = kvm_set_memslot(kvm, old, new, change);
if (r)
- goto out_unbind;
+ goto out_eventfd;
return 0;
+out_eventfd:
+ if (mem->flags & KVM_MEM_WRITE_NOTIFY)
+ eventfd_ctx_put(new->eventfd);
out_unbind:
if (mem->flags & KVM_MEM_GUEST_MEMFD)
kvm_gmem_unbind(new);
@@ -2727,6 +2751,11 @@ unsigned long kvm_host_page_size(struct kvm_vcpu *vcpu, gfn_t gfn)
return size;
}
+static bool memslot_is_write_notify(const struct kvm_memory_slot *slot)
+{
+ return slot->flags & KVM_MEM_WRITE_NOTIFY;
+}
+
static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
{
return slot->flags & KVM_MEM_READONLY;
@@ -2786,7 +2815,7 @@ unsigned long gfn_to_hva_memslot_prot(struct kvm_memory_slot *slot,
unsigned long hva = __gfn_to_hva_many(slot, gfn, NULL, false);
if (!kvm_is_error_hva(hva) && writable)
- *writable = !memslot_is_readonly(slot);
+ *writable = !memslot_is_readonly(slot)
return hva;
}
@@ -3060,7 +3089,11 @@ static kvm_pfn_t kvm_follow_pfn(struct kvm_follow_pfn *kfp)
if (kvm_is_error_hva(kfp->hva))
return KVM_PFN_NOSLOT;
- if (memslot_is_readonly(kfp->slot) && kfp->map_writable) {
+ if ((kfp->flags & FOLL_WRITE) && memslot_is_write_notify(kfp->slot))
+ return KVM_PFN_ERR_WRITE_NOTIFY;
+
+ if (kfp->map_writable &&
+ (memslot_is_readonly(kfp->slot) || memslot_is_write_notify(kfp->slot)) {
*kfp->map_writable = false;
kfp->map_writable = NULL;
}
@@ -3324,6 +3357,8 @@ static int __kvm_write_guest_page(struct kvm *kvm,
r = __copy_to_user((void __user *)addr + offset, data, len);
if (r)
return -EFAULT;
+ if (memslot_is_write_notify(memslot))
+ eventfd_signal(memslot->eventfd);
mark_page_dirty_in_slot(kvm, memslot, gfn);
return 0;
}
base-commit: 58e10b63777d0aebee2cf4e6c67e1a83e7edbe0f
--
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [RFC PATCH] KVM: optionally commit write on ioeventfd write
2025-12-19 1:20 ` Sean Christopherson
@ 2025-12-19 10:36 ` John Levon
2026-01-13 19:43 ` Thanos Makatos
1 sibling, 0 replies; 15+ messages in thread
From: John Levon @ 2025-12-19 10:36 UTC (permalink / raw)
To: Sean Christopherson
Cc: Thanos Makatos, kvm@vger.kernel.org, mst@redhat.com,
dinechin@redhat.com, cohuck@redhat.com, jasowang@redhat.com,
stefanha@redhat.com, jag.raman@oracle.com, eafanasova@gmail.com,
elena.ufimtseva@oracle.com, Paolo Bonzini
On Thu, Dec 18, 2025 at 05:20:44PM -0800, Sean Christopherson wrote:
> > Are you thinking for reusing/adapting the mechanism in this patch for that?
>
> While I really like the mechanics of the idea, after sketching out the basic
> gist (see below), I'm not convinced the additional complexity is worth the gains.
> Unless reading from NVMe submission queues is a common operation
NVMe over PCIe, 3.1.2.1:
"...The host should not read the doorbell registers. If a doorbell register is
read, the value returned is vendor specific."
So, no, not common for NVMe at least.
regards
john
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [RFC PATCH] KVM: optionally commit write on ioeventfd write
2025-12-19 1:20 ` Sean Christopherson
2025-12-19 10:36 ` John Levon
@ 2026-01-13 19:43 ` Thanos Makatos
2026-01-13 20:00 ` Sean Christopherson
1 sibling, 1 reply; 15+ messages in thread
From: Thanos Makatos @ 2026-01-13 19:43 UTC (permalink / raw)
To: Sean Christopherson
Cc: kvm@vger.kernel.org, John Levon, mst@redhat.com,
dinechin@redhat.com, cohuck@redhat.com, jasowang@redhat.com,
stefanha@redhat.com, jag.raman@oracle.com, eafanasova@gmail.com,
elena.ufimtseva@oracle.com, Paolo Bonzini
> -----Original Message-----
> From: Sean Christopherson <seanjc@google.com>
> Sent: 19 December 2025 01:21
> To: Thanos Makatos <thanos.makatos@nutanix.com>
> Cc: kvm@vger.kernel.org; John Levon <john.levon@nutanix.com>;
> mst@redhat.com; dinechin@redhat.com; cohuck@redhat.com;
> jasowang@redhat.com; stefanha@redhat.com; jag.raman@oracle.com;
> eafanasova@gmail.com; elena.ufimtseva@oracle.com; Paolo Bonzini
> <pbonzini@redhat.com>
> Subject: Re: [RFC PATCH] KVM: optionally commit write on ioeventfd write
>
> !-------------------------------------------------------------------|
> CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> +Paolo (just realized Paolo isn't on the Cc)
>
> On Wed, Dec 03, 2025, Thanos Makatos wrote:
> > > From: Sean Christopherson <seanjc@google.com>
> > > Side topic, Paolo had an off-the-cuff idea of adding uAPI to support
> > > notifications on memslot ranges, as opposed to posting writes via
> > > ioeventfd. E.g. add a memslot flag, or maybe a memory attribute, that
> > > causes KVM to write-protect a region, emulate in response to writes, and
> > > then notify an eventfd after emulating the write. It'd be a lot like
> > > KVM_MEM_READONLY, except that KVM would commit the write to
> memory and
> > > notify, as opposed to exiting to userspace.
> >
> > Are you thinking for reusing/adapting the mechanism in this patch for that?
>
> Paolo's idea was to forego this patch entirely and instead add a more generic
> write-notify mechanism. In practice, the only real difference is that the writes
> would be fully in-place instead of a redirection, which in turn would allow the
> guest to read without triggering a VM-Exit, and I suppose might save
> userspace
> from some dirty logging operations.
>
> While I really like the mechanics of the idea, after sketching out the basic
> gist (see below), I'm not convinced the additional complexity is worth the
> gains.
> Unless reading from NVMe submission queues is a common operation, it
> doesn't seem
> like eliding VM-Exits on reads buys much.
>
> Every arch would need to be updated to handle the new way of handling
> emulated
> writes, with varying degrees of complexity. E.g. on x86 I think it would just be
> teaching the MMU about the new "emulate on write" behavior, but for arm64
> (and
> presumably any other architecture without a generic emulator), it would be
> that
> plus new code to actually commit the write to guest memory.
>
> The other scary aspect is correctly handling "writable from KVM" and "can't
> be
> mapped writable". Getting that correct in all places is non-trivial, and seems
> like it could be a pain to maintain, which potentially fatal failure modes, e.g.
> if KVM writes guest memory but fails to notify, tracking down the bug would
> be
> "fun".
>
> So my vote is to add POST_WRITE functionality to I/O eventfd, and hold off on
> a
> generic write-notify mechanism until there's a (really) strong use case.
>
> Paolo, thoughts?
In the absence of a response, shall we go ahead with POST_WRITE? I have the revised patch ready.
>
>
> ---
> arch/x86/kvm/mmu/mmu.c | 6 +++--
> include/linux/kvm_host.h | 2 ++
> include/uapi/linux/kvm.h | 3 ++-
> virt/kvm/kvm_main.c | 51 +++++++++++++++++++++++++++++++++-------
> 4 files changed, 51 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 02c450686b4a..acad277ba2a1 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3493,7 +3493,8 @@ static int kvm_handle_error_pfn(struct kvm_vcpu
> *vcpu, struct kvm_page_fault *fa
> * into the spte otherwise read access on readonly gfn also can
> * caused mmio page fault and treat it as mmio access.
> */
> - if (fault->pfn == KVM_PFN_ERR_RO_FAULT)
> + if (fault->pfn == KVM_PFN_ERR_RO_FAULT ||
> + fault->pfn == KVM_PFN_ERR_WRITE_NOTIFY)
> return RET_PF_EMULATE;
>
> if (fault->pfn == KVM_PFN_ERR_HWPOISON) {
> @@ -4582,7 +4583,8 @@ static int kvm_mmu_faultin_pfn_gmem(struct
> kvm_vcpu *vcpu,
> return r;
> }
>
> - fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
> + fault->map_writable = !(fault->slot->flags &
> + (KVM_MEM_READONLY |
> KVM_MEM_WRITE_NOTIFY));
> fault->max_level = kvm_max_level_for_order(max_order);
>
> return RET_PF_CONTINUE;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index d93f75b05ae2..e75dc5c2a279 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -99,6 +99,7 @@
> #define KVM_PFN_ERR_RO_FAULT (KVM_PFN_ERR_MASK + 2)
> #define KVM_PFN_ERR_SIGPENDING (KVM_PFN_ERR_MASK + 3)
> #define KVM_PFN_ERR_NEEDS_IO (KVM_PFN_ERR_MASK + 4)
> +#define KVM_PFN_ERR_WRITE_NOTIFY (KVM_PFN_ERR_MASK + 5)
>
> /*
> * error pfns indicate that the gfn is in slot but faild to
> @@ -615,6 +616,7 @@ struct kvm_memory_slot {
> pgoff_t pgoff;
> } gmem;
> #endif
> + struct eventfd_ctx *eventfd;
> };
>
> static inline bool kvm_slot_has_gmem(const struct kvm_memory_slot *slot)
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index dddb781b0507..c3d084a09d6c 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -39,7 +39,7 @@ struct kvm_userspace_memory_region2 {
> __u64 userspace_addr;
> __u64 guest_memfd_offset;
> __u32 guest_memfd;
> - __u32 pad1;
> + __u32 eventfd;
> __u64 pad2[14];
> };
>
> @@ -51,6 +51,7 @@ struct kvm_userspace_memory_region2 {
> #define KVM_MEM_LOG_DIRTY_PAGES (1UL << 0)
> #define KVM_MEM_READONLY (1UL << 1)
> #define KVM_MEM_GUEST_MEMFD (1UL << 2)
> +#define KVM_MEM_WRITE_NOTIFY (1UL << 3)
>
> /* for KVM_IRQ_LINE */
> struct kvm_irq_level {
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index f1f6a71b2b5f..e58d43bae757 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -953,6 +953,8 @@ static void kvm_destroy_dirty_bitmap(struct
> kvm_memory_slot *memslot)
> /* This does not remove the slot from struct kvm_memslots data structures
> */
> static void kvm_free_memslot(struct kvm *kvm, struct kvm_memory_slot
> *slot)
> {
> + if (slot->flags & KVM_MEM_WRITE_NOTIFY)
> + eventfd_ctx_put(slot->eventfd);
> if (slot->flags & KVM_MEM_GUEST_MEMFD)
> kvm_gmem_unbind(slot);
>
> @@ -1607,11 +1609,15 @@ static int check_memory_region_flags(struct kvm
> *kvm,
> /*
> * GUEST_MEMFD is incompatible with read-only memslots, as writes
> to
> * read-only memslots have emulated MMIO, not page fault,
> semantics,
> - * and KVM doesn't allow emulated MMIO for private memory.
> + * and KVM doesn't allow emulated MMIO for private memory. Ditto
> for
> + * write-notify memslots (emulated exitless MMIO).
> */
> - if (kvm_arch_has_readonly_mem(kvm) &&
> - !(mem->flags & KVM_MEM_GUEST_MEMFD))
> - valid_flags |= KVM_MEM_READONLY;
> + if (!mem->flags & KVM_MEM_GUEST_MEMFD) {
> + if (kvm_arch_has_readonly_mem(kvm))
> + valid_flags |= KVM_MEM_READONLY;
> + if (kvm_arch_has_write_notify_mem(kvm))
> + valid_flags |= KVM_MEM_WRITE_NOTIFY;
> + }
>
> if (mem->flags & ~valid_flags)
> return -EINVAL;
> @@ -2100,7 +2106,9 @@ static int kvm_set_memory_region(struct kvm
> *kvm,
> return -EINVAL;
> if ((mem->userspace_addr != old->userspace_addr) ||
> (npages != old->npages) ||
> - ((mem->flags ^ old->flags) & (KVM_MEM_READONLY |
> KVM_MEM_GUEST_MEMFD)))
> + ((mem->flags ^ old->flags) & (KVM_MEM_READONLY |
> + KVM_MEM_GUEST_MEMFD
> |
> +
> KVM_MEM_WRITE_NOTIFY)))
> return -EINVAL;
>
> if (base_gfn != old->base_gfn)
> @@ -2131,13 +2139,29 @@ static int kvm_set_memory_region(struct kvm
> *kvm,
> if (r)
> goto out;
> }
> + if (mem->flags & KVM_MEM_WRITE_NOTIFY) {
> + CLASS(fd, f)(mem->eventfd);
> + if (fd_empty(f)) {
> + r = -EBADF;
> + goto out_unbind;
> + }
> +
> + new->eventfd = eventfd_ctx_fileget(fd_file(f));
> + if (IS_ERR(new->eventfd)) {
> + r = PTR_ERR(new->eventfd);
> + goto out_unbind;
> + }
> + }
>
> r = kvm_set_memslot(kvm, old, new, change);
> if (r)
> - goto out_unbind;
> + goto out_eventfd;
>
> return 0;
>
> +out_eventfd:
> + if (mem->flags & KVM_MEM_WRITE_NOTIFY)
> + eventfd_ctx_put(new->eventfd);
> out_unbind:
> if (mem->flags & KVM_MEM_GUEST_MEMFD)
> kvm_gmem_unbind(new);
> @@ -2727,6 +2751,11 @@ unsigned long kvm_host_page_size(struct
> kvm_vcpu *vcpu, gfn_t gfn)
> return size;
> }
>
> +static bool memslot_is_write_notify(const struct kvm_memory_slot *slot)
> +{
> + return slot->flags & KVM_MEM_WRITE_NOTIFY;
> +}
> +
> static bool memslot_is_readonly(const struct kvm_memory_slot *slot)
> {
> return slot->flags & KVM_MEM_READONLY;
> @@ -2786,7 +2815,7 @@ unsigned long gfn_to_hva_memslot_prot(struct
> kvm_memory_slot *slot,
> unsigned long hva = __gfn_to_hva_many(slot, gfn, NULL, false);
>
> if (!kvm_is_error_hva(hva) && writable)
> - *writable = !memslot_is_readonly(slot);
> + *writable = !memslot_is_readonly(slot)
>
> return hva;
> }
> @@ -3060,7 +3089,11 @@ static kvm_pfn_t kvm_follow_pfn(struct
> kvm_follow_pfn *kfp)
> if (kvm_is_error_hva(kfp->hva))
> return KVM_PFN_NOSLOT;
>
> - if (memslot_is_readonly(kfp->slot) && kfp->map_writable) {
> + if ((kfp->flags & FOLL_WRITE) && memslot_is_write_notify(kfp-
> >slot))
> + return KVM_PFN_ERR_WRITE_NOTIFY;
> +
> + if (kfp->map_writable &&
> + (memslot_is_readonly(kfp->slot) || memslot_is_write_notify(kfp-
> >slot)) {
> *kfp->map_writable = false;
> kfp->map_writable = NULL;
> }
> @@ -3324,6 +3357,8 @@ static int __kvm_write_guest_page(struct kvm
> *kvm,
> r = __copy_to_user((void __user *)addr + offset, data, len);
> if (r)
> return -EFAULT;
> + if (memslot_is_write_notify(memslot))
> + eventfd_signal(memslot->eventfd);
> mark_page_dirty_in_slot(kvm, memslot, gfn);
> return 0;
> }
>
> base-commit: 58e10b63777d0aebee2cf4e6c67e1a83e7edbe0f
> --
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [RFC PATCH] KVM: optionally commit write on ioeventfd write
2026-01-13 19:43 ` Thanos Makatos
@ 2026-01-13 20:00 ` Sean Christopherson
2026-03-02 12:28 ` [PATCH] KVM: optionally post " Thanos Makatos
0 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2026-01-13 20:00 UTC (permalink / raw)
To: Thanos Makatos
Cc: kvm@vger.kernel.org, John Levon, mst@redhat.com,
dinechin@redhat.com, cohuck@redhat.com, jasowang@redhat.com,
stefanha@redhat.com, jag.raman@oracle.com, eafanasova@gmail.com,
elena.ufimtseva@oracle.com, Paolo Bonzini
On Tue, Jan 13, 2026, Thanos Makatos wrote:
> > +Paolo (just realized Paolo isn't on the Cc)
> >
> > On Wed, Dec 03, 2025, Thanos Makatos wrote:
> > > > From: Sean Christopherson <seanjc@google.com>
> > > > Side topic, Paolo had an off-the-cuff idea of adding uAPI to support
> > > > notifications on memslot ranges, as opposed to posting writes via
> > > > ioeventfd. E.g. add a memslot flag, or maybe a memory attribute, that
> > > > causes KVM to write-protect a region, emulate in response to writes,
> > > > and then notify an eventfd after emulating the write. It'd be a lot
> > > > like KVM_MEM_READONLY, except that KVM would commit the write to
> > > > memory and notify, as opposed to exiting to userspace.
> > >
> > > Are you thinking for reusing/adapting the mechanism in this patch for that?
> >
> > Paolo's idea was to forego this patch entirely and instead add a more
> > generic write-notify mechanism. In practice, the only real difference is
> > that the writes would be fully in-place instead of a redirection, which in
> > turn would allow the guest to read without triggering a VM-Exit, and I
> > suppose might save userspace from some dirty logging operations.
> >
> > While I really like the mechanics of the idea, after sketching out the
> > basic gist (see below), I'm not convinced the additional complexity is
> > worth the gains. Unless reading from NVMe submission queues is a common
> > operation, it doesn't seem like eliding VM-Exits on reads buys much.
> >
> > Every arch would need to be updated to handle the new way of handling
> > emulated writes, with varying degrees of complexity. E.g. on x86 I think
> > it would just be teaching the MMU about the new "emulate on write"
> > behavior, but for arm64 (and presumably any other architecture without a
> > generic emulator), it would be that plus new code to actually commit the
> > write to guest memory.
> >
> > The other scary aspect is correctly handling "writable from KVM" and "can't
> > be mapped writable". Getting that correct in all places is non-trivial,
> > and seems like it could be a pain to maintain, which potentially fatal
> > failure modes, e.g. if KVM writes guest memory but fails to notify,
> > tracking down the bug would be "fun".
> >
> > So my vote is to add POST_WRITE functionality to I/O eventfd, and hold off
> > on a generic write-notify mechanism until there's a (really) strong use
> > case.
> >
> > Paolo, thoughts?
>
> In the absence of a response, shall we go ahead with POST_WRITE? I have the
> revised patch ready.
Ya, fire away.
^ permalink raw reply [flat|nested] 15+ messages in thread
* [PATCH] KVM: optionally post write on ioeventfd write
2026-01-13 20:00 ` Sean Christopherson
@ 2026-03-02 12:28 ` Thanos Makatos
2026-03-05 1:26 ` Sean Christopherson
` (2 more replies)
0 siblings, 3 replies; 15+ messages in thread
From: Thanos Makatos @ 2026-03-02 12:28 UTC (permalink / raw)
To: seanjc@google.com
Cc: pbonzini@redhat.com, John Levon, kvm@vger.kernel.org,
Thanos Makatos
This patch is a slightly different take on the ioregionfd mechanism
previously described here:
https://lore.kernel.org/all/88ca79d2e378dcbfb3988b562ad2c16c4f929ac7.camel@gmail.com/
The goal of this new mechanism is to speed up doorbell writes on NVMe
controllers emulated outside of the VMM. Currently, a doorbell write to
an NVMe SQ tail doorbell requires returning from ioctl(KVM_RUN) and the
VMM communicating the event, along with the doorbell value, to the NVMe
controller emulation task. With the shadow ioeventfd, the NVMe
emulation task is directly notified of the doorbell write and can find
the doorbell value in a known location, without the interference of the
VMM.
Signed-off-by: Thanos Makatos <thanos.makatos@nutanix.com>
---
include/uapi/linux/kvm.h | 11 ++++++++++-
tools/include/uapi/linux/kvm.h | 2 ++
virt/kvm/eventfd.c | 32 ++++++++++++++++++++++++++++++--
3 files changed, 42 insertions(+), 3 deletions(-)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 65500f5db379..f3ff559de60d 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -639,6 +639,7 @@ enum {
kvm_ioeventfd_flag_nr_deassign,
kvm_ioeventfd_flag_nr_virtio_ccw_notify,
kvm_ioeventfd_flag_nr_fast_mmio,
+ kvm_ioevetnfd_flag_nr_post_write,
kvm_ioeventfd_flag_nr_max,
};
@@ -648,6 +649,12 @@ enum {
#define KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY \
(1 << kvm_ioeventfd_flag_nr_virtio_ccw_notify)
+/*
+ * KVM does not provide any guarantees regarding read-after-write ordering for
+ * such updates.
+ */
+#define KVM_IOEVENTFD_FLAG_POST_WRITE (1 << kvm_ioevetnfd_flag_nr_post_write)
+
#define KVM_IOEVENTFD_VALID_FLAG_MASK ((1 << kvm_ioeventfd_flag_nr_max) - 1)
struct kvm_ioeventfd {
@@ -656,8 +663,10 @@ struct kvm_ioeventfd {
__u32 len; /* 1, 2, 4, or 8 bytes; or 0 to ignore length */
__s32 fd;
__u32 flags;
- __u8 pad[36];
+ void __user *post_addr; /* address to write to if POST_WRITE is set */
+ __u8 pad[24];
};
+_Static_assert(sizeof(struct kvm_ioeventfd) == 1 << 6, "bad size");
#define KVM_X86_DISABLE_EXITS_MWAIT (1 << 0)
#define KVM_X86_DISABLE_EXITS_HLT (1 << 1)
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index dddb781b0507..1fb481c90b57 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -629,6 +629,7 @@ enum {
kvm_ioeventfd_flag_nr_deassign,
kvm_ioeventfd_flag_nr_virtio_ccw_notify,
kvm_ioeventfd_flag_nr_fast_mmio,
+ kvm_ioevetnfd_flag_nr_commit_write,
kvm_ioeventfd_flag_nr_max,
};
@@ -637,6 +638,7 @@ enum {
#define KVM_IOEVENTFD_FLAG_DEASSIGN (1 << kvm_ioeventfd_flag_nr_deassign)
#define KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY \
(1 << kvm_ioeventfd_flag_nr_virtio_ccw_notify)
+#define KVM_IOEVENTFD_FLAG_COMMIT_WRITE (1 << kvm_ioevetnfd_flag_nr_commit_write)
#define KVM_IOEVENTFD_VALID_FLAG_MASK ((1 << kvm_ioeventfd_flag_nr_max) - 1)
diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
index 0e8b8a2c5b79..019cf3606aef 100644
--- a/virt/kvm/eventfd.c
+++ b/virt/kvm/eventfd.c
@@ -741,6 +741,7 @@ struct _ioeventfd {
struct kvm_io_device dev;
u8 bus_idx;
bool wildcard;
+ void __user *post_addr;
};
static inline struct _ioeventfd *
@@ -812,6 +813,9 @@ ioeventfd_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
if (!ioeventfd_in_range(p, addr, len, val))
return -EOPNOTSUPP;
+ if (p->post_addr && len > 0 && __copy_to_user(p->post_addr, val, len))
+ return -EFAULT;
+
eventfd_signal(p->eventfd);
return 0;
}
@@ -879,6 +883,27 @@ static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
goto fail;
}
+ if (args->flags & KVM_IOEVENTFD_FLAG_POST_WRITE) {
+ /*
+ * Although a NULL pointer it technically valid for userspace, it's
+ * unlikely that any use case actually cares.
+ */
+ if (!args->len || !args->post_addr ||
+ args->post_addr != untagged_addr(args->post_addr) ||
+ !access_ok((void __user *)(unsigned long)args->post_addr, args->len)) {
+ ret = -EINVAL;
+ goto free_fail;
+ }
+ p->post_addr = args->post_addr;
+ } else if (!args->post_addr) {
+ /*
+ * Ensure that post_addr isn't set without POST_WRITE to avoid accidental
+ * userspace errors.
+ */
+ ret = -EINVAL;
+ goto free_fail;
+ }
+
INIT_LIST_HEAD(&p->list);
p->addr = args->addr;
p->bus_idx = bus_idx;
@@ -915,8 +940,8 @@ static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
unlock_fail:
mutex_unlock(&kvm->slots_lock);
+free_fail:
kfree(p);
-
fail:
eventfd_ctx_put(eventfd);
@@ -932,12 +957,14 @@ kvm_deassign_ioeventfd_idx(struct kvm *kvm, enum kvm_bus bus_idx,
struct kvm_io_bus *bus;
int ret = -ENOENT;
bool wildcard;
+ void __user *post_addr;
eventfd = eventfd_ctx_fdget(args->fd);
if (IS_ERR(eventfd))
return PTR_ERR(eventfd);
wildcard = !(args->flags & KVM_IOEVENTFD_FLAG_DATAMATCH);
+ post_addr = args->post_addr;
mutex_lock(&kvm->slots_lock);
@@ -946,7 +973,8 @@ kvm_deassign_ioeventfd_idx(struct kvm *kvm, enum kvm_bus bus_idx,
p->eventfd != eventfd ||
p->addr != args->addr ||
p->length != args->len ||
- p->wildcard != wildcard)
+ p->wildcard != wildcard ||
+ p->post_addr != post_addr)
continue;
if (!p->wildcard && p->datamatch != args->datamatch)
--
2.47.3
^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: [PATCH] KVM: optionally post write on ioeventfd write
2026-03-02 12:28 ` [PATCH] KVM: optionally post " Thanos Makatos
@ 2026-03-05 1:26 ` Sean Christopherson
2026-03-06 11:14 ` Thanos Makatos
2026-03-05 1:49 ` kernel test robot
2026-03-05 9:39 ` kernel test robot
2 siblings, 1 reply; 15+ messages in thread
From: Sean Christopherson @ 2026-03-05 1:26 UTC (permalink / raw)
To: Thanos Makatos; +Cc: pbonzini@redhat.com, John Levon, kvm@vger.kernel.org
Please don't send patches in-reply to the previous version(s), it tends to mess
up b4.
On Mon, Mar 02, 2026, Thanos Makatos wrote:
> This patch is a slightly different take on the ioregionfd mechanism
> previously described here:
> https://lore.kernel.org/all/88ca79d2e378dcbfb3988b562ad2c16c4f929ac7.camel@gmail.com/
>
> The goal of this new mechanism is to speed up doorbell writes on NVMe
> controllers emulated outside of the VMM. Currently, a doorbell write to
> an NVMe SQ tail doorbell requires returning from ioctl(KVM_RUN) and the
> VMM communicating the event, along with the doorbell value, to the NVMe
> controller emulation task. With the shadow ioeventfd, the NVMe
> emulation task is directly notified of the doorbell write and can find
> the doorbell value in a known location, without the interference of the
> VMM.
Please add a KVM selftest to verify this works, and to verify that KVM rejects
bad configurations.
> Signed-off-by: Thanos Makatos <thanos.makatos@nutanix.com>
> ---
> include/uapi/linux/kvm.h | 11 ++++++++++-
> tools/include/uapi/linux/kvm.h | 2 ++
> virt/kvm/eventfd.c | 32 ++++++++++++++++++++++++++++++--
> 3 files changed, 42 insertions(+), 3 deletions(-)
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 65500f5db379..f3ff559de60d 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -639,6 +639,7 @@ enum {
> kvm_ioeventfd_flag_nr_deassign,
> kvm_ioeventfd_flag_nr_virtio_ccw_notify,
> kvm_ioeventfd_flag_nr_fast_mmio,
> + kvm_ioevetnfd_flag_nr_post_write,
> kvm_ioeventfd_flag_nr_max,
> };
>
> @@ -648,6 +649,12 @@ enum {
> #define KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY \
> (1 << kvm_ioeventfd_flag_nr_virtio_ccw_notify)
>
> +/*
> + * KVM does not provide any guarantees regarding read-after-write ordering for
> + * such updates.
Please document this (and more) in Documentation/virt/kvm/api.rst, not here.
> + */
> +#define KVM_IOEVENTFD_FLAG_POST_WRITE (1 << kvm_ioevetnfd_flag_nr_post_write)
> +
> #define KVM_IOEVENTFD_VALID_FLAG_MASK ((1 << kvm_ioeventfd_flag_nr_max) - 1)
>
> struct kvm_ioeventfd {
> @@ -656,8 +663,10 @@ struct kvm_ioeventfd {
> __u32 len; /* 1, 2, 4, or 8 bytes; or 0 to ignore length */
> __s32 fd;
> __u32 flags;
> - __u8 pad[36];
> + void __user *post_addr; /* address to write to if POST_WRITE is set */
> + __u8 pad[24];
> };
> +_Static_assert(sizeof(struct kvm_ioeventfd) == 1 << 6, "bad size");
>
> #define KVM_X86_DISABLE_EXITS_MWAIT (1 << 0)
> #define KVM_X86_DISABLE_EXITS_HLT (1 << 1)
> diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
> index dddb781b0507..1fb481c90b57 100644
> --- a/tools/include/uapi/linux/kvm.h
> +++ b/tools/include/uapi/linux/kvm.h
Don't bother updating tools, the copy of uapi headers in tools is maintained by
the perf folks (perf-the-tool needs all of the headers, nothing else does).
> @@ -629,6 +629,7 @@ enum {
> kvm_ioeventfd_flag_nr_deassign,
> kvm_ioeventfd_flag_nr_virtio_ccw_notify,
> kvm_ioeventfd_flag_nr_fast_mmio,
> + kvm_ioevetnfd_flag_nr_commit_write,
Then you won't have amusing mistakes like this :-)
> kvm_ioeventfd_flag_nr_max,
> };
>
> @@ -637,6 +638,7 @@ enum {
> #define KVM_IOEVENTFD_FLAG_DEASSIGN (1 << kvm_ioeventfd_flag_nr_deassign)
> #define KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY \
> (1 << kvm_ioeventfd_flag_nr_virtio_ccw_notify)
> +#define KVM_IOEVENTFD_FLAG_COMMIT_WRITE (1 << kvm_ioevetnfd_flag_nr_commit_write)
>
> #define KVM_IOEVENTFD_VALID_FLAG_MASK ((1 << kvm_ioeventfd_flag_nr_max) - 1)
>
> diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> index 0e8b8a2c5b79..019cf3606aef 100644
> --- a/virt/kvm/eventfd.c
> +++ b/virt/kvm/eventfd.c
> @@ -741,6 +741,7 @@ struct _ioeventfd {
> struct kvm_io_device dev;
> u8 bus_idx;
> bool wildcard;
> + void __user *post_addr;
> };
>
> static inline struct _ioeventfd *
> @@ -812,6 +813,9 @@ ioeventfd_write(struct kvm_vcpu *vcpu, struct kvm_io_device *this, gpa_t addr,
> if (!ioeventfd_in_range(p, addr, len, val))
> return -EOPNOTSUPP;
>
> + if (p->post_addr && len > 0 && __copy_to_user(p->post_addr, val, len))
> + return -EFAULT;
> +
> eventfd_signal(p->eventfd);
> return 0;
> }
> @@ -879,6 +883,27 @@ static int kvm_assign_ioeventfd_idx(struct kvm *kvm,
> goto fail;
> }
>
> + if (args->flags & KVM_IOEVENTFD_FLAG_POST_WRITE) {
> + /*
> + * Although a NULL pointer it technically valid for userspace, it's
> + * unlikely that any use case actually cares.
This is fine for a changelog, but for a code comment, simply state that KVM's ABI
is that NULL is disallowed.
> + */
> + if (!args->len || !args->post_addr ||
> + args->post_addr != untagged_addr(args->post_addr) ||
> + !access_ok((void __user *)(unsigned long)args->post_addr, args->len)) {
Align indentation. And use u64_to_user_ptr().
> + ret = -EINVAL;
> + goto free_fail;
This is rather silly. Put the checks before allocating. Then the post-alloc
code can simply be:
if (args->flags & KVM_IOEVENTFD_FLAG_POST_WRITE)
p->post_addr = args->post_addr;
I.e. your burning more code to try and save code. E.g.
if ((args->flags & KVM_IOEVENTFD_FLAG_POST_WRITE) &&
(!args->len || !args->post_addr ||
args->post_addr != untagged_addr(args->post_addr) ||
!access_ok(u64_to_user_ptr(args->post_addr), args->len)))
return -EINVAL;
p = kzalloc(sizeof(*p), GFP_KERNEL_ACCOUNT);
if (!p) {
ret = -ENOMEM;
goto fail;
}
INIT_LIST_HEAD(&p->list);
p->addr = args->addr;
p->bus_idx = bus_idx;
p->length = args->len;
p->eventfd = eventfd;
/* The datamatch feature is optional, otherwise this is a wildcard */
if (args->flags & KVM_IOEVENTFD_FLAG_DATAMATCH)
p->datamatch = args->datamatch;
else
p->wildcard = true;
if (args->flags & KVM_IOEVENTFD_FLAG_POST_WRITE)
p->post_addr = args->post_addr;
> + }
> + p->post_addr = args->post_addr;
> + } else if (!args->post_addr) {
This isn't a valid check. KVM didn't/doesn't require args->pad to be zero, so
it would be entirely legal for existing userspace to pass in a non-zero value and
expect success. If this added truly meaningful value, then maybe it would be
worth risking breakage, but in this case trying to help userspace is more likely
to do harm than good.
> + /*
> + * Ensure that post_addr isn't set without POST_WRITE to avoid accidental
Wrap at 80 since the comment carries over to a new line anyways. But as above,
it's moot.
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] KVM: optionally post write on ioeventfd write
2026-03-02 12:28 ` [PATCH] KVM: optionally post " Thanos Makatos
2026-03-05 1:26 ` Sean Christopherson
@ 2026-03-05 1:49 ` kernel test robot
2026-03-05 9:39 ` kernel test robot
2 siblings, 0 replies; 15+ messages in thread
From: kernel test robot @ 2026-03-05 1:49 UTC (permalink / raw)
To: Thanos Makatos, seanjc@google.com
Cc: oe-kbuild-all, pbonzini@redhat.com, John Levon,
kvm@vger.kernel.org, Thanos Makatos
Hi Thanos,
kernel test robot noticed the following build errors:
[auto build test ERROR on kvm/queue]
[also build test ERROR on kvm/next linus/master v7.0-rc2 next-20260304]
[cannot apply to kvm/linux-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Thanos-Makatos/KVM-optionally-post-write-on-ioeventfd-write/20260302-204031
base: https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link: https://lore.kernel.org/r/20260302122826.2572-1-thanos.makatos%40nutanix.com
patch subject: [PATCH] KVM: optionally post write on ioeventfd write
config: i386-allnoconfig (https://download.01.org/0day-ci/archive/20260305/202603050920.Lmf80GaE-lkp@intel.com/config)
compiler: gcc-14 (Debian 14.2.0-19) 14.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260305/202603050920.Lmf80GaE-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603050920.Lmf80GaE-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from include/linux/kvm_host.h:40,
from arch/x86/events/intel/core.c:17:
>> include/uapi/linux/kvm.h:669:1: error: static assertion failed: "bad size"
669 | _Static_assert(sizeof(struct kvm_ioeventfd) == 1 << 6, "bad size");
| ^~~~~~~~~~~~~~
vim +669 include/uapi/linux/kvm.h
659
660 struct kvm_ioeventfd {
661 __u64 datamatch;
662 __u64 addr; /* legal pio/mmio address */
663 __u32 len; /* 1, 2, 4, or 8 bytes; or 0 to ignore length */
664 __s32 fd;
665 __u32 flags;
666 void __user *post_addr; /* address to write to if POST_WRITE is set */
667 __u8 pad[24];
668 };
> 669 _Static_assert(sizeof(struct kvm_ioeventfd) == 1 << 6, "bad size");
670
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: [PATCH] KVM: optionally post write on ioeventfd write
2026-03-02 12:28 ` [PATCH] KVM: optionally post " Thanos Makatos
2026-03-05 1:26 ` Sean Christopherson
2026-03-05 1:49 ` kernel test robot
@ 2026-03-05 9:39 ` kernel test robot
2 siblings, 0 replies; 15+ messages in thread
From: kernel test robot @ 2026-03-05 9:39 UTC (permalink / raw)
To: Thanos Makatos, seanjc@google.com
Cc: oe-kbuild-all, pbonzini@redhat.com, John Levon,
kvm@vger.kernel.org, Thanos Makatos
Hi Thanos,
kernel test robot noticed the following build errors:
[auto build test ERROR on kvm/queue]
[also build test ERROR on kvm/next linus/master v7.0-rc2 next-20260304]
[cannot apply to kvm/linux-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]
url: https://github.com/intel-lab-lkp/linux/commits/Thanos-Makatos/KVM-optionally-post-write-on-ioeventfd-write/20260302-204031
base: https://git.kernel.org/pub/scm/virt/kvm/kvm.git queue
patch link: https://lore.kernel.org/r/20260302122826.2572-1-thanos.makatos%40nutanix.com
patch subject: [PATCH] KVM: optionally post write on ioeventfd write
config: i386-randconfig-141-20260305 (https://download.01.org/0day-ci/archive/20260305/202603051704.nmQyEAnO-lkp@intel.com/config)
compiler: clang version 20.1.8 (https://github.com/llvm/llvm-project 87f0227cb60147a26a1eeb4fb06e3b505e9c7261)
smatch: v0.5.0-9004-gb810ac53
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260305/202603051704.nmQyEAnO-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202603051704.nmQyEAnO-lkp@intel.com/
All errors (new ones prefixed by >>):
In file included from arch/x86/events/intel/core.c:17:
In file included from include/linux/kvm_host.h:40:
>> include/uapi/linux/kvm.h:669:16: error: static assertion failed due to requirement 'sizeof(struct kvm_ioeventfd) == 1 << 6': bad size
669 | _Static_assert(sizeof(struct kvm_ioeventfd) == 1 << 6, "bad size");
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
include/uapi/linux/kvm.h:669:45: note: expression evaluates to '56 == 64'
669 | _Static_assert(sizeof(struct kvm_ioeventfd) == 1 << 6, "bad size");
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~
1 error generated.
vim +669 include/uapi/linux/kvm.h
659
660 struct kvm_ioeventfd {
661 __u64 datamatch;
662 __u64 addr; /* legal pio/mmio address */
663 __u32 len; /* 1, 2, 4, or 8 bytes; or 0 to ignore length */
664 __s32 fd;
665 __u32 flags;
666 void __user *post_addr; /* address to write to if POST_WRITE is set */
667 __u8 pad[24];
668 };
> 669 _Static_assert(sizeof(struct kvm_ioeventfd) == 1 << 6, "bad size");
670
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply [flat|nested] 15+ messages in thread
* RE: [PATCH] KVM: optionally post write on ioeventfd write
2026-03-05 1:26 ` Sean Christopherson
@ 2026-03-06 11:14 ` Thanos Makatos
0 siblings, 0 replies; 15+ messages in thread
From: Thanos Makatos @ 2026-03-06 11:14 UTC (permalink / raw)
To: Sean Christopherson; +Cc: pbonzini@redhat.com, John Levon, kvm@vger.kernel.org
> -----Original Message-----
> From: Sean Christopherson <seanjc@google.com>
> Sent: 05 March 2026 01:27
> To: Thanos Makatos <thanos.makatos@nutanix.com>
> Cc: pbonzini@redhat.com; John Levon <john.levon@nutanix.com>;
> kvm@vger.kernel.org
> Subject: Re: [PATCH] KVM: optionally post write on ioeventfd write
>
> !-------------------------------------------------------------------|
> CAUTION: External Email
>
> |-------------------------------------------------------------------!
>
> Please don't send patches in-reply to the previous version(s), it tends to mess
> up b4.
>
> On Mon, Mar 02, 2026, Thanos Makatos wrote:
> > This patch is a slightly different take on the ioregionfd mechanism
> > previously described here:
> > https://urldefense.proofpoint.com/v2/url?u=https-
> 3A__lore.kernel.org_all_88ca79d2e378dcbfb3988b562ad2c16c4f929ac7.ca
> mel-
> 40gmail.com_&d=DwIBAg&c=s883GpUCOChKOHiocYtGcg&r=XTpYsh5Ps2zJvt
> w6ogtti46atk736SI4vgsJiUKIyDE&m=XKZUVVKO9SGqV_txMzP2_tgrfJrgB2lU
> 50rbshSY1i91mYQgU2LKO23a_If0S6GB&s=kp06dnwO7ESRSZ1iL_VQw0yKD
> OOED0L4jHbNjj4FqgI&e=
> >
> > The goal of this new mechanism is to speed up doorbell writes on NVMe
> > controllers emulated outside of the VMM. Currently, a doorbell write to
> > an NVMe SQ tail doorbell requires returning from ioctl(KVM_RUN) and the
> > VMM communicating the event, along with the doorbell value, to the NVMe
> > controller emulation task. With the shadow ioeventfd, the NVMe
> > emulation task is directly notified of the doorbell write and can find
> > the doorbell value in a known location, without the interference of the
> > VMM.
>
> Please add a KVM selftest to verify this works, and to verify that KVM rejects
> bad configurations.
Ack
>
> > Signed-off-by: Thanos Makatos <thanos.makatos@nutanix.com>
> > ---
> > include/uapi/linux/kvm.h | 11 ++++++++++-
> > tools/include/uapi/linux/kvm.h | 2 ++
> > virt/kvm/eventfd.c | 32 ++++++++++++++++++++++++++++++--
> > 3 files changed, 42 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> > index 65500f5db379..f3ff559de60d 100644
> > --- a/include/uapi/linux/kvm.h
> > +++ b/include/uapi/linux/kvm.h
> > @@ -639,6 +639,7 @@ enum {
> > kvm_ioeventfd_flag_nr_deassign,
> > kvm_ioeventfd_flag_nr_virtio_ccw_notify,
> > kvm_ioeventfd_flag_nr_fast_mmio,
> > + kvm_ioevetnfd_flag_nr_post_write,
> > kvm_ioeventfd_flag_nr_max,
> > };
> >
> > @@ -648,6 +649,12 @@ enum {
> > #define KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY \
> > (1 << kvm_ioeventfd_flag_nr_virtio_ccw_notify)
> >
> > +/*
> > + * KVM does not provide any guarantees regarding read-after-write
> ordering for
> > + * such updates.
>
> Please document this (and more) in Documentation/virt/kvm/api.rst, not
> here.
Ack
>
> > + */
> > +#define KVM_IOEVENTFD_FLAG_POST_WRITE (1 <<
> kvm_ioevetnfd_flag_nr_post_write)
> > +
> > #define KVM_IOEVENTFD_VALID_FLAG_MASK ((1 <<
> kvm_ioeventfd_flag_nr_max) - 1)
> >
> > struct kvm_ioeventfd {
> > @@ -656,8 +663,10 @@ struct kvm_ioeventfd {
> > __u32 len; /* 1, 2, 4, or 8 bytes; or 0 to ignore length */
> > __s32 fd;
> > __u32 flags;
> > - __u8 pad[36];
> > + void __user *post_addr; /* address to write to if POST_WRITE is set */
> > + __u8 pad[24];
> > };
> > +_Static_assert(sizeof(struct kvm_ioeventfd) == 1 << 6, "bad size");
> >
> > #define KVM_X86_DISABLE_EXITS_MWAIT (1 << 0)
> > #define KVM_X86_DISABLE_EXITS_HLT (1 << 1)
> > diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
> > index dddb781b0507..1fb481c90b57 100644
> > --- a/tools/include/uapi/linux/kvm.h
> > +++ b/tools/include/uapi/linux/kvm.h
>
> Don't bother updating tools, the copy of uapi headers in tools is maintained by
> the perf folks (perf-the-tool needs all of the headers, nothing else does).
>
> > @@ -629,6 +629,7 @@ enum {
> > kvm_ioeventfd_flag_nr_deassign,
> > kvm_ioeventfd_flag_nr_virtio_ccw_notify,
> > kvm_ioeventfd_flag_nr_fast_mmio,
> > + kvm_ioevetnfd_flag_nr_commit_write,
>
> Then you won't have amusing mistakes like this :-)
Ack
>
> > kvm_ioeventfd_flag_nr_max,
> > };
> >
> > @@ -637,6 +638,7 @@ enum {
> > #define KVM_IOEVENTFD_FLAG_DEASSIGN (1 <<
> kvm_ioeventfd_flag_nr_deassign)
> > #define KVM_IOEVENTFD_FLAG_VIRTIO_CCW_NOTIFY \
> > (1 << kvm_ioeventfd_flag_nr_virtio_ccw_notify)
> > +#define KVM_IOEVENTFD_FLAG_COMMIT_WRITE (1 <<
> kvm_ioevetnfd_flag_nr_commit_write)
> >
> > #define KVM_IOEVENTFD_VALID_FLAG_MASK ((1 <<
> kvm_ioeventfd_flag_nr_max) - 1)
> >
> > diff --git a/virt/kvm/eventfd.c b/virt/kvm/eventfd.c
> > index 0e8b8a2c5b79..019cf3606aef 100644
> > --- a/virt/kvm/eventfd.c
> > +++ b/virt/kvm/eventfd.c
> > @@ -741,6 +741,7 @@ struct _ioeventfd {
> > struct kvm_io_device dev;
> > u8 bus_idx;
> > bool wildcard;
> > + void __user *post_addr;
> > };
> >
> > static inline struct _ioeventfd *
> > @@ -812,6 +813,9 @@ ioeventfd_write(struct kvm_vcpu *vcpu, struct
> kvm_io_device *this, gpa_t addr,
> > if (!ioeventfd_in_range(p, addr, len, val))
> > return -EOPNOTSUPP;
> >
> > + if (p->post_addr && len > 0 && __copy_to_user(p->post_addr, val,
> len))
> > + return -EFAULT;
> > +
> > eventfd_signal(p->eventfd);
> > return 0;
> > }
> > @@ -879,6 +883,27 @@ static int kvm_assign_ioeventfd_idx(struct kvm
> *kvm,
> > goto fail;
> > }
> >
> > + if (args->flags & KVM_IOEVENTFD_FLAG_POST_WRITE) {
> > + /*
> > + * Although a NULL pointer it technically valid for userspace,
> it's
> > + * unlikely that any use case actually cares.
>
> This is fine for a changelog, but for a code comment, simply state that KVM's
> ABI
> is that NULL is disallowed.
Ack
>
> > + */
> > + if (!args->len || !args->post_addr ||
> > + args->post_addr != untagged_addr(args->post_addr)
> ||
> > + !access_ok((void __user *)(unsigned long)args-
> >post_addr, args->len)) {
>
> Align indentation. And use u64_to_user_ptr().
>
> > + ret = -EINVAL;
> > + goto free_fail;
>
> This is rather silly. Put the checks before allocating. Then the post-alloc
> code can simply be:
>
> if (args->flags & KVM_IOEVENTFD_FLAG_POST_WRITE)
> p->post_addr = args->post_addr;
>
> I.e. your burning more code to try and save code. E.g.
>
> if ((args->flags & KVM_IOEVENTFD_FLAG_POST_WRITE) &&
> (!args->len || !args->post_addr ||
> args->post_addr != untagged_addr(args->post_addr) ||
> !access_ok(u64_to_user_ptr(args->post_addr), args->len)))
> return -EINVAL;
>
> p = kzalloc(sizeof(*p), GFP_KERNEL_ACCOUNT);
> if (!p) {
> ret = -ENOMEM;
> goto fail;
> }
>
> INIT_LIST_HEAD(&p->list);
> p->addr = args->addr;
> p->bus_idx = bus_idx;
> p->length = args->len;
> p->eventfd = eventfd;
>
> /* The datamatch feature is optional, otherwise this is a wildcard */
> if (args->flags & KVM_IOEVENTFD_FLAG_DATAMATCH)
> p->datamatch = args->datamatch;
> else
> p->wildcard = true;
>
> if (args->flags & KVM_IOEVENTFD_FLAG_POST_WRITE)
> p->post_addr = args->post_addr;
Ack
>
>
> > + }
> > + p->post_addr = args->post_addr;
> > + } else if (!args->post_addr) {
>
> This isn't a valid check. KVM didn't/doesn't require args->pad to be zero, so
> it would be entirely legal for existing userspace to pass in a non-zero value and
> expect success. If this added truly meaningful value, then maybe it would be
> worth risking breakage, but in this case trying to help userspace is more likely
> to do harm than good.
Ack
>
> > + /*
> > + * Ensure that post_addr isn't set without POST_WRITE to
> avoid accidental
>
> Wrap at 80 since the comment carries over to a new line anyways. But as
> above,
> it's moot.
^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2026-03-06 11:14 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-10-05 21:15 [RFC PATCH] KVM: optionally commit write on ioeventfd write Thanos Makatos
2025-09-05 14:09 ` Sean Christopherson
2025-12-02 9:15 ` Thanos Makatos
2025-12-02 22:53 ` Sean Christopherson
2025-12-03 14:25 ` Thanos Makatos
2025-12-15 22:42 ` Thanos Makatos
2025-12-19 1:20 ` Sean Christopherson
2025-12-19 10:36 ` John Levon
2026-01-13 19:43 ` Thanos Makatos
2026-01-13 20:00 ` Sean Christopherson
2026-03-02 12:28 ` [PATCH] KVM: optionally post " Thanos Makatos
2026-03-05 1:26 ` Sean Christopherson
2026-03-06 11:14 ` Thanos Makatos
2026-03-05 1:49 ` kernel test robot
2026-03-05 9:39 ` kernel test robot
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox