* Re: Re: [BUG REPORT] USE_AFTER_FREE in complete_emulated_mmio found by KASAN/Syzkaller fuzz test (v5.10.0)
@ 2026-02-10 11:56 Zhangjiaji
2026-02-10 19:11 ` Sean Christopherson
0 siblings, 1 reply; 12+ messages in thread
From: Zhangjiaji @ 2026-02-10 11:56 UTC (permalink / raw)
To: Sean Christopherson
Cc: Paolo Bonzini, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Wangqinxiao (Tom), zhangyashu, wangyanan (Y)
> I think there's a not-completely-awful solution buried in this gigantic cesspool.
> The only time KVM uses on-stack variables is for qword or smaller accesses, i.e.
> 8 bytes in size or less. For larger fragments, e.g. AVX to/from MMIO, the target
> value will always be an operand in the emulator context. And so rather than
> disallow stack variables, for "small" fragments, we can rework the handling to
> copy the value to/from each fragment on-demand instead of stashing a pointer to
> the value.
Since we can store the frag->val in struct kvm_mmio_fragment,
why not just point frag->data to it? This Way we can save a lot code about
(frag->data == NULL).
Though this patch will block any read-into-stack calls, we can add a special path
in function emulator_read_write handling feasible read-into-stack calls -- the
target is released just after emulator_read_write returns.
---
arch/x86/kvm/x86.c | 9 ++++++++-
include/linux/kvm_host.h | 3 ++-
2 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 72d37c8930ad..12d53d441a39 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8197,7 +8197,14 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
WARN_ON(vcpu->mmio_nr_fragments >= KVM_MAX_MMIO_FRAGMENTS);
frag = &vcpu->mmio_fragments[vcpu->mmio_nr_fragments++];
frag->gpa = gpa;
- frag->data = val;
+ if (bytes > 8u || ! write) {
+ if (WARN_ON_ONCE(object_is_on_stack(val)))
+ return X86EMUL_UNHANDLEABLE;
+ frag->data = val;
+ } else {
+ memcpy(&frag->val, val, bytes);
+ frag->data = &frag->val;
+ }
frag->len = bytes;
return X86EMUL_CONTINUE;
}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index d93f75b05ae2..f8419d632394 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -320,7 +320,8 @@ static inline bool kvm_vcpu_can_poll(ktime_t cur, ktime_t stop)
struct kvm_mmio_fragment {
gpa_t gpa;
void *data;
- unsigned len;
+ unsigned int len;
+ u64 val;
};
struct kvm_vcpu {
base-commit: 72c395024dac5e215136cbff793455f065603b06
--
2.33.0
-----Original Message-----
From: Sean Christopherson <seanjc@google.com>
Sent: Tuesday, February 10, 2026 4:22 AM
To: Zhangjiaji <zhangjiaji1@huawei.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>; kvm@vger.kernel.org; linux-kernel@vger.kernel.org; Wangqinxiao (Tom) <wangqinxiao@huawei.com>; zhangyashu <zhangyashu2@h-partners.com>; wangyanan (Y) <wangyanan55@huawei.com>
Subject: Re: [BUG REPORT] USE_AFTER_FREE in complete_emulated_mmio found by KASAN/Syzkaller fuzz test (v5.10.0)
On Fri, Feb 06, 2026, Sean Christopherson wrote:
> On Mon, Feb 02, 2026, Zhangjiaji wrote:
> > Syzkaller hit 'KASAN: use-after-free Read in complete_emulated_mmio' bug.
> >
> > ==================================================================
> > BUG: KASAN: use-after-free in complete_emulated_mmio+0x305/0x420
> > Read of size 1 at addr ffff888009c378d1 by task syz-executor417/984
> >
> > CPU: 1 PID: 984 Comm: syz-executor417 Not tainted 5.10.0-182.0.0.95.h2627.eulerosv2r13.x86_64 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 Call Trace:
> > dump_stack+0xbe/0xfd
> > print_address_description.constprop.0+0x19/0x170
> > __kasan_report.cold+0x6c/0x84
> > kasan_report+0x3a/0x50
> > check_memory_region+0xfd/0x1f0
> > memcpy+0x20/0x60
> > complete_emulated_mmio+0x305/0x420
> > kvm_arch_vcpu_ioctl_run+0x63f/0x6d0
> > kvm_vcpu_ioctl+0x413/0xb20
> > __se_sys_ioctl+0x111/0x160
> > do_syscall_64+0x30/0x40
> > entry_SYSCALL_64_after_hwframe+0x67/0xd1
> > RIP: 0033:0x42477d
...
> > I've analyzed the Syzkaller output and the complete_emulated_mmio()
> > code path. The buggy address is created in em_enter(), where it
> > passes its local variable `ulong rbp` to emulate_push(), finally
> > ends in
> > emulator_read_write_onepage() putting the address into
> > vcpu->mmio_fragments[].data . The bug happens when kvm guest
> > vcpu->executes an
> > "enter" instruction, and top of the stack crosses the mem page. In
> > that case, the em_enter() function cannot complete the instruction
> > within itself, but leave the rest data (which is in the other page)
> > to complete_emulated_mmio(). When complete_emulated_mmio() starts,
> > em_enter() has exited, so local variable `ulong rbp` is also
> > released. Now
> > complete_emulated_mmio() trys to access vcpu->mmio_fragments[].data
> > , and the bug happened.
> >
> > any idea?
>
> Egad, sorry! I had reproduced this shortly after you sent the report
> and prepped a fix, but got distracted and lost this in my inbox.
>
> Can you test this on your end? I repro'd by modifying a KVM-Unit-Test
> and hacking KVM to tweak the stack, so I haven't confirmed the syzkaller version.
>
> It's a bit gross, as it abuses an unused field as scratch space, but
> AFAICT that's "fine". The alternative would be add a dedicated field, which seems like overkill?
>
> I'm also going to try and add a WARN to detect if the @val parameter
> passed to
> emulator_read_write() is ever on the kernel stack, e.g. to help detect
> lurking bugs like this one without relying on kasahn.
>
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c index
> c8e292e9a24d..dacef51c2565 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -1897,13 +1897,12 @@ static int em_enter(struct x86_emulate_ctxt *ctxt)
> int rc;
> unsigned frame_size = ctxt->src.val;
> unsigned nesting_level = ctxt->src2.val & 31;
> - ulong rbp;
>
> if (nesting_level)
> return X86EMUL_UNHANDLEABLE;
>
> - rbp = reg_read(ctxt, VCPU_REGS_RBP);
> - rc = emulate_push(ctxt, &rbp, stack_size(ctxt));
> + ctxt->memop.orig_val = reg_read(ctxt, VCPU_REGS_RBP);
> + rc = emulate_push(ctxt, &ctxt->memop.orig_val,
> + stack_size(ctxt));
> if (rc != X86EMUL_CONTINUE)
> return rc;
> assign_masked(reg_rmw(ctxt, VCPU_REGS_RBP), reg_read(ctxt,
> VCPU_REGS_RSP),
*sigh*
This isn't going to work. Or rather, it's far from a complete fix, as there are several other instructions and flows that use stack variables to service reads and writes. Hacking all of them to not use stack variables isn't feasible, as flows like emulate_iret_real() and em_popa() perform multiple acceses, i.e.
hijacking an unused operand simply won't work.
I'm tempted to go with a straightforward "fix" of rejecting userspace MMIO if the destination is on the stack, e.g. like so:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index db3f393192d9..113287612acd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8272,6 +8272,9 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt,
if (!vcpu->mmio_nr_fragments)
return X86EMUL_CONTINUE;
+ if (object_is_on_stack(val))
+ return X86EMUL_UNHANDLEABLE;
+
gpa = vcpu->mmio_fragments[0].gpa;
vcpu->mmio_needed = 1;
But I hate how arbitrary that is, and I'm somewhat concerned that it could break existing setups.
I think there's a not-completely-awful solution buried in this gigantic cesspool.
The only time KVM uses on-stack variables is for qword or smaller accesses, i.e.
8 bytes in size or less. For larger fragments, e.g. AVX to/from MMIO, the target value will always be an operand in the emulator context. And so rather than disallow stack variables, for "small" fragments, we can rework the handling to copy the value to/from each fragment on-demand instead of stashing a pointer to the value.
This needs a _lot_ more documentation, the SEV-ES flows aren't converted, and there is a ton of cleanup that can and should be done on top, but this appears to work.
Note the "clobber" in complete_emulated_mmio() to fill 4096 bytes of the stack with 0xaa, i.e. to simulate some of the stack being "freed".
---
arch/x86/kvm/kvm_emulate.h | 8 +++
arch/x86/kvm/x86.c | 125 +++++++++++++++++++++++++++----------
include/linux/kvm_host.h | 6 +-
3 files changed, 103 insertions(+), 36 deletions(-)
diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h index fb3dab4b5a53..f735158af05e 100644
--- a/arch/x86/kvm/kvm_emulate.h
+++ b/arch/x86/kvm/kvm_emulate.h
@@ -284,6 +284,14 @@ struct fetch_cache {
u8 *end;
};
+/*
+ * To complete userspace MMIO and I/O reads, KVM re-emulates
+(NO_DECODE) the
+ * _entire_ instruction to propagate the read data to its final destination.
+ * To avoid re-reading values from memory and/or getting "stuck" on the
+access
+ * that triggered an exit to userspace, KVM caches all values that have
+been
+ * read for a given instruction, and reads from this cache instead of
+reading
+ * from guest memory or from userspace.
+ */
struct read_cache {
u8 data[1024];
unsigned long pos;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index db3f393192d9..1b99de3e3236 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8109,7 +8109,7 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa, }
struct read_write_emulator_ops {
- int (*read_write_prepare)(struct kvm_vcpu *vcpu, void *val,
+ int (*read_mmio_fragment)(struct kvm_vcpu *vcpu, void *val,
int bytes);
int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa,
void *val, int bytes);
@@ -8120,16 +8120,37 @@ struct read_write_emulator_ops {
bool write;
};
-static int read_prepare(struct kvm_vcpu *vcpu, void *val, int bytes)
+static int read_mmio_fragment(struct kvm_vcpu *vcpu, void *val, int
+bytes)
{
- if (vcpu->mmio_read_completed) {
- trace_kvm_mmio(KVM_TRACE_MMIO_READ, bytes,
- vcpu->mmio_fragments[0].gpa, val);
- vcpu->mmio_read_completed = 0;
- return 1;
+ struct kvm_mmio_fragment *frag;
+ unsigned int len = bytes;
+
+ while (len && vcpu->mmio_head_fragment < vcpu->mmio_tail_fragment) {
+ int i = vcpu->mmio_head_fragment++;
+
+ if (WARN_ON_ONCE(i >= vcpu->mmio_nr_fragments))
+ break;
+
+ frag = &vcpu->mmio_fragments[i];
+ if (!frag->data) {
+ if (WARN_ON_ONCE(len < frag->len || frag->len > 8u)) {
+ pr_warn("len = %u, bytes = %u, frag->len = %u, gpa = %llx, rip = %lx\n",
+ len, bytes, frag->len, frag->gpa, kvm_rip_read(vcpu));
+ break;
+ }
+
+ memcpy(val, &frag->val, min(8u, frag->len));
+ }
+
+ val += frag->len;
+ len -= min(len, frag->len);
}
- return 0;
+ if ((int)len == bytes)
+ return 0;
+
+ trace_kvm_mmio(KVM_TRACE_MMIO_READ, bytes, frag->gpa, val);
+ return 1;
}
static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa, @@ -8150,6 +8171,11 @@ static int write_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes, void *val)
return vcpu_mmio_write(vcpu, gpa, bytes, val); }
+static void *mmio_frag_data(struct kvm_mmio_fragment *frag) {
+ return frag->data ?: &frag->val;
+}
+
static int read_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa,
void *val, int bytes)
{
@@ -8162,12 +8188,12 @@ static int write_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, {
struct kvm_mmio_fragment *frag = &vcpu->mmio_fragments[0];
- memcpy(vcpu->run->mmio.data, frag->data, min(8u, frag->len));
+ memcpy(vcpu->run->mmio.data, mmio_frag_data(frag), min(8u,
+frag->len));
return X86EMUL_CONTINUE;
}
static const struct read_write_emulator_ops read_emultor = {
- .read_write_prepare = read_prepare,
+ .read_mmio_fragment = read_mmio_fragment,
.read_write_emulate = read_emulate,
.read_write_mmio = vcpu_mmio_read,
.read_write_exit_mmio = read_exit_mmio, @@ -8226,7 +8252,13 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
WARN_ON(vcpu->mmio_nr_fragments >= KVM_MAX_MMIO_FRAGMENTS);
frag = &vcpu->mmio_fragments[vcpu->mmio_nr_fragments++];
frag->gpa = gpa;
- frag->data = val;
+ if (bytes > 8u) {
+ frag->data = val;
+ } else {
+ frag->data = NULL;
+ if (write)
+ memcpy(&frag->val, val, bytes);
+ }
frag->len = bytes;
return X86EMUL_CONTINUE;
}
@@ -8241,11 +8273,16 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt,
gpa_t gpa;
int rc;
- if (ops->read_write_prepare &&
- ops->read_write_prepare(vcpu, val, bytes))
+ if (WARN_ON_ONCE(bytes > 8u && object_is_on_stack(val)))
+ return X86EMUL_UNHANDLEABLE;
+
+ if (ops->read_mmio_fragment &&
+ ops->read_mmio_fragment(vcpu, val, bytes))
return X86EMUL_CONTINUE;
vcpu->mmio_nr_fragments = 0;
+ vcpu->mmio_head_fragment = 0;
+ vcpu->mmio_tail_fragment = 0;
/* Crossing a page boundary? */
if (((addr + bytes - 1) ^ addr) & PAGE_MASK) { @@ -8275,7 +8312,6 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt,
gpa = vcpu->mmio_fragments[0].gpa;
vcpu->mmio_needed = 1;
- vcpu->mmio_cur_fragment = 0;
vcpu->run->mmio.len = min(8u, vcpu->mmio_fragments[0].len);
vcpu->run->mmio.is_write = vcpu->mmio_is_write = ops->write; @@ -11832,42 +11868,61 @@ static int complete_emulated_mmio(struct kvm_vcpu *vcpu) {
struct kvm_run *run = vcpu->run;
struct kvm_mmio_fragment *frag;
- unsigned len;
+ unsigned int len;
+ u8 clobber[4096];
- BUG_ON(!vcpu->mmio_needed);
+ memset(clobber, 0xaa, sizeof(clobber));
+
+ if (WARN_ON_ONCE(!vcpu->mmio_needed))
+ return -EIO;
+
+ /* Complete MMIO for the current fragment. */
+ frag = &vcpu->mmio_fragments[vcpu->mmio_tail_fragment];
- /* Complete previous fragment */
- frag = &vcpu->mmio_fragments[vcpu->mmio_cur_fragment];
len = min(8u, frag->len);
if (!vcpu->mmio_is_write)
- memcpy(frag->data, run->mmio.data, len);
+ memcpy(mmio_frag_data(frag), run->mmio.data, len);
if (frag->len <= 8) {
/* Switch to the next fragment. */
frag++;
- vcpu->mmio_cur_fragment++;
+ vcpu->mmio_tail_fragment++;
} else {
+ if (WARN_ON_ONCE(!frag->data))
+ return -EIO;
+
/* Go forward to the next mmio piece. */
frag->data += len;
frag->gpa += len;
frag->len -= len;
}
- if (vcpu->mmio_cur_fragment >= vcpu->mmio_nr_fragments) {
+ if (vcpu->mmio_tail_fragment >= vcpu->mmio_nr_fragments) {
vcpu->mmio_needed = 0;
+ WARN_ON_ONCE(vcpu->mmio_head_fragment);
- /* FIXME: return into emulator if single-stepping. */
- if (vcpu->mmio_is_write)
+ /*
+ * Don't re-emulate the instruction for MMIO writes, as KVM has
+ * already committed all side effects (and the emulator simply
+ * isn't equi)
+ *
+ * FIXME: Return into emulator if single-stepping.
+ */
+ if (vcpu->mmio_is_write) {
+ vcpu->mmio_tail_fragment = 0;
return 1;
- vcpu->mmio_read_completed = 1;
+ }
+
return complete_emulated_io(vcpu);
}
+ len = min(8u, frag->len);
+
run->exit_reason = KVM_EXIT_MMIO;
run->mmio.phys_addr = frag->gpa;
if (vcpu->mmio_is_write)
- memcpy(run->mmio.data, frag->data, min(8u, frag->len));
- run->mmio.len = min(8u, frag->len);
+ memcpy(run->mmio.data, mmio_frag_data(frag), len);
+ run->mmio.len = len;
run->mmio.is_write = vcpu->mmio_is_write;
vcpu->arch.complete_userspace_io = complete_emulated_mmio;
return 0;
@@ -14247,15 +14302,15 @@ static int complete_sev_es_emulated_mmio(struct kvm_vcpu *vcpu)
BUG_ON(!vcpu->mmio_needed);
/* Complete previous fragment */
- frag = &vcpu->mmio_fragments[vcpu->mmio_cur_fragment];
+ frag = &vcpu->mmio_fragments[vcpu->mmio_tail_fragment];
len = min(8u, frag->len);
- if (!vcpu->mmio_is_write)
+ if (!vcpu->mmio_is_write && frag->data)
memcpy(frag->data, run->mmio.data, len);
if (frag->len <= 8) {
/* Switch to the next fragment. */
frag++;
- vcpu->mmio_cur_fragment++;
+ vcpu->mmio_tail_fragment++;
} else {
/* Go forward to the next mmio piece. */
frag->data += len;
@@ -14263,7 +14318,7 @@ static int complete_sev_es_emulated_mmio(struct kvm_vcpu *vcpu)
frag->len -= len;
}
- if (vcpu->mmio_cur_fragment >= vcpu->mmio_nr_fragments) {
+ if (vcpu->mmio_tail_fragment >= vcpu->mmio_nr_fragments) {
vcpu->mmio_needed = 0;
// VMG change, at this point, we're always done @@ -14303,13 +14358,14 @@ int kvm_sev_es_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned int bytes,
/*TODO: Check if need to increment number of frags */
frag = vcpu->mmio_fragments;
- vcpu->mmio_nr_fragments = 1;
frag->len = bytes;
frag->gpa = gpa;
frag->data = data;
vcpu->mmio_needed = 1;
- vcpu->mmio_cur_fragment = 0;
+ vcpu->mmio_nr_fragments = 1;
+ vcpu->mmio_head_fragment = 0;
+ vcpu->mmio_tail_fragment = 0;
vcpu->run->mmio.phys_addr = gpa;
vcpu->run->mmio.len = min(8u, frag->len); @@ -14342,13 +14398,14 @@ int kvm_sev_es_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned int bytes,
/*TODO: Check if need to increment number of frags */
frag = vcpu->mmio_fragments;
- vcpu->mmio_nr_fragments = 1;
frag->len = bytes;
frag->gpa = gpa;
frag->data = data;
vcpu->mmio_needed = 1;
- vcpu->mmio_cur_fragment = 0;
+ vcpu->mmio_nr_fragments = 1;
+ vcpu->mmio_head_fragment = 0;
+ vcpu->mmio_tail_fragment = 0;
vcpu->run->mmio.phys_addr = gpa;
vcpu->run->mmio.len = min(8u, frag->len); diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 782f4d670793..be4b9de5b8c9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -320,7 +320,8 @@ static inline bool kvm_vcpu_can_poll(ktime_t cur, ktime_t stop) struct kvm_mmio_fragment {
gpa_t gpa;
void *data;
- unsigned len;
+ u64 val;
+ unsigned int len;
};
struct kvm_vcpu {
@@ -356,7 +357,8 @@ struct kvm_vcpu {
int mmio_needed;
int mmio_read_completed;
int mmio_is_write;
- int mmio_cur_fragment;
+ int mmio_head_fragment;
+ int mmio_tail_fragment;
int mmio_nr_fragments;
struct kvm_mmio_fragment mmio_fragments[KVM_MAX_MMIO_FRAGMENTS];
#endif
base-commit: e944fe2c09f405a2e2d147145c9b470084bc4c9a
--
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: Re: [BUG REPORT] USE_AFTER_FREE in complete_emulated_mmio found by KASAN/Syzkaller fuzz test (v5.10.0)
2026-02-10 11:56 Re: [BUG REPORT] USE_AFTER_FREE in complete_emulated_mmio found by KASAN/Syzkaller fuzz test (v5.10.0) Zhangjiaji
@ 2026-02-10 19:11 ` Sean Christopherson
2026-02-18 20:56 ` Sean Christopherson
0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2026-02-10 19:11 UTC (permalink / raw)
To: Zhangjiaji
Cc: Paolo Bonzini, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Wangqinxiao (Tom), zhangyashu, wangyanan (Y)
On Tue, Feb 10, 2026, Zhangjiaji wrote:
> > I think there's a not-completely-awful solution buried in this gigantic cesspool.
> > The only time KVM uses on-stack variables is for qword or smaller accesses, i.e.
> > 8 bytes in size or less. For larger fragments, e.g. AVX to/from MMIO, the target
> > value will always be an operand in the emulator context. And so rather than
> > disallow stack variables, for "small" fragments, we can rework the handling to
> > copy the value to/from each fragment on-demand instead of stashing a pointer to
> > the value.
>
> Since we can store the frag->val in struct kvm_mmio_fragment,
> why not just point frag->data to it? This Way we can save a lot code about
> (frag->data == NULL).
It's not quite that simple, because we need to handle reads as well.
> Though this patch will block any read-into-stack calls, we can add a special path
> in function emulator_read_write handling feasible read-into-stack calls -- the
> target is released just after emulator_read_write returns.
>
> ---
> arch/x86/kvm/x86.c | 9 ++++++++-
> include/linux/kvm_host.h | 3 ++-
> 2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 72d37c8930ad..12d53d441a39 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8197,7 +8197,14 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
> WARN_ON(vcpu->mmio_nr_fragments >= KVM_MAX_MMIO_FRAGMENTS);
> frag = &vcpu->mmio_fragments[vcpu->mmio_nr_fragments++];
> frag->gpa = gpa;
> - frag->data = val;
> + if (bytes > 8u || ! write) {
> + if (WARN_ON_ONCE(object_is_on_stack(val)))
This is user-triggerable, e.g. em_popa(), em_pop_sreg(), emulate_iret_real(),
em_ret_near_imm(), em_ret_far(), and em_ret().
That said, I do like redirecting frag->data to &frag->val instead of nullifying
the pointer. If the change to tracking head+tail is isolated as a prep commit,
the diff isn't actually that scary (see below). Combined with a reworked loop
for read_mmio_fragment() to make it easier to follow (still need to add comments),
fixing this isn't as insane as I originally worried.
static int read_mmio_fragment(struct kvm_vcpu *vcpu, void *val, int bytes)
{
int *head = &vcpu->mmio_head_fragment;
int tail = vcpu->mmio_tail_fragment;
struct kvm_mmio_fragment *frag;
if (vcpu->mmio_head_fragment >= vcpu->mmio_tail_fragment)
return 0;
if (WARN_ON_ONCE(tail > vcpu->mmio_nr_fragments ||
tail > ARRAY_SIZE(vcpu->mmio_fragments)))
return 0;
for ( ; *head < tail; ++(*head)) {
frag = &vcpu->mmio_fragments[*head];
if (WARN_ON_ONCE(bytes < frag->len))
break;
if (frag->data == &frag->val)
memcpy(val, frag->data, frag->len);
val += frag->len;
bytes -= frag->len;
}
trace_kvm_mmio(KVM_TRACE_MMIO_READ, bytes, frag->gpa, val);
return 1;
}
I'll put together a series, there are a pile of cleanups that can be done, and
I want to comment the snot out of all of this because every time I end up in this
code I have to re-learn the subtleties.
---
arch/x86/kvm/x86.c | 28 +++++++++++++++++++++++-----
include/linux/kvm_host.h | 3 ++-
2 files changed, 25 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 8f698d68d85e..5886f082b5d6 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8138,6 +8138,9 @@ static int read_mmio_fragment(struct kvm_vcpu *vcpu, void *val, int bytes)
if (WARN_ON_ONCE(bytes < frag->len))
break;
+ if (frag->data == &frag->val)
+ memcpy(val, frag->data, frag->len);
+
val += frag->len;
bytes -= frag->len;
}
@@ -8240,7 +8243,14 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
WARN_ON(vcpu->mmio_nr_fragments >= KVM_MAX_MMIO_FRAGMENTS);
frag = &vcpu->mmio_fragments[vcpu->mmio_nr_fragments++];
frag->gpa = gpa;
- frag->data = val;
+ if (bytes > 8u) {
+ frag->data = val;
+ } else {
+ frag->val = 0;
+ frag->data = &frag->val;
+ if (write)
+ memcpy(&frag->val, val, bytes);
+ }
frag->len = bytes;
return X86EMUL_CONTINUE;
}
@@ -8255,6 +8265,9 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt,
gpa_t gpa;
int rc;
+ if (WARN_ON_ONCE(bytes > 8u && object_is_on_stack(val)))
+ return X86EMUL_UNHANDLEABLE;
+
if (ops->read_mmio_fragment &&
ops->read_mmio_fragment(vcpu, val, bytes))
return X86EMUL_CONTINUE;
@@ -11863,6 +11876,9 @@ static int complete_emulated_mmio(struct kvm_vcpu *vcpu)
frag++;
vcpu->mmio_tail_fragment++;
} else {
+ if (WARN_ON_ONCE(frag->data == &frag->val))
+ return -EIO;
+
/* Go forward to the next mmio piece. */
frag->data += len;
frag->gpa += len;
@@ -14291,8 +14307,10 @@ static int complete_sev_es_emulated_mmio(struct kvm_vcpu *vcpu)
vcpu->mmio_needed = 0;
vcpu->mmio_tail_fragment = 0;
- // VMG change, at this point, we're always done
- // RIP has already been advanced
+ /*
+ * All done, as frag->data always points at the GHCB scratch
+ * area and VMGEXIT is trap-like (RIP is advanced by hardware).
+ */
return 1;
}
@@ -14315,7 +14333,7 @@ int kvm_sev_es_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned int bytes,
int handled;
struct kvm_mmio_fragment *frag;
- if (!data)
+ if (!data || WARN_ON_ONCE(object_is_on_stack(data)))
return -EINVAL;
handled = write_emultor.read_write_mmio(vcpu, gpa, bytes, data);
@@ -14355,7 +14373,7 @@ int kvm_sev_es_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned int bytes,
int handled;
struct kvm_mmio_fragment *frag;
- if (!data)
+ if (!data || WARN_ON_ONCE(object_is_on_stack(data)))
return -EINVAL;
handled = read_emultor.read_write_mmio(vcpu, gpa, bytes, data);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 919682c6faeb..be4b9de5b8c9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -320,7 +320,8 @@ static inline bool kvm_vcpu_can_poll(ktime_t cur, ktime_t stop)
struct kvm_mmio_fragment {
gpa_t gpa;
void *data;
- unsigned len;
+ u64 val;
+ unsigned int len;
};
struct kvm_vcpu {
--
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: Re: [BUG REPORT] USE_AFTER_FREE in complete_emulated_mmio found by KASAN/Syzkaller fuzz test (v5.10.0)
2026-02-10 19:11 ` Sean Christopherson
@ 2026-02-18 20:56 ` Sean Christopherson
2026-05-08 7:57 ` Xinyu Zheng
0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2026-02-18 20:56 UTC (permalink / raw)
To: Zhangjiaji
Cc: Paolo Bonzini, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Wangqinxiao (Tom), zhangyashu, wangyanan (Y)
On Tue, Feb 10, 2026, Sean Christopherson wrote:
> On Tue, Feb 10, 2026, Zhangjiaji wrote:
> > > I think there's a not-completely-awful solution buried in this gigantic cesspool.
> > > The only time KVM uses on-stack variables is for qword or smaller accesses, i.e.
> > > 8 bytes in size or less. For larger fragments, e.g. AVX to/from MMIO, the target
> > > value will always be an operand in the emulator context. And so rather than
> > > disallow stack variables, for "small" fragments, we can rework the handling to
> > > copy the value to/from each fragment on-demand instead of stashing a pointer to
> > > the value.
> >
> > Since we can store the frag->val in struct kvm_mmio_fragment,
> > why not just point frag->data to it? This Way we can save a lot code about
> > (frag->data == NULL).
>
> It's not quite that simple, because we need to handle reads as well.
>
> > Though this patch will block any read-into-stack calls, we can add a special path
> > in function emulator_read_write handling feasible read-into-stack calls -- the
> > target is released just after emulator_read_write returns.
> >
> > ---
> > arch/x86/kvm/x86.c | 9 ++++++++-
> > include/linux/kvm_host.h | 3 ++-
> > 2 files changed, 10 insertions(+), 2 deletions(-)
> >
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 72d37c8930ad..12d53d441a39 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -8197,7 +8197,14 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
> > WARN_ON(vcpu->mmio_nr_fragments >= KVM_MAX_MMIO_FRAGMENTS);
> > frag = &vcpu->mmio_fragments[vcpu->mmio_nr_fragments++];
> > frag->gpa = gpa;
> > - frag->data = val;
> > + if (bytes > 8u || ! write) {
> > + if (WARN_ON_ONCE(object_is_on_stack(val)))
>
> This is user-triggerable, e.g. em_popa(), em_pop_sreg(), emulate_iret_real(),
> em_ret_near_imm(), em_ret_far(), and em_ret().
*sigh*
And I was wrong. I finally sat down to write some comments for all of this, and
realized that reads _never_ pass an on-stack @val to emulator_read_write_onepage(),
because read_emulated() always buffers reads through ctxt->mem_read.
So not only is my fancy, complex code unnecessary, it's actively broken. If a
read splits a page boundary, and the first page is NOT emulated MMIO, trying to
fulfill the read on-demand falls apart because the @val points at the start of
the operand (technically its cache "entry"). I'm sure that's a solvable problem,
but I don't see any point in manufacturing a problem in the first place.
I need to write a changelog, but as Yashu suggested, the fix can more simply be:
--
From: Sean Christopherson <seanjc@google.com>
Date: Tue, 10 Feb 2026 09:45:37 -0800
Subject: [PATCH 01/14] KVM: x86: Use scratch field in MMIO fragment to hold
small write values
Fixes: f78146b0f923 ("KVM: Fix page-crossing MMIO")
Suggested-by: Yashu Zhang <zhangjiaji1@huawei.com>
Reported-by: Yashu Zhang <zhangjiaji1@huawei.com>
Closes: https://lore.kernel.org/all/369eaaa2b3c1425c85e8477066391bc7@huawei.com
Cc: stable@vger.kernel.org
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
arch/x86/kvm/x86.c | 14 +++++++++++++-
include/linux/kvm_host.h | 3 ++-
2 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index db3f393192d9..ff3a6f86973f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8226,7 +8226,13 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
WARN_ON(vcpu->mmio_nr_fragments >= KVM_MAX_MMIO_FRAGMENTS);
frag = &vcpu->mmio_fragments[vcpu->mmio_nr_fragments++];
frag->gpa = gpa;
- frag->data = val;
+ if (write && bytes <= 8u) {
+ frag->val = 0;
+ frag->data = &frag->val;
+ memcpy(&frag->val, val, bytes);
+ } else {
+ frag->data = val;
+ }
frag->len = bytes;
return X86EMUL_CONTINUE;
}
@@ -8241,6 +8247,9 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt,
gpa_t gpa;
int rc;
+ if (WARN_ON_ONCE((bytes > 8u || !ops->write) && object_is_on_stack(val)))
+ return X86EMUL_UNHANDLEABLE;
+
if (ops->read_write_prepare &&
ops->read_write_prepare(vcpu, val, bytes))
return X86EMUL_CONTINUE;
@@ -11847,6 +11856,9 @@ static int complete_emulated_mmio(struct kvm_vcpu *vcpu)
frag++;
vcpu->mmio_cur_fragment++;
} else {
+ if (WARN_ON_ONCE(frag->data == &frag->val))
+ return -EIO;
+
/* Go forward to the next mmio piece. */
frag->data += len;
frag->gpa += len;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 2c7d76262898..0bb2a34fb93d 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -320,7 +320,8 @@ static inline bool kvm_vcpu_can_poll(ktime_t cur, ktime_t stop)
struct kvm_mmio_fragment {
gpa_t gpa;
void *data;
- unsigned len;
+ u64 val;
+ unsigned int len;
};
struct kvm_vcpu {
base-commit: 183bb0ce8c77b0fd1fb25874112bc8751a461e49
--
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [BUG REPORT] USE_AFTER_FREE in complete_emulated_mmio found by KASAN/Syzkaller fuzz test (v5.10.0)
2026-02-18 20:56 ` Sean Christopherson
@ 2026-05-08 7:57 ` Xinyu Zheng
2026-05-08 14:25 ` Sean Christopherson
0 siblings, 1 reply; 12+ messages in thread
From: Xinyu Zheng @ 2026-05-08 7:57 UTC (permalink / raw)
To: Sean Christopherson, Zhangjiaji
Cc: Paolo Bonzini, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Wangqinxiao (Tom), zhangyashu, wangyanan (Y), zouyipeng,
wangyanan55
On 2/19/2026 4:56 AM, Sean Christopherson wrote:
> On Tue, Feb 10, 2026, Sean Christopherson wrote:
>> On Tue, Feb 10, 2026, Zhangjiaji wrote:
>>>> I think there's a not-completely-awful solution buried in this gigantic cesspool.
>>>> The only time KVM uses on-stack variables is for qword or smaller accesses, i.e.
>>>> 8 bytes in size or less. For larger fragments, e.g. AVX to/from MMIO, the target
>>>> value will always be an operand in the emulator context. And so rather than
>>>> disallow stack variables, for "small" fragments, we can rework the handling to
>>>> copy the value to/from each fragment on-demand instead of stashing a pointer to
>>>> the value.
>>>
>>> Since we can store the frag->val in struct kvm_mmio_fragment,
>>> why not just point frag->data to it? This Way we can save a lot code about
>>> (frag->data == NULL).
>>
>> It's not quite that simple, because we need to handle reads as well.
>>
>>> Though this patch will block any read-into-stack calls, we can add a special path
>>> in function emulator_read_write handling feasible read-into-stack calls -- the
>>> target is released just after emulator_read_write returns.
>>>
>>> ---
>>> arch/x86/kvm/x86.c | 9 ++++++++-
>>> include/linux/kvm_host.h | 3 ++-
>>> 2 files changed, 10 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>>> index 72d37c8930ad..12d53d441a39 100644
>>> --- a/arch/x86/kvm/x86.c
>>> +++ b/arch/x86/kvm/x86.c
>>> @@ -8197,7 +8197,14 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
>>> WARN_ON(vcpu->mmio_nr_fragments >= KVM_MAX_MMIO_FRAGMENTS);
>>> frag = &vcpu->mmio_fragments[vcpu->mmio_nr_fragments++];
>>> frag->gpa = gpa;
>>> - frag->data = val;
>>> + if (bytes > 8u || ! write) {
>>> + if (WARN_ON_ONCE(object_is_on_stack(val)))
>>
>> This is user-triggerable, e.g. em_popa(), em_pop_sreg(), emulate_iret_real(),
>> em_ret_near_imm(), em_ret_far(), and em_ret().
>
> *sigh*
>
> And I was wrong. I finally sat down to write some comments for all of this, and
> realized that reads _never_ pass an on-stack @val to emulator_read_write_onepage(),
> because read_emulated() always buffers reads through ctxt->mem_read.
>
> So not only is my fancy, complex code unnecessary, it's actively broken. If a
> read splits a page boundary, and the first page is NOT emulated MMIO, trying to
> fulfill the read on-demand falls apart because the @val points at the start of
> the operand (technically its cache "entry"). I'm sure that's a solvable problem,
> but I don't see any point in manufacturing a problem in the first place.
>
> I need to write a changelog, but as Yashu suggested, the fix can more simply be:
>
> --
> From: Sean Christopherson <seanjc@google.com>
> Date: Tue, 10 Feb 2026 09:45:37 -0800
> Subject: [PATCH 01/14] KVM: x86: Use scratch field in MMIO fragment to hold
> small write values
>
> Fixes: f78146b0f923 ("KVM: Fix page-crossing MMIO")
> Suggested-by: Yashu Zhang <zhangjiaji1@huawei.com>
> Reported-by: Yashu Zhang <zhangjiaji1@huawei.com>
> Closes: https://lore.kernel.org/all/369eaaa2b3c1425c85e8477066391bc7@huawei.com
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/kvm/x86.c | 14 +++++++++++++-
> include/linux/kvm_host.h | 3 ++-
> 2 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index db3f393192d9..ff3a6f86973f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -8226,7 +8226,13 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
> WARN_ON(vcpu->mmio_nr_fragments >= KVM_MAX_MMIO_FRAGMENTS);
> frag = &vcpu->mmio_fragments[vcpu->mmio_nr_fragments++];
> frag->gpa = gpa;
> - frag->data = val;
> + if (write && bytes <= 8u) {
> + frag->val = 0;
> + frag->data = &frag->val;
> + memcpy(&frag->val, val, bytes);
> + } else {
> + frag->data = val;
> + }
> frag->len = bytes;
> return X86EMUL_CONTINUE;
> }
> @@ -8241,6 +8247,9 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt,
> gpa_t gpa;
> int rc;
>
> + if (WARN_ON_ONCE((bytes > 8u || !ops->write) && object_is_on_stack(val)))
> + return X86EMUL_UNHANDLEABLE;
> +
> if (ops->read_write_prepare &&
> ops->read_write_prepare(vcpu, val, bytes))
> return X86EMUL_CONTINUE;
> @@ -11847,6 +11856,9 @@ static int complete_emulated_mmio(struct kvm_vcpu *vcpu)
> frag++;
> vcpu->mmio_cur_fragment++;
> } else {
> + if (WARN_ON_ONCE(frag->data == &frag->val))
> + return -EIO;
> +
> /* Go forward to the next mmio piece. */
> frag->data += len;
> frag->gpa += len;
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index 2c7d76262898..0bb2a34fb93d 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -320,7 +320,8 @@ static inline bool kvm_vcpu_can_poll(ktime_t cur, ktime_t stop)
> struct kvm_mmio_fragment {
> gpa_t gpa;
> void *data;
> - unsigned len;
> + u64 val;
Hi, Jiayi and Sean,
Since I met a KABI consistence break problem from this change, I am
finding a way to avoid add including kvm_mmio_fragment.val.
Can I try to directly malloc a 8 size buffer for kvm_mmio_fragment.data
instead of using kvm_mmio_fragment.val, and free this buffer in
complete_emulated_mmio when all fragments is been copied?
Thanks!
> + unsigned int len;
> };
>
> struct kvm_vcpu {
>
> base-commit: 183bb0ce8c77b0fd1fb25874112bc8751a461e49
> --
--
Xinyu
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [BUG REPORT] USE_AFTER_FREE in complete_emulated_mmio found by KASAN/Syzkaller fuzz test (v5.10.0)
2026-05-08 7:57 ` Xinyu Zheng
@ 2026-05-08 14:25 ` Sean Christopherson
2026-05-09 1:55 ` Xinyu
0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2026-05-08 14:25 UTC (permalink / raw)
To: Xinyu Zheng
Cc: Zhangjiaji, Paolo Bonzini, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, Wangqinxiao (Tom), zhangyashu,
wangyanan (Y), zouyipeng
On Fri, May 08, 2026, Xinyu Zheng wrote:
> On 2/19/2026 4:56 AM, Sean Christopherson wrote:
> > On Tue, Feb 10, 2026, Sean Christopherson wrote:
> > diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> > index 2c7d76262898..0bb2a34fb93d 100644
> > --- a/include/linux/kvm_host.h
> > +++ b/include/linux/kvm_host.h
> > @@ -320,7 +320,8 @@ static inline bool kvm_vcpu_can_poll(ktime_t cur, ktime_t stop)
> > struct kvm_mmio_fragment {
> > gpa_t gpa;
> > void *data;
> > - unsigned len;
> > + u64 val;
>
> Hi, Jiayi and Sean,
>
> Since I met a KABI consistence break problem from this change, I am finding
> a way to avoid add including kvm_mmio_fragment.val.
I assume you're looking for a solution for a private/proprietary kernel? I.e.
not trying to figure out a solution for an upstream LTS kernel?
> Can I try to directly malloc a 8 size buffer for kvm_mmio_fragment.data
> instead of using kvm_mmio_fragment.val, and free this buffer in
> complete_emulated_mmio when all fragments is been copied?
I highly doubt that will work, because you'd still need to stash the pointer
somewhere. And it pretty much would have to be somewhere in kvm_vcpu, which
would likely mean a change in KABI. FWIW, freeing the allocation in
complete_emulated_mmio() wouldn't suffice; you'd also need to free the memory
on vCPU destruction, because there's no guarantee userspace would complete
KVM_RUN.
You should be able to use the padding in the "kvm_run.s". Thanks to s390's
massive regs size, there's a huge amount of unused space in the union on x86.
Note, because there can be two fragments in-flight, you'd need to index the
array using the correct fragment number.
Userspace can scribble the value, but that's completely irrelevant from a host
safety perspective.
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6c8afa2047bf..29c7123d5467 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -515,7 +515,10 @@ struct kvm_run {
__u64 kvm_valid_regs;
__u64 kvm_dirty_regs;
union {
- struct kvm_sync_regs regs;
+ struct {
+ struct kvm_sync_regs regs;
+ u64 x86_mmio_val[2];
+ };
char padding[SYNC_REGS_SIZE_BYTES];
} s;
};
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [BUG REPORT] USE_AFTER_FREE in complete_emulated_mmio found by KASAN/Syzkaller fuzz test (v5.10.0)
2026-05-08 14:25 ` Sean Christopherson
@ 2026-05-09 1:55 ` Xinyu
0 siblings, 0 replies; 12+ messages in thread
From: Xinyu @ 2026-05-09 1:55 UTC (permalink / raw)
To: Sean Christopherson
Cc: Zhangjiaji, Paolo Bonzini, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, Wangqinxiao (Tom), zhangyashu,
wangyanan (Y), zouyipeng, zhengxinyu6
On 5/8/2026 10:25 PM, Sean Christopherson wrote:
> On Fri, May 08, 2026, Xinyu Zheng wrote:
>> On 2/19/2026 4:56 AM, Sean Christopherson wrote:
>>> On Tue, Feb 10, 2026, Sean Christopherson wrote:
>>> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
>>> index 2c7d76262898..0bb2a34fb93d 100644
>>> --- a/include/linux/kvm_host.h
>>> +++ b/include/linux/kvm_host.h
>>> @@ -320,7 +320,8 @@ static inline bool kvm_vcpu_can_poll(ktime_t cur, ktime_t stop)
>>> struct kvm_mmio_fragment {
>>> gpa_t gpa;
>>> void *data;
>>> - unsigned len;
>>> + u64 val;
>>
>> Hi, Jiayi and Sean,
>>
>> Since I met a KABI consistence break problem from this change, I am finding
>> a way to avoid add including kvm_mmio_fragment.val.
>
> I assume you're looking for a solution for a private/proprietary kernel? I.e.
> not trying to figure out a solution for an upstream LTS kernel?
Yes.
>
>> Can I try to directly malloc a 8 size buffer for kvm_mmio_fragment.data
>> instead of using kvm_mmio_fragment.val, and free this buffer in
>> complete_emulated_mmio when all fragments is been copied?
>
> I highly doubt that will work, because you'd still need to stash the pointer
> somewhere. And it pretty much would have to be somewhere in kvm_vcpu, which
> would likely mean a change in KABI. FWIW, freeing the allocation in
> complete_emulated_mmio() wouldn't suffice; you'd also need to free the memory
> on vCPU destruction, because there's no guarantee userspace would complete
> KVM_RUN.
Thanks for your correction and patient reply!
>
> You should be able to use the padding in the "kvm_run.s". Thanks to s390's
> massive regs size, there's a huge amount of unused space in the union on x86.
> Note, because there can be two fragments in-flight, you'd need to index the
> array using the correct fragment number.
>
> Userspace can scribble the value, but that's completely irrelevant from a host
> safety perspective.
>
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6c8afa2047bf..29c7123d5467 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -515,7 +515,10 @@ struct kvm_run {
> __u64 kvm_valid_regs;
> __u64 kvm_dirty_regs;
> union {
> - struct kvm_sync_regs regs;
> + struct {
> + struct kvm_sync_regs regs;
> + u64 x86_mmio_val[2];
> + };
> char padding[SYNC_REGS_SIZE_BYTES];
> } s;
> };
I will take this suggestion.
--
Xinyu Zheng
^ permalink raw reply [flat|nested] 12+ messages in thread
* [BUG REPORT] USE_AFTER_FREE in complete_emulated_mmio found by KASAN/Syzkaller fuzz test (v5.10.0)
@ 2026-02-02 1:24 Zhangjiaji
2026-02-06 23:30 ` Sean Christopherson
2026-02-10 6:26 ` Paolo Bonzini
0 siblings, 2 replies; 12+ messages in thread
From: Zhangjiaji @ 2026-02-02 1:24 UTC (permalink / raw)
To: Sean Christopherson, Paolo Bonzini, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org
Cc: Wangqinxiao (Tom), zhangyashu, wangyanan (Y)
Syzkaller hit 'KASAN: use-after-free Read in complete_emulated_mmio' bug.
==================================================================
BUG: KASAN: use-after-free in complete_emulated_mmio+0x305/0x420
Read of size 1 at addr ffff888009c378d1 by task syz-executor417/984
CPU: 1 PID: 984 Comm: syz-executor417 Not tainted 5.10.0-182.0.0.95.h2627.eulerosv2r13.x86_64 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 Call Trace:
dump_stack+0xbe/0xfd
print_address_description.constprop.0+0x19/0x170
__kasan_report.cold+0x6c/0x84
kasan_report+0x3a/0x50
check_memory_region+0xfd/0x1f0
memcpy+0x20/0x60
complete_emulated_mmio+0x305/0x420
kvm_arch_vcpu_ioctl_run+0x63f/0x6d0
kvm_vcpu_ioctl+0x413/0xb20
__se_sys_ioctl+0x111/0x160
do_syscall_64+0x30/0x40
entry_SYSCALL_64_after_hwframe+0x67/0xd1
RIP: 0033:0x42477d
Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007faa8e6890e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 00000000004d7338 RCX: 000000000042477d
RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000005
RBP: 00000000004d7330 R08: 00007fff28d546df R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004d733c
R13: 0000000000000000 R14: 000000000040a200 R15: 00007fff28d54720
The buggy address belongs to the page:
page:0000000029f6a428 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x9c37
flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
raw: 000fffffc0000000 0000000000000000 ffffea0000270dc8 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888009c37780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff888009c37800: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff888009c37880: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
^
ffff888009c37900: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
ffff888009c37980: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ==================================================================
Syzkaller reproducer:
# {Threaded:true Repeat:true RepeatTimes:0 Procs:1 Slowdown:1 Sandbox: SandboxArg:0 Leak:false NetInjection:false NetDevices:false NetReset:false Cgroups:false BinfmtMisc:false CloseFDs:false KCSAN:false DevlinkPCI:false NicVF:false USB:false VhciInjection:false Wifi:false IEEE802154:false Sysctl:false Swap:false UseTmpDir:false HandleSegv:true Repro:false Trace:false LegacyOptions:{Collide:false Fault:false FaultCall:0 FaultNth:0}}
r0 = openat$kvm(0xffffffffffffff9c, &(0x7f00000001c0), 0x0, 0x0)
r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) syz_kvm_setup_cpu$x86(r1, r2, &(0x7f0000fe2000/0x18000)=nil, &(0x7f0000000080)=[@text32={0x20, &(0x7f0000000000)="44c8df2020c020c003000000000f22c0671e26653e360f2224660f65b600000000b9e0450200f5e8f5e8f30f1ed6c744240000100000c744240200000000c7442406000000000f011424eacf5700000301b8010000000f01c1", 0x59}], 0x1, 0x27, 0x0, 0x1) ioctl$KVM_RUN(r2, 0xae80, 0x0) ioctl$KVM_SMI(0xffffffffffffffff, 0xaeb7) (async) ioctl$KVM_RUN(r2, 0xae80, 0x0)
----------------------------
Hi,
I've analyzed the Syzkaller output and the complete_emulated_mmio() code path.
The buggy address is created in em_enter(), where it passes its local variable `ulong rbp` to emulate_push(), finally ends in emulator_read_write_onepage() putting the address into vcpu->mmio_fragments[].data .
The bug happens when kvm guest executes an "enter" instruction, and top of the stack crosses the mem page.
In that case, the em_enter() function cannot complete the instruction within itself, but leave the rest data (which is in the other page) to complete_emulated_mmio().
When complete_emulated_mmio() starts, em_enter() has exited, so local variable `ulong rbp` is also released.
Now complete_emulated_mmio() trys to access vcpu->mmio_fragments[].data , and the bug happened.
any idea?
--
Best regards,
Yashu Zhang
^ permalink raw reply [flat|nested] 12+ messages in thread* Re: [BUG REPORT] USE_AFTER_FREE in complete_emulated_mmio found by KASAN/Syzkaller fuzz test (v5.10.0)
2026-02-02 1:24 Zhangjiaji
@ 2026-02-06 23:30 ` Sean Christopherson
2026-02-09 20:21 ` Sean Christopherson
2026-02-10 6:26 ` Paolo Bonzini
1 sibling, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2026-02-06 23:30 UTC (permalink / raw)
To: Zhangjiaji
Cc: Paolo Bonzini, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Wangqinxiao (Tom), zhangyashu, wangyanan (Y)
On Mon, Feb 02, 2026, Zhangjiaji wrote:
> Syzkaller hit 'KASAN: use-after-free Read in complete_emulated_mmio' bug.
>
> ==================================================================
> BUG: KASAN: use-after-free in complete_emulated_mmio+0x305/0x420
> Read of size 1 at addr ffff888009c378d1 by task syz-executor417/984
>
> CPU: 1 PID: 984 Comm: syz-executor417 Not tainted 5.10.0-182.0.0.95.h2627.eulerosv2r13.x86_64 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 Call Trace:
> dump_stack+0xbe/0xfd
> print_address_description.constprop.0+0x19/0x170
> __kasan_report.cold+0x6c/0x84
> kasan_report+0x3a/0x50
> check_memory_region+0xfd/0x1f0
> memcpy+0x20/0x60
> complete_emulated_mmio+0x305/0x420
> kvm_arch_vcpu_ioctl_run+0x63f/0x6d0
> kvm_vcpu_ioctl+0x413/0xb20
> __se_sys_ioctl+0x111/0x160
> do_syscall_64+0x30/0x40
> entry_SYSCALL_64_after_hwframe+0x67/0xd1
> RIP: 0033:0x42477d
> Code: 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
> RSP: 002b:00007faa8e6890e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
> RAX: ffffffffffffffda RBX: 00000000004d7338 RCX: 000000000042477d
> RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000005
> RBP: 00000000004d7330 R08: 00007fff28d546df R09: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000246 R12: 00000000004d733c
> R13: 0000000000000000 R14: 000000000040a200 R15: 00007fff28d54720
>
> The buggy address belongs to the page:
> page:0000000029f6a428 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x9c37
> flags: 0xfffffc0000000(node=0|zone=1|lastcpupid=0x1fffff)
> raw: 000fffffc0000000 0000000000000000 ffffea0000270dc8 0000000000000000
> raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000 page dumped because: kasan: bad access detected
>
> Memory state around the buggy address:
> ffff888009c37780: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ffff888009c37800: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> >ffff888009c37880: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ^
> ffff888009c37900: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
> ffff888009c37980: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ==================================================================
>
>
> Syzkaller reproducer:
> # {Threaded:true Repeat:true RepeatTimes:0 Procs:1 Slowdown:1 Sandbox: SandboxArg:0 Leak:false NetInjection:false NetDevices:false NetReset:false Cgroups:false BinfmtMisc:false CloseFDs:false KCSAN:false DevlinkPCI:false NicVF:false USB:false VhciInjection:false Wifi:false IEEE802154:false Sysctl:false Swap:false UseTmpDir:false HandleSegv:true Repro:false Trace:false LegacyOptions:{Collide:false Fault:false FaultCall:0 FaultNth:0}}
> r0 = openat$kvm(0xffffffffffffff9c, &(0x7f00000001c0), 0x0, 0x0)
> r1 = ioctl$KVM_CREATE_VM(r0, 0xae01, 0x0)
> r2 = ioctl$KVM_CREATE_VCPU(r1, 0xae41, 0x0) syz_kvm_setup_cpu$x86(r1, r2, &(0x7f0000fe2000/0x18000)=nil, &(0x7f0000000080)=[@text32={0x20, &(0x7f0000000000)="44c8df2020c020c003000000000f22c0671e26653e360f2224660f65b600000000b9e0450200f5e8f5e8f30f1ed6c744240000100000c744240200000000c7442406000000000f011424eacf5700000301b8010000000f01c1", 0x59}], 0x1, 0x27, 0x0, 0x1) ioctl$KVM_RUN(r2, 0xae80, 0x0) ioctl$KVM_SMI(0xffffffffffffffff, 0xaeb7) (async) ioctl$KVM_RUN(r2, 0xae80, 0x0)
>
>
> ----------------------------
> Hi,
>
> I've analyzed the Syzkaller output and the complete_emulated_mmio() code
> path. The buggy address is created in em_enter(), where it passes its local
> variable `ulong rbp` to emulate_push(), finally ends in
> emulator_read_write_onepage() putting the address into
> vcpu->mmio_fragments[].data . The bug happens when kvm guest executes an
> "enter" instruction, and top of the stack crosses the mem page. In that
> case, the em_enter() function cannot complete the instruction within itself,
> but leave the rest data (which is in the other page) to
> complete_emulated_mmio(). When complete_emulated_mmio() starts, em_enter()
> has exited, so local variable `ulong rbp` is also released. Now
> complete_emulated_mmio() trys to access vcpu->mmio_fragments[].data , and the
> bug happened.
>
> any idea?
Egad, sorry! I had reproduced this shortly after you sent the report and prepped
a fix, but got distracted and lost this in my inbox.
Can you test this on your end? I repro'd by modifying a KVM-Unit-Test and hacking
KVM to tweak the stack, so I haven't confirmed the syzkaller version.
It's a bit gross, as it abuses an unused field as scratch space, but AFAICT that's
"fine". The alternative would be add a dedicated field, which seems like overkill?
I'm also going to try and add a WARN to detect if the @val parameter passed to
emulator_read_write() is ever on the kernel stack, e.g. to help detect lurking
bugs like this one without relying on kasahn.
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index c8e292e9a24d..dacef51c2565 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1897,13 +1897,12 @@ static int em_enter(struct x86_emulate_ctxt *ctxt)
int rc;
unsigned frame_size = ctxt->src.val;
unsigned nesting_level = ctxt->src2.val & 31;
- ulong rbp;
if (nesting_level)
return X86EMUL_UNHANDLEABLE;
- rbp = reg_read(ctxt, VCPU_REGS_RBP);
- rc = emulate_push(ctxt, &rbp, stack_size(ctxt));
+ ctxt->memop.orig_val = reg_read(ctxt, VCPU_REGS_RBP);
+ rc = emulate_push(ctxt, &ctxt->memop.orig_val, stack_size(ctxt));
if (rc != X86EMUL_CONTINUE)
return rc;
assign_masked(reg_rmw(ctxt, VCPU_REGS_RBP), reg_read(ctxt, VCPU_REGS_RSP),
^ permalink raw reply related [flat|nested] 12+ messages in thread* Re: [BUG REPORT] USE_AFTER_FREE in complete_emulated_mmio found by KASAN/Syzkaller fuzz test (v5.10.0)
2026-02-06 23:30 ` Sean Christopherson
@ 2026-02-09 20:21 ` Sean Christopherson
0 siblings, 0 replies; 12+ messages in thread
From: Sean Christopherson @ 2026-02-09 20:21 UTC (permalink / raw)
To: Zhangjiaji
Cc: Paolo Bonzini, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Wangqinxiao (Tom), zhangyashu, wangyanan (Y)
On Fri, Feb 06, 2026, Sean Christopherson wrote:
> On Mon, Feb 02, 2026, Zhangjiaji wrote:
> > Syzkaller hit 'KASAN: use-after-free Read in complete_emulated_mmio' bug.
> >
> > ==================================================================
> > BUG: KASAN: use-after-free in complete_emulated_mmio+0x305/0x420
> > Read of size 1 at addr ffff888009c378d1 by task syz-executor417/984
> >
> > CPU: 1 PID: 984 Comm: syz-executor417 Not tainted 5.10.0-182.0.0.95.h2627.eulerosv2r13.x86_64 #3 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014 Call Trace:
> > dump_stack+0xbe/0xfd
> > print_address_description.constprop.0+0x19/0x170
> > __kasan_report.cold+0x6c/0x84
> > kasan_report+0x3a/0x50
> > check_memory_region+0xfd/0x1f0
> > memcpy+0x20/0x60
> > complete_emulated_mmio+0x305/0x420
> > kvm_arch_vcpu_ioctl_run+0x63f/0x6d0
> > kvm_vcpu_ioctl+0x413/0xb20
> > __se_sys_ioctl+0x111/0x160
> > do_syscall_64+0x30/0x40
> > entry_SYSCALL_64_after_hwframe+0x67/0xd1
> > RIP: 0033:0x42477d
...
> > I've analyzed the Syzkaller output and the complete_emulated_mmio() code
> > path. The buggy address is created in em_enter(), where it passes its local
> > variable `ulong rbp` to emulate_push(), finally ends in
> > emulator_read_write_onepage() putting the address into
> > vcpu->mmio_fragments[].data . The bug happens when kvm guest executes an
> > "enter" instruction, and top of the stack crosses the mem page. In that
> > case, the em_enter() function cannot complete the instruction within itself,
> > but leave the rest data (which is in the other page) to
> > complete_emulated_mmio(). When complete_emulated_mmio() starts, em_enter()
> > has exited, so local variable `ulong rbp` is also released. Now
> > complete_emulated_mmio() trys to access vcpu->mmio_fragments[].data , and the
> > bug happened.
> >
> > any idea?
>
> Egad, sorry! I had reproduced this shortly after you sent the report and prepped
> a fix, but got distracted and lost this in my inbox.
>
> Can you test this on your end? I repro'd by modifying a KVM-Unit-Test and hacking
> KVM to tweak the stack, so I haven't confirmed the syzkaller version.
>
> It's a bit gross, as it abuses an unused field as scratch space, but AFAICT that's
> "fine". The alternative would be add a dedicated field, which seems like overkill?
>
> I'm also going to try and add a WARN to detect if the @val parameter passed to
> emulator_read_write() is ever on the kernel stack, e.g. to help detect lurking
> bugs like this one without relying on kasahn.
>
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index c8e292e9a24d..dacef51c2565 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -1897,13 +1897,12 @@ static int em_enter(struct x86_emulate_ctxt *ctxt)
> int rc;
> unsigned frame_size = ctxt->src.val;
> unsigned nesting_level = ctxt->src2.val & 31;
> - ulong rbp;
>
> if (nesting_level)
> return X86EMUL_UNHANDLEABLE;
>
> - rbp = reg_read(ctxt, VCPU_REGS_RBP);
> - rc = emulate_push(ctxt, &rbp, stack_size(ctxt));
> + ctxt->memop.orig_val = reg_read(ctxt, VCPU_REGS_RBP);
> + rc = emulate_push(ctxt, &ctxt->memop.orig_val, stack_size(ctxt));
> if (rc != X86EMUL_CONTINUE)
> return rc;
> assign_masked(reg_rmw(ctxt, VCPU_REGS_RBP), reg_read(ctxt, VCPU_REGS_RSP),
*sigh*
This isn't going to work. Or rather, it's far from a complete fix, as there are
several other instructions and flows that use stack variables to service reads
and writes. Hacking all of them to not use stack variables isn't feasible, as
flows like emulate_iret_real() and em_popa() perform multiple acceses, i.e.
hijacking an unused operand simply won't work.
I'm tempted to go with a straightforward "fix" of rejecting userspace MMIO if
the destination is on the stack, e.g. like so:
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index db3f393192d9..113287612acd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8272,6 +8272,9 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt,
if (!vcpu->mmio_nr_fragments)
return X86EMUL_CONTINUE;
+ if (object_is_on_stack(val))
+ return X86EMUL_UNHANDLEABLE;
+
gpa = vcpu->mmio_fragments[0].gpa;
vcpu->mmio_needed = 1;
But I hate how arbitrary that is, and I'm somewhat concerned that it could break
existing setups.
I think there's a not-completely-awful solution buried in this gigantic cesspool.
The only time KVM uses on-stack variables is for qword or smaller accesses, i.e.
8 bytes in size or less. For larger fragments, e.g. AVX to/from MMIO, the target
value will always be an operand in the emulator context. And so rather than
disallow stack variables, for "small" fragments, we can rework the handling to
copy the value to/from each fragment on-demand instead of stashing a pointer to
the value.
This needs a _lot_ more documentation, the SEV-ES flows aren't converted, and
there is a ton of cleanup that can and should be done on top, but this appears
to work.
Note the "clobber" in complete_emulated_mmio() to fill 4096 bytes of the stack
with 0xaa, i.e. to simulate some of the stack being "freed".
---
arch/x86/kvm/kvm_emulate.h | 8 +++
arch/x86/kvm/x86.c | 125 +++++++++++++++++++++++++++----------
include/linux/kvm_host.h | 6 +-
3 files changed, 103 insertions(+), 36 deletions(-)
diff --git a/arch/x86/kvm/kvm_emulate.h b/arch/x86/kvm/kvm_emulate.h
index fb3dab4b5a53..f735158af05e 100644
--- a/arch/x86/kvm/kvm_emulate.h
+++ b/arch/x86/kvm/kvm_emulate.h
@@ -284,6 +284,14 @@ struct fetch_cache {
u8 *end;
};
+/*
+ * To complete userspace MMIO and I/O reads, KVM re-emulates (NO_DECODE) the
+ * _entire_ instruction to propagate the read data to its final destination.
+ * To avoid re-reading values from memory and/or getting "stuck" on the access
+ * that triggered an exit to userspace, KVM caches all values that have been
+ * read for a given instruction, and reads from this cache instead of reading
+ * from guest memory or from userspace.
+ */
struct read_cache {
u8 data[1024];
unsigned long pos;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index db3f393192d9..1b99de3e3236 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -8109,7 +8109,7 @@ int emulator_write_phys(struct kvm_vcpu *vcpu, gpa_t gpa,
}
struct read_write_emulator_ops {
- int (*read_write_prepare)(struct kvm_vcpu *vcpu, void *val,
+ int (*read_mmio_fragment)(struct kvm_vcpu *vcpu, void *val,
int bytes);
int (*read_write_emulate)(struct kvm_vcpu *vcpu, gpa_t gpa,
void *val, int bytes);
@@ -8120,16 +8120,37 @@ struct read_write_emulator_ops {
bool write;
};
-static int read_prepare(struct kvm_vcpu *vcpu, void *val, int bytes)
+static int read_mmio_fragment(struct kvm_vcpu *vcpu, void *val, int bytes)
{
- if (vcpu->mmio_read_completed) {
- trace_kvm_mmio(KVM_TRACE_MMIO_READ, bytes,
- vcpu->mmio_fragments[0].gpa, val);
- vcpu->mmio_read_completed = 0;
- return 1;
+ struct kvm_mmio_fragment *frag;
+ unsigned int len = bytes;
+
+ while (len && vcpu->mmio_head_fragment < vcpu->mmio_tail_fragment) {
+ int i = vcpu->mmio_head_fragment++;
+
+ if (WARN_ON_ONCE(i >= vcpu->mmio_nr_fragments))
+ break;
+
+ frag = &vcpu->mmio_fragments[i];
+ if (!frag->data) {
+ if (WARN_ON_ONCE(len < frag->len || frag->len > 8u)) {
+ pr_warn("len = %u, bytes = %u, frag->len = %u, gpa = %llx, rip = %lx\n",
+ len, bytes, frag->len, frag->gpa, kvm_rip_read(vcpu));
+ break;
+ }
+
+ memcpy(val, &frag->val, min(8u, frag->len));
+ }
+
+ val += frag->len;
+ len -= min(len, frag->len);
}
- return 0;
+ if ((int)len == bytes)
+ return 0;
+
+ trace_kvm_mmio(KVM_TRACE_MMIO_READ, bytes, frag->gpa, val);
+ return 1;
}
static int read_emulate(struct kvm_vcpu *vcpu, gpa_t gpa,
@@ -8150,6 +8171,11 @@ static int write_mmio(struct kvm_vcpu *vcpu, gpa_t gpa, int bytes, void *val)
return vcpu_mmio_write(vcpu, gpa, bytes, val);
}
+static void *mmio_frag_data(struct kvm_mmio_fragment *frag)
+{
+ return frag->data ?: &frag->val;
+}
+
static int read_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa,
void *val, int bytes)
{
@@ -8162,12 +8188,12 @@ static int write_exit_mmio(struct kvm_vcpu *vcpu, gpa_t gpa,
{
struct kvm_mmio_fragment *frag = &vcpu->mmio_fragments[0];
- memcpy(vcpu->run->mmio.data, frag->data, min(8u, frag->len));
+ memcpy(vcpu->run->mmio.data, mmio_frag_data(frag), min(8u, frag->len));
return X86EMUL_CONTINUE;
}
static const struct read_write_emulator_ops read_emultor = {
- .read_write_prepare = read_prepare,
+ .read_mmio_fragment = read_mmio_fragment,
.read_write_emulate = read_emulate,
.read_write_mmio = vcpu_mmio_read,
.read_write_exit_mmio = read_exit_mmio,
@@ -8226,7 +8252,13 @@ static int emulator_read_write_onepage(unsigned long addr, void *val,
WARN_ON(vcpu->mmio_nr_fragments >= KVM_MAX_MMIO_FRAGMENTS);
frag = &vcpu->mmio_fragments[vcpu->mmio_nr_fragments++];
frag->gpa = gpa;
- frag->data = val;
+ if (bytes > 8u) {
+ frag->data = val;
+ } else {
+ frag->data = NULL;
+ if (write)
+ memcpy(&frag->val, val, bytes);
+ }
frag->len = bytes;
return X86EMUL_CONTINUE;
}
@@ -8241,11 +8273,16 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt,
gpa_t gpa;
int rc;
- if (ops->read_write_prepare &&
- ops->read_write_prepare(vcpu, val, bytes))
+ if (WARN_ON_ONCE(bytes > 8u && object_is_on_stack(val)))
+ return X86EMUL_UNHANDLEABLE;
+
+ if (ops->read_mmio_fragment &&
+ ops->read_mmio_fragment(vcpu, val, bytes))
return X86EMUL_CONTINUE;
vcpu->mmio_nr_fragments = 0;
+ vcpu->mmio_head_fragment = 0;
+ vcpu->mmio_tail_fragment = 0;
/* Crossing a page boundary? */
if (((addr + bytes - 1) ^ addr) & PAGE_MASK) {
@@ -8275,7 +8312,6 @@ static int emulator_read_write(struct x86_emulate_ctxt *ctxt,
gpa = vcpu->mmio_fragments[0].gpa;
vcpu->mmio_needed = 1;
- vcpu->mmio_cur_fragment = 0;
vcpu->run->mmio.len = min(8u, vcpu->mmio_fragments[0].len);
vcpu->run->mmio.is_write = vcpu->mmio_is_write = ops->write;
@@ -11832,42 +11868,61 @@ static int complete_emulated_mmio(struct kvm_vcpu *vcpu)
{
struct kvm_run *run = vcpu->run;
struct kvm_mmio_fragment *frag;
- unsigned len;
+ unsigned int len;
+ u8 clobber[4096];
- BUG_ON(!vcpu->mmio_needed);
+ memset(clobber, 0xaa, sizeof(clobber));
+
+ if (WARN_ON_ONCE(!vcpu->mmio_needed))
+ return -EIO;
+
+ /* Complete MMIO for the current fragment. */
+ frag = &vcpu->mmio_fragments[vcpu->mmio_tail_fragment];
- /* Complete previous fragment */
- frag = &vcpu->mmio_fragments[vcpu->mmio_cur_fragment];
len = min(8u, frag->len);
if (!vcpu->mmio_is_write)
- memcpy(frag->data, run->mmio.data, len);
+ memcpy(mmio_frag_data(frag), run->mmio.data, len);
if (frag->len <= 8) {
/* Switch to the next fragment. */
frag++;
- vcpu->mmio_cur_fragment++;
+ vcpu->mmio_tail_fragment++;
} else {
+ if (WARN_ON_ONCE(!frag->data))
+ return -EIO;
+
/* Go forward to the next mmio piece. */
frag->data += len;
frag->gpa += len;
frag->len -= len;
}
- if (vcpu->mmio_cur_fragment >= vcpu->mmio_nr_fragments) {
+ if (vcpu->mmio_tail_fragment >= vcpu->mmio_nr_fragments) {
vcpu->mmio_needed = 0;
+ WARN_ON_ONCE(vcpu->mmio_head_fragment);
- /* FIXME: return into emulator if single-stepping. */
- if (vcpu->mmio_is_write)
+ /*
+ * Don't re-emulate the instruction for MMIO writes, as KVM has
+ * already committed all side effects (and the emulator simply
+ * isn't equi)
+ *
+ * FIXME: Return into emulator if single-stepping.
+ */
+ if (vcpu->mmio_is_write) {
+ vcpu->mmio_tail_fragment = 0;
return 1;
- vcpu->mmio_read_completed = 1;
+ }
+
return complete_emulated_io(vcpu);
}
+ len = min(8u, frag->len);
+
run->exit_reason = KVM_EXIT_MMIO;
run->mmio.phys_addr = frag->gpa;
if (vcpu->mmio_is_write)
- memcpy(run->mmio.data, frag->data, min(8u, frag->len));
- run->mmio.len = min(8u, frag->len);
+ memcpy(run->mmio.data, mmio_frag_data(frag), len);
+ run->mmio.len = len;
run->mmio.is_write = vcpu->mmio_is_write;
vcpu->arch.complete_userspace_io = complete_emulated_mmio;
return 0;
@@ -14247,15 +14302,15 @@ static int complete_sev_es_emulated_mmio(struct kvm_vcpu *vcpu)
BUG_ON(!vcpu->mmio_needed);
/* Complete previous fragment */
- frag = &vcpu->mmio_fragments[vcpu->mmio_cur_fragment];
+ frag = &vcpu->mmio_fragments[vcpu->mmio_tail_fragment];
len = min(8u, frag->len);
- if (!vcpu->mmio_is_write)
+ if (!vcpu->mmio_is_write && frag->data)
memcpy(frag->data, run->mmio.data, len);
if (frag->len <= 8) {
/* Switch to the next fragment. */
frag++;
- vcpu->mmio_cur_fragment++;
+ vcpu->mmio_tail_fragment++;
} else {
/* Go forward to the next mmio piece. */
frag->data += len;
@@ -14263,7 +14318,7 @@ static int complete_sev_es_emulated_mmio(struct kvm_vcpu *vcpu)
frag->len -= len;
}
- if (vcpu->mmio_cur_fragment >= vcpu->mmio_nr_fragments) {
+ if (vcpu->mmio_tail_fragment >= vcpu->mmio_nr_fragments) {
vcpu->mmio_needed = 0;
// VMG change, at this point, we're always done
@@ -14303,13 +14358,14 @@ int kvm_sev_es_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned int bytes,
/*TODO: Check if need to increment number of frags */
frag = vcpu->mmio_fragments;
- vcpu->mmio_nr_fragments = 1;
frag->len = bytes;
frag->gpa = gpa;
frag->data = data;
vcpu->mmio_needed = 1;
- vcpu->mmio_cur_fragment = 0;
+ vcpu->mmio_nr_fragments = 1;
+ vcpu->mmio_head_fragment = 0;
+ vcpu->mmio_tail_fragment = 0;
vcpu->run->mmio.phys_addr = gpa;
vcpu->run->mmio.len = min(8u, frag->len);
@@ -14342,13 +14398,14 @@ int kvm_sev_es_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, unsigned int bytes,
/*TODO: Check if need to increment number of frags */
frag = vcpu->mmio_fragments;
- vcpu->mmio_nr_fragments = 1;
frag->len = bytes;
frag->gpa = gpa;
frag->data = data;
vcpu->mmio_needed = 1;
- vcpu->mmio_cur_fragment = 0;
+ vcpu->mmio_nr_fragments = 1;
+ vcpu->mmio_head_fragment = 0;
+ vcpu->mmio_tail_fragment = 0;
vcpu->run->mmio.phys_addr = gpa;
vcpu->run->mmio.len = min(8u, frag->len);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 782f4d670793..be4b9de5b8c9 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -320,7 +320,8 @@ static inline bool kvm_vcpu_can_poll(ktime_t cur, ktime_t stop)
struct kvm_mmio_fragment {
gpa_t gpa;
void *data;
- unsigned len;
+ u64 val;
+ unsigned int len;
};
struct kvm_vcpu {
@@ -356,7 +357,8 @@ struct kvm_vcpu {
int mmio_needed;
int mmio_read_completed;
int mmio_is_write;
- int mmio_cur_fragment;
+ int mmio_head_fragment;
+ int mmio_tail_fragment;
int mmio_nr_fragments;
struct kvm_mmio_fragment mmio_fragments[KVM_MAX_MMIO_FRAGMENTS];
#endif
base-commit: e944fe2c09f405a2e2d147145c9b470084bc4c9a
--
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [BUG REPORT] USE_AFTER_FREE in complete_emulated_mmio found by KASAN/Syzkaller fuzz test (v5.10.0)
2026-02-02 1:24 Zhangjiaji
2026-02-06 23:30 ` Sean Christopherson
@ 2026-02-10 6:26 ` Paolo Bonzini
2026-02-10 14:35 ` Sean Christopherson
1 sibling, 1 reply; 12+ messages in thread
From: Paolo Bonzini @ 2026-02-10 6:26 UTC (permalink / raw)
To: Zhangjiaji
Cc: Sean Christopherson, kvm@vger.kernel.org,
linux-kernel@vger.kernel.org, Wangqinxiao (Tom), zhangyashu,
wangyanan (Y)
> I've analyzed the Syzkaller output and the complete_emulated_mmio() code path.
> The buggy address is created in em_enter(), where it passes its local variable `ulong rbp` to emulate_push(), finally ends in emulator_read_write_onepage() putting the address into vcpu->mmio_fragments[].data .
> The bug happens when kvm guest executes an "enter" instruction, and top of the stack crosses the mem page.
> In that case, the em_enter() function cannot complete the instruction within itself, but leave the rest data (which is in the other page) to complete_emulated_mmio().
> When complete_emulated_mmio() starts, em_enter() has exited, so local variable `ulong rbp` is also released.
> Now complete_emulated_mmio() trys to access vcpu->mmio_fragments[].data , and the bug happened.
>
> any idea?
Ouch, the bug is certainly legit. The easiest way to fix it is something
like this:
diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
index c8e292e9a24d..1c8698139dd5 100644
--- a/arch/x86/kvm/emulate.c
+++ b/arch/x86/kvm/emulate.c
@@ -1905,7 +1905,7 @@ static int em_enter(struct x86_emulate_ctxt *ctxt)
rbp = reg_read(ctxt, VCPU_REGS_RBP);
rc = emulate_push(ctxt, &rbp, stack_size(ctxt));
if (rc != X86EMUL_CONTINUE)
- return rc;
+ return X86EMUL_UNHANDLEABLE;
assign_masked(reg_rmw(ctxt, VCPU_REGS_RBP), reg_read(ctxt, VCPU_REGS_RSP),
stack_mask(ctxt));
assign_masked(reg_rmw(ctxt, VCPU_REGS_RSP),
The hard part is auditing all similar places that lead to passing a
local variable to emulator_read_write_onepage().
For example I found this one:
write_segment_descriptor(ctxt, old_tss_sel, &curr_tss_desc);
which however is caught at kvm_task_switch() and changed to an
emulation error.
It may be a good idea to stick a defensive "vcpu->mmio_needed = false;"
in prepare_emulation_failure_exit(), as well as
"vcpu->arch.complete_userspace_io = NULL" both there and in
kvm_task_switch().
Paolo
^ permalink raw reply related [flat|nested] 12+ messages in thread
* Re: [BUG REPORT] USE_AFTER_FREE in complete_emulated_mmio found by KASAN/Syzkaller fuzz test (v5.10.0)
2026-02-10 6:26 ` Paolo Bonzini
@ 2026-02-10 14:35 ` Sean Christopherson
2026-02-10 17:41 ` Paolo Bonzini
0 siblings, 1 reply; 12+ messages in thread
From: Sean Christopherson @ 2026-02-10 14:35 UTC (permalink / raw)
To: Paolo Bonzini
Cc: Zhangjiaji, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Wangqinxiao (Tom), zhangyashu, wangyanan (Y)
On Tue, Feb 10, 2026, Paolo Bonzini wrote:
>
> > I've analyzed the Syzkaller output and the complete_emulated_mmio() code path.
> > The buggy address is created in em_enter(), where it passes its local variable `ulong rbp` to emulate_push(), finally ends in emulator_read_write_onepage() putting the address into vcpu->mmio_fragments[].data .
> > The bug happens when kvm guest executes an "enter" instruction, and top of the stack crosses the mem page.
> > In that case, the em_enter() function cannot complete the instruction within itself, but leave the rest data (which is in the other page) to complete_emulated_mmio().
> > When complete_emulated_mmio() starts, em_enter() has exited, so local variable `ulong rbp` is also released.
> > Now complete_emulated_mmio() trys to access vcpu->mmio_fragments[].data , and the bug happened.
> >
> > any idea?
>
> Ouch, the bug is certainly legit. The easiest way to fix it is something
> like this:
>
> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
> index c8e292e9a24d..1c8698139dd5 100644
> --- a/arch/x86/kvm/emulate.c
> +++ b/arch/x86/kvm/emulate.c
> @@ -1905,7 +1905,7 @@ static int em_enter(struct x86_emulate_ctxt *ctxt)
> rbp = reg_read(ctxt, VCPU_REGS_RBP);
> rc = emulate_push(ctxt, &rbp, stack_size(ctxt));
> if (rc != X86EMUL_CONTINUE)
> - return rc;
> + return X86EMUL_UNHANDLEABLE;
This won't do anything, rc == X86EMUL_CONTINUE when MMIO is needed. The check
would need to be something equivalent to this hack:
rc = emulate_push(ctxt, &rbp, stack_size(ctxt));
if (((struct kvm_vcpu *)ctxt->vcpu)->mmio_needed)
return X86EMUL_UNHANDLEABLE;
but that's largely a moot point.
> assign_masked(reg_rmw(ctxt, VCPU_REGS_RBP), reg_read(ctxt, VCPU_REGS_RSP),
> stack_mask(ctxt));
> assign_masked(reg_rmw(ctxt, VCPU_REGS_RSP),
>
> The hard part is auditing all similar places that lead to passing a
> local variable to emulator_read_write_onepage().
>
> For example I found this one:
>
> write_segment_descriptor(ctxt, old_tss_sel, &curr_tss_desc);
>
> which however is caught at kvm_task_switch() and changed to an
> emulation error.
>
> It may be a good idea to stick a defensive "vcpu->mmio_needed = false;"
> in prepare_emulation_failure_exit(), as well as
> "vcpu->arch.complete_userspace_io = NULL" both there and in
> kvm_task_switch().
Please see my other reply[*]. There are a pile of instructions and flows that
read/write using on-stack variables. IMO, anything that relies on updating
callers is a non-starter, and I don't love the idea of simply disallowing MMIO
for any instruction that uses an on-stack variable.
[*] https://lore.kernel.org/all/aYpBz1wDdsDfl8Al@google.com
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [BUG REPORT] USE_AFTER_FREE in complete_emulated_mmio found by KASAN/Syzkaller fuzz test (v5.10.0)
2026-02-10 14:35 ` Sean Christopherson
@ 2026-02-10 17:41 ` Paolo Bonzini
0 siblings, 0 replies; 12+ messages in thread
From: Paolo Bonzini @ 2026-02-10 17:41 UTC (permalink / raw)
To: Sean Christopherson
Cc: Zhangjiaji, kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
Wangqinxiao (Tom), zhangyashu, wangyanan (Y)
On 2/10/26 15:35, Sean Christopherson wrote:
> On Tue, Feb 10, 2026, Paolo Bonzini wrote:
>>
>>> I've analyzed the Syzkaller output and the complete_emulated_mmio() code path.
>>> The buggy address is created in em_enter(), where it passes its local variable `ulong rbp` to emulate_push(), finally ends in emulator_read_write_onepage() putting the address into vcpu->mmio_fragments[].data .
>>> The bug happens when kvm guest executes an "enter" instruction, and top of the stack crosses the mem page.
>>> In that case, the em_enter() function cannot complete the instruction within itself, but leave the rest data (which is in the other page) to complete_emulated_mmio().
>>> When complete_emulated_mmio() starts, em_enter() has exited, so local variable `ulong rbp` is also released.
>>> Now complete_emulated_mmio() trys to access vcpu->mmio_fragments[].data , and the bug happened.
>>>
>>> any idea?
>>
>> Ouch, the bug is certainly legit. The easiest way to fix it is something
>> like this:
>>
>> diff --git a/arch/x86/kvm/emulate.c b/arch/x86/kvm/emulate.c
>> index c8e292e9a24d..1c8698139dd5 100644
>> --- a/arch/x86/kvm/emulate.c
>> +++ b/arch/x86/kvm/emulate.c
>> @@ -1905,7 +1905,7 @@ static int em_enter(struct x86_emulate_ctxt *ctxt)
>> rbp = reg_read(ctxt, VCPU_REGS_RBP);
>> rc = emulate_push(ctxt, &rbp, stack_size(ctxt));
>> if (rc != X86EMUL_CONTINUE)
>> - return rc;
>> + return X86EMUL_UNHANDLEABLE;
>
> This won't do anything, rc == X86EMUL_CONTINUE when MMIO is needed.
Yeah, I was thinking of X86EMUL_IO_NEEDED but that's only for reads.
Paolo
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2026-05-09 1:55 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-10 11:56 Re: [BUG REPORT] USE_AFTER_FREE in complete_emulated_mmio found by KASAN/Syzkaller fuzz test (v5.10.0) Zhangjiaji
2026-02-10 19:11 ` Sean Christopherson
2026-02-18 20:56 ` Sean Christopherson
2026-05-08 7:57 ` Xinyu Zheng
2026-05-08 14:25 ` Sean Christopherson
2026-05-09 1:55 ` Xinyu
-- strict thread matches above, loose matches on Subject: below --
2026-02-02 1:24 Zhangjiaji
2026-02-06 23:30 ` Sean Christopherson
2026-02-09 20:21 ` Sean Christopherson
2026-02-10 6:26 ` Paolo Bonzini
2026-02-10 14:35 ` Sean Christopherson
2026-02-10 17:41 ` Paolo Bonzini
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox