Kernel KVM virtualization development
 help / color / mirror / Atom feed
From: sashiko-bot@kernel.org
To: leixiang <leenollei@gmail.com>
Cc: kvm@vger.kernel.org, ojeda@kernel.org
Subject: Re: [RFC PATCH] KVM: rust: add Rust reimplementation of eventfd
Date: Fri, 26 Jun 2026 08:48:41 +0000	[thread overview]
Message-ID: <20260626084842.172581F000E9@smtp.kernel.org> (raw)
In-Reply-To: <20260626083212.39611-1-leixiang@kylinos.cn>

Thank you for your contribution! Sashiko AI review found 7 potential issue(s) to consider:
- [Critical] Recursive spinlock acquisition causing guaranteed deadlock in `irqfd_wakeup`.
- [Critical] ABBA deadlock between `irqfds.lock` and waitqueue lock intentionally hidden from lockdep.
- [High] `kvm_deassign_ioeventfd_idx` fails to free the `Ioeventfd` device, leaking memory and eventfd references.
- [High] MutexGuard dropped too late, holding `irq_lock` across `synchronize_srcu_expedited`.
- [High] Logic error in `irqfd_wakeup` injects spurious interrupts for deactivating irqfds.
- [High] Unsafe RCU list traversal in `kvm_notify_irqfd_resampler` lacks `rcu_dereference` memory barriers.
- [Medium] Unaligned memory read causes undefined behavior in `ioeventfd_in_range`.
--

commit 3427a3ec37eb356c05fc7c711d26f74d692d5086
Author: ninol <leenollei@gmail.com>
Subject: KVM: rust: add Rust reimplementation of eventfd

This patch introduces a Rust reimplementation of KVM's eventfd support, providing irqfd and ioeventfd functionality. It uses RAII guards for lock management and is selectable via a new Kconfig option.

> diff --git a/rust/kernel/kvm/eventfd.rs b/rust/kernel/kvm/eventfd.rs
> new file mode 100644
> index 0000000000000..d9b12c7caccb0
> --- /dev/null
> +++ b/rust/kernel/kvm/eventfd.rs
[ ... ]
> +unsafe extern "C" fn irqfd_wakeup(
> +    wait: *mut bindings::wait_queue_entry,
> +    _mode: c_uint,
> +    _sync: c_int,
> +    key: *mut c_void,
> +) -> c_int {
> +    // SAFETY: wait is embedded in kvm_kernel_irqfd.
> +    let irqfd: *mut bindings::kvm_kernel_irqfd =
> +        crate::container_of!(wait, bindings::kvm_kernel_irqfd, wait);
> +    let flags = (key as usize) as bindings::__poll_t;
> +    let kvm = (*irqfd).kvm;
> +    let mut ret: c_int = 0;
> +
> +    if (flags & bindings::POLLIN) != 0 {
> +        let mut cnt: u64 = 0;
> +        eventfd_ctx_do_read((*irqfd).eventfd, &raw mut cnt);

[Severity: Critical]
Does this cause a recursive spinlock deadlock? This irqfd_wakeup() function is
invoked from the eventfd waitqueue wakeup path where the waitqueue spinlock
(wqh.lock) is already held. Calling eventfd_ctx_do_read() here will try to
acquire that exact same spinlock again, resulting in an immediate deadlock.

> +
> +        // SAFETY: kvm->irq_srcu is valid.
> +        let _srcu_guard = unsafe { SrcuGuard::new(&raw mut (*kvm).irq_srcu) };
> +
> +        // Read irq_entry under seqcount protection to avoid torn reads
> +        // when kvm_irq_routing_update() writes concurrently on another CPU.
> +        let mut irq = loop {
> +            let seq = unsafe { bindings::read_seqcount_begin(&raw mut (*irqfd).irq_entry_sc) };
> +            let entry = unsafe { core::ptr::read(&(*irqfd).irq_entry) };
> +            if unsafe { bindings::read_seqcount_retry(&raw mut (*irqfd).irq_entry_sc, seq) } == 0 {
> +                break entry;
> +            }
> +        };
> +
> +        let deactivating = !irqfd_is_active(irqfd);
> +        if deactivating
> +            || unsafe {
> +                bindings::kvm_arch_set_irq_inatomic(
> +                    &raw mut irq,
> +                    kvm,
> +                    bindings::KVM_USERSPACE_IRQ_SOURCE_ID as i32,
> +                    1,
> +                    false,
> +                )
> +            } == -(bindings::EWOULDBLOCK as c_int)
> +        {
> +            // SAFETY: FFI call.
> +            unsafe { bindings::schedule_work(&raw mut (*irqfd).inject) };
> +        }

[Severity: High]
Does this logic mistakenly inject spurious interrupts when an irqfd is
deactivating? If the deactivating condition is true, the || operator
short-circuits and evaluates the entire condition to true, meaning we will
schedule work for deactivating irqfds. This could lead to use-after-free
issues if the work executes after the irqfd memory has been freed.

> +
> +        // _srcu_guard is dropped here
> +        ret = 1;
> +    }
> +
> +    if (flags & bindings::POLLHUP) != 0 {
> +        // SAFETY: Taking irqfds.lock is safe here per original C comment.
> +        // SAFETY: kvm is valid, irqfds.lock is valid.
> +        let _spin_guard = unsafe { SpinLockIrqSaveGuard::new(&raw mut (*kvm).irqfds.lock) };

[Severity: Critical]
Is there an ABBA deadlock risk here? This acquires irqfds.lock while the
waitqueue lock (wqh.lock) is already held by the caller. However, in
kvm_irqfd_register(), irqfds.lock is acquired before calling
add_wait_queue_priority_exclusive(), which in turn acquires wqh.lock. This
lock inversion can trigger a deadlock when registering and closing eventfds
concurrently.

[ ... ]
> +unsafe extern "C" fn kvm_irqfd_register(
> +    _file: *mut bindings::file,
> +    wqh: *mut bindings::wait_queue_head,
> +    pt: *mut bindings::poll_table,
> +) {
[ ... ]
> +    // Initialize wait queue entry with our custom wakeup handler.
> +    init_waitqueue_func_entry(&raw mut (*irqfd).wait, Some(irqfd_wakeup));
> +
> +    // Temporarily lie to lockdep about holding irqfds.lock to avoid a
> +    // false positive: add_wait_queue_priority_exclusive takes wqh->lock
> +    // internally, and lockdep would see irqfds.lock -> wqh->lock, which
> +    // matches the reverse order in irqfd_wakeup(EPOLLHUP).
> +    // SAFETY: `kvm` is a valid pointer, and `irqfds.lock` is held.
> +    unsafe { bindings::rust_helper_kvm_irqfds_spin_release(kvm) };
> +
> +    // Add as priority waiter on eventfd.
> +    (*p).ret = add_wait_queue_priority_exclusive(wqh, &raw mut (*irqfd).wait);
> +
> +    // SAFETY: `kvm` is valid, lockdep state is restored after
> +    // releasing wait queue lock.
> +    unsafe { bindings::rust_helper_kvm_irqfds_spin_acquire(kvm) };

[Severity: Critical]
This isn't a false positive; silencing lockdep here hides the real ABBA
deadlock mentioned above between irqfd_wakeup() and this function. Could we
restructure the locking to avoid this inversion instead of hiding it from
lockdep?

[ ... ]
> +#[inline(always)]
> +unsafe fn ioeventfd_in_range(
> +    p: &Ioeventfd,
> +    addr: bindings::gpa_t,
> +    len: c_int,
> +    val: *const c_void,
> +) -> bool {
[ ... ]
> +    if let Some(datamatch) = p.datamatch() {
> +        // KVM guarantees val is naturally aligned; assert to match C's BUG_ON.
> +        debug_assert!(
> +            (val as usize) % (len as usize) == 0,
> +            "ioeventfd_in_range: val pointer not aligned to len"
> +        );
> +        let _val: u64 = match len {
> +            1 => unsafe { (val as *const u8).read() as u64 },
> +            2 => unsafe { (val as *const u16).read() as u64 },
> +            4 => unsafe { (val as *const u32).read() as u64 },
> +            8 => unsafe { (val as *const u64).read() as u64 },
> +            _ => return false,
> +        };

[Severity: Medium]
Could this result in undefined behavior? In release builds, debug_assert!
compiles away, and ptr::read() strictly requires the input pointer to be
aligned. If the pointer happens to be unaligned, this invokes undefined
behavior. Using ptr::read_unaligned() would be safer here.

[ ... ]
> +unsafe fn kvm_deassign_ioeventfd_idx(
> +    kvm: *mut bindings::kvm,
> +    bus_idx: bindings::kvm_bus,
> +    args: *const bindings::kvm_ioeventfd,
> +) -> c_int {
[ ... ]
> +            // SAFETY: kvm_io_bus_unregister_dev will trigger the iodevice destructor,
> +            // which implicitly performs list_del and kfree(p). Using `next` above avoids UAF.
> +            unsafe { bindings::kvm_io_bus_unregister_dev(kvm, bus_idx, &raw mut (*p).dev) };
> +            let bus = unsafe { bindings::kvm_get_bus(kvm, bus_idx) };
> +            if !bus.is_null() {
> +                unsafe { (*bus).ioeventfd_count -= 1 };
> +            }
> +            ret = 0;
> +            break;
> +        }
> +        // _mutex_guard drops here, unlocking before eventfd_ctx_put.
> +    }

[Severity: High]
Does this code leak the Ioeventfd and eventfd references? The assumption in
the comment that kvm_io_bus_unregister_dev() triggers the destructor is
incorrect; it only removes the device from the bus array. The caller is still
responsible for explicitly deleting the list entry and freeing the memory.

[ ... ]
> +pub unsafe extern "C" fn kvm_notify_irqfd_resampler(
> +    kvm: *mut bindings::kvm,
> +    irqchip: c_uint,
> +    pin: c_uint,
> +) -> bool {
> +    // SAFETY: kvm->irq_srcu is valid.
> +    let _srcu_guard = unsafe { SrcuGuard::new(&raw mut (*kvm).irq_srcu) };
> +    let gsi = unsafe { bindings::kvm_irq_map_chip_pin(kvm, irqchip, pin) };
> +
> +    if gsi != -1 {
> +        let mut pos = (*kvm).irqfds.resampler_list.next;
> +        let head = &raw mut (*kvm).irqfds.resampler_list;
> +        while pos != head {
> +            let resampler: *mut bindings::kvm_kernel_irqfd_resampler =
> +                crate::container_of!(pos, bindings::kvm_kernel_irqfd_resampler, link);

[Severity: High]
Can this plain pointer traversal cause a data race? Accessing an RCU list via
direct pos.next dereferences under an SRCU read lock lacks the memory barrier
provided by rcu_dereference(), which is required on weakly ordered
architectures to prevent reading uninitialized state.

[ ... ]
> +pub unsafe extern "C" fn kvm_unregister_irq_ack_notifier(
> +    kvm: *mut bindings::kvm,
> +    kian: *mut bindings::kvm_irq_ack_notifier,
> +) {
> +    // SAFETY: Caller guarantees kvm is valid.
> +    let _mutex_guard = unsafe { MutexGuard::new(&raw mut (*kvm).irq_lock) };
> +    unsafe { bindings::rust_helper_hlist_del_init_rcu(&raw mut (*kian).link) };
> +    // _mutex_guard dropped here
> +    unsafe { bindings::synchronize_srcu_expedited(&raw mut (*kvm).irq_srcu) };
> +    unsafe { bindings::kvm_arch_post_irq_ack_notifier_list_update(kvm) };
> +}

[Severity: High]
Is the _mutex_guard actually dropped where the comment claims? Because
_mutex_guard is not explicitly dropped (e.g., via drop(_mutex_guard) or
wrapping in an inner block), its lifetime extends to the end of the function
scope. This means irq_lock is held across the blocking
synchronize_srcu_expedited() call, violating the locking hierarchy and
potentially causing a deadlock.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260626083212.39611-1-leixiang@kylinos.cn?part=1

  reply	other threads:[~2026-06-26  8:48 UTC|newest]

Thread overview: 4+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-26  8:29 [RFC PATCH] KVM: rust: add Rust reimplementation of eventfd leixiang
2026-06-26  8:48 ` sashiko-bot [this message]
2026-06-29  8:29 ` Onur Özkan
2026-06-30  7:16   ` ninol

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260626084842.172581F000E9@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=leenollei@gmail.com \
    --cc=ojeda@kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox