Re: [RFC PATCH] KVM: rust: add Rust reimplementation of eventfd

All of lore.kernel.org
 help / color / mirror / Atom feed

From: sashiko-bot@kernel.org
To: leixiang <leenollei@gmail.com>
Cc: kvm@vger.kernel.org, ojeda@kernel.org
Subject: Re: [RFC PATCH] KVM: rust: add Rust reimplementation of eventfd
Date: Fri, 26 Jun 2026 08:48:41 +0000	[thread overview]
Message-ID: <20260626084842.172581F000E9@smtp.kernel.org> (raw)
In-Reply-To: <20260626083212.39611-1-leixiang@kylinos.cn>

Thank you for your contribution! Sashiko AI review found 7 potential issue(s) to consider:
- [Critical] Recursive spinlock acquisition causing guaranteed deadlock in `irqfd_wakeup`.
- [Critical] ABBA deadlock between `irqfds.lock` and waitqueue lock intentionally hidden from lockdep.
- [High] `kvm_deassign_ioeventfd_idx` fails to free the `Ioeventfd` device, leaking memory and eventfd references.
- [High] MutexGuard dropped too late, holding `irq_lock` across `synchronize_srcu_expedited`.
- [High] Logic error in `irqfd_wakeup` injects spurious interrupts for deactivating irqfds.
- [High] Unsafe RCU list traversal in `kvm_notify_irqfd_resampler` lacks `rcu_dereference` memory barriers.
- [Medium] Unaligned memory read causes undefined behavior in `ioeventfd_in_range`.
--

commit 3427a3ec37eb356c05fc7c711d26f74d692d5086
Author: ninol <leenollei@gmail.com>
Subject: KVM: rust: add Rust reimplementation of eventfd

This patch introduces a Rust reimplementation of KVM's eventfd support, providing irqfd and ioeventfd functionality. It uses RAII guards for lock management and is selectable via a new Kconfig option.

> diff --git a/rust/kernel/kvm/eventfd.rs b/rust/kernel/kvm/eventfd.rs
> new file mode 100644
> index 0000000000000..d9b12c7caccb0
> --- /dev/null
> +++ b/rust/kernel/kvm/eventfd.rs
[ ... ]
> +unsafe extern "C" fn irqfd_wakeup(
> +    wait: *mut bindings::wait_queue_entry,
> +    _mode: c_uint,
> +    _sync: c_int,
> +    key: *mut c_void,
> +) -> c_int {
> +    // SAFETY: wait is embedded in kvm_kernel_irqfd.
> +    let irqfd: *mut bindings::kvm_kernel_irqfd =
> +        crate::container_of!(wait, bindings::kvm_kernel_irqfd, wait);
> +    let flags = (key as usize) as bindings::__poll_t;
> +    let kvm = (*irqfd).kvm;
> +    let mut ret: c_int = 0;
> +
> +    if (flags & bindings::POLLIN) != 0 {
> +        let mut cnt: u64 = 0;
> +        eventfd_ctx_do_read((*irqfd).eventfd, &raw mut cnt);

[Severity: Critical]
Does this cause a recursive spinlock deadlock? This irqfd_wakeup() function is
invoked from the eventfd waitqueue wakeup path where the waitqueue spinlock
(wqh.lock) is already held. Calling eventfd_ctx_do_read() here will try to
acquire that exact same spinlock again, resulting in an immediate deadlock.

> +
> +        // SAFETY: kvm->irq_srcu is valid.
> +        let _srcu_guard = unsafe { SrcuGuard::new(&raw mut (*kvm).irq_srcu) };
> +
> +        // Read irq_entry under seqcount protection to avoid torn reads
> +        // when kvm_irq_routing_update() writes concurrently on another CPU.
> +        let mut irq = loop {
> +            let seq = unsafe { bindings::read_seqcount_begin(&raw mut (*irqfd).irq_entry_sc) };
> +            let entry = unsafe { core::ptr::read(&(*irqfd).irq_entry) };
> +            if unsafe { bindings::read_seqcount_retry(&raw mut (*irqfd).irq_entry_sc, seq) } == 0 {
> +                break entry;
> +            }
> +        };
> +
> +        let deactivating = !irqfd_is_active(irqfd);
> +        if deactivating
> +            || unsafe {
> +                bindings::kvm_arch_set_irq_inatomic(
> +                    &raw mut irq,
> +                    kvm,
> +                    bindings::KVM_USERSPACE_IRQ_SOURCE_ID as i32,
> +                    1,
> +                    false,
> +                )
> +            } == -(bindings::EWOULDBLOCK as c_int)
> +        {
> +            // SAFETY: FFI call.
> +            unsafe { bindings::schedule_work(&raw mut (*irqfd).inject) };
> +        }

[Severity: High]
Does this logic mistakenly inject spurious interrupts when an irqfd is
deactivating? If the deactivating condition is true, the || operator
short-circuits and evaluates the entire condition to true, meaning we will
schedule work for deactivating irqfds. This could lead to use-after-free
issues if the work executes after the irqfd memory has been freed.

> +
> +        // _srcu_guard is dropped here
> +        ret = 1;
> +    }
> +
> +    if (flags & bindings::POLLHUP) != 0 {
> +        // SAFETY: Taking irqfds.lock is safe here per original C comment.
> +        // SAFETY: kvm is valid, irqfds.lock is valid.
> +        let _spin_guard = unsafe { SpinLockIrqSaveGuard::new(&raw mut (*kvm).irqfds.lock) };

[Severity: Critical]
Is there an ABBA deadlock risk here? This acquires irqfds.lock while the
waitqueue lock (wqh.lock) is already held by the caller. However, in
kvm_irqfd_register(), irqfds.lock is acquired before calling
add_wait_queue_priority_exclusive(), which in turn acquires wqh.lock. This
lock inversion can trigger a deadlock when registering and closing eventfds
concurrently.

[ ... ]
> +unsafe extern "C" fn kvm_irqfd_register(
> +    _file: *mut bindings::file,
> +    wqh: *mut bindings::wait_queue_head,
> +    pt: *mut bindings::poll_table,
> +) {
[ ... ]
> +    // Initialize wait queue entry with our custom wakeup handler.
> +    init_waitqueue_func_entry(&raw mut (*irqfd).wait, Some(irqfd_wakeup));
> +
> +    // Temporarily lie to lockdep about holding irqfds.lock to avoid a
> +    // false positive: add_wait_queue_priority_exclusive takes wqh->lock
> +    // internally, and lockdep would see irqfds.lock -> wqh->lock, which
> +    // matches the reverse order in irqfd_wakeup(EPOLLHUP).
> +    // SAFETY: `kvm` is a valid pointer, and `irqfds.lock` is held.
> +    unsafe { bindings::rust_helper_kvm_irqfds_spin_release(kvm) };
> +
> +    // Add as priority waiter on eventfd.
> +    (*p).ret = add_wait_queue_priority_exclusive(wqh, &raw mut (*irqfd).wait);
> +
> +    // SAFETY: `kvm` is valid, lockdep state is restored after
> +    // releasing wait queue lock.
> +    unsafe { bindings::rust_helper_kvm_irqfds_spin_acquire(kvm) };

[Severity: Critical]
This isn't a false positive; silencing lockdep here hides the real ABBA
deadlock mentioned above between irqfd_wakeup() and this function. Could we
restructure the locking to avoid this inversion instead of hiding it from
lockdep?

[ ... ]
> +#[inline(always)]
> +unsafe fn ioeventfd_in_range(
> +    p: &Ioeventfd,
> +    addr: bindings::gpa_t,
> +    len: c_int,
> +    val: *const c_void,
> +) -> bool {
[ ... ]
> +    if let Some(datamatch) = p.datamatch() {
> +        // KVM guarantees val is naturally aligned; assert to match C's BUG_ON.
> +        debug_assert!(
> +            (val as usize) % (len as usize) == 0,
> +            "ioeventfd_in_range: val pointer not aligned to len"
> +        );
> +        let _val: u64 = match len {
> +            1 => unsafe { (val as *const u8).read() as u64 },
> +            2 => unsafe { (val as *const u16).read() as u64 },
> +            4 => unsafe { (val as *const u32).read() as u64 },
> +            8 => unsafe { (val as *const u64).read() as u64 },
> +            _ => return false,
> +        };

[Severity: Medium]
Could this result in undefined behavior? In release builds, debug_assert!
compiles away, and ptr::read() strictly requires the input pointer to be
aligned. If the pointer happens to be unaligned, this invokes undefined
behavior. Using ptr::read_unaligned() would be safer here.

[ ... ]
> +unsafe fn kvm_deassign_ioeventfd_idx(
> +    kvm: *mut bindings::kvm,
> +    bus_idx: bindings::kvm_bus,
> +    args: *const bindings::kvm_ioeventfd,
> +) -> c_int {
[ ... ]
> +            // SAFETY: kvm_io_bus_unregister_dev will trigger the iodevice destructor,
> +            // which implicitly performs list_del and kfree(p). Using `next` above avoids UAF.
> +            unsafe { bindings::kvm_io_bus_unregister_dev(kvm, bus_idx, &raw mut (*p).dev) };
> +            let bus = unsafe { bindings::kvm_get_bus(kvm, bus_idx) };
> +            if !bus.is_null() {
> +                unsafe { (*bus).ioeventfd_count -= 1 };
> +            }
> +            ret = 0;
> +            break;
> +        }
> +        // _mutex_guard drops here, unlocking before eventfd_ctx_put.
> +    }

[Severity: High]
Does this code leak the Ioeventfd and eventfd references? The assumption in
the comment that kvm_io_bus_unregister_dev() triggers the destructor is
incorrect; it only removes the device from the bus array. The caller is still
responsible for explicitly deleting the list entry and freeing the memory.

[ ... ]
> +pub unsafe extern "C" fn kvm_notify_irqfd_resampler(
> +    kvm: *mut bindings::kvm,
> +    irqchip: c_uint,
> +    pin: c_uint,
> +) -> bool {
> +    // SAFETY: kvm->irq_srcu is valid.
> +    let _srcu_guard = unsafe { SrcuGuard::new(&raw mut (*kvm).irq_srcu) };
> +    let gsi = unsafe { bindings::kvm_irq_map_chip_pin(kvm, irqchip, pin) };
> +
> +    if gsi != -1 {
> +        let mut pos = (*kvm).irqfds.resampler_list.next;
> +        let head = &raw mut (*kvm).irqfds.resampler_list;
> +        while pos != head {
> +            let resampler: *mut bindings::kvm_kernel_irqfd_resampler =
> +                crate::container_of!(pos, bindings::kvm_kernel_irqfd_resampler, link);

[Severity: High]
Can this plain pointer traversal cause a data race? Accessing an RCU list via
direct pos.next dereferences under an SRCU read lock lacks the memory barrier
provided by rcu_dereference(), which is required on weakly ordered
architectures to prevent reading uninitialized state.

[ ... ]
> +pub unsafe extern "C" fn kvm_unregister_irq_ack_notifier(
> +    kvm: *mut bindings::kvm,
> +    kian: *mut bindings::kvm_irq_ack_notifier,
> +) {
> +    // SAFETY: Caller guarantees kvm is valid.
> +    let _mutex_guard = unsafe { MutexGuard::new(&raw mut (*kvm).irq_lock) };
> +    unsafe { bindings::rust_helper_hlist_del_init_rcu(&raw mut (*kian).link) };
> +    // _mutex_guard dropped here
> +    unsafe { bindings::synchronize_srcu_expedited(&raw mut (*kvm).irq_srcu) };
> +    unsafe { bindings::kvm_arch_post_irq_ack_notifier_list_update(kvm) };
> +}

[Severity: High]
Is the _mutex_guard actually dropped where the comment claims? Because
_mutex_guard is not explicitly dropped (e.g., via drop(_mutex_guard) or
wrapping in an inner block), its lifetime extends to the end of the function
scope. This means irq_lock is held across the blocking
synchronize_srcu_expedited() call, violating the locking hierarchy and
potentially causing a deadlock.

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260626083212.39611-1-leixiang@kylinos.cn?part=1

next prev parent reply	other threads:[~2026-06-26  8:48 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-06-26  8:29 [RFC PATCH] KVM: rust: add Rust reimplementation of eventfd leixiang
2026-06-26  8:48 ` sashiko-bot [this message]
2026-06-29  8:29 ` Onur Özkan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260626084842.172581F000E9@smtp.kernel.org \
    --to=sashiko-bot@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=leenollei@gmail.com \
    --cc=ojeda@kernel.org \
    --cc=sashiko-reviews@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.