All of lore.kernel.org
 help / color / mirror / Atom feed
From: Zhao Liu <zhao1.liu@intel.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-devel@nongnu.org, qemu-rust@nongnu.org,
	Zhao Liu <zhao1.liu@intel.com>
Subject: Re: [PATCH 0/5] rust/hpet: complete moving state out of HPETTimer
Date: Mon, 16 Mar 2026 23:21:59 +0800	[thread overview]
Message-ID: <abggFzbWA2Dfkuvm@intel.com> (raw)
In-Reply-To: <20251117084752.203219-1-pbonzini@redhat.com>

Hi,

> I'm leaving out the conversion to Mutex because, as Zhao noticed, it
> has a deadlock - the timer callback tries to grab the HPET mutex inside
> the BQL, where as the vCPU tries to grab the BQL inside the HPET mutex.
> This is not present in the C code only because... it doesn't take the
> lock at all in places where it should.  In particular hpet_timer() reads
> and writes t->cmp and t->cmp64 outside the lock, while hpet_ram_write()
> does so within the lock via hpet_set_timer().

About this issue, I'd like to continue previous discussion and share
some my thoughts.


Review the Current Issue
========================

To better explain the scope of locks and compare different options later,
let me go over the previous issue in more detail.

HPET's timer callback hpet_timer() currently does NOT hold s->lock (the
device mutex). This is a data race bug: it modifies t->cmp64, t->cmp,
t->wrap_flag, and calls update_irq() -- all of which race with
hpet_ram_write().

However, fixing this by adding QEMU_LOCK_GUARD(&s->lock) to hpet_timer()
creates an AB-BA deadlock:

  Path A (timer callback):  BQL → mutex
    main_loop_wait() holds BQL
    → qemu_clock_run_all_timers()
      → timerlist_run_timers() 
        → hpet_timer()
          → QEMU_LOCK_GUARD(&s->lock)    // mutex under BQL

  Path B (MMIO write):      mutex → BQL
    lockless_io = true, so no BQL on entry
    → hpet_ram_write()
      → QEMU_LOCK_GUARD(&s->lock)      // mutex first
        → update_irq()
          → BQL_LOCK_GUARD()            // BQL under mutex
            → qemu_set_irq()

I think we could have these 3 options in hand:
 * lockless timer: unlock bql for HPET timer handler - like lockless MMIO did
 * lockless irq: implement lockless ioapic/gic/...
 * defer call: defer BQL related call outside of mutex context.

Since the Rust implementations are currently mirrors of the C code, I'll use
C-side pseudocode as the main examples.


Option 1: Lockless timer attribute
==================================

Add a QEMU_TIMER_ATTR_LOCKLESS flag. In timerlist_run_timers(), release
BQL before invoking a timer callback marked with this attribute, and
re-acquire it afterward:

    /* in timerlist_run_timers(), around cb(opaque) call: */
    bool release_bql = (ts->attributes & QEMU_TIMER_ATTR_LOCKLESS)
                       && bql_locked();
    if (release_bql) {
        bql_unlock();
    }
    cb(opaque);
    if (release_bql) {
        bql_lock();
    }

With this, HPET's timer callback runs without BQL, and can do:

    hpet_timer() {
        QEMU_LOCK_GUARD(&s->lock);   // mutex first, no BQL held
        ...
        update_irq(t, 1);            // takes BQL inside → mutex→BQL
    }

Both paths become mutex→BQL. Deadlock eliminated.

This option has minimal change, and it also aligns well with the concept
of lockless devices - its approach is consistent with lockless MMIO, i.e,
transferring the lock-protected implementation and responsibility to the
device itself.

But I'm not sure if modifying the timer is a sufficiently general and
thorough approach: if other types of devices also implement lockless,
would we need to convert more callbacks to bql-free? More bql-free
callbacks sounds a bit messy?


Option 2: Lockless interrupt controllers (IOAPIC/PIC/GSI)
---------------------------------------------------------

Make the entire GSI→IOAPIC/PIC path BQL-free by converting interrupt
controller state to atomic operations or fine-grained locks. Then
update_irq() would never need BQL, eliminating the mutex→BQL nesting
in the MMIO path entirely.

This refers Paolo's previous idea:

https://lore.kernel.org/qemu-devel/ac058be3-274a-4896-b01d-f433d036b5d0@redhat.com/

This is the most thorough solution: IRQ path becomes fully BQL-free and
all lockless_io devices benefit automatically.

I find the first blocking issue is gsi_handler() under CONFIG_XEN_EMU
calls xen_evtchn_set_gsi() which has assert(bql_locked()). The Xen
event channel's callback_gsi logic uses a synchronous recursive flag
(setting_callback_gsi) that deeply depends on BQL's mutual exclusion
semantics. This looks like it needs refactoring?

I think PIC and IOAPIC also need device locks or atomic operations for
protection, but their lockless implementations all have complexity.


Option 3: defer_call to reorder lock acquisition in MMIO path
=============================================================

Previously, I had a POC (in Rust) to defer IRQ injection after Mutex
context:
https://gitlab.com/zhao.liu/qemu/-/tree/rust-hpet-lockless-v0-01-04-2026

I hold it in hand and didn't post it since I thought it's not easy to
make C side apply this method. Until I realized that QEMU actually
already has defer call! Although what I need is still somewhat different
(like allowing repeated calls, BQL context), the thread-local handling
is more concise and elegant than my POC.

So it's possible to use QEMU's defer_call mechanism (with some
additional minor enhancements: allowing repeated calls, BQL context) to
defer qemu_set_irq() calls to after the device mutex is released in the
MMIO write path:

    hpet_ram_write() {
        defer_call_begin();
        QEMU_LOCK_GUARD(&s->lock);
        // state updates under mutex, record pending IRQ actions
        defer_call(hpet_flush_irqs, s);
        // mutex released here
        defer_call_end();
        // hpet_flush_irqs runs here: BQL_LOCK_GUARD() + qemu_set_irq()
        // mutex and BQL are never nested
    }

A problem is, then the MMIO is not atomic: With lockless_io, multiple
vCPUs can concurrently enter MMIO handlers. Between mutex release and
defer_call_end(), another vCPU could modify state, causing the deferred
IRQ action to operate on stale decisions.

This situation is relatively rare, and I think the delayed updates in a
few cases of state are also acceptable?

This option is more complex than option 1, since for Rust side, we will
need a defer call binding. But this option doesn't need to justify
whether other callbacks (like timer) could or should support BQL-free.


Discussion
==========

Overall, I think option 1 or 3 are general lockless infrastructure
enhancements, while option 2 is an i386-specific lockless feature. So
maybe we could choose one from options 1 and 3, decouple it from option
2, meaning both could proceed in parallel? What do you think?

Thanks and Best Regards,
Zhao





      parent reply	other threads:[~2026-03-16 14:56 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-17  8:47 [PATCH 0/5] rust/hpet: complete moving state out of HPETTimer Paolo Bonzini
2025-11-17  8:47 ` [PATCH 1/5] rust/hpet: move hidden registers to HPETTimerRegisters Paolo Bonzini
2025-11-18  8:35   ` Zhao Liu
2025-11-17  8:47 ` [PATCH 2/5] rust/hpet: move hpet_offset to HPETRegisters Paolo Bonzini
2025-11-18 13:54   ` Zhao Liu
2025-11-24  8:57   ` Zhao Liu
2025-11-17  8:47 ` [PATCH 3/5] rust/hpet: remove BqlRefCell around HPETTimer Paolo Bonzini
2025-11-19 15:17   ` Zhao Liu
2025-11-19 22:28     ` Paolo Bonzini
2025-11-17  8:47 ` [PATCH 4/5] rust: migration: implement ToMigrationState for Timer Paolo Bonzini
2025-11-20 14:31   ` Zhao Liu
2025-11-17  8:47 ` [PATCH 5/5] rust/hpet: Apply Migratable<> wrapper and ToMigrationState Paolo Bonzini
2025-11-19 15:31   ` Zhao Liu
2025-11-19 15:59 ` [PATCH 0/5] rust/hpet: complete moving state out of HPETTimer Zhao Liu
2026-03-16 15:21 ` Zhao Liu [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abggFzbWA2Dfkuvm@intel.com \
    --to=zhao1.liu@intel.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-rust@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.