From: Zhao Liu <zhao1.liu@intel.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-devel@nongnu.org, qemu-rust@nongnu.org,
Zhao Liu <zhao1.liu@intel.com>
Subject: Re: [PATCH 0/5] rust/hpet: complete moving state out of HPETTimer
Date: Mon, 16 Mar 2026 23:21:59 +0800 [thread overview]
Message-ID: <abggFzbWA2Dfkuvm@intel.com> (raw)
In-Reply-To: <20251117084752.203219-1-pbonzini@redhat.com>
Hi,
> I'm leaving out the conversion to Mutex because, as Zhao noticed, it
> has a deadlock - the timer callback tries to grab the HPET mutex inside
> the BQL, where as the vCPU tries to grab the BQL inside the HPET mutex.
> This is not present in the C code only because... it doesn't take the
> lock at all in places where it should. In particular hpet_timer() reads
> and writes t->cmp and t->cmp64 outside the lock, while hpet_ram_write()
> does so within the lock via hpet_set_timer().
About this issue, I'd like to continue previous discussion and share
some my thoughts.
Review the Current Issue
========================
To better explain the scope of locks and compare different options later,
let me go over the previous issue in more detail.
HPET's timer callback hpet_timer() currently does NOT hold s->lock (the
device mutex). This is a data race bug: it modifies t->cmp64, t->cmp,
t->wrap_flag, and calls update_irq() -- all of which race with
hpet_ram_write().
However, fixing this by adding QEMU_LOCK_GUARD(&s->lock) to hpet_timer()
creates an AB-BA deadlock:
Path A (timer callback): BQL → mutex
main_loop_wait() holds BQL
→ qemu_clock_run_all_timers()
→ timerlist_run_timers()
→ hpet_timer()
→ QEMU_LOCK_GUARD(&s->lock) // mutex under BQL
Path B (MMIO write): mutex → BQL
lockless_io = true, so no BQL on entry
→ hpet_ram_write()
→ QEMU_LOCK_GUARD(&s->lock) // mutex first
→ update_irq()
→ BQL_LOCK_GUARD() // BQL under mutex
→ qemu_set_irq()
I think we could have these 3 options in hand:
* lockless timer: unlock bql for HPET timer handler - like lockless MMIO did
* lockless irq: implement lockless ioapic/gic/...
* defer call: defer BQL related call outside of mutex context.
Since the Rust implementations are currently mirrors of the C code, I'll use
C-side pseudocode as the main examples.
Option 1: Lockless timer attribute
==================================
Add a QEMU_TIMER_ATTR_LOCKLESS flag. In timerlist_run_timers(), release
BQL before invoking a timer callback marked with this attribute, and
re-acquire it afterward:
/* in timerlist_run_timers(), around cb(opaque) call: */
bool release_bql = (ts->attributes & QEMU_TIMER_ATTR_LOCKLESS)
&& bql_locked();
if (release_bql) {
bql_unlock();
}
cb(opaque);
if (release_bql) {
bql_lock();
}
With this, HPET's timer callback runs without BQL, and can do:
hpet_timer() {
QEMU_LOCK_GUARD(&s->lock); // mutex first, no BQL held
...
update_irq(t, 1); // takes BQL inside → mutex→BQL
}
Both paths become mutex→BQL. Deadlock eliminated.
This option has minimal change, and it also aligns well with the concept
of lockless devices - its approach is consistent with lockless MMIO, i.e,
transferring the lock-protected implementation and responsibility to the
device itself.
But I'm not sure if modifying the timer is a sufficiently general and
thorough approach: if other types of devices also implement lockless,
would we need to convert more callbacks to bql-free? More bql-free
callbacks sounds a bit messy?
Option 2: Lockless interrupt controllers (IOAPIC/PIC/GSI)
---------------------------------------------------------
Make the entire GSI→IOAPIC/PIC path BQL-free by converting interrupt
controller state to atomic operations or fine-grained locks. Then
update_irq() would never need BQL, eliminating the mutex→BQL nesting
in the MMIO path entirely.
This refers Paolo's previous idea:
https://lore.kernel.org/qemu-devel/ac058be3-274a-4896-b01d-f433d036b5d0@redhat.com/
This is the most thorough solution: IRQ path becomes fully BQL-free and
all lockless_io devices benefit automatically.
I find the first blocking issue is gsi_handler() under CONFIG_XEN_EMU
calls xen_evtchn_set_gsi() which has assert(bql_locked()). The Xen
event channel's callback_gsi logic uses a synchronous recursive flag
(setting_callback_gsi) that deeply depends on BQL's mutual exclusion
semantics. This looks like it needs refactoring?
I think PIC and IOAPIC also need device locks or atomic operations for
protection, but their lockless implementations all have complexity.
Option 3: defer_call to reorder lock acquisition in MMIO path
=============================================================
Previously, I had a POC (in Rust) to defer IRQ injection after Mutex
context:
https://gitlab.com/zhao.liu/qemu/-/tree/rust-hpet-lockless-v0-01-04-2026
I hold it in hand and didn't post it since I thought it's not easy to
make C side apply this method. Until I realized that QEMU actually
already has defer call! Although what I need is still somewhat different
(like allowing repeated calls, BQL context), the thread-local handling
is more concise and elegant than my POC.
So it's possible to use QEMU's defer_call mechanism (with some
additional minor enhancements: allowing repeated calls, BQL context) to
defer qemu_set_irq() calls to after the device mutex is released in the
MMIO write path:
hpet_ram_write() {
defer_call_begin();
QEMU_LOCK_GUARD(&s->lock);
// state updates under mutex, record pending IRQ actions
defer_call(hpet_flush_irqs, s);
// mutex released here
defer_call_end();
// hpet_flush_irqs runs here: BQL_LOCK_GUARD() + qemu_set_irq()
// mutex and BQL are never nested
}
A problem is, then the MMIO is not atomic: With lockless_io, multiple
vCPUs can concurrently enter MMIO handlers. Between mutex release and
defer_call_end(), another vCPU could modify state, causing the deferred
IRQ action to operate on stale decisions.
This situation is relatively rare, and I think the delayed updates in a
few cases of state are also acceptable?
This option is more complex than option 1, since for Rust side, we will
need a defer call binding. But this option doesn't need to justify
whether other callbacks (like timer) could or should support BQL-free.
Discussion
==========
Overall, I think option 1 or 3 are general lockless infrastructure
enhancements, while option 2 is an i386-specific lockless feature. So
maybe we could choose one from options 1 and 3, decouple it from option
2, meaning both could proceed in parallel? What do you think?
Thanks and Best Regards,
Zhao
prev parent reply other threads:[~2026-03-16 14:56 UTC|newest]
Thread overview: 15+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-17 8:47 [PATCH 0/5] rust/hpet: complete moving state out of HPETTimer Paolo Bonzini
2025-11-17 8:47 ` [PATCH 1/5] rust/hpet: move hidden registers to HPETTimerRegisters Paolo Bonzini
2025-11-18 8:35 ` Zhao Liu
2025-11-17 8:47 ` [PATCH 2/5] rust/hpet: move hpet_offset to HPETRegisters Paolo Bonzini
2025-11-18 13:54 ` Zhao Liu
2025-11-24 8:57 ` Zhao Liu
2025-11-17 8:47 ` [PATCH 3/5] rust/hpet: remove BqlRefCell around HPETTimer Paolo Bonzini
2025-11-19 15:17 ` Zhao Liu
2025-11-19 22:28 ` Paolo Bonzini
2025-11-17 8:47 ` [PATCH 4/5] rust: migration: implement ToMigrationState for Timer Paolo Bonzini
2025-11-20 14:31 ` Zhao Liu
2025-11-17 8:47 ` [PATCH 5/5] rust/hpet: Apply Migratable<> wrapper and ToMigrationState Paolo Bonzini
2025-11-19 15:31 ` Zhao Liu
2025-11-19 15:59 ` [PATCH 0/5] rust/hpet: complete moving state out of HPETTimer Zhao Liu
2026-03-16 15:21 ` Zhao Liu [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=abggFzbWA2Dfkuvm@intel.com \
--to=zhao1.liu@intel.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-rust@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox