public inbox for qemu-devel@nongnu.org
 help / color / mirror / Atom feed
From: Zhao Liu <zhao1.liu@intel.com>
To: Paolo Bonzini <pbonzini@redhat.com>
Cc: qemu-devel@nongnu.org, qemu-rust@nongnu.org,
	Zhao Liu <zhao1.liu@intel.com>
Subject: Re: [PATCH 0/5] rust/hpet: complete moving state out of HPETTimer
Date: Mon, 16 Mar 2026 23:21:59 +0800	[thread overview]
Message-ID: <abggFzbWA2Dfkuvm@intel.com> (raw)
In-Reply-To: <20251117084752.203219-1-pbonzini@redhat.com>

Hi,

> I'm leaving out the conversion to Mutex because, as Zhao noticed, it
> has a deadlock - the timer callback tries to grab the HPET mutex inside
> the BQL, where as the vCPU tries to grab the BQL inside the HPET mutex.
> This is not present in the C code only because... it doesn't take the
> lock at all in places where it should.  In particular hpet_timer() reads
> and writes t->cmp and t->cmp64 outside the lock, while hpet_ram_write()
> does so within the lock via hpet_set_timer().

About this issue, I'd like to continue previous discussion and share
some my thoughts.


Review the Current Issue
========================

To better explain the scope of locks and compare different options later,
let me go over the previous issue in more detail.

HPET's timer callback hpet_timer() currently does NOT hold s->lock (the
device mutex). This is a data race bug: it modifies t->cmp64, t->cmp,
t->wrap_flag, and calls update_irq() -- all of which race with
hpet_ram_write().

However, fixing this by adding QEMU_LOCK_GUARD(&s->lock) to hpet_timer()
creates an AB-BA deadlock:

  Path A (timer callback):  BQL → mutex
    main_loop_wait() holds BQL
    → qemu_clock_run_all_timers()
      → timerlist_run_timers() 
        → hpet_timer()
          → QEMU_LOCK_GUARD(&s->lock)    // mutex under BQL

  Path B (MMIO write):      mutex → BQL
    lockless_io = true, so no BQL on entry
    → hpet_ram_write()
      → QEMU_LOCK_GUARD(&s->lock)      // mutex first
        → update_irq()
          → BQL_LOCK_GUARD()            // BQL under mutex
            → qemu_set_irq()

I think we could have these 3 options in hand:
 * lockless timer: unlock bql for HPET timer handler - like lockless MMIO did
 * lockless irq: implement lockless ioapic/gic/...
 * defer call: defer BQL related call outside of mutex context.

Since the Rust implementations are currently mirrors of the C code, I'll use
C-side pseudocode as the main examples.


Option 1: Lockless timer attribute
==================================

Add a QEMU_TIMER_ATTR_LOCKLESS flag. In timerlist_run_timers(), release
BQL before invoking a timer callback marked with this attribute, and
re-acquire it afterward:

    /* in timerlist_run_timers(), around cb(opaque) call: */
    bool release_bql = (ts->attributes & QEMU_TIMER_ATTR_LOCKLESS)
                       && bql_locked();
    if (release_bql) {
        bql_unlock();
    }
    cb(opaque);
    if (release_bql) {
        bql_lock();
    }

With this, HPET's timer callback runs without BQL, and can do:

    hpet_timer() {
        QEMU_LOCK_GUARD(&s->lock);   // mutex first, no BQL held
        ...
        update_irq(t, 1);            // takes BQL inside → mutex→BQL
    }

Both paths become mutex→BQL. Deadlock eliminated.

This option has minimal change, and it also aligns well with the concept
of lockless devices - its approach is consistent with lockless MMIO, i.e,
transferring the lock-protected implementation and responsibility to the
device itself.

But I'm not sure if modifying the timer is a sufficiently general and
thorough approach: if other types of devices also implement lockless,
would we need to convert more callbacks to bql-free? More bql-free
callbacks sounds a bit messy?


Option 2: Lockless interrupt controllers (IOAPIC/PIC/GSI)
---------------------------------------------------------

Make the entire GSI→IOAPIC/PIC path BQL-free by converting interrupt
controller state to atomic operations or fine-grained locks. Then
update_irq() would never need BQL, eliminating the mutex→BQL nesting
in the MMIO path entirely.

This refers Paolo's previous idea:

https://lore.kernel.org/qemu-devel/ac058be3-274a-4896-b01d-f433d036b5d0@redhat.com/

This is the most thorough solution: IRQ path becomes fully BQL-free and
all lockless_io devices benefit automatically.

I find the first blocking issue is gsi_handler() under CONFIG_XEN_EMU
calls xen_evtchn_set_gsi() which has assert(bql_locked()). The Xen
event channel's callback_gsi logic uses a synchronous recursive flag
(setting_callback_gsi) that deeply depends on BQL's mutual exclusion
semantics. This looks like it needs refactoring?

I think PIC and IOAPIC also need device locks or atomic operations for
protection, but their lockless implementations all have complexity.


Option 3: defer_call to reorder lock acquisition in MMIO path
=============================================================

Previously, I had a POC (in Rust) to defer IRQ injection after Mutex
context:
https://gitlab.com/zhao.liu/qemu/-/tree/rust-hpet-lockless-v0-01-04-2026

I hold it in hand and didn't post it since I thought it's not easy to
make C side apply this method. Until I realized that QEMU actually
already has defer call! Although what I need is still somewhat different
(like allowing repeated calls, BQL context), the thread-local handling
is more concise and elegant than my POC.

So it's possible to use QEMU's defer_call mechanism (with some
additional minor enhancements: allowing repeated calls, BQL context) to
defer qemu_set_irq() calls to after the device mutex is released in the
MMIO write path:

    hpet_ram_write() {
        defer_call_begin();
        QEMU_LOCK_GUARD(&s->lock);
        // state updates under mutex, record pending IRQ actions
        defer_call(hpet_flush_irqs, s);
        // mutex released here
        defer_call_end();
        // hpet_flush_irqs runs here: BQL_LOCK_GUARD() + qemu_set_irq()
        // mutex and BQL are never nested
    }

A problem is, then the MMIO is not atomic: With lockless_io, multiple
vCPUs can concurrently enter MMIO handlers. Between mutex release and
defer_call_end(), another vCPU could modify state, causing the deferred
IRQ action to operate on stale decisions.

This situation is relatively rare, and I think the delayed updates in a
few cases of state are also acceptable?

This option is more complex than option 1, since for Rust side, we will
need a defer call binding. But this option doesn't need to justify
whether other callbacks (like timer) could or should support BQL-free.


Discussion
==========

Overall, I think option 1 or 3 are general lockless infrastructure
enhancements, while option 2 is an i386-specific lockless feature. So
maybe we could choose one from options 1 and 3, decouple it from option
2, meaning both could proceed in parallel? What do you think?

Thanks and Best Regards,
Zhao





      parent reply	other threads:[~2026-03-16 14:56 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-17  8:47 [PATCH 0/5] rust/hpet: complete moving state out of HPETTimer Paolo Bonzini
2025-11-17  8:47 ` [PATCH 1/5] rust/hpet: move hidden registers to HPETTimerRegisters Paolo Bonzini
2025-11-18  8:35   ` Zhao Liu
2025-11-17  8:47 ` [PATCH 2/5] rust/hpet: move hpet_offset to HPETRegisters Paolo Bonzini
2025-11-18 13:54   ` Zhao Liu
2025-11-24  8:57   ` Zhao Liu
2025-11-17  8:47 ` [PATCH 3/5] rust/hpet: remove BqlRefCell around HPETTimer Paolo Bonzini
2025-11-19 15:17   ` Zhao Liu
2025-11-19 22:28     ` Paolo Bonzini
2025-11-17  8:47 ` [PATCH 4/5] rust: migration: implement ToMigrationState for Timer Paolo Bonzini
2025-11-20 14:31   ` Zhao Liu
2025-11-17  8:47 ` [PATCH 5/5] rust/hpet: Apply Migratable<> wrapper and ToMigrationState Paolo Bonzini
2025-11-19 15:31   ` Zhao Liu
2025-11-19 15:59 ` [PATCH 0/5] rust/hpet: complete moving state out of HPETTimer Zhao Liu
2026-03-16 15:21 ` Zhao Liu [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=abggFzbWA2Dfkuvm@intel.com \
    --to=zhao1.liu@intel.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=qemu-rust@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox