[PATCH v2 0/6] Reinvent BQL-free PIO/MMIO

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/6] Reinvent BQL-free PIO/MMIO
@ 2025-07-30 12:39 Igor Mammedov
  2025-07-30 12:39 ` [PATCH v2 1/6] memory: reintroduce BQL-free fine-grained PIO/MMIO Igor Mammedov
                   ` (6 more replies)
  0 siblings, 7 replies; 22+ messages in thread
From: Igor Mammedov @ 2025-07-30 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, peterx, mst, mtosatti, kraxel, peter.maydell

v2:
  * Make both read and write pathes BQL-less (Gerd)
  * Refactor HPET to handle lock-less access correctly
    when stopping/starting counter in parallel. (Peter Maydell)
  * Publish kvm-unit-tests HPET bench/torture test [1] to verify
    HPET lock-less handling

When booting WS2025 with following CLI
 1)   -M q35,hpet=off -cpu host -enable-kvm -smp 240,sockets=4
the guest boots very slow and is sluggish after boot
or it's stuck on boot at spinning circle (most of the time).

pref shows that VM is experiencing heavy BQL contention on IO path
which happens to be ACPI PM timer read access. A variation with
HPET enabled moves contention to HPET timer read access.
And it only gets worse with increasing number of VCPUs.

Series prevents large VM vCPUs contending on BQL due to PM|HPET timer
access and lets Windows to move on with boot process.

Testing lock-less IO with HPET micro benchmark [1] shows approx 80%
better performance than the current BLQ locked path.
[chart https://ibb.co/MJY9999 shows much better scaling of lock-less
IO compared to BQL one.]

In my tests, with CLI WS2025 guest wasn't able to boot within 30min
on both hosts
  * 32 core 2NUMA nodes
  * 448 cores 8NUMA nodes
With ACPI PM timer in BQL-free read mode, guest boots within approx:
 * 2min
 * 1min
respectively.

With HPET enabled boot time shrinks ~2x
 * 4m13 -> 2m21
 * 2m19 -> 1m15
respectively.

1) "[kvm-unit-tests PATCH v4 0/5] x86: add HPET counter tests"
    https://lore.kernel.org/kvm/20250725095429.1691734-1-imammedo@redhat.com/T/#t
PS:
Using hv-time=on cpu option helps a lot (when it works) and
lets [1] guest boot fine in ~1-2min. Series doesn't make
a significant impact in this case.

PS2:
Tested series with a bunch of different guests:
 RHEL-[6..10]x64, WS2012R2, WS2016, WS2022, WS2025

PS3:
 dropped mention of https://bugzilla.redhat.com/show_bug.cgi?id=1322713
 as it's not reproducible with current software stack or even with
 the same qemu/seabios as reported (kernel versions mentioned in
 the report were interim ones and no longer available,
 so I've used nearest released at the time for testing) 

Igor Mammedov (6):
  memory: reintroduce BQL-free fine-grained PIO/MMIO
  acpi: mark PMTIMER as unlocked
  hpet: switch to fain-grained device locking
  hpet: move out main counter read into a separate block
  hpet: make main counter read lock-less
  kvm: i386: irqchip: take BQL only if there is an interrupt

 include/system/memory.h | 10 +++++++
 hw/acpi/core.c          |  1 +
 hw/timer/hpet.c         | 64 +++++++++++++++++++++++++++++++++++------
 system/memory.c         |  6 ++++
 system/physmem.c        |  2 +-
 target/i386/kvm/kvm.c   | 58 +++++++++++++++++++++++--------------
 6 files changed, 111 insertions(+), 30 deletions(-)

-- 
2.47.1

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v2 1/6] memory: reintroduce BQL-free fine-grained PIO/MMIO
  2025-07-30 12:39 [PATCH v2 0/6] Reinvent BQL-free PIO/MMIO Igor Mammedov
@ 2025-07-30 12:39 ` Igor Mammedov
  2025-07-30 21:47   ` Peter Xu
  2025-07-30 12:39 ` [PATCH v2 2/6] acpi: mark PMTIMER as unlocked Igor Mammedov
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 22+ messages in thread
From: Igor Mammedov @ 2025-07-30 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, peterx, mst, mtosatti, kraxel, peter.maydell

This patch brings back Jan's idea [1] of BQL-free IO access

This will let us make access to ACPI PM/HPET timers cheaper,
and prevent BQL contention in case of workload that heavily
uses the timers with a lot of vCPUs.

1) 196ea13104f (memory: Add global-locking property to memory regions)
   ... de7ea885c539 (kvm: Switch to unlocked MMIO)

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 include/system/memory.h | 10 ++++++++++
 system/memory.c         |  6 ++++++
 system/physmem.c        |  2 +-
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/include/system/memory.h b/include/system/memory.h
index e2cd6ed126..d04366c994 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -833,6 +833,7 @@ struct MemoryRegion {
     bool nonvolatile;
     bool rom_device;
     bool flush_coalesced_mmio;
+    bool lockless_io;
     bool unmergeable;
     uint8_t dirty_log_mask;
     bool is_iommu;
@@ -2341,6 +2342,15 @@ void memory_region_set_flush_coalesced(MemoryRegion *mr);
  */
 void memory_region_clear_flush_coalesced(MemoryRegion *mr);
 
+/**
+ * memory_region_enable_lockless_io: Enable lockless (BQL free) acceess.
+ *
+ * Enable BQL-free access for devices with fine-grained locking.
+ *
+ * @mr: the memory region to be updated.
+ */
+void memory_region_enable_lockless_io(MemoryRegion *mr);
+
 /**
  * memory_region_add_eventfd: Request an eventfd to be triggered when a word
  *                            is written to a location.
diff --git a/system/memory.c b/system/memory.c
index 5646547940..9a5a262112 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2546,6 +2546,12 @@ void memory_region_clear_flush_coalesced(MemoryRegion *mr)
     }
 }
 
+void memory_region_enable_lockless_io(MemoryRegion *mr)
+{
+    mr->lockless_io = true;
+    mr->disable_reentrancy_guard = true;
+}
+
 void memory_region_add_eventfd(MemoryRegion *mr,
                                hwaddr addr,
                                unsigned size,
diff --git a/system/physmem.c b/system/physmem.c
index 130c148ffb..107871e2b3 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2909,7 +2909,7 @@ bool prepare_mmio_access(MemoryRegion *mr)
 {
     bool release_lock = false;
 
-    if (!bql_locked()) {
+    if (!bql_locked() && !mr->lockless_io) {
         bql_lock();
         release_lock = true;
     }
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 2/6] acpi: mark PMTIMER as unlocked
  2025-07-30 12:39 [PATCH v2 0/6] Reinvent BQL-free PIO/MMIO Igor Mammedov
  2025-07-30 12:39 ` [PATCH v2 1/6] memory: reintroduce BQL-free fine-grained PIO/MMIO Igor Mammedov
@ 2025-07-30 12:39 ` Igor Mammedov
  2025-07-30 12:39 ` [PATCH v2 3/6] hpet: switch to fain-grained device locking Igor Mammedov
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Igor Mammedov @ 2025-07-30 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, peterx, mst, mtosatti, kraxel, peter.maydell

Reading QEMU_CLOCK_VIRTUAL is thread-safe, write access is NOP.

This makes possible to boot Windows with large vCPUs count when
hv-time is not used.

Reproducer:
  -M q35,hpet=off -cpu host -enable-kvm -smp 240,sockets=4 -m 8G WS2025.img
fails to boot within 30min.

With this fix it boots within 2-1min.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 hw/acpi/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/acpi/core.c b/hw/acpi/core.c
index 58f8964e13..ff16582803 100644
--- a/hw/acpi/core.c
+++ b/hw/acpi/core.c
@@ -547,6 +547,7 @@ void acpi_pm_tmr_init(ACPIREGS *ar, acpi_update_sci_fn update_sci,
     ar->tmr.timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, acpi_pm_tmr_timer, ar);
     memory_region_init_io(&ar->tmr.io, memory_region_owner(parent),
                           &acpi_pm_tmr_ops, ar, "acpi-tmr", 4);
+    memory_region_enable_lockless_io(&ar->tmr.io);
     memory_region_add_subregion(parent, 8, &ar->tmr.io);
 }
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 3/6] hpet: switch to fain-grained device locking
  2025-07-30 12:39 [PATCH v2 0/6] Reinvent BQL-free PIO/MMIO Igor Mammedov
  2025-07-30 12:39 ` [PATCH v2 1/6] memory: reintroduce BQL-free fine-grained PIO/MMIO Igor Mammedov
  2025-07-30 12:39 ` [PATCH v2 2/6] acpi: mark PMTIMER as unlocked Igor Mammedov
@ 2025-07-30 12:39 ` Igor Mammedov
  2025-07-30 12:39 ` [PATCH v2 4/6] hpet: move out main counter read into a separate block Igor Mammedov
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Igor Mammedov @ 2025-07-30 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, peterx, mst, mtosatti, kraxel, peter.maydell

as a step towards lock-less HPET counter read,
use per device locking instead of BQL.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 hw/timer/hpet.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
index cb48cc151f..ab5aa59ae4 100644
--- a/hw/timer/hpet.c
+++ b/hw/timer/hpet.c
@@ -38,6 +38,7 @@
 #include "hw/timer/i8254.h"
 #include "system/address-spaces.h"
 #include "qom/object.h"
+#include "qemu/lockable.h"
 #include "trace.h"
 
 struct hpet_fw_config hpet_fw_cfg = {.count = UINT8_MAX};
@@ -69,6 +70,7 @@ struct HPETState {
     SysBusDevice parent_obj;
     /*< public >*/
 
+    QemuMutex lock;
     MemoryRegion iomem;
     uint64_t hpet_offset;
     bool hpet_offset_saved;
@@ -428,6 +430,7 @@ static uint64_t hpet_ram_read(void *opaque, hwaddr addr,
     trace_hpet_ram_read(addr);
     addr &= ~4;
 
+    QEMU_LOCK_GUARD(&s->lock);
     /*address range of all global regs*/
     if (addr <= 0xff) {
         switch (addr) {
@@ -482,6 +485,7 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
     int len = MIN(size * 8, 64 - shift);
     uint64_t old_val, new_val, cleared;
 
+    QEMU_LOCK_GUARD(&s->lock);
     trace_hpet_ram_write(addr, value);
     addr &= ~4;
 
@@ -679,8 +683,10 @@ static void hpet_init(Object *obj)
     SysBusDevice *sbd = SYS_BUS_DEVICE(obj);
     HPETState *s = HPET(obj);
 
+    qemu_mutex_init(&s->lock);
     /* HPET Area */
     memory_region_init_io(&s->iomem, obj, &hpet_ram_ops, s, "hpet", HPET_LEN);
+    memory_region_enable_lockless_io(&s->iomem);
     sysbus_init_mmio(sbd, &s->iomem);
 }
 
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 4/6] hpet: move out main counter read into a separate block
  2025-07-30 12:39 [PATCH v2 0/6] Reinvent BQL-free PIO/MMIO Igor Mammedov
                   ` (2 preceding siblings ...)
  2025-07-30 12:39 ` [PATCH v2 3/6] hpet: switch to fain-grained device locking Igor Mammedov
@ 2025-07-30 12:39 ` Igor Mammedov
  2025-07-30 12:39 ` [PATCH v2 5/6] hpet: make main counter read lock-less Igor Mammedov
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 22+ messages in thread
From: Igor Mammedov @ 2025-07-30 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, peterx, mst, mtosatti, kraxel, peter.maydell

Follow up patche will switch main counter read to
lock-less mode. As preparation for that move relevant
branch into a separate top level block to make followup
patch cleaner/simplier by reducing contextual noise
when lock-less read is introduced.

no functional changes.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 hw/timer/hpet.c | 18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
index ab5aa59ae4..97687697c9 100644
--- a/hw/timer/hpet.c
+++ b/hw/timer/hpet.c
@@ -431,6 +431,16 @@ static uint64_t hpet_ram_read(void *opaque, hwaddr addr,
     addr &= ~4;
 
     QEMU_LOCK_GUARD(&s->lock);
+    if ((addr <= 0xff) && (addr == HPET_COUNTER)) {
+        if (hpet_enabled(s)) {
+            cur_tick = hpet_get_ticks(s);
+        } else {
+            cur_tick = s->hpet_counter;
+        }
+        trace_hpet_ram_read_reading_counter(addr & 4, cur_tick);
+        return cur_tick >> shift;
+    }
+
     /*address range of all global regs*/
     if (addr <= 0xff) {
         switch (addr) {
@@ -438,14 +448,6 @@ static uint64_t hpet_ram_read(void *opaque, hwaddr addr,
             return s->capability >> shift;
         case HPET_CFG:
             return s->config >> shift;
-        case HPET_COUNTER:
-            if (hpet_enabled(s)) {
-                cur_tick = hpet_get_ticks(s);
-            } else {
-                cur_tick = s->hpet_counter;
-            }
-            trace_hpet_ram_read_reading_counter(addr & 4, cur_tick);
-            return cur_tick >> shift;
         case HPET_STATUS:
             return s->isr >> shift;
         default:
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 5/6] hpet: make main counter read lock-less
  2025-07-30 12:39 [PATCH v2 0/6] Reinvent BQL-free PIO/MMIO Igor Mammedov
                   ` (3 preceding siblings ...)
  2025-07-30 12:39 ` [PATCH v2 4/6] hpet: move out main counter read into a separate block Igor Mammedov
@ 2025-07-30 12:39 ` Igor Mammedov
  2025-07-30 22:15   ` Peter Xu
  2025-07-30 12:39 ` [PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt Igor Mammedov
  2025-07-31 21:15 ` [PATCH v2 0/6] Reinvent BQL-free PIO/MMIO Philippe Mathieu-Daudé
  6 siblings, 1 reply; 22+ messages in thread
From: Igor Mammedov @ 2025-07-30 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, peterx, mst, mtosatti, kraxel, peter.maydell

Make access to main HPET counter lock-less when enable/disable
state isn't changing (which is the most of the time).

A read will fallback to locked access if the state change happens
in the middle of read or read happens in the middle of the state
change.

This basically uses the same approach as cpu_get_clock(),
modulo instead of busy wait it piggibacks to taking device lock
to wait until HPET reaches consistent state.

As result micro benchmark of concurrent reading of HPET counter
with large number of vCPU shows over 80% better (less) latency.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 hw/timer/hpet.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 44 insertions(+), 4 deletions(-)

diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
index 97687697c9..d822ca1cd0 100644
--- a/hw/timer/hpet.c
+++ b/hw/timer/hpet.c
@@ -74,6 +74,7 @@ struct HPETState {
     MemoryRegion iomem;
     uint64_t hpet_offset;
     bool hpet_offset_saved;
+    unsigned state_version;
     qemu_irq irqs[HPET_NUM_IRQ_ROUTES];
     uint32_t flags;
     uint8_t rtc_irq_level;
@@ -430,17 +431,44 @@ static uint64_t hpet_ram_read(void *opaque, hwaddr addr,
     trace_hpet_ram_read(addr);
     addr &= ~4;
 
-    QEMU_LOCK_GUARD(&s->lock);
     if ((addr <= 0xff) && (addr == HPET_COUNTER)) {
-        if (hpet_enabled(s)) {
-            cur_tick = hpet_get_ticks(s);
-        } else {
+        unsigned version;
+        bool release_lock = false;
+redo:
+        version = qatomic_load_acquire(&s->state_version);
+        if (unlikely(version & 1)) {
+                /*
+                 * Updater is running, state can be inconsistent
+                 * wait till it's done before reading counter
+                 */
+                release_lock = true;
+                qemu_mutex_lock(&s->lock);
+        }
+
+        if (unlikely(!hpet_enabled(s))) {
             cur_tick = s->hpet_counter;
+        } else {
+            cur_tick = hpet_get_ticks(s);
+        }
+
+        /*
+         * ensure counter math happens before we check version again
+         */
+        smp_rmb();
+        if (unlikely(version != qatomic_load_acquire(&s->state_version))) {
+            /*
+             * counter state has changed, re-read counter again
+             */
+            goto redo;
+        }
+        if (unlikely(release_lock)) {
+            qemu_mutex_unlock(&s->lock);
         }
         trace_hpet_ram_read_reading_counter(addr & 4, cur_tick);
         return cur_tick >> shift;
     }
 
+    QEMU_LOCK_GUARD(&s->lock);
     /*address range of all global regs*/
     if (addr <= 0xff) {
         switch (addr) {
@@ -500,6 +528,11 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
             old_val = s->config;
             new_val = deposit64(old_val, shift, len, value);
             new_val = hpet_fixup_reg(new_val, old_val, HPET_CFG_WRITE_MASK);
+            /*
+             * Odd versions mark the critical section, any readers will be
+             * forced into lock protected read if they come in the middle of it
+             */
+            qatomic_inc(&s->state_version);
             s->config = new_val;
             if (activating_bit(old_val, new_val, HPET_CFG_ENABLE)) {
                 /* Enable main counter and interrupt generation. */
@@ -518,6 +551,13 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
                     hpet_del_timer(&s->timer[i]);
                 }
             }
+            /*
+             * even versions mark the end of critical section,
+             * any readers started before config change, but were still executed
+             * during the change, will be forced to re-read counter state
+             */
+            qatomic_inc(&s->state_version);
+
             /* i8254 and RTC output pins are disabled
              * when HPET is in legacy mode */
             if (activating_bit(old_val, new_val, HPET_CFG_LEGACY)) {
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* [PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt
  2025-07-30 12:39 [PATCH v2 0/6] Reinvent BQL-free PIO/MMIO Igor Mammedov
                   ` (4 preceding siblings ...)
  2025-07-30 12:39 ` [PATCH v2 5/6] hpet: make main counter read lock-less Igor Mammedov
@ 2025-07-30 12:39 ` Igor Mammedov
  2025-07-31 19:24   ` Peter Xu
  2025-08-01 10:26   ` Paolo Bonzini
  2025-07-31 21:15 ` [PATCH v2 0/6] Reinvent BQL-free PIO/MMIO Philippe Mathieu-Daudé
  6 siblings, 2 replies; 22+ messages in thread
From: Igor Mammedov @ 2025-07-30 12:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, peterx, mst, mtosatti, kraxel, peter.maydell

when kernel-irqchip=split is used, QEMU still hits BQL
contention issue when reading ACPI PM/HPET timers
(despite of timer[s] access being lock-less).

So Windows with more than 255 cpus is still not able to
boot (since it requires iommu -> split irqchip).

Problematic path is in kvm_arch_pre_run() where BQL is taken
unconditionally when split irqchip is in use.

There are a few parts tha BQL protects there:
  1. interrupt check and injecting

    however we do not take BQL when checking for pending
    interrupt (even within the same function), so the patch
    takes the same approach for cpu->interrupt_request checks
    and takes BQL only if there is a job to do.

  2. request_interrupt_window access
      CPUState::kvm_run::request_interrupt_window doesn't need BQL
      as it's accessed on side QEMU only by its own vCPU thread.
      The only thing that BQL provides there is implict barrier.
      Which can be done by using cheaper explicit barrier there.

  3. cr8/cpu_get_apic_tpr access
      the same (as #2) applies to CPUState::kvm_run::cr8 write,
      and APIC registers are also cached/synced (get/put) within
      the vCPU thread it belongs to.

Taking BQL only when is necessary, eleminates BQL bottleneck on
IO/MMIO only exit path, improoving latency by 80% on HPET micro
benchmark.

This lets Windows to boot succesfully (in case hv-time isn't used)
when more than 255 vCPUs are in use.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 target/i386/kvm/kvm.c | 58 +++++++++++++++++++++++++++----------------
 1 file changed, 37 insertions(+), 21 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 369626f8c8..32024d50f5 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -5450,6 +5450,7 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
 {
     X86CPU *x86_cpu = X86_CPU(cpu);
     CPUX86State *env = &x86_cpu->env;
+    bool release_bql = 0;
     int ret;
 
     /* Inject NMI */
@@ -5478,15 +5479,16 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
         }
     }
 
-    if (!kvm_pic_in_kernel()) {
-        bql_lock();
-    }
 
     /* Force the VCPU out of its inner loop to process any INIT requests
      * or (for userspace APIC, but it is cheap to combine the checks here)
      * pending TPR access reports.
      */
     if (cpu->interrupt_request & (CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
+        if (!kvm_pic_in_kernel()) {
+            bql_lock();
+            release_bql = true;
+        }
         if ((cpu->interrupt_request & CPU_INTERRUPT_INIT) &&
             !(env->hflags & HF_SMM_MASK)) {
             cpu->exit_request = 1;
@@ -5497,24 +5499,31 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
     }
 
     if (!kvm_pic_in_kernel()) {
-        /* Try to inject an interrupt if the guest can accept it */
-        if (run->ready_for_interrupt_injection &&
-            (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&
-            (env->eflags & IF_MASK)) {
-            int irq;
-
-            cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;
-            irq = cpu_get_pic_interrupt(env);
-            if (irq >= 0) {
-                struct kvm_interrupt intr;
-
-                intr.irq = irq;
-                DPRINTF("injected interrupt %d\n", irq);
-                ret = kvm_vcpu_ioctl(cpu, KVM_INTERRUPT, &intr);
-                if (ret < 0) {
-                    fprintf(stderr,
-                            "KVM: injection failed, interrupt lost (%s)\n",
-                            strerror(-ret));
+        if (cpu->interrupt_request & CPU_INTERRUPT_HARD) {
+            if (!release_bql) {
+                bql_lock();
+                release_bql = true;
+            }
+
+            /* Try to inject an interrupt if the guest can accept it */
+            if (run->ready_for_interrupt_injection &&
+                (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&
+                (env->eflags & IF_MASK)) {
+                int irq;
+
+                cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;
+                irq = cpu_get_pic_interrupt(env);
+                if (irq >= 0) {
+                    struct kvm_interrupt intr;
+
+                    intr.irq = irq;
+                    DPRINTF("injected interrupt %d\n", irq);
+                    ret = kvm_vcpu_ioctl(cpu, KVM_INTERRUPT, &intr);
+                    if (ret < 0) {
+                        fprintf(stderr,
+                                "KVM: injection failed, interrupt lost (%s)\n",
+                                strerror(-ret));
+                    }
                 }
             }
         }
@@ -5531,7 +5540,14 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
 
         DPRINTF("setting tpr\n");
         run->cr8 = cpu_get_apic_tpr(x86_cpu->apic_state);
+        /*
+         * make sure that request_interrupt_window/cr8 are set
+         * before KVM_RUN might read them
+         */
+        smp_mb();
+    }
 
+    if (release_bql) {
         bql_unlock();
     }
 }
-- 
2.47.1



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/6] memory: reintroduce BQL-free fine-grained PIO/MMIO
  2025-07-30 12:39 ` [PATCH v2 1/6] memory: reintroduce BQL-free fine-grained PIO/MMIO Igor Mammedov
@ 2025-07-30 21:47   ` Peter Xu
  2025-07-31  8:15     ` Igor Mammedov
  2025-08-01 12:42     ` Igor Mammedov
  0 siblings, 2 replies; 22+ messages in thread
From: Peter Xu @ 2025-07-30 21:47 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: qemu-devel, pbonzini, mst, mtosatti, kraxel, peter.maydell

On Wed, Jul 30, 2025 at 02:39:29PM +0200, Igor Mammedov wrote:
> diff --git a/system/memory.c b/system/memory.c
> index 5646547940..9a5a262112 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2546,6 +2546,12 @@ void memory_region_clear_flush_coalesced(MemoryRegion *mr)
>      }
>  }
>  
> +void memory_region_enable_lockless_io(MemoryRegion *mr)
> +{
> +    mr->lockless_io = true;
> +    mr->disable_reentrancy_guard = true;

IIUC this is needed only because the re-entrancy guard is not
per-transaction but per-device, am I right?

Maybe some comment would be nice here to explain how mmio concurrency could
affect this.  If my above comment is correct, it could also be a TODO so we
could re-enable this when it is per-transaction (even though I don't know
whether it's easy / useful to do..).

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 5/6] hpet: make main counter read lock-less
  2025-07-30 12:39 ` [PATCH v2 5/6] hpet: make main counter read lock-less Igor Mammedov
@ 2025-07-30 22:15   ` Peter Xu
  2025-07-31  8:32     ` Igor Mammedov
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2025-07-30 22:15 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: qemu-devel, pbonzini, mst, mtosatti, kraxel, peter.maydell

On Wed, Jul 30, 2025 at 02:39:33PM +0200, Igor Mammedov wrote:
> Make access to main HPET counter lock-less when enable/disable
> state isn't changing (which is the most of the time).
> 
> A read will fallback to locked access if the state change happens
> in the middle of read or read happens in the middle of the state
> change.
> 
> This basically uses the same approach as cpu_get_clock(),
> modulo instead of busy wait it piggibacks to taking device lock
> to wait until HPET reaches consistent state.

The open-coded seqlock will slightly add complexity of the hpet code.  Is
it required? IOW, is it common to have concurrent writters while reading?

How bad it is to spin on read waiting for the writer to finish?

Thanks,

> 
> As result micro benchmark of concurrent reading of HPET counter
> with large number of vCPU shows over 80% better (less) latency.
> 
> Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> ---
>  hw/timer/hpet.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
>  1 file changed, 44 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
> index 97687697c9..d822ca1cd0 100644
> --- a/hw/timer/hpet.c
> +++ b/hw/timer/hpet.c
> @@ -74,6 +74,7 @@ struct HPETState {
>      MemoryRegion iomem;
>      uint64_t hpet_offset;
>      bool hpet_offset_saved;
> +    unsigned state_version;
>      qemu_irq irqs[HPET_NUM_IRQ_ROUTES];
>      uint32_t flags;
>      uint8_t rtc_irq_level;
> @@ -430,17 +431,44 @@ static uint64_t hpet_ram_read(void *opaque, hwaddr addr,
>      trace_hpet_ram_read(addr);
>      addr &= ~4;
>  
> -    QEMU_LOCK_GUARD(&s->lock);
>      if ((addr <= 0xff) && (addr == HPET_COUNTER)) {
> -        if (hpet_enabled(s)) {
> -            cur_tick = hpet_get_ticks(s);
> -        } else {
> +        unsigned version;
> +        bool release_lock = false;
> +redo:
> +        version = qatomic_load_acquire(&s->state_version);
> +        if (unlikely(version & 1)) {
> +                /*
> +                 * Updater is running, state can be inconsistent
> +                 * wait till it's done before reading counter
> +                 */
> +                release_lock = true;
> +                qemu_mutex_lock(&s->lock);
> +        }
> +
> +        if (unlikely(!hpet_enabled(s))) {
>              cur_tick = s->hpet_counter;
> +        } else {
> +            cur_tick = hpet_get_ticks(s);
> +        }
> +
> +        /*
> +         * ensure counter math happens before we check version again
> +         */
> +        smp_rmb();
> +        if (unlikely(version != qatomic_load_acquire(&s->state_version))) {
> +            /*
> +             * counter state has changed, re-read counter again
> +             */
> +            goto redo;
> +        }
> +        if (unlikely(release_lock)) {
> +            qemu_mutex_unlock(&s->lock);
>          }
>          trace_hpet_ram_read_reading_counter(addr & 4, cur_tick);
>          return cur_tick >> shift;
>      }
>  
> +    QEMU_LOCK_GUARD(&s->lock);
>      /*address range of all global regs*/
>      if (addr <= 0xff) {
>          switch (addr) {
> @@ -500,6 +528,11 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
>              old_val = s->config;
>              new_val = deposit64(old_val, shift, len, value);
>              new_val = hpet_fixup_reg(new_val, old_val, HPET_CFG_WRITE_MASK);
> +            /*
> +             * Odd versions mark the critical section, any readers will be
> +             * forced into lock protected read if they come in the middle of it
> +             */
> +            qatomic_inc(&s->state_version);
>              s->config = new_val;
>              if (activating_bit(old_val, new_val, HPET_CFG_ENABLE)) {
>                  /* Enable main counter and interrupt generation. */
> @@ -518,6 +551,13 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
>                      hpet_del_timer(&s->timer[i]);
>                  }
>              }
> +            /*
> +             * even versions mark the end of critical section,
> +             * any readers started before config change, but were still executed
> +             * during the change, will be forced to re-read counter state
> +             */
> +            qatomic_inc(&s->state_version);
> +
>              /* i8254 and RTC output pins are disabled
>               * when HPET is in legacy mode */
>              if (activating_bit(old_val, new_val, HPET_CFG_LEGACY)) {
> -- 
> 2.47.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/6] memory: reintroduce BQL-free fine-grained PIO/MMIO
  2025-07-30 21:47   ` Peter Xu
@ 2025-07-31  8:15     ` Igor Mammedov
  2025-08-01 12:42     ` Igor Mammedov
  1 sibling, 0 replies; 22+ messages in thread
From: Igor Mammedov @ 2025-07-31  8:15 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, pbonzini, mst, mtosatti, kraxel, peter.maydell

On Wed, 30 Jul 2025 17:47:52 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Jul 30, 2025 at 02:39:29PM +0200, Igor Mammedov wrote:
> > diff --git a/system/memory.c b/system/memory.c
> > index 5646547940..9a5a262112 100644
> > --- a/system/memory.c
> > +++ b/system/memory.c
> > @@ -2546,6 +2546,12 @@ void memory_region_clear_flush_coalesced(MemoryRegion *mr)
> >      }
> >  }
> >  
> > +void memory_region_enable_lockless_io(MemoryRegion *mr)
> > +{
> > +    mr->lockless_io = true;
> > +    mr->disable_reentrancy_guard = true;  
> 
> IIUC this is needed only because the re-entrancy guard is not
> per-transaction but per-device, am I right?

As far as I understood, it was per memory region (device in this case).
 
> Maybe some comment would be nice here to explain how mmio concurrency could
> affect this.  If my above comment is correct, it could also be a TODO so we
> could re-enable this when it is per-transaction (even though I don't know
> whether it's easy / useful to do..).

I can add a comment on repin.
> 
> Thanks,
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 5/6] hpet: make main counter read lock-less
  2025-07-30 22:15   ` Peter Xu
@ 2025-07-31  8:32     ` Igor Mammedov
  2025-07-31 14:02       ` Peter Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Igor Mammedov @ 2025-07-31  8:32 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, pbonzini, mst, mtosatti, kraxel, peter.maydell

On Wed, 30 Jul 2025 18:15:03 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Jul 30, 2025 at 02:39:33PM +0200, Igor Mammedov wrote:
> > Make access to main HPET counter lock-less when enable/disable
> > state isn't changing (which is the most of the time).
> > 
> > A read will fallback to locked access if the state change happens
> > in the middle of read or read happens in the middle of the state
> > change.
> > 
> > This basically uses the same approach as cpu_get_clock(),
> > modulo instead of busy wait it piggibacks to taking device lock
> > to wait until HPET reaches consistent state.  
> 
> The open-coded seqlock will slightly add complexity of the hpet code.  Is
> it required? IOW, is it common to have concurrent writters while reading?

Write path has to be lock protected for correctness sake even though
concurrent writers are not likely.

I've tried seqlock as well, the difference wrt seqlock is few LOC only
it didn't make HPET code any simpler.

> How bad it is to spin on read waiting for the writer to finish?
that will waste CPU cycles, and on large NUMA system it will generate
more cross node traffic. (i.e. it would scale badly, though TBH
I don't have numbers. I think measuring it would be hard as it
would drown in the noise.)

hence I've opted for a more effective option, to halt readers
until update is done. (at the cost of latency spike when that
unlikely event happens)

> Thanks,
> 
> > 
> > As result micro benchmark of concurrent reading of HPET counter
> > with large number of vCPU shows over 80% better (less) latency.
> > 
> > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > ---
> >  hw/timer/hpet.c | 48 ++++++++++++++++++++++++++++++++++++++++++++----
> >  1 file changed, 44 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
> > index 97687697c9..d822ca1cd0 100644
> > --- a/hw/timer/hpet.c
> > +++ b/hw/timer/hpet.c
> > @@ -74,6 +74,7 @@ struct HPETState {
> >      MemoryRegion iomem;
> >      uint64_t hpet_offset;
> >      bool hpet_offset_saved;
> > +    unsigned state_version;
> >      qemu_irq irqs[HPET_NUM_IRQ_ROUTES];
> >      uint32_t flags;
> >      uint8_t rtc_irq_level;
> > @@ -430,17 +431,44 @@ static uint64_t hpet_ram_read(void *opaque, hwaddr addr,
> >      trace_hpet_ram_read(addr);
> >      addr &= ~4;
> >  
> > -    QEMU_LOCK_GUARD(&s->lock);
> >      if ((addr <= 0xff) && (addr == HPET_COUNTER)) {
> > -        if (hpet_enabled(s)) {
> > -            cur_tick = hpet_get_ticks(s);
> > -        } else {
> > +        unsigned version;
> > +        bool release_lock = false;
> > +redo:
> > +        version = qatomic_load_acquire(&s->state_version);
> > +        if (unlikely(version & 1)) {
> > +                /*
> > +                 * Updater is running, state can be inconsistent
> > +                 * wait till it's done before reading counter
> > +                 */
> > +                release_lock = true;
> > +                qemu_mutex_lock(&s->lock);
> > +        }
> > +
> > +        if (unlikely(!hpet_enabled(s))) {
> >              cur_tick = s->hpet_counter;
> > +        } else {
> > +            cur_tick = hpet_get_ticks(s);
> > +        }
> > +
> > +        /*
> > +         * ensure counter math happens before we check version again
> > +         */
> > +        smp_rmb();
> > +        if (unlikely(version != qatomic_load_acquire(&s->state_version))) {
> > +            /*
> > +             * counter state has changed, re-read counter again
> > +             */
> > +            goto redo;
> > +        }
> > +        if (unlikely(release_lock)) {
> > +            qemu_mutex_unlock(&s->lock);
> >          }
> >          trace_hpet_ram_read_reading_counter(addr & 4, cur_tick);
> >          return cur_tick >> shift;
> >      }
> >  
> > +    QEMU_LOCK_GUARD(&s->lock);
> >      /*address range of all global regs*/
> >      if (addr <= 0xff) {
> >          switch (addr) {
> > @@ -500,6 +528,11 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
> >              old_val = s->config;
> >              new_val = deposit64(old_val, shift, len, value);
> >              new_val = hpet_fixup_reg(new_val, old_val, HPET_CFG_WRITE_MASK);
> > +            /*
> > +             * Odd versions mark the critical section, any readers will be
> > +             * forced into lock protected read if they come in the middle of it
> > +             */
> > +            qatomic_inc(&s->state_version);
> >              s->config = new_val;
> >              if (activating_bit(old_val, new_val, HPET_CFG_ENABLE)) {
> >                  /* Enable main counter and interrupt generation. */
> > @@ -518,6 +551,13 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
> >                      hpet_del_timer(&s->timer[i]);
> >                  }
> >              }
> > +            /*
> > +             * even versions mark the end of critical section,
> > +             * any readers started before config change, but were still executed
> > +             * during the change, will be forced to re-read counter state
> > +             */
> > +            qatomic_inc(&s->state_version);
> > +
> >              /* i8254 and RTC output pins are disabled
> >               * when HPET is in legacy mode */
> >              if (activating_bit(old_val, new_val, HPET_CFG_LEGACY)) {
> > -- 
> > 2.47.1
> >   
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 5/6] hpet: make main counter read lock-less
  2025-07-31  8:32     ` Igor Mammedov
@ 2025-07-31 14:02       ` Peter Xu
  2025-08-01  8:06         ` Igor Mammedov
  0 siblings, 1 reply; 22+ messages in thread
From: Peter Xu @ 2025-07-31 14:02 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: qemu-devel, pbonzini, mst, mtosatti, kraxel, peter.maydell

On Thu, Jul 31, 2025 at 10:32:10AM +0200, Igor Mammedov wrote:
> On Wed, 30 Jul 2025 18:15:03 -0400
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Wed, Jul 30, 2025 at 02:39:33PM +0200, Igor Mammedov wrote:
> > > Make access to main HPET counter lock-less when enable/disable
> > > state isn't changing (which is the most of the time).
> > > 
> > > A read will fallback to locked access if the state change happens
> > > in the middle of read or read happens in the middle of the state
> > > change.
> > > 
> > > This basically uses the same approach as cpu_get_clock(),
> > > modulo instead of busy wait it piggibacks to taking device lock
> > > to wait until HPET reaches consistent state.  
> > 
> > The open-coded seqlock will slightly add complexity of the hpet code.  Is
> > it required? IOW, is it common to have concurrent writters while reading?
> 
> Write path has to be lock protected for correctness sake even though
> concurrent writers are not likely.

Right.  Though we have seqlock_write_lock() for that, IIUC (even though
maybe in hpet's use case we don't need it..).

> 
> I've tried seqlock as well, the difference wrt seqlock is few LOC only
> it didn't make HPET code any simpler.

I tried to do this and it looks still worthwhile to do, but maybe I missed
something alone the lines. Please have a look if so.  That is still a lot
of LOC saved, meanwhile IMHO the important part is mem barriers are just
tricky to always hard-code in users, so I thought it would always be nice
to reuse the lock APIs whenever possible.

One example is, IIUC this current patch may have missed the mem barriers
when boosting state_version in hpet_ram_write().

> 
> > How bad it is to spin on read waiting for the writer to finish?
> that will waste CPU cycles, and on large NUMA system it will generate
> more cross node traffic. (i.e. it would scale badly, though TBH
> I don't have numbers. I think measuring it would be hard as it
> would drown in the noise.)
> 
> hence I've opted for a more effective option, to halt readers
> until update is done. (at the cost of latency spike when that
> unlikely event happens)

If it is extremely unlikely (IIUC, disabling HPET while someone is using /
reading the counter.. should never happen in normal production?), would
spinning read also be fine?  Maybe that's also why I can save more LOCs in
the diff below.

In the diff I also removed a "addr <= 0xff" check, that might belong to a
prior patch that I thought is not needed.

Thanks,

diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
index d822ca1cd0..09a84d19f3 100644
--- a/hw/timer/hpet.c
+++ b/hw/timer/hpet.c
@@ -39,6 +39,7 @@
 #include "system/address-spaces.h"
 #include "qom/object.h"
 #include "qemu/lockable.h"
+#include "qemu/seqlock.h"
 #include "trace.h"
 
 struct hpet_fw_config hpet_fw_cfg = {.count = UINT8_MAX};
@@ -74,7 +75,7 @@ struct HPETState {
     MemoryRegion iomem;
     uint64_t hpet_offset;
     bool hpet_offset_saved;
-    unsigned state_version;
+    QemuSeqLock state_version;
     qemu_irq irqs[HPET_NUM_IRQ_ROUTES];
     uint32_t flags;
     uint8_t rtc_irq_level;
@@ -431,39 +432,17 @@ static uint64_t hpet_ram_read(void *opaque, hwaddr addr,
     trace_hpet_ram_read(addr);
     addr &= ~4;
 
-    if ((addr <= 0xff) && (addr == HPET_COUNTER)) {
+    if (addr == HPET_COUNTER) {
         unsigned version;
-        bool release_lock = false;
-redo:
-        version = qatomic_load_acquire(&s->state_version);
-        if (unlikely(version & 1)) {
-                /*
-                 * Updater is running, state can be inconsistent
-                 * wait till it's done before reading counter
-                 */
-                release_lock = true;
-                qemu_mutex_lock(&s->lock);
-        }
-
-        if (unlikely(!hpet_enabled(s))) {
-            cur_tick = s->hpet_counter;
-        } else {
-            cur_tick = hpet_get_ticks(s);
-        }
-
-        /*
-         * ensure counter math happens before we check version again
-         */
-        smp_rmb();
-        if (unlikely(version != qatomic_load_acquire(&s->state_version))) {
-            /*
-             * counter state has changed, re-read counter again
-             */
-            goto redo;
-        }
-        if (unlikely(release_lock)) {
-            qemu_mutex_unlock(&s->lock);
-        }
+        /* Write update is extremely rare, so spinning is fine */
+        do {
+            version = seqlock_read_begin(&s->state_version);
+            if (unlikely(!hpet_enabled(s))) {
+                cur_tick = s->hpet_counter;
+            } else {
+                cur_tick = hpet_get_ticks(s);
+            }
+        } while (seqlock_read_retry(&s->state_version, version));
         trace_hpet_ram_read_reading_counter(addr & 4, cur_tick);
         return cur_tick >> shift;
     }
@@ -528,11 +507,7 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
             old_val = s->config;
             new_val = deposit64(old_val, shift, len, value);
             new_val = hpet_fixup_reg(new_val, old_val, HPET_CFG_WRITE_MASK);
-            /*
-             * Odd versions mark the critical section, any readers will be
-             * forced into lock protected read if they come in the middle of it
-             */
-            qatomic_inc(&s->state_version);
+            seqlock_write_begin(&s->state_version);
             s->config = new_val;
             if (activating_bit(old_val, new_val, HPET_CFG_ENABLE)) {
                 /* Enable main counter and interrupt generation. */
@@ -551,12 +526,7 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
                     hpet_del_timer(&s->timer[i]);
                 }
             }
-            /*
-             * even versions mark the end of critical section,
-             * any readers started before config change, but were still executed
-             * during the change, will be forced to re-read counter state
-             */
-            qatomic_inc(&s->state_version);
+            seqlock_write_end(&s->state_version);
 
             /* i8254 and RTC output pins are disabled
              * when HPET is in legacy mode */


-- 
Peter Xu



^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt
  2025-07-30 12:39 ` [PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt Igor Mammedov
@ 2025-07-31 19:24   ` Peter Xu
  2025-08-01  8:42     ` Igor Mammedov
  2025-08-01 10:26   ` Paolo Bonzini
  1 sibling, 1 reply; 22+ messages in thread
From: Peter Xu @ 2025-07-31 19:24 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: qemu-devel, pbonzini, mst, mtosatti, kraxel, peter.maydell

On Wed, Jul 30, 2025 at 02:39:34PM +0200, Igor Mammedov wrote:
> when kernel-irqchip=split is used, QEMU still hits BQL
> contention issue when reading ACPI PM/HPET timers
> (despite of timer[s] access being lock-less).
> 
> So Windows with more than 255 cpus is still not able to
> boot (since it requires iommu -> split irqchip).
> 
> Problematic path is in kvm_arch_pre_run() where BQL is taken
> unconditionally when split irqchip is in use.
> 
> There are a few parts tha BQL protects there:
>   1. interrupt check and injecting
> 
>     however we do not take BQL when checking for pending
>     interrupt (even within the same function), so the patch
>     takes the same approach for cpu->interrupt_request checks
>     and takes BQL only if there is a job to do.
> 
>   2. request_interrupt_window access
>       CPUState::kvm_run::request_interrupt_window doesn't need BQL
>       as it's accessed on side QEMU only by its own vCPU thread.
>       The only thing that BQL provides there is implict barrier.
>       Which can be done by using cheaper explicit barrier there.
> 
>   3. cr8/cpu_get_apic_tpr access
>       the same (as #2) applies to CPUState::kvm_run::cr8 write,
>       and APIC registers are also cached/synced (get/put) within
>       the vCPU thread it belongs to.
> 
> Taking BQL only when is necessary, eleminates BQL bottleneck on
> IO/MMIO only exit path, improoving latency by 80% on HPET micro
> benchmark.
> 
> This lets Windows to boot succesfully (in case hv-time isn't used)
> when more than 255 vCPUs are in use.

Not familiar with this path, but the change looks reasonable, a few pure
questions inline.

> 
> Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> ---
>  target/i386/kvm/kvm.c | 58 +++++++++++++++++++++++++++----------------
>  1 file changed, 37 insertions(+), 21 deletions(-)
> 
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index 369626f8c8..32024d50f5 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -5450,6 +5450,7 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
>  {
>      X86CPU *x86_cpu = X86_CPU(cpu);
>      CPUX86State *env = &x86_cpu->env;
> +    bool release_bql = 0;
>      int ret;
>  
>      /* Inject NMI */
> @@ -5478,15 +5479,16 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
>          }
>      }
>  
> -    if (!kvm_pic_in_kernel()) {
> -        bql_lock();
> -    }
>  
>      /* Force the VCPU out of its inner loop to process any INIT requests
>       * or (for userspace APIC, but it is cheap to combine the checks here)
>       * pending TPR access reports.
>       */
>      if (cpu->interrupt_request & (CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
> +        if (!kvm_pic_in_kernel()) {
> +            bql_lock();
> +            release_bql = true;
> +        }

Does updating exit_request need bql at all?  I saw the pattern is this:

        kvm_arch_pre_run(cpu, run);
        if (qatomic_read(&cpu->exit_request)) {
            trace_kvm_interrupt_exit_request();
            /*
             * KVM requires us to reenter the kernel after IO exits to complete
             * instruction emulation. This self-signal will ensure that we
             * leave ASAP again.
             */
            kvm_cpu_kick_self();
        }

So setting exit_request=1 here will likely be read very soon later, in this
case it seems the lock isn't needed.

>          if ((cpu->interrupt_request & CPU_INTERRUPT_INIT) &&
>              !(env->hflags & HF_SMM_MASK)) {
>              cpu->exit_request = 1;
> @@ -5497,24 +5499,31 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
>      }
>  
>      if (!kvm_pic_in_kernel()) {
> -        /* Try to inject an interrupt if the guest can accept it */
> -        if (run->ready_for_interrupt_injection &&
> -            (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&
> -            (env->eflags & IF_MASK)) {
> -            int irq;
> -
> -            cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;
> -            irq = cpu_get_pic_interrupt(env);
> -            if (irq >= 0) {
> -                struct kvm_interrupt intr;
> -
> -                intr.irq = irq;
> -                DPRINTF("injected interrupt %d\n", irq);
> -                ret = kvm_vcpu_ioctl(cpu, KVM_INTERRUPT, &intr);
> -                if (ret < 0) {
> -                    fprintf(stderr,
> -                            "KVM: injection failed, interrupt lost (%s)\n",
> -                            strerror(-ret));
> +        if (cpu->interrupt_request & CPU_INTERRUPT_HARD) {
> +            if (!release_bql) {
> +                bql_lock();
> +                release_bql = true;
> +            }
> +
> +            /* Try to inject an interrupt if the guest can accept it */
> +            if (run->ready_for_interrupt_injection &&
> +                (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&
> +                (env->eflags & IF_MASK)) {
> +                int irq;
> +
> +                cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;
> +                irq = cpu_get_pic_interrupt(env);
> +                if (irq >= 0) {
> +                    struct kvm_interrupt intr;
> +
> +                    intr.irq = irq;
> +                    DPRINTF("injected interrupt %d\n", irq);
> +                    ret = kvm_vcpu_ioctl(cpu, KVM_INTERRUPT, &intr);
> +                    if (ret < 0) {
> +                        fprintf(stderr,
> +                                "KVM: injection failed, interrupt lost (%s)\n",
> +                                strerror(-ret));
> +                    }
>                  }
>              }
>          }
> @@ -5531,7 +5540,14 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
>  
>          DPRINTF("setting tpr\n");
>          run->cr8 = cpu_get_apic_tpr(x86_cpu->apic_state);
> +        /*
> +         * make sure that request_interrupt_window/cr8 are set
> +         * before KVM_RUN might read them
> +         */
> +        smp_mb();

Is this mb() needed if KVM_RUN will always happen in the same thread anyway?

Thanks,

> +    }
>  
> +    if (release_bql) {
>          bql_unlock();
>      }
>  }
> -- 
> 2.47.1
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 0/6] Reinvent BQL-free PIO/MMIO
  2025-07-30 12:39 [PATCH v2 0/6] Reinvent BQL-free PIO/MMIO Igor Mammedov
                   ` (5 preceding siblings ...)
  2025-07-30 12:39 ` [PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt Igor Mammedov
@ 2025-07-31 21:15 ` Philippe Mathieu-Daudé
  6 siblings, 0 replies; 22+ messages in thread
From: Philippe Mathieu-Daudé @ 2025-07-31 21:15 UTC (permalink / raw)
  To: Igor Mammedov, qemu-devel
  Cc: pbonzini, peterx, mst, mtosatti, kraxel, peter.maydell,
	Alexander Bulekov, Bandan Das, Darren Kenny

Cc'ing Alex, Darren and Bandan.

On 30/7/25 14:39, Igor Mammedov wrote:
> v2:
>    * Make both read and write pathes BQL-less (Gerd)
>    * Refactor HPET to handle lock-less access correctly
>      when stopping/starting counter in parallel. (Peter Maydell)
>    * Publish kvm-unit-tests HPET bench/torture test [1] to verify
>      HPET lock-less handling
> 
> When booting WS2025 with following CLI
>   1)   -M q35,hpet=off -cpu host -enable-kvm -smp 240,sockets=4
> the guest boots very slow and is sluggish after boot
> or it's stuck on boot at spinning circle (most of the time).
> 
> pref shows that VM is experiencing heavy BQL contention on IO path
> which happens to be ACPI PM timer read access. A variation with
> HPET enabled moves contention to HPET timer read access.
> And it only gets worse with increasing number of VCPUs.
> 
> Series prevents large VM vCPUs contending on BQL due to PM|HPET timer
> access and lets Windows to move on with boot process.
> 
> Testing lock-less IO with HPET micro benchmark [1] shows approx 80%
> better performance than the current BLQ locked path.
> [chart https://ibb.co/MJY9999 shows much better scaling of lock-less
> IO compared to BQL one.]
> 
> In my tests, with CLI WS2025 guest wasn't able to boot within 30min
> on both hosts
>    * 32 core 2NUMA nodes
>    * 448 cores 8NUMA nodes
> With ACPI PM timer in BQL-free read mode, guest boots within approx:
>   * 2min
>   * 1min
> respectively.
> 
> With HPET enabled boot time shrinks ~2x
>   * 4m13 -> 2m21
>   * 2m19 -> 1m15
> respectively.
> 
> 1) "[kvm-unit-tests PATCH v4 0/5] x86: add HPET counter tests"
>      https://lore.kernel.org/kvm/20250725095429.1691734-1-imammedo@redhat.com/T/#t
> PS:
> Using hv-time=on cpu option helps a lot (when it works) and
> lets [1] guest boot fine in ~1-2min. Series doesn't make
> a significant impact in this case.
> 
> PS2:
> Tested series with a bunch of different guests:
>   RHEL-[6..10]x64, WS2012R2, WS2016, WS2022, WS2025
> 
> PS3:
>   dropped mention of https://bugzilla.redhat.com/show_bug.cgi?id=1322713
>   as it's not reproducible with current software stack or even with
>   the same qemu/seabios as reported (kernel versions mentioned in
>   the report were interim ones and no longer available,
>   so I've used nearest released at the time for testing)
> 
> Igor Mammedov (6):
>    memory: reintroduce BQL-free fine-grained PIO/MMIO
>    acpi: mark PMTIMER as unlocked
>    hpet: switch to fain-grained device locking
>    hpet: move out main counter read into a separate block
>    hpet: make main counter read lock-less
>    kvm: i386: irqchip: take BQL only if there is an interrupt
> 
>   include/system/memory.h | 10 +++++++
>   hw/acpi/core.c          |  1 +
>   hw/timer/hpet.c         | 64 +++++++++++++++++++++++++++++++++++------
>   system/memory.c         |  6 ++++
>   system/physmem.c        |  2 +-
>   target/i386/kvm/kvm.c   | 58 +++++++++++++++++++++++--------------
>   6 files changed, 111 insertions(+), 30 deletions(-)
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 5/6] hpet: make main counter read lock-less
  2025-07-31 14:02       ` Peter Xu
@ 2025-08-01  8:06         ` Igor Mammedov
  2025-08-01 13:32           ` Peter Xu
  0 siblings, 1 reply; 22+ messages in thread
From: Igor Mammedov @ 2025-08-01  8:06 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, pbonzini, mst, mtosatti, kraxel, peter.maydell

On Thu, 31 Jul 2025 10:02:06 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Thu, Jul 31, 2025 at 10:32:10AM +0200, Igor Mammedov wrote:
> > On Wed, 30 Jul 2025 18:15:03 -0400
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > On Wed, Jul 30, 2025 at 02:39:33PM +0200, Igor Mammedov wrote:  
> > > > Make access to main HPET counter lock-less when enable/disable
> > > > state isn't changing (which is the most of the time).
> > > > 
> > > > A read will fallback to locked access if the state change happens
> > > > in the middle of read or read happens in the middle of the state
> > > > change.
> > > > 
> > > > This basically uses the same approach as cpu_get_clock(),
> > > > modulo instead of busy wait it piggibacks to taking device lock
> > > > to wait until HPET reaches consistent state.    
> > > 
> > > The open-coded seqlock will slightly add complexity of the hpet code.  Is
> > > it required? IOW, is it common to have concurrent writters while reading?  
> > 
> > Write path has to be lock protected for correctness sake even though
> > concurrent writers are not likely.  
> 
> Right.  Though we have seqlock_write_lock() for that, IIUC (even though
> maybe in hpet's use case we don't need it..).
> 
> > 
> > I've tried seqlock as well, the difference wrt seqlock is few LOC only
> > it didn't make HPET code any simpler.  
> 
> I tried to do this and it looks still worthwhile to do, but maybe I missed
> something alone the lines. Please have a look if so.  That is still a lot
> of LOC saved, meanwhile IMHO the important part is mem barriers are just
> tricky to always hard-code in users, so I thought it would always be nice
> to reuse the lock APIs whenever possible.

I'll try it for the next respin

> One example is, IIUC this current patch may have missed the mem barriers
> when boosting state_version in hpet_ram_write().

docs put qatomic_inc() in 'Sequentially consistent' category,
hence no manual barrier.
before that I've used weak qatomic_store_release(), but
qatomic_inc() should do increment and store that in one go.

> > > How bad it is to spin on read waiting for the writer to finish?  
> > that will waste CPU cycles, and on large NUMA system it will generate
> > more cross node traffic. (i.e. it would scale badly, though TBH
> > I don't have numbers. I think measuring it would be hard as it
> > would drown in the noise.)
> > 
> > hence I've opted for a more effective option, to halt readers
> > until update is done. (at the cost of latency spike when that
> > unlikely event happens)  
> 
> If it is extremely unlikely (IIUC, disabling HPET while someone is using /
> reading the counter.. should never happen in normal production?), would
> spinning read also be fine?  Maybe that's also why I can save more LOCs in
> the diff below.

it's mostly need for comments that goes away.
but you are right,
it's very not likely to happen. so busywait vs lock probably won't matter. 

> 
> In the diff I also removed a "addr <= 0xff" check, that might belong to a
> prior patch that I thought is not needed.

indeed check is not really needed.

> 
> Thanks,
> 
> diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
> index d822ca1cd0..09a84d19f3 100644
> --- a/hw/timer/hpet.c
> +++ b/hw/timer/hpet.c
> @@ -39,6 +39,7 @@
>  #include "system/address-spaces.h"
>  #include "qom/object.h"
>  #include "qemu/lockable.h"
> +#include "qemu/seqlock.h"
>  #include "trace.h"
>  
>  struct hpet_fw_config hpet_fw_cfg = {.count = UINT8_MAX};
> @@ -74,7 +75,7 @@ struct HPETState {
>      MemoryRegion iomem;
>      uint64_t hpet_offset;
>      bool hpet_offset_saved;
> -    unsigned state_version;
> +    QemuSeqLock state_version;
>      qemu_irq irqs[HPET_NUM_IRQ_ROUTES];
>      uint32_t flags;
>      uint8_t rtc_irq_level;
> @@ -431,39 +432,17 @@ static uint64_t hpet_ram_read(void *opaque, hwaddr addr,
>      trace_hpet_ram_read(addr);
>      addr &= ~4;
>  
> -    if ((addr <= 0xff) && (addr == HPET_COUNTER)) {
> +    if (addr == HPET_COUNTER) {
>          unsigned version;
> -        bool release_lock = false;
> -redo:
> -        version = qatomic_load_acquire(&s->state_version);
> -        if (unlikely(version & 1)) {
> -                /*
> -                 * Updater is running, state can be inconsistent
> -                 * wait till it's done before reading counter
> -                 */
> -                release_lock = true;
> -                qemu_mutex_lock(&s->lock);
> -        }
> -
> -        if (unlikely(!hpet_enabled(s))) {
> -            cur_tick = s->hpet_counter;
> -        } else {
> -            cur_tick = hpet_get_ticks(s);
> -        }
> -
> -        /*
> -         * ensure counter math happens before we check version again
> -         */
> -        smp_rmb();
> -        if (unlikely(version != qatomic_load_acquire(&s->state_version))) {
> -            /*
> -             * counter state has changed, re-read counter again
> -             */
> -            goto redo;
> -        }
> -        if (unlikely(release_lock)) {
> -            qemu_mutex_unlock(&s->lock);
> -        }
> +        /* Write update is extremely rare, so spinning is fine */
> +        do {
> +            version = seqlock_read_begin(&s->state_version);
> +            if (unlikely(!hpet_enabled(s))) {
> +                cur_tick = s->hpet_counter;
> +            } else {
> +                cur_tick = hpet_get_ticks(s);
> +            }
> +        } while (seqlock_read_retry(&s->state_version, version));
>          trace_hpet_ram_read_reading_counter(addr & 4, cur_tick);
>          return cur_tick >> shift;
>      }
> @@ -528,11 +507,7 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
>              old_val = s->config;
>              new_val = deposit64(old_val, shift, len, value);
>              new_val = hpet_fixup_reg(new_val, old_val, HPET_CFG_WRITE_MASK);
> -            /*
> -             * Odd versions mark the critical section, any readers will be
> -             * forced into lock protected read if they come in the middle of it
> -             */
> -            qatomic_inc(&s->state_version);
> +            seqlock_write_begin(&s->state_version);
>              s->config = new_val;
>              if (activating_bit(old_val, new_val, HPET_CFG_ENABLE)) {
>                  /* Enable main counter and interrupt generation. */
> @@ -551,12 +526,7 @@ static void hpet_ram_write(void *opaque, hwaddr addr,
>                      hpet_del_timer(&s->timer[i]);
>                  }
>              }
> -            /*
> -             * even versions mark the end of critical section,
> -             * any readers started before config change, but were still executed
> -             * during the change, will be forced to re-read counter state
> -             */
> -            qatomic_inc(&s->state_version);
> +            seqlock_write_end(&s->state_version);
>  
>              /* i8254 and RTC output pins are disabled
>               * when HPET is in legacy mode */
> 
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt
  2025-07-31 19:24   ` Peter Xu
@ 2025-08-01  8:42     ` Igor Mammedov
  2025-08-01 13:08       ` Paolo Bonzini
  0 siblings, 1 reply; 22+ messages in thread
From: Igor Mammedov @ 2025-08-01  8:42 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, pbonzini, mst, mtosatti, kraxel, peter.maydell

On Thu, 31 Jul 2025 15:24:59 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Jul 30, 2025 at 02:39:34PM +0200, Igor Mammedov wrote:
> > when kernel-irqchip=split is used, QEMU still hits BQL
> > contention issue when reading ACPI PM/HPET timers
> > (despite of timer[s] access being lock-less).
> > 
> > So Windows with more than 255 cpus is still not able to
> > boot (since it requires iommu -> split irqchip).
> > 
> > Problematic path is in kvm_arch_pre_run() where BQL is taken
> > unconditionally when split irqchip is in use.
> > 
> > There are a few parts tha BQL protects there:
> >   1. interrupt check and injecting
> > 
> >     however we do not take BQL when checking for pending
> >     interrupt (even within the same function), so the patch
> >     takes the same approach for cpu->interrupt_request checks
> >     and takes BQL only if there is a job to do.
> > 
> >   2. request_interrupt_window access
> >       CPUState::kvm_run::request_interrupt_window doesn't need BQL
> >       as it's accessed on side QEMU only by its own vCPU thread.
> >       The only thing that BQL provides there is implict barrier.
> >       Which can be done by using cheaper explicit barrier there.
> > 
> >   3. cr8/cpu_get_apic_tpr access
> >       the same (as #2) applies to CPUState::kvm_run::cr8 write,
> >       and APIC registers are also cached/synced (get/put) within
> >       the vCPU thread it belongs to.
> > 
> > Taking BQL only when is necessary, eleminates BQL bottleneck on
> > IO/MMIO only exit path, improoving latency by 80% on HPET micro
> > benchmark.
> > 
> > This lets Windows to boot succesfully (in case hv-time isn't used)
> > when more than 255 vCPUs are in use.  
> 
> Not familiar with this path, but the change looks reasonable, a few pure
> questions inline.

perhaps Paolo can give an expert opinion.

> 
> > 
> > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > ---
> >  target/i386/kvm/kvm.c | 58 +++++++++++++++++++++++++++----------------
> >  1 file changed, 37 insertions(+), 21 deletions(-)
> > 
> > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> > index 369626f8c8..32024d50f5 100644
> > --- a/target/i386/kvm/kvm.c
> > +++ b/target/i386/kvm/kvm.c
> > @@ -5450,6 +5450,7 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
> >  {
> >      X86CPU *x86_cpu = X86_CPU(cpu);
> >      CPUX86State *env = &x86_cpu->env;
> > +    bool release_bql = 0;
> >      int ret;
> >  
> >      /* Inject NMI */
> > @@ -5478,15 +5479,16 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
> >          }
> >      }
> >  
> > -    if (!kvm_pic_in_kernel()) {
> > -        bql_lock();
> > -    }
> >  
> >      /* Force the VCPU out of its inner loop to process any INIT requests
> >       * or (for userspace APIC, but it is cheap to combine the checks here)
> >       * pending TPR access reports.
> >       */
> >      if (cpu->interrupt_request & (CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
> > +        if (!kvm_pic_in_kernel()) {
> > +            bql_lock();
> > +            release_bql = true;
> > +        }  
> 
> Does updating exit_request need bql at all?  I saw the pattern is this:
> 
>         kvm_arch_pre_run(cpu, run);
>         if (qatomic_read(&cpu->exit_request)) {
>             trace_kvm_interrupt_exit_request();
>             /*
>              * KVM requires us to reenter the kernel after IO exits to complete
>              * instruction emulation. This self-signal will ensure that we
>              * leave ASAP again.
>              */
>             kvm_cpu_kick_self();
>         }
> 
> So setting exit_request=1 here will likely be read very soon later, in this
> case it seems the lock isn't needed.

Given I'm not familiar with the code, I'm changing locking logic only in
places I have to and preserving current current behavior elsewhere.

looking at the code, it seems we are always hold BQL when setting
exit_request.

> 
> >          if ((cpu->interrupt_request & CPU_INTERRUPT_INIT) &&
> >              !(env->hflags & HF_SMM_MASK)) {
> >              cpu->exit_request = 1;
> > @@ -5497,24 +5499,31 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
> >      }
> >  
> >      if (!kvm_pic_in_kernel()) {
> > -        /* Try to inject an interrupt if the guest can accept it */
> > -        if (run->ready_for_interrupt_injection &&
> > -            (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&
> > -            (env->eflags & IF_MASK)) {
> > -            int irq;
> > -
> > -            cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;
> > -            irq = cpu_get_pic_interrupt(env);
> > -            if (irq >= 0) {
> > -                struct kvm_interrupt intr;
> > -
> > -                intr.irq = irq;
> > -                DPRINTF("injected interrupt %d\n", irq);
> > -                ret = kvm_vcpu_ioctl(cpu, KVM_INTERRUPT, &intr);
> > -                if (ret < 0) {
> > -                    fprintf(stderr,
> > -                            "KVM: injection failed, interrupt lost (%s)\n",
> > -                            strerror(-ret));
> > +        if (cpu->interrupt_request & CPU_INTERRUPT_HARD) {
> > +            if (!release_bql) {
> > +                bql_lock();
> > +                release_bql = true;
> > +            }
> > +
> > +            /* Try to inject an interrupt if the guest can accept it */
> > +            if (run->ready_for_interrupt_injection &&
> > +                (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&
> > +                (env->eflags & IF_MASK)) {
> > +                int irq;
> > +
> > +                cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;
> > +                irq = cpu_get_pic_interrupt(env);
> > +                if (irq >= 0) {
> > +                    struct kvm_interrupt intr;
> > +
> > +                    intr.irq = irq;
> > +                    DPRINTF("injected interrupt %d\n", irq);
> > +                    ret = kvm_vcpu_ioctl(cpu, KVM_INTERRUPT, &intr);
> > +                    if (ret < 0) {
> > +                        fprintf(stderr,
> > +                                "KVM: injection failed, interrupt lost (%s)\n",
> > +                                strerror(-ret));
> > +                    }
> >                  }
> >              }
> >          }
> > @@ -5531,7 +5540,14 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
> >  
> >          DPRINTF("setting tpr\n");
> >          run->cr8 = cpu_get_apic_tpr(x86_cpu->apic_state);
> > +        /*
> > +         * make sure that request_interrupt_window/cr8 are set
> > +         * before KVM_RUN might read them
> > +         */
> > +        smp_mb();  
> 
> Is this mb() needed if KVM_RUN will always happen in the same thread anyway?


it matches with similar pattern:

        /* Read cpu->exit_request before KVM_RUN reads run->immediate_exit.      
         * Matching barrier in kvm_eat_signals.                                  
         */                                                                                                     
        smp_rmb();                                                               
                                                                                 
        run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0); 

to be on the safe side, this is preserving barrier that BQL has provided before
I can drop it if it's not really needed.

> 
> Thanks,
> 
> > +    }
> >  
> > +    if (release_bql) {
> >          bql_unlock();
> >      }
> >  }
> > -- 
> > 2.47.1
> >   
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt
  2025-07-30 12:39 ` [PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt Igor Mammedov
  2025-07-31 19:24   ` Peter Xu
@ 2025-08-01 10:26   ` Paolo Bonzini
  2025-08-01 12:47     ` Igor Mammedov
  1 sibling, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2025-08-01 10:26 UTC (permalink / raw)
  To: Igor Mammedov, qemu-devel; +Cc: peterx, mst, mtosatti, kraxel, peter.maydell

The patch is not wrong but complicates things more than it should.

Also, as we do more of these tricks it may be worth adding wrapper APIs 
for interrupt_request access, but that needs to be done tree-wide so you 
can do it separately.

On 7/30/25 14:39, Igor Mammedov wrote:
>      if (cpu->interrupt_request & (CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
> +        if (!kvm_pic_in_kernel()) {
> +            bql_lock();
> +            release_bql = true;
> +        }

This bql_lock() is not needed, all the writes in the "if" are local to 
the current CPU.

When the outer bql_lock() was added, cpu_interrupt() was not thread-safe 
at all, and taking the lock was needed in order to read 
cpu->interrupt_request.  But now it is ok to read outside the lock, 
which you can use to simplify this patch a lot.

>          if ((cpu->interrupt_request & CPU_INTERRUPT_INIT) &&
>              !(env->hflags & HF_SMM_MASK)) {
>              cpu->exit_request = 1;

A patch that changes all these accesses to 
qatomic_set(&cpu->exit_request, 1), tree-wide, would be nice.

> +        if (cpu->interrupt_request & CPU_INTERRUPT_HARD) {

This should be qatomic_read(&cpu->interrupt_request).  Not a blocker for 
now, but this is where I would suggest adding a wrapper like 
cpu_test_interrupt(cpu, CPU_INTERRUPT_HARD).

> +            if (!release_bql) {
> +                bql_lock();
> +                release_bql = true;
> +            }

With the above simplification, this can be done unconditionally.

> +            /* Try to inject an interrupt if the guest can accept it */
> +            if (run->ready_for_interrupt_injection &&
> +                (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&
> +                (env->eflags & IF_MASK)) {
> +                int irq;
> +
> +                cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;

Reads and writes to cpu->interrupt_request still take the BQL, which is 
consistent with include/hw/core/cpu.h, so yeah here the bql_lock() is 
needed.

Like above, writing it's a data race with readers outside the BQL, so 
qatomic_read()/qatomic_set() would be needed to respect the C standard. 
Even better could be to add a function cpu_reset_interrupt_locked() that 
does

    assert(bql_locked());
    qatomic_set(&cpu->interrupt_request, cpu->interrupt_request & ~mask);

But neither of these wrappers (which should be applied tree-wide) are an 
absolute necessity for this series.

> @@ -5531,7 +5540,14 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
>   
>           DPRINTF("setting tpr\n");
>           run->cr8 = cpu_get_apic_tpr(x86_cpu->apic_state);
> +        /*
> +         * make sure that request_interrupt_window/cr8 are set
> +         * before KVM_RUN might read them
> +         */
> +        smp_mb();

This is not needed, ->cr8 is only read for the same CPU (in 
kvm_arch_vcpu_ioctl_run).

> +    }
>   
> +    if (release_bql) {
>           bql_unlock();
>       }

And since release_bql is not needed anymore, the bql_unlock() can be 
left where it was.

Paolo

>   }

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/6] memory: reintroduce BQL-free fine-grained PIO/MMIO
  2025-07-30 21:47   ` Peter Xu
  2025-07-31  8:15     ` Igor Mammedov
@ 2025-08-01 12:42     ` Igor Mammedov
  2025-08-01 13:19       ` Peter Xu
  1 sibling, 1 reply; 22+ messages in thread
From: Igor Mammedov @ 2025-08-01 12:42 UTC (permalink / raw)
  To: Peter Xu; +Cc: qemu-devel, pbonzini, mst, mtosatti, kraxel, peter.maydell

On Wed, 30 Jul 2025 17:47:52 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Wed, Jul 30, 2025 at 02:39:29PM +0200, Igor Mammedov wrote:
> > diff --git a/system/memory.c b/system/memory.c
> > index 5646547940..9a5a262112 100644
> > --- a/system/memory.c
> > +++ b/system/memory.c
> > @@ -2546,6 +2546,12 @@ void memory_region_clear_flush_coalesced(MemoryRegion *mr)
> >      }
> >  }
> >  
> > +void memory_region_enable_lockless_io(MemoryRegion *mr)
> > +{
> > +    mr->lockless_io = true;

    /*                                                                                                                                       
     * reentrancy_guard has per device scope, that when enabled                                                                              
     * will effectively prevent concurrent access to device's IO                                                                             
     * MemoryRegion(s) by not calling accessor callback.                                                                                     
     *                                                                                                                                       
     * Turn it off for lock-less IO enabled devices, to allow                                                                                
     * concurrent IO.                                                                                                                        
     * TODO: remove this when reentrancy_guard becomes per transaction.          
     */  

would something like this be sufficient?

> > +    mr->disable_reentrancy_guard = true;  
> 
> IIUC this is needed only because the re-entrancy guard is not
> per-transaction but per-device, am I right?
> 
> Maybe some comment would be nice here to explain how mmio concurrency could
> affect this.  If my above comment is correct, it could also be a TODO so we
> could re-enable this when it is per-transaction (even though I don't know
> whether it's easy / useful to do..).
> 
> Thanks,
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt
  2025-08-01 10:26   ` Paolo Bonzini
@ 2025-08-01 12:47     ` Igor Mammedov
  0 siblings, 0 replies; 22+ messages in thread
From: Igor Mammedov @ 2025-08-01 12:47 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: qemu-devel, peterx, mst, mtosatti, kraxel, peter.maydell

On Fri, 1 Aug 2025 12:26:26 +0200
Paolo Bonzini <pbonzini@redhat.com> wrote:

> The patch is not wrong but complicates things more than it should.
> 
> Also, as we do more of these tricks it may be worth adding wrapper APIs 
> for interrupt_request access, but that needs to be done tree-wide so you 
> can do it separately.

Thanks,
I'll respin this with minimal changes for this series
and post another one on top with tree wide refactoring as suggested

> 
> On 7/30/25 14:39, Igor Mammedov wrote:
> >      if (cpu->interrupt_request & (CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
> > +        if (!kvm_pic_in_kernel()) {
> > +            bql_lock();
> > +            release_bql = true;
> > +        }  
> 
> This bql_lock() is not needed, all the writes in the "if" are local to 
> the current CPU.
> 
> When the outer bql_lock() was added, cpu_interrupt() was not thread-safe 
> at all, and taking the lock was needed in order to read 
> cpu->interrupt_request.  But now it is ok to read outside the lock, 
> which you can use to simplify this patch a lot.
> 
> >          if ((cpu->interrupt_request & CPU_INTERRUPT_INIT) &&
> >              !(env->hflags & HF_SMM_MASK)) {
> >              cpu->exit_request = 1;  
> 
> A patch that changes all these accesses to 
> qatomic_set(&cpu->exit_request, 1), tree-wide, would be nice.
> 
> > +        if (cpu->interrupt_request & CPU_INTERRUPT_HARD) {  
> 
> This should be qatomic_read(&cpu->interrupt_request).  Not a blocker for 
> now, but this is where I would suggest adding a wrapper like 
> cpu_test_interrupt(cpu, CPU_INTERRUPT_HARD).
> 
> > +            if (!release_bql) {
> > +                bql_lock();
> > +                release_bql = true;
> > +            }  
> 
> With the above simplification, this can be done unconditionally.
> 
> > +            /* Try to inject an interrupt if the guest can accept it */
> > +            if (run->ready_for_interrupt_injection &&
> > +                (cpu->interrupt_request & CPU_INTERRUPT_HARD) &&
> > +                (env->eflags & IF_MASK)) {
> > +                int irq;
> > +
> > +                cpu->interrupt_request &= ~CPU_INTERRUPT_HARD;  
> 
> Reads and writes to cpu->interrupt_request still take the BQL, which is 
> consistent with include/hw/core/cpu.h, so yeah here the bql_lock() is 
> needed.
> 
> Like above, writing it's a data race with readers outside the BQL, so 
> qatomic_read()/qatomic_set() would be needed to respect the C standard. 
> Even better could be to add a function cpu_reset_interrupt_locked() that 
> does
> 
>     assert(bql_locked());
>     qatomic_set(&cpu->interrupt_request, cpu->interrupt_request & ~mask);
> 
> But neither of these wrappers (which should be applied tree-wide) are an 
> absolute necessity for this series.
> 
> > @@ -5531,7 +5540,14 @@ void kvm_arch_pre_run(CPUState *cpu, struct kvm_run *run)
> >   
> >           DPRINTF("setting tpr\n");
> >           run->cr8 = cpu_get_apic_tpr(x86_cpu->apic_state);
> > +        /*
> > +         * make sure that request_interrupt_window/cr8 are set
> > +         * before KVM_RUN might read them
> > +         */
> > +        smp_mb();  
> 
> This is not needed, ->cr8 is only read for the same CPU (in 
> kvm_arch_vcpu_ioctl_run).
> 
> > +    }
> >   
> > +    if (release_bql) {
> >           bql_unlock();
> >       }  
> 
> And since release_bql is not needed anymore, the bql_unlock() can be 
> left where it was.
> 
> Paolo
> 
> >   }  
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt
  2025-08-01  8:42     ` Igor Mammedov
@ 2025-08-01 13:08       ` Paolo Bonzini
  0 siblings, 0 replies; 22+ messages in thread
From: Paolo Bonzini @ 2025-08-01 13:08 UTC (permalink / raw)
  To: Igor Mammedov, Peter Xu; +Cc: qemu-devel, mst, mtosatti, kraxel, peter.maydell

On 8/1/25 10:42, Igor Mammedov wrote:
> looking at the code, it seems we are always hold BQL when setting
> exit_request.

While the BQL is taken, it is for other reasons (e.g. because of 
cpu->halt_cond).

In this case it's set and read from within the same thread so it's okay.

> it matches with similar pattern:
> 
>          /* Read cpu->exit_request before KVM_RUN reads run->immediate_exit.
>           * Matching barrier in kvm_eat_signals.
>           */
>          smp_rmb();
>                                                                                   
>          run_ret = kvm_vcpu_ioctl(cpu, KVM_RUN, 0);
> 
> to be on the safe side, this is preserving barrier that BQL has provided before
> I can drop it if it's not really needed.

That comment is wrong...  The actual pairing here is with cpu_exit(), 
though the logic of cpu_exit() is messed up and only fully works for 
TCG, and immediate_exit does not matter at all.  I'll clean it up and 
write a comment.

A correct ordering would be:

(a) store other flags that will be checked if cpu->exit_request is 1

(b) cpu_exit(): store-release cpu->exit_request

(c) cpu_interrupt(): store-release cpu->interrupt_request

- broadcast cpu->halt_cond if needed; right now it's done always in 
qemu_cpu_kick()

 >>> now you can release the BQL

(d) do the accelerator-specific kick (e.g. write icount_decr for TCG, 
pthread_kill for KVM, etc.)


The other side then does the checks in the opposite direction:

(d) the accelerator's execution loop exits thanks to the kick

(c) then check cpu->interrupt_request - any work that's needed here may 
take the BQL or not, and may set cpu->exit_request

(b) then check cpu->exit_request to see if it should do slow path work

(a) then (under the BQL) it possibly goes to sleep, waiting on 
cpu->halt_cond.

cpu->exit_request and cpu->interrupt_request are not a 
load-acquire/store-release pair right now, but it should be.  Probably 
everything is protected one way or the other by the BQL, but it's not clear.

I'll handle cpu->exit_request and leave cpu->interrupt_request to you. 
For the sake of this series, please do the following:

- contrarily to what I said in my earlier review, do introduce a 
cpu_test_interrupt() now and make it use a load-acquire.  There aren't 
many occurrences:

accel/tcg/cpu-exec.c:    if 
(unlikely(qatomic_read(&cpu->interrupt_request))) {
target/alpha/cpu.c:    return cs->interrupt_request & (CPU_INTERRUPT_HARD
target/arm/cpu.c:        && cs->interrupt_request &
target/arm/hvf/hvf.c:    if (cpu->interrupt_request & 
(CPU_INTERRUPT_HARD | CPU_INTERRUPT_FIQ)) {
target/avr/cpu.c:    return (cs->interrupt_request & (CPU_INTERRUPT_HARD 
| CPU_INTERRUPT_RESET))
target/hppa/cpu.c:    return cs->interrupt_request & (CPU_INTERRUPT_HARD 
| CPU_INTERRUPT_NMI);
target/i386/kvm/kvm.c:    if (cpu->interrupt_request & 
(CPU_INTERRUPT_NMI | CPU_INTERRUPT_SMI)) {
target/i386/kvm/kvm.c:    if (cpu->interrupt_request & 
(CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
target/i386/nvmm/nvmm-all.c:    if (cpu->interrupt_request & 
(CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
target/i386/whpx/whpx-all.c:        cpu->interrupt_request & 
(CPU_INTERRUPT_NMI | CPU_INTERRUPT_SMI)) {
target/i386/whpx/whpx-all.c:    if (cpu->interrupt_request & 
(CPU_INTERRUPT_INIT | CPU_INTERRUPT_TPR)) {
target/microblaze/cpu.c:    return cs->interrupt_request & 
(CPU_INTERRUPT_HARD | CPU_INTERRUPT_NMI);
target/openrisc/cpu.c:    return cs->interrupt_request & 
(CPU_INTERRUPT_HARD |
target/rx/cpu.c:    return cs->interrupt_request &

- in tcg_handle_interrupt and generic_handle_interrupt, change like this

     /* Pairs with load_acquire in cpu_test_interrupt().  */
     qatomic_store_release(&cpu->interrupt_request,
         cpu->interrupt_request | mask);

I'll take care of properly adding the store-release/load-acquire
for exit_request and removing the unnecessary memory barriers in kvm-all.c.

Paolo


>>
>> Thanks,
>>
>>> +    }
>>>   
>>> +    if (release_bql) {
>>>           bql_unlock();
>>>       }
>>>   }
>>> -- 
>>> 2.47.1
>>>    
>>
> 
> 
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 1/6] memory: reintroduce BQL-free fine-grained PIO/MMIO
  2025-08-01 12:42     ` Igor Mammedov
@ 2025-08-01 13:19       ` Peter Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2025-08-01 13:19 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: qemu-devel, pbonzini, mst, mtosatti, kraxel, peter.maydell

On Fri, Aug 01, 2025 at 02:42:26PM +0200, Igor Mammedov wrote:
> On Wed, 30 Jul 2025 17:47:52 -0400
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Wed, Jul 30, 2025 at 02:39:29PM +0200, Igor Mammedov wrote:
> > > diff --git a/system/memory.c b/system/memory.c
> > > index 5646547940..9a5a262112 100644
> > > --- a/system/memory.c
> > > +++ b/system/memory.c
> > > @@ -2546,6 +2546,12 @@ void memory_region_clear_flush_coalesced(MemoryRegion *mr)
> > >      }
> > >  }
> > >  
> > > +void memory_region_enable_lockless_io(MemoryRegion *mr)
> > > +{
> > > +    mr->lockless_io = true;
> 
>     /*                                                                                                                                       
>      * reentrancy_guard has per device scope, that when enabled                                                                              
>      * will effectively prevent concurrent access to device's IO                                                                             
>      * MemoryRegion(s) by not calling accessor callback.                                                                                     
>      *                                                                                                                                       
>      * Turn it off for lock-less IO enabled devices, to allow                                                                                
>      * concurrent IO.                                                                                                                        
>      * TODO: remove this when reentrancy_guard becomes per transaction.          
>      */  
> 
> would something like this be sufficient?

Looks good to me, thanks!

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v2 5/6] hpet: make main counter read lock-less
  2025-08-01  8:06         ` Igor Mammedov
@ 2025-08-01 13:32           ` Peter Xu
  0 siblings, 0 replies; 22+ messages in thread
From: Peter Xu @ 2025-08-01 13:32 UTC (permalink / raw)
  To: Igor Mammedov; +Cc: qemu-devel, pbonzini, mst, mtosatti, kraxel, peter.maydell

On Fri, Aug 01, 2025 at 10:06:45AM +0200, Igor Mammedov wrote:
> docs put qatomic_inc() in 'Sequentially consistent' category,
> hence no manual barrier.

True.. I somehow read it as a _set().  Maybe it's another way to prove
memory barriers are error prone, by making a mistake myself. :)

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2025-08-01 13:40 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-30 12:39 [PATCH v2 0/6] Reinvent BQL-free PIO/MMIO Igor Mammedov
2025-07-30 12:39 ` [PATCH v2 1/6] memory: reintroduce BQL-free fine-grained PIO/MMIO Igor Mammedov
2025-07-30 21:47   ` Peter Xu
2025-07-31  8:15     ` Igor Mammedov
2025-08-01 12:42     ` Igor Mammedov
2025-08-01 13:19       ` Peter Xu
2025-07-30 12:39 ` [PATCH v2 2/6] acpi: mark PMTIMER as unlocked Igor Mammedov
2025-07-30 12:39 ` [PATCH v2 3/6] hpet: switch to fain-grained device locking Igor Mammedov
2025-07-30 12:39 ` [PATCH v2 4/6] hpet: move out main counter read into a separate block Igor Mammedov
2025-07-30 12:39 ` [PATCH v2 5/6] hpet: make main counter read lock-less Igor Mammedov
2025-07-30 22:15   ` Peter Xu
2025-07-31  8:32     ` Igor Mammedov
2025-07-31 14:02       ` Peter Xu
2025-08-01  8:06         ` Igor Mammedov
2025-08-01 13:32           ` Peter Xu
2025-07-30 12:39 ` [PATCH v2 6/6] kvm: i386: irqchip: take BQL only if there is an interrupt Igor Mammedov
2025-07-31 19:24   ` Peter Xu
2025-08-01  8:42     ` Igor Mammedov
2025-08-01 13:08       ` Paolo Bonzini
2025-08-01 10:26   ` Paolo Bonzini
2025-08-01 12:47     ` Igor Mammedov
2025-07-31 21:15 ` [PATCH v2 0/6] Reinvent BQL-free PIO/MMIO Philippe Mathieu-Daudé

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).