qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] Reinvent BQL-free PIO/MMIO
@ 2025-06-20 15:14 Igor Mammedov
  2025-06-20 15:14 ` [PATCH 1/3] memory: reintroduce BQL-free fine-grained PIO/MMIO Igor Mammedov
                   ` (2 more replies)
  0 siblings, 3 replies; 15+ messages in thread
From: Igor Mammedov @ 2025-06-20 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: kraxel, mst, anisinha, elena.ufimtseva, jag.raman, pbonzini,
	peterx, david, philmd

When booting WS2025 with following CLI
 1)   -M q35,hpet=off -cpu host -enable-kvm -smp 240,sockets=4
the guest boots very slow and is sluggish after boot
or it's stuck on boot at spinning circle.

pref shows that VM is experiencing heavy BQL contention on IO path
which happens to be ACPI PM timer read access. A variation with
HPET enabled moves contention to HPET timer read access.
And it only gets worse with increasing number of VCPUs.

Series prevents large VM vCPUs contending on BQL due to PM|HPET timer
access and lets it move on with boot process.

It's basically resurecting Jan's idea [2] of BQL-free IO access,
with a twist that whitelist reads access only to minimze chances
of [3] issues. If we hit issues later on, it might be better
to actually find a root cause and perhaps fix guest side if
possible. (so I tested series with a bucn of different guests
instead, RHEL-[6..10]x64, WS2012R2, WS2016, WS2022, WS2025)

In my tests, with [1] CLI WS2025 guest wasn't able to boot within 30min
on both hosts
  * 32 core 2NUMA nodes
  * 448 cores 8NUMA nodes
With ACPI PM timer in BQL-free read mode, guest boots within approx:
 * 2min
 * 1min
respectively.

With HPET enabled boot time shrinks ~2x
 * 4m13 -> 2m21
 * 2m19 -> 1m15
respectively.

Using hv-time=on cpu option helps a lot (when it works) and
lets [1] guest boot fine in ~1-2min (this series reduces
boot time on ~10% on top of hv-time improvements).

2) 196ea13104f (memory: Add global-locking property to memory regions)
  ... de7ea885c539 (kvm: Switch to unlocked MMIO)
3) https://bugzilla.redhat.com/show_bug.cgi?id=1322713
  1beb99f787 (Revert "acpi: mark PMTIMER as unlocked")

Igor Mammedov (3):
  memory: reintroduce BQL-free fine-grained PIO/MMIO
  acpi: mark PMTIMER as unlocked
  mark HPET as unlocked

 include/system/memory.h   | 10 +++++++++-
 hw/acpi/core.c            |  2 ++
 hw/remote/vfio-user-obj.c |  2 +-
 hw/timer/hpet.c           |  2 ++
 system/memory.c           |  5 +++++
 system/memory_ldst.c.inc  | 18 +++++++++---------
 system/physmem.c          |  9 +++++----
 7 files changed, 33 insertions(+), 15 deletions(-)

--
2.43.5



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/3] memory: reintroduce BQL-free fine-grained PIO/MMIO
  2025-06-20 15:14 [PATCH 0/3] Reinvent BQL-free PIO/MMIO Igor Mammedov
@ 2025-06-20 15:14 ` Igor Mammedov
  2025-06-20 16:53   ` Peter Xu
  2025-06-20 15:14 ` [PATCH 2/3] acpi: mark PMTIMER as unlocked Igor Mammedov
  2025-06-20 15:14 ` [PATCH 3/3] mark HPET " Igor Mammedov
  2 siblings, 1 reply; 15+ messages in thread
From: Igor Mammedov @ 2025-06-20 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: kraxel, mst, anisinha, elena.ufimtseva, jag.raman, pbonzini,
	peterx, david, philmd

This patch brings back Jan's idea [1] of BQL-free IO access,
with a twist that whitelist read access only.

(as BQL-free write access in [1] used to cause issues [2]
and still does (Windows crash) if write path is not lock protected)

However with limiting it read path only, guest boots without issues.
This will let us make read access ACPI PM/HPET timers cheaper,
and prevent guest VCPUs BQL contention in case of workload
that heavily uses the timers.

2) 196ea13104f (memory: Add global-locking property to memory regions)
   ... de7ea885c539 (kvm: Switch to unlocked MMIO)
3) https://bugzilla.redhat.com/show_bug.cgi?id=1322713
   1beb99f787 (Revert "acpi: mark PMTIMER as unlocked")

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 include/system/memory.h   | 10 +++++++++-
 hw/remote/vfio-user-obj.c |  2 +-
 system/memory.c           |  5 +++++
 system/memory_ldst.c.inc  | 18 +++++++++---------
 system/physmem.c          |  9 +++++----
 5 files changed, 29 insertions(+), 15 deletions(-)

diff --git a/include/system/memory.h b/include/system/memory.h
index fc35a0dcad..1afabf2d94 100644
--- a/include/system/memory.h
+++ b/include/system/memory.h
@@ -775,6 +775,7 @@ struct MemoryRegion {
     bool nonvolatile;
     bool rom_device;
     bool flush_coalesced_mmio;
+    bool lockless_ro_io;
     bool unmergeable;
     uint8_t dirty_log_mask;
     bool is_iommu;
@@ -2253,6 +2254,13 @@ void memory_region_set_flush_coalesced(MemoryRegion *mr);
  */
 void memory_region_clear_flush_coalesced(MemoryRegion *mr);
 
+/**
+ * memory_region_enable_lockless_ro_io: Enable lockless (BQL) read-only acceess.
+ *
+ * Enable BQL-free readonly access for devices with fine-grained locking.
+ */
+void memory_region_enable_lockless_ro_io(MemoryRegion *mr);
+
 /**
  * memory_region_add_eventfd: Request an eventfd to be triggered when a word
  *                            is written to a location.
@@ -3002,7 +3010,7 @@ MemTxResult address_space_write_cached_slow(MemoryRegionCache *cache,
                                             hwaddr len);
 
 int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr);
-bool prepare_mmio_access(MemoryRegion *mr);
+bool prepare_mmio_access(MemoryRegion *mr, bool read);
 
 static inline bool memory_region_supports_direct_access(MemoryRegion *mr)
 {
diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
index ea6165ebdc..936a9befd4 100644
--- a/hw/remote/vfio-user-obj.c
+++ b/hw/remote/vfio-user-obj.c
@@ -381,7 +381,7 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
          * The read/write logic used below is similar to the ones in
          * flatview_read/write_continue()
          */
-        release_lock = prepare_mmio_access(mr);
+        release_lock = prepare_mmio_access(mr, !is_write);
 
         access_size = memory_access_size(mr, size, offset);
 
diff --git a/system/memory.c b/system/memory.c
index 63b983efcd..5192195473 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -2558,6 +2558,11 @@ void memory_region_clear_flush_coalesced(MemoryRegion *mr)
     }
 }
 
+void memory_region_enable_lockless_ro_io(MemoryRegion *mr)
+{
+    mr->lockless_ro_io = true;
+}
+
 void memory_region_add_eventfd(MemoryRegion *mr,
                                hwaddr addr,
                                unsigned size,
diff --git a/system/memory_ldst.c.inc b/system/memory_ldst.c.inc
index 7f32d3d9ff..919357578c 100644
--- a/system/memory_ldst.c.inc
+++ b/system/memory_ldst.c.inc
@@ -35,7 +35,7 @@ static inline uint32_t glue(address_space_ldl_internal, SUFFIX)(ARG1_DECL,
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, false, attrs);
     if (l < 4 || !memory_access_is_direct(mr, false, attrs)) {
-        release_lock |= prepare_mmio_access(mr);
+        release_lock |= prepare_mmio_access(mr, true);
 
         /* I/O case */
         r = memory_region_dispatch_read(mr, addr1, &val,
@@ -104,7 +104,7 @@ static inline uint64_t glue(address_space_ldq_internal, SUFFIX)(ARG1_DECL,
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, false, attrs);
     if (l < 8 || !memory_access_is_direct(mr, false, attrs)) {
-        release_lock |= prepare_mmio_access(mr);
+        release_lock |= prepare_mmio_access(mr, true);
 
         /* I/O case */
         r = memory_region_dispatch_read(mr, addr1, &val,
@@ -171,7 +171,7 @@ uint8_t glue(address_space_ldub, SUFFIX)(ARG1_DECL,
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, false, attrs);
     if (!memory_access_is_direct(mr, false, attrs)) {
-        release_lock |= prepare_mmio_access(mr);
+        release_lock |= prepare_mmio_access(mr, true);
 
         /* I/O case */
         r = memory_region_dispatch_read(mr, addr1, &val, MO_8, attrs);
@@ -208,7 +208,7 @@ static inline uint16_t glue(address_space_lduw_internal, SUFFIX)(ARG1_DECL,
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, false, attrs);
     if (l < 2 || !memory_access_is_direct(mr, false, attrs)) {
-        release_lock |= prepare_mmio_access(mr);
+        release_lock |= prepare_mmio_access(mr, true);
 
         /* I/O case */
         r = memory_region_dispatch_read(mr, addr1, &val,
@@ -278,7 +278,7 @@ void glue(address_space_stl_notdirty, SUFFIX)(ARG1_DECL,
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, true, attrs);
     if (l < 4 || !memory_access_is_direct(mr, true, attrs)) {
-        release_lock |= prepare_mmio_access(mr);
+        release_lock |= prepare_mmio_access(mr, false);
 
         r = memory_region_dispatch_write(mr, addr1, val, MO_32, attrs);
     } else {
@@ -315,7 +315,7 @@ static inline void glue(address_space_stl_internal, SUFFIX)(ARG1_DECL,
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, true, attrs);
     if (l < 4 || !memory_access_is_direct(mr, true, attrs)) {
-        release_lock |= prepare_mmio_access(mr);
+        release_lock |= prepare_mmio_access(mr, false);
         r = memory_region_dispatch_write(mr, addr1, val,
                                          MO_32 | devend_memop(endian), attrs);
     } else {
@@ -378,7 +378,7 @@ void glue(address_space_stb, SUFFIX)(ARG1_DECL,
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, true, attrs);
     if (!memory_access_is_direct(mr, true, attrs)) {
-        release_lock |= prepare_mmio_access(mr);
+        release_lock |= prepare_mmio_access(mr, false);
         r = memory_region_dispatch_write(mr, addr1, val, MO_8, attrs);
     } else {
         /* RAM case */
@@ -411,7 +411,7 @@ static inline void glue(address_space_stw_internal, SUFFIX)(ARG1_DECL,
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, true, attrs);
     if (l < 2 || !memory_access_is_direct(mr, true, attrs)) {
-        release_lock |= prepare_mmio_access(mr);
+        release_lock |= prepare_mmio_access(mr, false);
         r = memory_region_dispatch_write(mr, addr1, val,
                                          MO_16 | devend_memop(endian), attrs);
     } else {
@@ -475,7 +475,7 @@ static void glue(address_space_stq_internal, SUFFIX)(ARG1_DECL,
     RCU_READ_LOCK();
     mr = TRANSLATE(addr, &addr1, &l, true, attrs);
     if (l < 8 || !memory_access_is_direct(mr, true, attrs)) {
-        release_lock |= prepare_mmio_access(mr);
+        release_lock |= prepare_mmio_access(mr, false);
         r = memory_region_dispatch_write(mr, addr1, val,
                                          MO_64 | devend_memop(endian), attrs);
     } else {
diff --git a/system/physmem.c b/system/physmem.c
index a8a9ca309e..60e330de99 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -2881,11 +2881,12 @@ int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr)
     return l;
 }
 
-bool prepare_mmio_access(MemoryRegion *mr)
+bool prepare_mmio_access(MemoryRegion *mr, bool read)
 {
     bool release_lock = false;
 
-    if (!bql_locked()) {
+    if (!bql_locked() &&
+        !(read && mr->lockless_ro_io == true)) {
         bql_lock();
         release_lock = true;
     }
@@ -2935,7 +2936,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
     if (!memory_access_is_direct(mr, true, attrs)) {
         uint64_t val;
         MemTxResult result;
-        bool release_lock = prepare_mmio_access(mr);
+        bool release_lock = prepare_mmio_access(mr, false);
 
         *l = memory_access_size(mr, *l, mr_addr);
         /*
@@ -3032,7 +3033,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
         /* I/O case */
         uint64_t val;
         MemTxResult result;
-        bool release_lock = prepare_mmio_access(mr);
+        bool release_lock = prepare_mmio_access(mr, true);
 
         *l = memory_access_size(mr, *l, mr_addr);
         result = memory_region_dispatch_read(mr, mr_addr, &val, size_memop(*l),
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/3] acpi: mark PMTIMER as unlocked
  2025-06-20 15:14 [PATCH 0/3] Reinvent BQL-free PIO/MMIO Igor Mammedov
  2025-06-20 15:14 ` [PATCH 1/3] memory: reintroduce BQL-free fine-grained PIO/MMIO Igor Mammedov
@ 2025-06-20 15:14 ` Igor Mammedov
  2025-06-20 15:14 ` [PATCH 3/3] mark HPET " Igor Mammedov
  2 siblings, 0 replies; 15+ messages in thread
From: Igor Mammedov @ 2025-06-20 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: kraxel, mst, anisinha, elena.ufimtseva, jag.raman, pbonzini,
	peterx, david, philmd

Reading QEMU_CLOCK_VIRTUAL is thread-safe.

with CLI
 -M q35,hpet=off -cpu host -enable-kvm -smp 240,sockets=4
patch makes WS2025 guest, that wasn't able to boot in 30min, to boot in 2-1min.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 hw/acpi/core.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/acpi/core.c b/hw/acpi/core.c
index 58f8964e13..683fbdd8df 100644
--- a/hw/acpi/core.c
+++ b/hw/acpi/core.c
@@ -547,6 +547,8 @@ void acpi_pm_tmr_init(ACPIREGS *ar, acpi_update_sci_fn update_sci,
     ar->tmr.timer = timer_new_ns(QEMU_CLOCK_VIRTUAL, acpi_pm_tmr_timer, ar);
     memory_region_init_io(&ar->tmr.io, memory_region_owner(parent),
                           &acpi_pm_tmr_ops, ar, "acpi-tmr", 4);
+    memory_region_enable_lockless_ro_io(&ar->tmr.io);
+    ar->tmr.io.disable_reentrancy_guard = true;
     memory_region_add_subregion(parent, 8, &ar->tmr.io);
 }
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/3] mark HPET as unlocked
  2025-06-20 15:14 [PATCH 0/3] Reinvent BQL-free PIO/MMIO Igor Mammedov
  2025-06-20 15:14 ` [PATCH 1/3] memory: reintroduce BQL-free fine-grained PIO/MMIO Igor Mammedov
  2025-06-20 15:14 ` [PATCH 2/3] acpi: mark PMTIMER as unlocked Igor Mammedov
@ 2025-06-20 15:14 ` Igor Mammedov
  2025-06-20 17:01   ` Peter Maydell
  2 siblings, 1 reply; 15+ messages in thread
From: Igor Mammedov @ 2025-06-20 15:14 UTC (permalink / raw)
  To: qemu-devel
  Cc: kraxel, mst, anisinha, elena.ufimtseva, jag.raman, pbonzini,
	peterx, david, philmd

Reading QEMU_CLOCK_VIRTUAL is thread-safe.

with CLI
 -M q35,hpet=on -cpu host -enable-kvm -smp 240,sockets=4
patch makes WS2025 guest, that was able to boot in 4/2min, to boot in 2/1min.

Signed-off-by: Igor Mammedov <imammedo@redhat.com>
---
 hw/timer/hpet.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
index 0fd1337a15..1ebd715529 100644
--- a/hw/timer/hpet.c
+++ b/hw/timer/hpet.c
@@ -681,6 +681,8 @@ static void hpet_init(Object *obj)
 
     /* HPET Area */
     memory_region_init_io(&s->iomem, obj, &hpet_ram_ops, s, "hpet", HPET_LEN);
+    memory_region_enable_lockless_ro_io(&s->iomem);
+    s->iomem.disable_reentrancy_guard = true;
     sysbus_init_mmio(sbd, &s->iomem);
 }
 
-- 
2.43.5



^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] memory: reintroduce BQL-free fine-grained PIO/MMIO
  2025-06-20 15:14 ` [PATCH 1/3] memory: reintroduce BQL-free fine-grained PIO/MMIO Igor Mammedov
@ 2025-06-20 16:53   ` Peter Xu
  2025-06-23 12:51     ` Igor Mammedov
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Xu @ 2025-06-20 16:53 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, kraxel, mst, anisinha, elena.ufimtseva, jag.raman,
	pbonzini, david, philmd

On Fri, Jun 20, 2025 at 05:14:16PM +0200, Igor Mammedov wrote:
> This patch brings back Jan's idea [1] of BQL-free IO access,
> with a twist that whitelist read access only.
> 
> (as BQL-free write access in [1] used to cause issues [2]
> and still does (Windows crash) if write path is not lock protected)

Can we add some explanation on why it would fail on lockless writes?

I saw that acpi_pm_tmr_write() is no-op, so I don't yet understand what
raced, and also why guest writes to it at all..

Thanks,

> 
> However with limiting it read path only, guest boots without issues.
> This will let us make read access ACPI PM/HPET timers cheaper,
> and prevent guest VCPUs BQL contention in case of workload
> that heavily uses the timers.
> 
> 2) 196ea13104f (memory: Add global-locking property to memory regions)
>    ... de7ea885c539 (kvm: Switch to unlocked MMIO)
> 3) https://bugzilla.redhat.com/show_bug.cgi?id=1322713
>    1beb99f787 (Revert "acpi: mark PMTIMER as unlocked")
> 
> Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> ---
>  include/system/memory.h   | 10 +++++++++-
>  hw/remote/vfio-user-obj.c |  2 +-
>  system/memory.c           |  5 +++++
>  system/memory_ldst.c.inc  | 18 +++++++++---------
>  system/physmem.c          |  9 +++++----
>  5 files changed, 29 insertions(+), 15 deletions(-)
> 
> diff --git a/include/system/memory.h b/include/system/memory.h
> index fc35a0dcad..1afabf2d94 100644
> --- a/include/system/memory.h
> +++ b/include/system/memory.h
> @@ -775,6 +775,7 @@ struct MemoryRegion {
>      bool nonvolatile;
>      bool rom_device;
>      bool flush_coalesced_mmio;
> +    bool lockless_ro_io;
>      bool unmergeable;
>      uint8_t dirty_log_mask;
>      bool is_iommu;
> @@ -2253,6 +2254,13 @@ void memory_region_set_flush_coalesced(MemoryRegion *mr);
>   */
>  void memory_region_clear_flush_coalesced(MemoryRegion *mr);
>  
> +/**
> + * memory_region_enable_lockless_ro_io: Enable lockless (BQL) read-only acceess.
> + *
> + * Enable BQL-free readonly access for devices with fine-grained locking.
> + */
> +void memory_region_enable_lockless_ro_io(MemoryRegion *mr);
> +
>  /**
>   * memory_region_add_eventfd: Request an eventfd to be triggered when a word
>   *                            is written to a location.
> @@ -3002,7 +3010,7 @@ MemTxResult address_space_write_cached_slow(MemoryRegionCache *cache,
>                                              hwaddr len);
>  
>  int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr);
> -bool prepare_mmio_access(MemoryRegion *mr);
> +bool prepare_mmio_access(MemoryRegion *mr, bool read);
>  
>  static inline bool memory_region_supports_direct_access(MemoryRegion *mr)
>  {
> diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> index ea6165ebdc..936a9befd4 100644
> --- a/hw/remote/vfio-user-obj.c
> +++ b/hw/remote/vfio-user-obj.c
> @@ -381,7 +381,7 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
>           * The read/write logic used below is similar to the ones in
>           * flatview_read/write_continue()
>           */
> -        release_lock = prepare_mmio_access(mr);
> +        release_lock = prepare_mmio_access(mr, !is_write);
>  
>          access_size = memory_access_size(mr, size, offset);
>  
> diff --git a/system/memory.c b/system/memory.c
> index 63b983efcd..5192195473 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -2558,6 +2558,11 @@ void memory_region_clear_flush_coalesced(MemoryRegion *mr)
>      }
>  }
>  
> +void memory_region_enable_lockless_ro_io(MemoryRegion *mr)
> +{
> +    mr->lockless_ro_io = true;
> +}
> +
>  void memory_region_add_eventfd(MemoryRegion *mr,
>                                 hwaddr addr,
>                                 unsigned size,
> diff --git a/system/memory_ldst.c.inc b/system/memory_ldst.c.inc
> index 7f32d3d9ff..919357578c 100644
> --- a/system/memory_ldst.c.inc
> +++ b/system/memory_ldst.c.inc
> @@ -35,7 +35,7 @@ static inline uint32_t glue(address_space_ldl_internal, SUFFIX)(ARG1_DECL,
>      RCU_READ_LOCK();
>      mr = TRANSLATE(addr, &addr1, &l, false, attrs);
>      if (l < 4 || !memory_access_is_direct(mr, false, attrs)) {
> -        release_lock |= prepare_mmio_access(mr);
> +        release_lock |= prepare_mmio_access(mr, true);
>  
>          /* I/O case */
>          r = memory_region_dispatch_read(mr, addr1, &val,
> @@ -104,7 +104,7 @@ static inline uint64_t glue(address_space_ldq_internal, SUFFIX)(ARG1_DECL,
>      RCU_READ_LOCK();
>      mr = TRANSLATE(addr, &addr1, &l, false, attrs);
>      if (l < 8 || !memory_access_is_direct(mr, false, attrs)) {
> -        release_lock |= prepare_mmio_access(mr);
> +        release_lock |= prepare_mmio_access(mr, true);
>  
>          /* I/O case */
>          r = memory_region_dispatch_read(mr, addr1, &val,
> @@ -171,7 +171,7 @@ uint8_t glue(address_space_ldub, SUFFIX)(ARG1_DECL,
>      RCU_READ_LOCK();
>      mr = TRANSLATE(addr, &addr1, &l, false, attrs);
>      if (!memory_access_is_direct(mr, false, attrs)) {
> -        release_lock |= prepare_mmio_access(mr);
> +        release_lock |= prepare_mmio_access(mr, true);
>  
>          /* I/O case */
>          r = memory_region_dispatch_read(mr, addr1, &val, MO_8, attrs);
> @@ -208,7 +208,7 @@ static inline uint16_t glue(address_space_lduw_internal, SUFFIX)(ARG1_DECL,
>      RCU_READ_LOCK();
>      mr = TRANSLATE(addr, &addr1, &l, false, attrs);
>      if (l < 2 || !memory_access_is_direct(mr, false, attrs)) {
> -        release_lock |= prepare_mmio_access(mr);
> +        release_lock |= prepare_mmio_access(mr, true);
>  
>          /* I/O case */
>          r = memory_region_dispatch_read(mr, addr1, &val,
> @@ -278,7 +278,7 @@ void glue(address_space_stl_notdirty, SUFFIX)(ARG1_DECL,
>      RCU_READ_LOCK();
>      mr = TRANSLATE(addr, &addr1, &l, true, attrs);
>      if (l < 4 || !memory_access_is_direct(mr, true, attrs)) {
> -        release_lock |= prepare_mmio_access(mr);
> +        release_lock |= prepare_mmio_access(mr, false);
>  
>          r = memory_region_dispatch_write(mr, addr1, val, MO_32, attrs);
>      } else {
> @@ -315,7 +315,7 @@ static inline void glue(address_space_stl_internal, SUFFIX)(ARG1_DECL,
>      RCU_READ_LOCK();
>      mr = TRANSLATE(addr, &addr1, &l, true, attrs);
>      if (l < 4 || !memory_access_is_direct(mr, true, attrs)) {
> -        release_lock |= prepare_mmio_access(mr);
> +        release_lock |= prepare_mmio_access(mr, false);
>          r = memory_region_dispatch_write(mr, addr1, val,
>                                           MO_32 | devend_memop(endian), attrs);
>      } else {
> @@ -378,7 +378,7 @@ void glue(address_space_stb, SUFFIX)(ARG1_DECL,
>      RCU_READ_LOCK();
>      mr = TRANSLATE(addr, &addr1, &l, true, attrs);
>      if (!memory_access_is_direct(mr, true, attrs)) {
> -        release_lock |= prepare_mmio_access(mr);
> +        release_lock |= prepare_mmio_access(mr, false);
>          r = memory_region_dispatch_write(mr, addr1, val, MO_8, attrs);
>      } else {
>          /* RAM case */
> @@ -411,7 +411,7 @@ static inline void glue(address_space_stw_internal, SUFFIX)(ARG1_DECL,
>      RCU_READ_LOCK();
>      mr = TRANSLATE(addr, &addr1, &l, true, attrs);
>      if (l < 2 || !memory_access_is_direct(mr, true, attrs)) {
> -        release_lock |= prepare_mmio_access(mr);
> +        release_lock |= prepare_mmio_access(mr, false);
>          r = memory_region_dispatch_write(mr, addr1, val,
>                                           MO_16 | devend_memop(endian), attrs);
>      } else {
> @@ -475,7 +475,7 @@ static void glue(address_space_stq_internal, SUFFIX)(ARG1_DECL,
>      RCU_READ_LOCK();
>      mr = TRANSLATE(addr, &addr1, &l, true, attrs);
>      if (l < 8 || !memory_access_is_direct(mr, true, attrs)) {
> -        release_lock |= prepare_mmio_access(mr);
> +        release_lock |= prepare_mmio_access(mr, false);
>          r = memory_region_dispatch_write(mr, addr1, val,
>                                           MO_64 | devend_memop(endian), attrs);
>      } else {
> diff --git a/system/physmem.c b/system/physmem.c
> index a8a9ca309e..60e330de99 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -2881,11 +2881,12 @@ int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr)
>      return l;
>  }
>  
> -bool prepare_mmio_access(MemoryRegion *mr)
> +bool prepare_mmio_access(MemoryRegion *mr, bool read)
>  {
>      bool release_lock = false;
>  
> -    if (!bql_locked()) {
> +    if (!bql_locked() &&
> +        !(read && mr->lockless_ro_io == true)) {
>          bql_lock();
>          release_lock = true;
>      }
> @@ -2935,7 +2936,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
>      if (!memory_access_is_direct(mr, true, attrs)) {
>          uint64_t val;
>          MemTxResult result;
> -        bool release_lock = prepare_mmio_access(mr);
> +        bool release_lock = prepare_mmio_access(mr, false);
>  
>          *l = memory_access_size(mr, *l, mr_addr);
>          /*
> @@ -3032,7 +3033,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
>          /* I/O case */
>          uint64_t val;
>          MemTxResult result;
> -        bool release_lock = prepare_mmio_access(mr);
> +        bool release_lock = prepare_mmio_access(mr, true);
>  
>          *l = memory_access_size(mr, *l, mr_addr);
>          result = memory_region_dispatch_read(mr, mr_addr, &val, size_memop(*l),
> -- 
> 2.43.5
> 

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] mark HPET as unlocked
  2025-06-20 15:14 ` [PATCH 3/3] mark HPET " Igor Mammedov
@ 2025-06-20 17:01   ` Peter Maydell
  2025-06-24 10:39     ` Igor Mammedov
  0 siblings, 1 reply; 15+ messages in thread
From: Peter Maydell @ 2025-06-20 17:01 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, kraxel, mst, anisinha, elena.ufimtseva, jag.raman,
	pbonzini, peterx, david, philmd

On Fri, 20 Jun 2025 at 16:15, Igor Mammedov <imammedo@redhat.com> wrote:
>
> Reading QEMU_CLOCK_VIRTUAL is thread-safe.
>
> with CLI
>  -M q35,hpet=on -cpu host -enable-kvm -smp 240,sockets=4
> patch makes WS2025 guest, that was able to boot in 4/2min, to boot in 2/1min.
>
> Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> ---
>  hw/timer/hpet.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
> index 0fd1337a15..1ebd715529 100644
> --- a/hw/timer/hpet.c
> +++ b/hw/timer/hpet.c
> @@ -681,6 +681,8 @@ static void hpet_init(Object *obj)
>
>      /* HPET Area */
>      memory_region_init_io(&s->iomem, obj, &hpet_ram_ops, s, "hpet", HPET_LEN);
> +    memory_region_enable_lockless_ro_io(&s->iomem);
> +    s->iomem.disable_reentrancy_guard = true;
>      sysbus_init_mmio(sbd, &s->iomem);

Is this sequence possible?


thread A
   takes the BQL
   enters hpet_ram_write() for HPET_CFG to set ENABLE
   executes s->config = new_val (setting the ENABLE bit in it)
   context switch before it gets to the point of setting
     s->hpet_offset

thread B
   enters hpet_ram_read() for HPET_COUNTER (which it can now
    do because it doesn't need the BQL)
   hpet_enabled() returns true (it tests s->config)
   calls hpet_get_ticks which adds hpet_offset to the clock,
     but hpet_offset has not been correctly set by A yet

thanks
-- PMM


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] memory: reintroduce BQL-free fine-grained PIO/MMIO
  2025-06-20 16:53   ` Peter Xu
@ 2025-06-23 12:51     ` Igor Mammedov
  2025-06-23 13:36       ` Peter Xu
  0 siblings, 1 reply; 15+ messages in thread
From: Igor Mammedov @ 2025-06-23 12:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, kraxel, mst, anisinha, elena.ufimtseva, jag.raman,
	pbonzini, david, philmd

On Fri, 20 Jun 2025 12:53:06 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Fri, Jun 20, 2025 at 05:14:16PM +0200, Igor Mammedov wrote:
> > This patch brings back Jan's idea [1] of BQL-free IO access,
> > with a twist that whitelist read access only.
> > 
> > (as BQL-free write access in [1] used to cause issues [2]
> > and still does (Windows crash) if write path is not lock protected)  
> 
> Can we add some explanation on why it would fail on lockless writes?
> 
> I saw that acpi_pm_tmr_write() is no-op, so I don't yet understand what
> raced, and also why guest writes to it at all..

root cause wasn't diagnosed back then, and I haven't able to
reproduce that as well. So I erred on side of caution and
implemented RO only.

Theoretically write should be fine too, but I don't have
an idea how to test that.

> 
> Thanks,
> 
> > 
> > However with limiting it read path only, guest boots without issues.
> > This will let us make read access ACPI PM/HPET timers cheaper,
> > and prevent guest VCPUs BQL contention in case of workload
> > that heavily uses the timers.
> > 
> > 2) 196ea13104f (memory: Add global-locking property to memory regions)
> >    ... de7ea885c539 (kvm: Switch to unlocked MMIO)
> > 3) https://bugzilla.redhat.com/show_bug.cgi?id=1322713
> >    1beb99f787 (Revert "acpi: mark PMTIMER as unlocked")
> > 
> > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > ---
> >  include/system/memory.h   | 10 +++++++++-
> >  hw/remote/vfio-user-obj.c |  2 +-
> >  system/memory.c           |  5 +++++
> >  system/memory_ldst.c.inc  | 18 +++++++++---------
> >  system/physmem.c          |  9 +++++----
> >  5 files changed, 29 insertions(+), 15 deletions(-)
> > 
> > diff --git a/include/system/memory.h b/include/system/memory.h
> > index fc35a0dcad..1afabf2d94 100644
> > --- a/include/system/memory.h
> > +++ b/include/system/memory.h
> > @@ -775,6 +775,7 @@ struct MemoryRegion {
> >      bool nonvolatile;
> >      bool rom_device;
> >      bool flush_coalesced_mmio;
> > +    bool lockless_ro_io;
> >      bool unmergeable;
> >      uint8_t dirty_log_mask;
> >      bool is_iommu;
> > @@ -2253,6 +2254,13 @@ void memory_region_set_flush_coalesced(MemoryRegion *mr);
> >   */
> >  void memory_region_clear_flush_coalesced(MemoryRegion *mr);
> >  
> > +/**
> > + * memory_region_enable_lockless_ro_io: Enable lockless (BQL) read-only acceess.
> > + *
> > + * Enable BQL-free readonly access for devices with fine-grained locking.
> > + */
> > +void memory_region_enable_lockless_ro_io(MemoryRegion *mr);
> > +
> >  /**
> >   * memory_region_add_eventfd: Request an eventfd to be triggered when a word
> >   *                            is written to a location.
> > @@ -3002,7 +3010,7 @@ MemTxResult address_space_write_cached_slow(MemoryRegionCache *cache,
> >                                              hwaddr len);
> >  
> >  int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr);
> > -bool prepare_mmio_access(MemoryRegion *mr);
> > +bool prepare_mmio_access(MemoryRegion *mr, bool read);
> >  
> >  static inline bool memory_region_supports_direct_access(MemoryRegion *mr)
> >  {
> > diff --git a/hw/remote/vfio-user-obj.c b/hw/remote/vfio-user-obj.c
> > index ea6165ebdc..936a9befd4 100644
> > --- a/hw/remote/vfio-user-obj.c
> > +++ b/hw/remote/vfio-user-obj.c
> > @@ -381,7 +381,7 @@ static int vfu_object_mr_rw(MemoryRegion *mr, uint8_t *buf, hwaddr offset,
> >           * The read/write logic used below is similar to the ones in
> >           * flatview_read/write_continue()
> >           */
> > -        release_lock = prepare_mmio_access(mr);
> > +        release_lock = prepare_mmio_access(mr, !is_write);
> >  
> >          access_size = memory_access_size(mr, size, offset);
> >  
> > diff --git a/system/memory.c b/system/memory.c
> > index 63b983efcd..5192195473 100644
> > --- a/system/memory.c
> > +++ b/system/memory.c
> > @@ -2558,6 +2558,11 @@ void memory_region_clear_flush_coalesced(MemoryRegion *mr)
> >      }
> >  }
> >  
> > +void memory_region_enable_lockless_ro_io(MemoryRegion *mr)
> > +{
> > +    mr->lockless_ro_io = true;
> > +}
> > +
> >  void memory_region_add_eventfd(MemoryRegion *mr,
> >                                 hwaddr addr,
> >                                 unsigned size,
> > diff --git a/system/memory_ldst.c.inc b/system/memory_ldst.c.inc
> > index 7f32d3d9ff..919357578c 100644
> > --- a/system/memory_ldst.c.inc
> > +++ b/system/memory_ldst.c.inc
> > @@ -35,7 +35,7 @@ static inline uint32_t glue(address_space_ldl_internal, SUFFIX)(ARG1_DECL,
> >      RCU_READ_LOCK();
> >      mr = TRANSLATE(addr, &addr1, &l, false, attrs);
> >      if (l < 4 || !memory_access_is_direct(mr, false, attrs)) {
> > -        release_lock |= prepare_mmio_access(mr);
> > +        release_lock |= prepare_mmio_access(mr, true);
> >  
> >          /* I/O case */
> >          r = memory_region_dispatch_read(mr, addr1, &val,
> > @@ -104,7 +104,7 @@ static inline uint64_t glue(address_space_ldq_internal, SUFFIX)(ARG1_DECL,
> >      RCU_READ_LOCK();
> >      mr = TRANSLATE(addr, &addr1, &l, false, attrs);
> >      if (l < 8 || !memory_access_is_direct(mr, false, attrs)) {
> > -        release_lock |= prepare_mmio_access(mr);
> > +        release_lock |= prepare_mmio_access(mr, true);
> >  
> >          /* I/O case */
> >          r = memory_region_dispatch_read(mr, addr1, &val,
> > @@ -171,7 +171,7 @@ uint8_t glue(address_space_ldub, SUFFIX)(ARG1_DECL,
> >      RCU_READ_LOCK();
> >      mr = TRANSLATE(addr, &addr1, &l, false, attrs);
> >      if (!memory_access_is_direct(mr, false, attrs)) {
> > -        release_lock |= prepare_mmio_access(mr);
> > +        release_lock |= prepare_mmio_access(mr, true);
> >  
> >          /* I/O case */
> >          r = memory_region_dispatch_read(mr, addr1, &val, MO_8, attrs);
> > @@ -208,7 +208,7 @@ static inline uint16_t glue(address_space_lduw_internal, SUFFIX)(ARG1_DECL,
> >      RCU_READ_LOCK();
> >      mr = TRANSLATE(addr, &addr1, &l, false, attrs);
> >      if (l < 2 || !memory_access_is_direct(mr, false, attrs)) {
> > -        release_lock |= prepare_mmio_access(mr);
> > +        release_lock |= prepare_mmio_access(mr, true);
> >  
> >          /* I/O case */
> >          r = memory_region_dispatch_read(mr, addr1, &val,
> > @@ -278,7 +278,7 @@ void glue(address_space_stl_notdirty, SUFFIX)(ARG1_DECL,
> >      RCU_READ_LOCK();
> >      mr = TRANSLATE(addr, &addr1, &l, true, attrs);
> >      if (l < 4 || !memory_access_is_direct(mr, true, attrs)) {
> > -        release_lock |= prepare_mmio_access(mr);
> > +        release_lock |= prepare_mmio_access(mr, false);
> >  
> >          r = memory_region_dispatch_write(mr, addr1, val, MO_32, attrs);
> >      } else {
> > @@ -315,7 +315,7 @@ static inline void glue(address_space_stl_internal, SUFFIX)(ARG1_DECL,
> >      RCU_READ_LOCK();
> >      mr = TRANSLATE(addr, &addr1, &l, true, attrs);
> >      if (l < 4 || !memory_access_is_direct(mr, true, attrs)) {
> > -        release_lock |= prepare_mmio_access(mr);
> > +        release_lock |= prepare_mmio_access(mr, false);
> >          r = memory_region_dispatch_write(mr, addr1, val,
> >                                           MO_32 | devend_memop(endian), attrs);
> >      } else {
> > @@ -378,7 +378,7 @@ void glue(address_space_stb, SUFFIX)(ARG1_DECL,
> >      RCU_READ_LOCK();
> >      mr = TRANSLATE(addr, &addr1, &l, true, attrs);
> >      if (!memory_access_is_direct(mr, true, attrs)) {
> > -        release_lock |= prepare_mmio_access(mr);
> > +        release_lock |= prepare_mmio_access(mr, false);
> >          r = memory_region_dispatch_write(mr, addr1, val, MO_8, attrs);
> >      } else {
> >          /* RAM case */
> > @@ -411,7 +411,7 @@ static inline void glue(address_space_stw_internal, SUFFIX)(ARG1_DECL,
> >      RCU_READ_LOCK();
> >      mr = TRANSLATE(addr, &addr1, &l, true, attrs);
> >      if (l < 2 || !memory_access_is_direct(mr, true, attrs)) {
> > -        release_lock |= prepare_mmio_access(mr);
> > +        release_lock |= prepare_mmio_access(mr, false);
> >          r = memory_region_dispatch_write(mr, addr1, val,
> >                                           MO_16 | devend_memop(endian), attrs);
> >      } else {
> > @@ -475,7 +475,7 @@ static void glue(address_space_stq_internal, SUFFIX)(ARG1_DECL,
> >      RCU_READ_LOCK();
> >      mr = TRANSLATE(addr, &addr1, &l, true, attrs);
> >      if (l < 8 || !memory_access_is_direct(mr, true, attrs)) {
> > -        release_lock |= prepare_mmio_access(mr);
> > +        release_lock |= prepare_mmio_access(mr, false);
> >          r = memory_region_dispatch_write(mr, addr1, val,
> >                                           MO_64 | devend_memop(endian), attrs);
> >      } else {
> > diff --git a/system/physmem.c b/system/physmem.c
> > index a8a9ca309e..60e330de99 100644
> > --- a/system/physmem.c
> > +++ b/system/physmem.c
> > @@ -2881,11 +2881,12 @@ int memory_access_size(MemoryRegion *mr, unsigned l, hwaddr addr)
> >      return l;
> >  }
> >  
> > -bool prepare_mmio_access(MemoryRegion *mr)
> > +bool prepare_mmio_access(MemoryRegion *mr, bool read)
> >  {
> >      bool release_lock = false;
> >  
> > -    if (!bql_locked()) {
> > +    if (!bql_locked() &&
> > +        !(read && mr->lockless_ro_io == true)) {
> >          bql_lock();
> >          release_lock = true;
> >      }
> > @@ -2935,7 +2936,7 @@ static MemTxResult flatview_write_continue_step(MemTxAttrs attrs,
> >      if (!memory_access_is_direct(mr, true, attrs)) {
> >          uint64_t val;
> >          MemTxResult result;
> > -        bool release_lock = prepare_mmio_access(mr);
> > +        bool release_lock = prepare_mmio_access(mr, false);
> >  
> >          *l = memory_access_size(mr, *l, mr_addr);
> >          /*
> > @@ -3032,7 +3033,7 @@ static MemTxResult flatview_read_continue_step(MemTxAttrs attrs, uint8_t *buf,
> >          /* I/O case */
> >          uint64_t val;
> >          MemTxResult result;
> > -        bool release_lock = prepare_mmio_access(mr);
> > +        bool release_lock = prepare_mmio_access(mr, true);
> >  
> >          *l = memory_access_size(mr, *l, mr_addr);
> >          result = memory_region_dispatch_read(mr, mr_addr, &val, size_memop(*l),
> > -- 
> > 2.43.5
> >   
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] memory: reintroduce BQL-free fine-grained PIO/MMIO
  2025-06-23 12:51     ` Igor Mammedov
@ 2025-06-23 13:36       ` Peter Xu
  2025-06-24  7:07         ` Gerd Hoffmann
  2025-06-24 10:57         ` Igor Mammedov
  0 siblings, 2 replies; 15+ messages in thread
From: Peter Xu @ 2025-06-23 13:36 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: qemu-devel, kraxel, mst, anisinha, elena.ufimtseva, jag.raman,
	pbonzini, david, philmd

On Mon, Jun 23, 2025 at 02:51:46PM +0200, Igor Mammedov wrote:
> On Fri, 20 Jun 2025 12:53:06 -0400
> Peter Xu <peterx@redhat.com> wrote:
> 
> > On Fri, Jun 20, 2025 at 05:14:16PM +0200, Igor Mammedov wrote:
> > > This patch brings back Jan's idea [1] of BQL-free IO access,
> > > with a twist that whitelist read access only.
> > > 
> > > (as BQL-free write access in [1] used to cause issues [2]
> > > and still does (Windows crash) if write path is not lock protected)  
> > 
> > Can we add some explanation on why it would fail on lockless writes?
> > 
> > I saw that acpi_pm_tmr_write() is no-op, so I don't yet understand what
> > raced, and also why guest writes to it at all..
> 
> root cause wasn't diagnosed back then, and I haven't able to
> reproduce that as well. So I erred on side of caution and
> implemented RO only.

Ah OK, I think I got that feeling it can be reproduced as above mentioned
"still does (Windows crash) if write ...".

> 
> Theoretically write should be fine too, but I don't have
> an idea how to test that.

Then the question is how do we justify it will work this time..

If nobody can reproduce it anymore, there's indeed one way to go if we
strongly want to have the optimization, which is to apply it again and wait
for the reproducer to pop up once more.  Just like to double check is this
the case, and we have no way to reproduce?

I also wonder whether it's still a bit late because such experiment might
be better done at the start of release cycle.  Now we have roughly 3 weeks
to soft-freeze (July 15).  I had a look, last time it was pretty late when
reverting the change:

975eb6a547 (tag: v2.6.0-rc4) Update version for v2.6.0-rc4 release
1beb99f787 Revert "acpi: mark PMTIMER as unlocked"

So there's also the question of whether we should land this for this
release or next when open.

Gerd mentioned this in the relevant bz:

        Note: root cause for the initrd issue noted in comment 5 is seabios
        running into problems with ehci -> io errors -> corrupted initrd.
        Sometimes it doesn't boot at all, probably in case the io errors
        happen to hit the kernel not the initrd.

This seems to be the last piece of information we have had that is closest
to the root cause.  I sincerely wished there's still some way to move
forward, as it looks really close, but it might be that it was just too
late for 2.6 so we didn't got time to keep looking back then.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] memory: reintroduce BQL-free fine-grained PIO/MMIO
  2025-06-23 13:36       ` Peter Xu
@ 2025-06-24  7:07         ` Gerd Hoffmann
  2025-06-24 10:45           ` Igor Mammedov
  2025-06-24 10:57         ` Igor Mammedov
  1 sibling, 1 reply; 15+ messages in thread
From: Gerd Hoffmann @ 2025-06-24  7:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: Igor Mammedov, qemu-devel, mst, anisinha, elena.ufimtseva,
	jag.raman, pbonzini, david, philmd

  Hi,

> Gerd mentioned this in the relevant bz:
> 
>         Note: root cause for the initrd issue noted in comment 5 is seabios
>         running into problems with ehci -> io errors -> corrupted initrd.
>         Sometimes it doesn't boot at all, probably in case the io errors
>         happen to hit the kernel not the initrd.
> 
> This seems to be the last piece of information we have had that is closest
> to the root cause.

seabios used to prefer pmtimer back then for timekeeping then because it
has a fixed frequency.  Doing tsc calibration can easily be /way/ off in
a virtual machine on a loaded host.

Meanwhile seabios got support for reading the tsc frequency via cpuid
(if invtsc is available) or via kvmclock.  If that works seabios will
prefer the tsc for timekeeping.

So, when trying to reproduce the failure for analysis you have to either
use an old seabios version, or turn off kvmclock + invtsc support,
otherwise seabios will not use the pmtimer in the first place.

You should have this line in the firmware log:

    Using pmtimer, ioport 0x608

HTH & take care,
  Gerd



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/3] mark HPET as unlocked
  2025-06-20 17:01   ` Peter Maydell
@ 2025-06-24 10:39     ` Igor Mammedov
  0 siblings, 0 replies; 15+ messages in thread
From: Igor Mammedov @ 2025-06-24 10:39 UTC (permalink / raw)
  To: Peter Maydell
  Cc: qemu-devel, kraxel, mst, anisinha, elena.ufimtseva, jag.raman,
	pbonzini, peterx, david, philmd

On Fri, 20 Jun 2025 18:01:08 +0100
Peter Maydell <peter.maydell@linaro.org> wrote:

> On Fri, 20 Jun 2025 at 16:15, Igor Mammedov <imammedo@redhat.com> wrote:
> >
> > Reading QEMU_CLOCK_VIRTUAL is thread-safe.
> >
> > with CLI
> >  -M q35,hpet=on -cpu host -enable-kvm -smp 240,sockets=4
> > patch makes WS2025 guest, that was able to boot in 4/2min, to boot in 2/1min.
> >
> > Signed-off-by: Igor Mammedov <imammedo@redhat.com>
> > ---
> >  hw/timer/hpet.c | 2 ++
> >  1 file changed, 2 insertions(+)
> >
> > diff --git a/hw/timer/hpet.c b/hw/timer/hpet.c
> > index 0fd1337a15..1ebd715529 100644
> > --- a/hw/timer/hpet.c
> > +++ b/hw/timer/hpet.c
> > @@ -681,6 +681,8 @@ static void hpet_init(Object *obj)
> >
> >      /* HPET Area */
> >      memory_region_init_io(&s->iomem, obj, &hpet_ram_ops, s, "hpet", HPET_LEN);
> > +    memory_region_enable_lockless_ro_io(&s->iomem);
> > +    s->iomem.disable_reentrancy_guard = true;
> >      sysbus_init_mmio(sbd, &s->iomem);  
> 
> Is this sequence possible?
While unlikely (what I observe from Linux/Windows guest,
they enable timer 1st and only then start readers),
but yes it still possible.

The more convoluted is the switch into disabled state, where
1:
  reader is already in enabled branch and preempted before
  reading clock:  
          case HPET_COUNTER:
            if (hpet_enabled(s)) {
               <yield>
               cur_tick = hpet_get_ticks(s);
2:
  writer saves s->hpet_counter = <now>

3:
  reader resumes and reads newer 'now'

4:
  on next counter read, reader switches to disabled branch
  and sees timer jump back to older s->hpet_counter value.
  and that shouldn't happen.

for this to work s->hpet_counter needs to catch up
the latest read timer value.

Let me try and see what could be done with it.

> 
> thread A
>    takes the BQL
>    enters hpet_ram_write() for HPET_CFG to set ENABLE
>    executes s->config = new_val (setting the ENABLE bit in it)
>    context switch before it gets to the point of setting
>      s->hpet_offset
> 
> thread B
>    enters hpet_ram_read() for HPET_COUNTER (which it can now
>     do because it doesn't need the BQL)
>    hpet_enabled() returns true (it tests s->config)
>    calls hpet_get_ticks which adds hpet_offset to the clock,
>      but hpet_offset has not been correctly set by A yet
> 
> thanks
> -- PMM
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] memory: reintroduce BQL-free fine-grained PIO/MMIO
  2025-06-24  7:07         ` Gerd Hoffmann
@ 2025-06-24 10:45           ` Igor Mammedov
  2025-06-27 12:02             ` Igor Mammedov
  0 siblings, 1 reply; 15+ messages in thread
From: Igor Mammedov @ 2025-06-24 10:45 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Peter Xu, qemu-devel, mst, anisinha, elena.ufimtseva, jag.raman,
	pbonzini, david, philmd

On Tue, 24 Jun 2025 09:07:11 +0200
Gerd Hoffmann <kraxel@redhat.com> wrote:

>   Hi,
> 
> > Gerd mentioned this in the relevant bz:
> > 
> >         Note: root cause for the initrd issue noted in comment 5 is seabios
> >         running into problems with ehci -> io errors -> corrupted initrd.
> >         Sometimes it doesn't boot at all, probably in case the io errors
> >         happen to hit the kernel not the initrd.
> > 
> > This seems to be the last piece of information we have had that is closest
> > to the root cause.  
> 
> seabios used to prefer pmtimer back then for timekeeping then because it
> has a fixed frequency.  Doing tsc calibration can easily be /way/ off in
> a virtual machine on a loaded host.
> 
> Meanwhile seabios got support for reading the tsc frequency via cpuid
> (if invtsc is available) or via kvmclock.  If that works seabios will
> prefer the tsc for timekeeping.
> 
> So, when trying to reproduce the failure for analysis you have to either
> use an old seabios version, or turn off kvmclock + invtsc support,
> otherwise seabios will not use the pmtimer in the first place.

thanks for the hint (I've been trying to reproduce with current seabios),
I'll try to reproduce with the old seabios.

> 
> You should have this line in the firmware log:
> 
>     Using pmtimer, ioport 0x608
> 
> HTH & take care,
>   Gerd
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] memory: reintroduce BQL-free fine-grained PIO/MMIO
  2025-06-23 13:36       ` Peter Xu
  2025-06-24  7:07         ` Gerd Hoffmann
@ 2025-06-24 10:57         ` Igor Mammedov
  1 sibling, 0 replies; 15+ messages in thread
From: Igor Mammedov @ 2025-06-24 10:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: qemu-devel, kraxel, mst, anisinha, elena.ufimtseva, jag.raman,
	pbonzini, david, philmd

On Mon, 23 Jun 2025 09:36:05 -0400
Peter Xu <peterx@redhat.com> wrote:

> On Mon, Jun 23, 2025 at 02:51:46PM +0200, Igor Mammedov wrote:
> > On Fri, 20 Jun 2025 12:53:06 -0400
> > Peter Xu <peterx@redhat.com> wrote:
> >   
> > > On Fri, Jun 20, 2025 at 05:14:16PM +0200, Igor Mammedov wrote:  
> > > > This patch brings back Jan's idea [1] of BQL-free IO access,
> > > > with a twist that whitelist read access only.
> > > > 
> > > > (as BQL-free write access in [1] used to cause issues [2]
> > > > and still does (Windows crash) if write path is not lock protected)    
> > > 
> > > Can we add some explanation on why it would fail on lockless writes?
> > > 
> > > I saw that acpi_pm_tmr_write() is no-op, so I don't yet understand what
> > > raced, and also why guest writes to it at all..  
> > 
> > root cause wasn't diagnosed back then, and I haven't able to
> > reproduce that as well. So I erred on side of caution and
> > implemented RO only.  
> 
> Ah OK, I think I got that feeling it can be reproduced as above mentioned
> "still does (Windows crash) if write ...".

that is leftover from experiments with lockless split irqchip,
as we need to use it with more then 255 vCPU, and then we are back
BQL contention as every IO exit will also trigger taking BQL
for non-in-kernel irqchip.

So this series addresses unboottable Windows issue only upto
255 vCPU. If I manage to make split irqchip checks lockless,
it will be a separate series on top.

> 
> > 
> > Theoretically write should be fine too, but I don't have
> > an idea how to test that.  
> 
> Then the question is how do we justify it will work this time..
> 
> If nobody can reproduce it anymore, there's indeed one way to go if we
> strongly want to have the optimization, which is to apply it again and wait
> for the reproducer to pop up once more.  Just like to double check is this
> the case, and we have no way to reproduce?

I'd prefer to reproduce issue if possible, but if that won't workout
it might be better to try and see it explodes elsewhere.
Let's see if I could reproduce with old Seabios as per Gerd's suggestions. 
 
> I also wonder whether it's still a bit late because such experiment might
> be better done at the start of release cycle.  Now we have roughly 3 weeks
> to soft-freeze (July 15).  I had a look, last time it was pretty late when
> reverting the change:
> 
> 975eb6a547 (tag: v2.6.0-rc4) Update version for v2.6.0-rc4 release
> 1beb99f787 Revert "acpi: mark PMTIMER as unlocked"
> 
> So there's also the question of whether we should land this for this
> release or next when open.

I don't see the need to rush this, so
+1 to the next cycle.


> Gerd mentioned this in the relevant bz:
> 
>         Note: root cause for the initrd issue noted in comment 5 is seabios
>         running into problems with ehci -> io errors -> corrupted initrd.
>         Sometimes it doesn't boot at all, probably in case the io errors
>         happen to hit the kernel not the initrd.
> 
> This seems to be the last piece of information we have had that is closest
> to the root cause.  I sincerely wished there's still some way to move
> forward, as it looks really close, but it might be that it was just too
> late for 2.6 so we didn't got time to keep looking back then.
> 
> Thanks,
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] memory: reintroduce BQL-free fine-grained PIO/MMIO
  2025-06-24 10:45           ` Igor Mammedov
@ 2025-06-27 12:02             ` Igor Mammedov
  2025-06-30 10:02               ` Gerd Hoffmann
  0 siblings, 1 reply; 15+ messages in thread
From: Igor Mammedov @ 2025-06-27 12:02 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Peter Xu, qemu-devel, mst, anisinha, elena.ufimtseva, jag.raman,
	pbonzini, david, philmd

On Tue, 24 Jun 2025 12:45:27 +0200
Igor Mammedov <imammedo@redhat.com> wrote:

> On Tue, 24 Jun 2025 09:07:11 +0200
> Gerd Hoffmann <kraxel@redhat.com> wrote:
> 
> >   Hi,
> >   
> > > Gerd mentioned this in the relevant bz:
> > > 
> > >         Note: root cause for the initrd issue noted in comment 5 is seabios
> > >         running into problems with ehci -> io errors -> corrupted initrd.
> > >         Sometimes it doesn't boot at all, probably in case the io errors
> > >         happen to hit the kernel not the initrd.
> > > 
> > > This seems to be the last piece of information we have had that is closest
> > > to the root cause.    
> > 
> > seabios used to prefer pmtimer back then for timekeeping then because it
> > has a fixed frequency.  Doing tsc calibration can easily be /way/ off in
> > a virtual machine on a loaded host.
> > 
> > Meanwhile seabios got support for reading the tsc frequency via cpuid
> > (if invtsc is available) or via kvmclock.  If that works seabios will
> > prefer the tsc for timekeeping.
> > 
> > So, when trying to reproduce the failure for analysis you have to either
> > use an old seabios version, or turn off kvmclock + invtsc support,
> > otherwise seabios will not use the pmtimer in the first place.  
> 
> thanks for the hint (I've been trying to reproduce with current seabios),
> I'll try to reproduce with the old seabios.

not exactly the same config but close 

qemu at offending commit 1beb99f787ba11 + exactly the same SeaBIOS
with guest GA RHEL7.3 (3.10.0-514.el7.x86_64) and making sure
seabios used pmtimer.

Running RHBZ 1322713 reproducer over several days in loop,
haven't been able to reproduce the issue.

Differences are: RHEL9 host and for guest a bit newer kernel.
the reproducer used a bit older (intermediate) one that is nowhere
to be found anymore.

As you've said in comment
https://bugzilla.redhat.com/show_bug.cgi?id=1322713#c6
it's strange that patch causes issues at all, especially with
'-smp 1' as in reproducer.

Also repeated with -smp x>1, it still worked fine.

Perhaps issue was elsewhere after all.

> 
> > 
> > You should have this line in the firmware log:
> > 
> >     Using pmtimer, ioport 0x608
> > 
> > HTH & take care,
> >   Gerd
> >   
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] memory: reintroduce BQL-free fine-grained PIO/MMIO
  2025-06-27 12:02             ` Igor Mammedov
@ 2025-06-30 10:02               ` Gerd Hoffmann
  2025-07-01 14:33                 ` Igor Mammedov
  0 siblings, 1 reply; 15+ messages in thread
From: Gerd Hoffmann @ 2025-06-30 10:02 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: Peter Xu, qemu-devel, mst, anisinha, elena.ufimtseva, jag.raman,
	pbonzini, david, philmd

  Hi,

> As you've said in comment
> https://bugzilla.redhat.com/show_bug.cgi?id=1322713#c6
> it's strange that patch causes issues at all, especially with
> '-smp 1' as in reproducer.
> 
> Also repeated with -smp x>1, it still worked fine.
> 
> Perhaps issue was elsewhere after all.

Yea, looks pretty much like this.  Guess we do not need the
'make only read access lockless' part then.

take care,
  Gerd



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/3] memory: reintroduce BQL-free fine-grained PIO/MMIO
  2025-06-30 10:02               ` Gerd Hoffmann
@ 2025-07-01 14:33                 ` Igor Mammedov
  0 siblings, 0 replies; 15+ messages in thread
From: Igor Mammedov @ 2025-07-01 14:33 UTC (permalink / raw)
  To: Gerd Hoffmann
  Cc: Peter Xu, qemu-devel, mst, anisinha, elena.ufimtseva, jag.raman,
	pbonzini, david, philmd

On Mon, 30 Jun 2025 12:02:25 +0200
Gerd Hoffmann <kraxel@redhat.com> wrote:

>   Hi,
> 
> > As you've said in comment
> > https://bugzilla.redhat.com/show_bug.cgi?id=1322713#c6
> > it's strange that patch causes issues at all, especially with
> > '-smp 1' as in reproducer.
> > 
> > Also repeated with -smp x>1, it still worked fine.
> > 
> > Perhaps issue was elsewhere after all.  
> 
> Yea, looks pretty much like this.  Guess we do not need the
> 'make only read access lockless' part then.

Yep, I'll ditch readonly and instead add proper fine-grained locking
to PM and HPET timers instead (HPET needs it anyways as Peter pointed out,
while PM will be just a collateral)


> 
> take care,
>   Gerd
> 



^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2025-07-01 14:34 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-20 15:14 [PATCH 0/3] Reinvent BQL-free PIO/MMIO Igor Mammedov
2025-06-20 15:14 ` [PATCH 1/3] memory: reintroduce BQL-free fine-grained PIO/MMIO Igor Mammedov
2025-06-20 16:53   ` Peter Xu
2025-06-23 12:51     ` Igor Mammedov
2025-06-23 13:36       ` Peter Xu
2025-06-24  7:07         ` Gerd Hoffmann
2025-06-24 10:45           ` Igor Mammedov
2025-06-27 12:02             ` Igor Mammedov
2025-06-30 10:02               ` Gerd Hoffmann
2025-07-01 14:33                 ` Igor Mammedov
2025-06-24 10:57         ` Igor Mammedov
2025-06-20 15:14 ` [PATCH 2/3] acpi: mark PMTIMER as unlocked Igor Mammedov
2025-06-20 15:14 ` [PATCH 3/3] mark HPET " Igor Mammedov
2025-06-20 17:01   ` Peter Maydell
2025-06-24 10:39     ` Igor Mammedov

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).