[PATCH v2 0/2] Performant CXL type 3 non-interleaved regions

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v2 0/2] Performant CXL type 3 non-interleaved regions
@ 2026-02-04 14:59 Alireza Sanaee via qemu development
  2026-02-04 15:00 ` [PATCH v2 1/2] hw/cxl: Use HPA in cxl_cfmws_find_device() rather than offset in window Alireza Sanaee via qemu development
                   ` (3 more replies)
  0 siblings, 4 replies; 12+ messages in thread
From: Alireza Sanaee via qemu development @ 2026-02-04 14:59 UTC (permalink / raw)
  To: qemu-devel
  Cc: linuxarm, jonathan.cameron, nifan.cxl, anisa.su887, mst, armbru,
	lizhijian, ppbonzini, peterx, philmd, david, imammedo,
	xiaoguangrong.eric, venkataravis, gourry

The CXL address to device decoding logic is complex because of the need
to correctly decode fine grained interleave. The current implementation
prevents use with KVM where executed instructions may reside in that
memory and gives very slow performance even in TCG.

In many real cases non interleaved memory configurations are useful and
for those we can use a more conventional memory region alias allowing
similar performance to other memory in the system.

Whether this fast path is applicable can be established once the full
set of HDM decoders has been committed (in whatever order the guest
decides to commit them). As such a check is performed on each commit /
uncommit of HDM decoder to establish if the alias should be added or
removed.

Tested Topologies and Region Layouts
====================================

This series was validated across multiple CXL topology configurations,
covering single-device, multi-device, multi-host-bridge, and switched
fabrics. Region creation was exercised using the `cxl` userspace tool
with both non-interleaved and interleaved setups.

Decoder and memdev identifiers were discovered using:

  cxl list
  cxl list -D

Decoder IDs (e.g. decoder0.0) and memdev names (mem0, mem1) are
environment-specific. Commands below use placeholders such as
<decoder_span_both> which should be replaced with IDs from `cxl list -D`.

---------------------------------------------------------------------

Region Layout Notation
----------------------

CFMW (CXL Fixed Memory Window) is shown as a linear address space
containing regions:

  CFMW: [ R0 | R1 | R2 ]

R0, R1, R2 are regions created by `cxl create-region`.

Non-interleaved region:

  R0 (ways=1) -> entirely on one device (mem0 or mem1)
  Fast path: APPLICABLE

2-way interleaved region (g=256):

  R1 (ways=2, g=256) striped across devices:

    |mem0|mem1|mem0|mem1|mem0|mem1| ...
     256  256  256  256  256  256  bytes

  Fast path: NOT APPLICABLE

---------------------------------------------------------------------

1) One device, one host bridge, one fixed window
------------------------------------------------

QEMU:

  -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
  -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
  -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
  -object memory-backend-ram,id=mem0,size=512M,share=on
  -device cxl-type3,id=dev0,bus=rp0,memdev=mem0

Topology:

  Host
    |
    +-- CXL Host Bridge (cxl.0)
         |
         +-- Root Port (rp0)
              |
              +-- Type-3 (dev0, mem0)

Regions created:

  cxl create-region ... -w 1 ... mem0   (Fast path: YES)
  cxl create-region ... -w 1 ... mem0   (Fast path: YES)

Layout:

  CFMW: [ R0 | R1 ]

  R0 -> mem0  (Fast path: YES)
  R1 -> mem0  (Fast path: YES)

---------------------------------------------------------------------

2) One host bridge, two Type-3 devices (via two root ports)
------------------------------------------------------------

QEMU:

  -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
  -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
  -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
  -device cxl-rp,id=rp1,bus=cxl.0,port=1,chassis=0,slot=3
  -object memory-backend-ram,id=mem0,size=512M,share=on
  -object memory-backend-ram,id=mem1,size=512M,share=on
  -device cxl-type3,id=dev0,bus=rp0,memdev=mem0
  -device cxl-type3,id=dev1,bus=rp1,memdev=mem1

Topology:

  Host
    |
    +-- CXL Host Bridge (cxl.0)
         |
         +-- Root Port (rp0) -- Type-3 (dev0, mem0)
         |
         +-- Root Port (rp1) -- Type-3 (dev1, mem1)

Region patterns exercised:

2.1 All non-interleaved:
  R0 -> mem0  (Fast path: YES)
  R1 -> mem0  (Fast path: YES)
  R2 -> mem1  (Fast path: YES)
  R3 -> mem1  (Fast path: YES)

2.2 Interleaved + local:
  R0 -> mem0/mem1 interleaved  (Fast path: NO)
  R1 -> mem0                   (Fast path: YES)

2.3 Local + interleaved + local:
  R0 -> mem0                   (Fast path: YES)
  R1 -> mem0/mem1 interleaved  (Fast path: NO)
  R2 -> mem1                   (Fast path: YES)

---------------------------------------------------------------------

3) Two host bridges, one device per host bridge
------------------------------------------------

QEMU:

  -M q35,cxl=on,
     cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,
     cxl-fmw.1.targets.0=cxl.1,cxl-fmw.1.size=4G,
     cxl-fmw.2.targets.0=cxl.0,cxl-fmw.2.size=4G
  -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
  -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
  -object memory-backend-ram,id=mem0,size=512M,share=on
  -device cxl-type3,id=dev0,bus=rp0,memdev=mem0
  -device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=13
  -device cxl-rp,id=rp1,bus=cxl.1,port=0,chassis=1,slot=2
  -object memory-backend-ram,id=mem1,size=512M,share=on
  -device cxl-type3,id=dev1,bus=rp1,memdev=mem1

Region patterns identical to section 2, and fast-path applicability is
identical per region mapping (non-interleaved: YES, interleaved: NO).

---------------------------------------------------------------------

4) Switch topology
------------------

QEMU:

  -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
  -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
  -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
  -device cxl-rp,id=rp1,bus=cxl.0,port=0,chassis=0,slot=3
  -device cxl-upstream,id=us0,bus=rp0
  -device cxl-downstream,id=ds0,bus=us0,port=0,chassis=0,slot=4
  -object memory-backend-ram,id=mem0,size=512M,share=on
  -device cxl-type3,id=dev0,bus=ds0,memdev=mem0

Topology (detailed):

  Host
    |
    +-- CXL Host Bridge (cxl.0)
         |
         +-- Root Port (rp0)
         |     |
         |     +-- CXL Switch (upstream us0)
         |           |
         |           +-- Downstream Port (ds0) -- Type-3 (mem0)
         |           |
         |           +-- Downstream Port (ds1) -- Type-3 (mem1) [optional]
         +-- Root Port (rp1)
               |
               +-- More devices/switches.

Fast-path interpretation in this topology:

  If only mem0 exists:
    All regions -> Fast path: YES

  If mem0 and mem1 exist:
    Non-interleaved regions -> Fast path: YES
    Interleaved regions     -> Fast path: NO

---------------------------------------------------------------------

Summary
-------

Across all topologies, region creation, enablement, and HDM decoder
commit/uncommit flows were exercised. The fast path is enabled only when
all decoders describe a non-interleaved mapping and is removed when any
interleave configuration is introduced.

Alireza Sanaee (2):
  hw/cxl: Use HPA in cxl_cfmws_find_device() rather than offset in
    window.
  hw/cxl: Add a performant (and correct) path for the non interleaved
    cases

 hw/cxl/cxl-component-utils.c |   8 ++
 hw/cxl/cxl-host.c            | 203 +++++++++++++++++++++++++++++++++--
 hw/mem/cxl_type3.c           |   4 +
 include/hw/cxl/cxl.h         |   1 +
 include/hw/cxl/cxl_device.h  |   1 +
 5 files changed, 208 insertions(+), 9 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/2] hw/cxl: Use HPA in cxl_cfmws_find_device() rather than offset in window.
  2026-02-04 14:59 [PATCH v2 0/2] Performant CXL type 3 non-interleaved regions Alireza Sanaee via qemu development
@ 2026-02-04 15:00 ` Alireza Sanaee via qemu development
  2026-02-06  9:57   ` Zhijian Li (Fujitsu)
  2026-02-04 15:00 ` [PATCH v2 2/2] hw/cxl: Add a performant (and correct) path for the non interleaved cases Alireza Sanaee via qemu development
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 12+ messages in thread
From: Alireza Sanaee via qemu development @ 2026-02-04 15:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: linuxarm, jonathan.cameron, nifan.cxl, anisa.su887, mst, armbru,
	lizhijian, ppbonzini, peterx, philmd, david, imammedo,
	xiaoguangrong.eric, venkataravis, gourry

This function will shortly be used to help find if there is a route to a
device, serving an HPA, under a particular fixed memory window. Rather
than having that new use case subtract the base address in the caller,
only to add it again in cxl_cfmws_find_device(), push the responsibility
for calculating the HPA to the caller.

This also reduces the inconsistency in the meaning of the hwaddr
addr parameter between this function and the calls made within it
that access the HDM decoders that operating on HPA.

Signed-off-by: Alireza Sanaee <alireza.sanaee@huawei.com>
---
 hw/cxl/cxl-host.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/hw/cxl/cxl-host.c b/hw/cxl/cxl-host.c
index f3479b1991..9633b01abf 100644
--- a/hw/cxl/cxl-host.c
+++ b/hw/cxl/cxl-host.c
@@ -168,8 +168,6 @@ static PCIDevice *cxl_cfmws_find_device(CXLFixedWindow *fw, hwaddr addr)
     bool target_found;
     PCIDevice *rp, *d;
 
-    /* Address is relative to memory region. Convert to HPA */
-    addr += fw->base;
 
     rb_index = (addr / cxl_decode_ig(fw->enc_int_gran)) % fw->num_targets;
     hb = PCI_HOST_BRIDGE(fw->target_hbs[rb_index]->cxl_host_bridge);
@@ -254,7 +252,7 @@ static MemTxResult cxl_read_cfmws(void *opaque, hwaddr addr, uint64_t *data,
     CXLFixedWindow *fw = opaque;
     PCIDevice *d;
 
-    d = cxl_cfmws_find_device(fw, addr);
+    d = cxl_cfmws_find_device(fw, addr + fw->base, true);
     if (d == NULL) {
         *data = 0;
         /* Reads to invalid address return poison */
@@ -271,7 +269,7 @@ static MemTxResult cxl_write_cfmws(void *opaque, hwaddr addr,
     CXLFixedWindow *fw = opaque;
     PCIDevice *d;
 
-    d = cxl_cfmws_find_device(fw, addr);
+    d = cxl_cfmws_find_device(fw, addr + fw->base, true);
     if (d == NULL) {
         /* Writes to invalid address are silent */
         return MEMTX_OK;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] hw/cxl: Use HPA in cxl_cfmws_find_device() rather than offset in window.
  2026-02-04 15:00 ` [PATCH v2 1/2] hw/cxl: Use HPA in cxl_cfmws_find_device() rather than offset in window Alireza Sanaee via qemu development
@ 2026-02-06  9:57   ` Zhijian Li (Fujitsu)
  2026-02-06 12:51     ` Jonathan Cameron via qemu development
  0 siblings, 1 reply; 12+ messages in thread
From: Zhijian Li (Fujitsu) @ 2026-02-06  9:57 UTC (permalink / raw)
  To: Alireza Sanaee, qemu-devel@nongnu.org
  Cc: linuxarm@huawei.com, jonathan.cameron@huawei.com,
	nifan.cxl@gmail.com, anisa.su887@gmail.com, mst@redhat.com,
	armbru@redhat.com, ppbonzini@redhat.com, peterx@redhat.com,
	philmd@linaro.org, david@kernel.org, imammedo@redhat.com,
	xiaoguangrong.eric@gmail.com, venkataravis@micron.com,
	gourry@gourry.net



On 04/02/2026 23:00, Alireza Sanaee wrote:
> This function will shortly be used to help find if there is a route to a
> device, serving an HPA, under a particular fixed memory window. Rather
> than having that new use case subtract the base address in the caller,
> only to add it again in cxl_cfmws_find_device(), push the responsibility
> for calculating the HPA to the caller.
> 
> This also reduces the inconsistency in the meaning of the hwaddr
> addr parameter between this function and the calls made within it
> that access the HDM decoders that operating on HPA.
> 
> Signed-off-by: Alireza Sanaee <alireza.sanaee@huawei.com>
> ---
>   hw/cxl/cxl-host.c | 6 ++----
>   1 file changed, 2 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/cxl/cxl-host.c b/hw/cxl/cxl-host.c
> index f3479b1991..9633b01abf 100644
> --- a/hw/cxl/cxl-host.c
> +++ b/hw/cxl/cxl-host.c
> @@ -168,8 +168,6 @@ static PCIDevice *cxl_cfmws_find_device(CXLFixedWindow *fw, hwaddr addr)
>       bool target_found;
>       PCIDevice *rp, *d;
>   
> -    /* Address is relative to memory region. Convert to HPA */
> -    addr += fw->base;
>   
>       rb_index = (addr / cxl_decode_ig(fw->enc_int_gran)) % fw->num_targets;
>       hb = PCI_HOST_BRIDGE(fw->target_hbs[rb_index]->cxl_host_bridge);
> @@ -254,7 +252,7 @@ static MemTxResult cxl_read_cfmws(void *opaque, hwaddr addr, uint64_t *data,
>       CXLFixedWindow *fw = opaque;
>       PCIDevice *d;
>   
> -    d = cxl_cfmws_find_device(fw, addr);
> +    d = cxl_cfmws_find_device(fw, addr + fw->base, true);


It seems you've added an extra parameter to the calls to cxl_cfmws_find_device(), but its function signature wasn't updated.



Thanks
Zhijian



>       if (d == NULL) {
>           *data = 0;
>           /* Reads to invalid address return poison */
> @@ -271,7 +269,7 @@ static MemTxResult cxl_write_cfmws(void *opaque, hwaddr addr,
>       CXLFixedWindow *fw = opaque;
>       PCIDevice *d;
>   
> -    d = cxl_cfmws_find_device(fw, addr);
> +    d = cxl_cfmws_find_device(fw, addr + fw->base, true);
>       if (d == NULL) {
>           /* Writes to invalid address are silent */
>           return MEMTX_OK;

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/2] hw/cxl: Use HPA in cxl_cfmws_find_device() rather than offset in window.
  2026-02-06  9:57   ` Zhijian Li (Fujitsu)
@ 2026-02-06 12:51     ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 12+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-02-06 12:51 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu), linuxarm
  Cc: Alireza Sanaee, qemu-devel@nongnu.org,
	jonathan.cameron@huawei.com, nifan.cxl@gmail.com,
	anisa.su887@gmail.com, mst@redhat.com, armbru@redhat.com,
	ppbonzini@redhat.com, peterx@redhat.com, philmd@linaro.org,
	david@kernel.org, imammedo@redhat.com,
	xiaoguangrong.eric@gmail.com, venkataravis@micron.com,
	gourry@gourry.net

On Fri, 6 Feb 2026 09:57:15 +0000
"Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> wrote:

> On 04/02/2026 23:00, Alireza Sanaee wrote:
> > This function will shortly be used to help find if there is a route to a
> > device, serving an HPA, under a particular fixed memory window. Rather
> > than having that new use case subtract the base address in the caller,
> > only to add it again in cxl_cfmws_find_device(), push the responsibility
> > for calculating the HPA to the caller.
> > 
> > This also reduces the inconsistency in the meaning of the hwaddr
> > addr parameter between this function and the calls made within it
> > that access the HDM decoders that operating on HPA.
> > 
> > Signed-off-by: Alireza Sanaee <alireza.sanaee@huawei.com>
> > ---
> >   hw/cxl/cxl-host.c | 6 ++----
> >   1 file changed, 2 insertions(+), 4 deletions(-)
> > 
> > diff --git a/hw/cxl/cxl-host.c b/hw/cxl/cxl-host.c
> > index f3479b1991..9633b01abf 100644
> > --- a/hw/cxl/cxl-host.c
> > +++ b/hw/cxl/cxl-host.c
> > @@ -168,8 +168,6 @@ static PCIDevice *cxl_cfmws_find_device(CXLFixedWindow *fw, hwaddr addr)
> >       bool target_found;
> >       PCIDevice *rp, *d;
> >   
> > -    /* Address is relative to memory region. Convert to HPA */
> > -    addr += fw->base;
> >   
> >       rb_index = (addr / cxl_decode_ig(fw->enc_int_gran)) % fw->num_targets;
> >       hb = PCI_HOST_BRIDGE(fw->target_hbs[rb_index]->cxl_host_bridge);
> > @@ -254,7 +252,7 @@ static MemTxResult cxl_read_cfmws(void *opaque, hwaddr addr, uint64_t *data,
> >       CXLFixedWindow *fw = opaque;
> >       PCIDevice *d;
> >   
> > -    d = cxl_cfmws_find_device(fw, addr);
> > +    d = cxl_cfmws_find_device(fw, addr + fw->base, true);  
> 
> 
> It seems you've added an extra parameter to the calls to cxl_cfmws_find_device(), but its function signature wasn't updated.

Good spot. The extra parameter should be in the next patch.

Jonathan

> 
> 
> 
> Thanks
> Zhijian
> 
> 
> 
> >       if (d == NULL) {
> >           *data = 0;
> >           /* Reads to invalid address return poison */
> > @@ -271,7 +269,7 @@ static MemTxResult cxl_write_cfmws(void *opaque, hwaddr addr,
> >       CXLFixedWindow *fw = opaque;
> >       PCIDevice *d;
> >   
> > -    d = cxl_cfmws_find_device(fw, addr);
> > +    d = cxl_cfmws_find_device(fw, addr + fw->base, true);
> >       if (d == NULL) {
> >           /* Writes to invalid address are silent */
> >           return MEMTX_OK;  



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 2/2] hw/cxl: Add a performant (and correct) path for the non interleaved cases
  2026-02-04 14:59 [PATCH v2 0/2] Performant CXL type 3 non-interleaved regions Alireza Sanaee via qemu development
  2026-02-04 15:00 ` [PATCH v2 1/2] hw/cxl: Use HPA in cxl_cfmws_find_device() rather than offset in window Alireza Sanaee via qemu development
@ 2026-02-04 15:00 ` Alireza Sanaee via qemu development
  2026-02-09  3:06   ` Zhijian Li (Fujitsu)
  2026-02-05 18:07 ` [PATCH v2 0/2] Performant CXL type 3 non-interleaved regions Gregory Price
  2026-02-09  3:17 ` Gregory Price
  3 siblings, 1 reply; 12+ messages in thread
From: Alireza Sanaee via qemu development @ 2026-02-04 15:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: linuxarm, jonathan.cameron, nifan.cxl, anisa.su887, mst, armbru,
	lizhijian, ppbonzini, peterx, philmd, david, imammedo,
	xiaoguangrong.eric, venkataravis, gourry

The CXL address to device decoding logic is complex because of the need
to correctly decode fine grained interleave. The current implementation
prevents use with KVM where executed instructions may reside in that
memory and gives very slow performance even in TCG.

In many real cases non interleaved memory configurations are useful and
for those we can use a more conventional memory region alias allowing
similar performance to other memory in the system.

Co-developed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Alireza Sanaee <alireza.sanaee@huawei.com>
---
 hw/cxl/cxl-component-utils.c |   8 ++
 hw/cxl/cxl-host.c            | 197 ++++++++++++++++++++++++++++++++++-
 hw/mem/cxl_type3.c           |   4 +
 include/hw/cxl/cxl.h         |   1 +
 include/hw/cxl/cxl_device.h  |   1 +
 5 files changed, 206 insertions(+), 5 deletions(-)

diff --git a/hw/cxl/cxl-component-utils.c b/hw/cxl/cxl-component-utils.c
index d36162e91b..d459d04b6d 100644
--- a/hw/cxl/cxl-component-utils.c
+++ b/hw/cxl/cxl-component-utils.c
@@ -142,6 +142,14 @@ static void dumb_hdm_handler(CXLComponentState *cxl_cstate, hwaddr offset,
         value = FIELD_DP32(value, CXL_HDM_DECODER0_CTRL, COMMITTED, 0);
     }
     stl_le_p((uint8_t *)cache_mem + offset, value);
+
+    if (should_commit) {
+        cfmws_update_non_interleaved(true);
+    }
+
+    if (should_uncommit) {
+        cfmws_update_non_interleaved(false);
+    }
 }
 
 static void bi_handler(CXLComponentState *cxl_cstate, hwaddr offset,
diff --git a/hw/cxl/cxl-host.c b/hw/cxl/cxl-host.c
index 9633b01abf..1fcfe01164 100644
--- a/hw/cxl/cxl-host.c
+++ b/hw/cxl/cxl-host.c
@@ -104,7 +104,7 @@ void cxl_fmws_link_targets(Error **errp)
 }
 
 static bool cxl_hdm_find_target(uint32_t *cache_mem, hwaddr addr,
-                                uint8_t *target)
+                                uint8_t *target, bool *interleaved)
 {
     int hdm_inc = R_CXL_HDM_DECODER1_BASE_LO - R_CXL_HDM_DECODER0_BASE_LO;
     unsigned int hdm_count;
@@ -138,6 +138,11 @@ static bool cxl_hdm_find_target(uint32_t *cache_mem, hwaddr addr,
         found = true;
         ig_enc = FIELD_EX32(ctrl, CXL_HDM_DECODER0_CTRL, IG);
         iw_enc = FIELD_EX32(ctrl, CXL_HDM_DECODER0_CTRL, IW);
+
+        if (interleaved) {
+            *interleaved = iw_enc != 0;
+        }
+
         target_idx = (addr / cxl_decode_ig(ig_enc)) % (1 << iw_enc);
 
         if (target_idx < 4) {
@@ -157,7 +162,8 @@ static bool cxl_hdm_find_target(uint32_t *cache_mem, hwaddr addr,
     return found;
 }
 
-static PCIDevice *cxl_cfmws_find_device(CXLFixedWindow *fw, hwaddr addr)
+static PCIDevice *cxl_cfmws_find_device(CXLFixedWindow *fw, hwaddr addr,
+                                        bool allow_interleave)
 {
     CXLComponentState *hb_cstate, *usp_cstate;
     PCIHostState *hb;
@@ -165,9 +171,12 @@ static PCIDevice *cxl_cfmws_find_device(CXLFixedWindow *fw, hwaddr addr)
     int rb_index;
     uint32_t *cache_mem;
     uint8_t target;
-    bool target_found;
+    bool target_found, interleaved;
     PCIDevice *rp, *d;
 
+    if ((fw->num_targets > 1) && !allow_interleave) {
+        return NULL;
+    }
 
     rb_index = (addr / cxl_decode_ig(fw->enc_int_gran)) % fw->num_targets;
     hb = PCI_HOST_BRIDGE(fw->target_hbs[rb_index]->cxl_host_bridge);
@@ -188,11 +197,16 @@ static PCIDevice *cxl_cfmws_find_device(CXLFixedWindow *fw, hwaddr addr)
 
         cache_mem = hb_cstate->crb.cache_mem_registers;
 
-        target_found = cxl_hdm_find_target(cache_mem, addr, &target);
+        target_found = cxl_hdm_find_target(cache_mem, addr, &target,
+                                           &interleaved);
         if (!target_found) {
             return NULL;
         }
 
+        if (interleaved && !allow_interleave) {
+            return NULL;
+        }
+
         rp = pcie_find_port_by_pn(hb->bus, target);
         if (!rp) {
             return NULL;
@@ -224,11 +238,15 @@ static PCIDevice *cxl_cfmws_find_device(CXLFixedWindow *fw, hwaddr addr)
 
     cache_mem = usp_cstate->crb.cache_mem_registers;
 
-    target_found = cxl_hdm_find_target(cache_mem, addr, &target);
+    target_found = cxl_hdm_find_target(cache_mem, addr, &target, &interleaved);
     if (!target_found) {
         return NULL;
     }
 
+    if (interleaved && !allow_interleave) {
+        return NULL;
+    }
+
     d = pcie_find_port_by_pn(&PCI_BRIDGE(d)->sec_bus, target);
     if (!d) {
         return NULL;
@@ -246,6 +264,175 @@ static PCIDevice *cxl_cfmws_find_device(CXLFixedWindow *fw, hwaddr addr)
     return d;
 }
 
+typedef struct CXLDirectPTState {
+    CXLType3Dev *ct3d;
+    hwaddr decoder_base;
+    hwaddr decoder_size;
+    hwaddr dpa_base;
+    unsigned int hdm_decoder_idx;
+    bool commit;
+} CXLDirectPTState;
+
+static void cxl_fmws_direct_passthrough_setup(CXLDirectPTState *state,
+                                              CXLFixedWindow *fw)
+{
+    CXLType3Dev *ct3d = state->ct3d;
+    MemoryRegion *mr = NULL;
+    uint64_t vmr_size = 0, pmr_size = 0, offset = 0;
+    MemoryRegion *direct_mr;
+
+    if (ct3d->hostvmem) {
+        MemoryRegion *vmr = host_memory_backend_get_memory(ct3d->hostvmem);
+
+        vmr_size = memory_region_size(vmr);
+        if (state->dpa_base < vmr_size) {
+            mr = vmr;
+            offset = state->dpa_base;
+        }
+    }
+    if (!mr && ct3d->hostpmem) {
+        MemoryRegion *pmr = host_memory_backend_get_memory(ct3d->hostpmem);
+
+        pmr_size = memory_region_size(pmr);
+        if (state->dpa_base - vmr_size < pmr_size) {
+            mr = pmr;
+            offset = state->dpa_base - vmr_size;
+        }
+    }
+    if (!mr) {
+        return;
+    }
+
+    direct_mr = &ct3d->direct_mr[state->hdm_decoder_idx];
+    if (memory_region_is_mapped(direct_mr)) {
+        return;
+    }
+
+    memory_region_init_alias(direct_mr, OBJECT(ct3d), "direct-mapping", mr,
+                             offset, state->decoder_size);
+    memory_region_add_subregion(&fw->mr,
+                                state->decoder_base - fw->base, direct_mr);
+}
+
+static void cxl_fmws_direct_passthrough_teardown(CXLDirectPTState *state,
+                                                 CXLFixedWindow *fw)
+{
+    CXLType3Dev *ct3d = state->ct3d;
+    MemoryRegion *direct_mr = &ct3d->direct_mr[state->hdm_decoder_idx];
+
+    if (memory_region_is_mapped(direct_mr)) {
+        memory_region_del_subregion(&fw->mr, direct_mr);
+    }
+}
+
+static int cxl_fmws_direct_passthrough(Object *obj, void *opaque)
+{
+    CXLDirectPTState *state = opaque;
+    CXLFixedWindow *fw;
+
+    if (!object_dynamic_cast(obj, TYPE_CXL_FMW)) {
+        return 0;
+    }
+
+    fw = CXL_FMW(obj);
+
+    /* Verify not interleaved */
+    if (!cxl_cfmws_find_device(fw, state->decoder_base, false)) {
+        return 0;
+    }
+
+   if (state->commit) {
+        cxl_fmws_direct_passthrough_setup(state, fw);
+    } else {
+        cxl_fmws_direct_passthrough_teardown(state, fw);
+    }
+
+    return 0;
+}
+
+static int update_non_interleaved(Object *obj, void *opaque)
+{
+    const int hdm_inc = R_CXL_HDM_DECODER1_BASE_LO - R_CXL_HDM_DECODER0_BASE_LO;
+    bool commit = *(bool *)opaque;
+    CXLType3Dev *ct3d;
+    uint32_t *cache_mem;
+    unsigned int hdm_count, i;
+    uint32_t cap;
+    uint64_t dpa_base = 0;
+
+    if (!object_dynamic_cast(obj, TYPE_CXL_TYPE3)) {
+        return 0;
+    }
+
+    ct3d = CXL_TYPE3(obj);
+    cache_mem = ct3d->cxl_cstate.crb.cache_mem_registers;
+    cap = ldl_le_p(cache_mem + R_CXL_HDM_DECODER_CAPABILITY);
+    hdm_count = cxl_decoder_count_dec(FIELD_EX32(cap,
+                                                 CXL_HDM_DECODER_CAPABILITY,
+                                                 DECODER_COUNT));
+    /*
+     * Walk the decoders and find any committed with iw set to 0
+     * (non interleaved).
+     */
+    for (i = 0; i < hdm_count; i++) {
+        uint64_t decoder_base, decoder_size, skip;
+        uint32_t hdm_ctrl, low, high;
+        int iw, committed;
+
+        hdm_ctrl = ldl_le_p(cache_mem + R_CXL_HDM_DECODER0_CTRL + i * hdm_inc);
+        committed = FIELD_EX32(hdm_ctrl, CXL_HDM_DECODER0_CTRL, COMMITTED);
+        if (commit ^ committed) {
+            return 0;
+        }
+
+        low = ldl_le_p(cache_mem + R_CXL_HDM_DECODER0_DPA_SKIP_LO +
+                       i * hdm_inc);
+        high = ldl_le_p(cache_mem + R_CXL_HDM_DECODER0_DPA_SKIP_HI +
+                        i * hdm_inc);
+        skip = ((uint64_t)high << 32) | (low & 0xf0000000);
+        dpa_base += skip;
+
+        low = ldl_le_p(cache_mem + R_CXL_HDM_DECODER0_SIZE_LO + i * hdm_inc);
+        high = ldl_le_p(cache_mem + R_CXL_HDM_DECODER0_SIZE_HI + i * hdm_inc);
+        decoder_size = ((uint64_t)high << 32) | (low & 0xf0000000);
+
+        low = ldl_le_p(cache_mem + R_CXL_HDM_DECODER0_BASE_LO + i * hdm_inc);
+        high = ldl_le_p(cache_mem + R_CXL_HDM_DECODER0_BASE_HI + i * hdm_inc);
+        decoder_base = ((uint64_t)high << 32) | (low & 0xf0000000);
+
+        iw = FIELD_EX32(hdm_ctrl, CXL_HDM_DECODER0_CTRL, IW);
+
+        if (iw == 0) {
+            CXLDirectPTState state = {
+                .ct3d = ct3d,
+                .decoder_base = decoder_base,
+                .decoder_size = decoder_size,
+                .dpa_base = dpa_base,
+                .hdm_decoder_idx = i,
+                .commit = commit,
+            };
+
+            object_child_foreach_recursive(object_get_root(),
+                                           cxl_fmws_direct_passthrough, &state);
+        }
+        dpa_base += decoder_size / cxl_interleave_ways_dec(iw, &error_fatal);
+    }
+
+    return 0;
+}
+
+bool cfmws_update_non_interleaved(bool commit)
+{
+    /*
+     * Walk endpoints to find committed decoders then check if they are not
+     * interleaved (but path is fully set up).
+     */
+    object_child_foreach_recursive(object_get_root(),
+                                   update_non_interleaved, &commit);
+
+    return false;
+}
+
 static MemTxResult cxl_read_cfmws(void *opaque, hwaddr addr, uint64_t *data,
                                   unsigned size, MemTxAttrs attrs)
 {
diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
index 3f09c589ae..a95f6a4014 100644
--- a/hw/mem/cxl_type3.c
+++ b/hw/mem/cxl_type3.c
@@ -427,6 +427,8 @@ static void hdm_decoder_commit(CXLType3Dev *ct3d, int which)
     ctrl = FIELD_DP32(ctrl, CXL_HDM_DECODER0_CTRL, COMMITTED, 1);
 
     stl_le_p(cache_mem + R_CXL_HDM_DECODER0_CTRL + which * hdm_inc, ctrl);
+
+    cfmws_update_non_interleaved(true);
 }
 
 static void hdm_decoder_uncommit(CXLType3Dev *ct3d, int which)
@@ -442,6 +444,8 @@ static void hdm_decoder_uncommit(CXLType3Dev *ct3d, int which)
     ctrl = FIELD_DP32(ctrl, CXL_HDM_DECODER0_CTRL, COMMITTED, 0);
 
     stl_le_p(cache_mem + R_CXL_HDM_DECODER0_CTRL + which * hdm_inc, ctrl);
+
+    cfmws_update_non_interleaved(false);
 }
 
 static int ct3d_qmp_uncor_err_to_cxl(CxlUncorErrorType qmp_err)
diff --git a/include/hw/cxl/cxl.h b/include/hw/cxl/cxl.h
index 998f495a98..931f5680bd 100644
--- a/include/hw/cxl/cxl.h
+++ b/include/hw/cxl/cxl.h
@@ -71,4 +71,5 @@ CXLComponentState *cxl_usp_to_cstate(CXLUpstreamPort *usp);
 typedef struct CXLDownstreamPort CXLDownstreamPort;
 DECLARE_INSTANCE_CHECKER(CXLDownstreamPort, CXL_DSP, TYPE_CXL_DSP)
 
+bool cfmws_update_non_interleaved(bool commit);
 #endif
diff --git a/include/hw/cxl/cxl_device.h b/include/hw/cxl/cxl_device.h
index 393f312217..d295469301 100644
--- a/include/hw/cxl/cxl_device.h
+++ b/include/hw/cxl/cxl_device.h
@@ -712,6 +712,7 @@ struct CXLType3Dev {
     uint64_t sn;
 
     /* State */
+    MemoryRegion direct_mr[CXL_HDM_DECODER_COUNT];
     AddressSpace hostvmem_as;
     AddressSpace hostpmem_as;
     CXLComponentState cxl_cstate;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] hw/cxl: Add a performant (and correct) path for the non interleaved cases
  2026-02-04 15:00 ` [PATCH v2 2/2] hw/cxl: Add a performant (and correct) path for the non interleaved cases Alireza Sanaee via qemu development
@ 2026-02-09  3:06   ` Zhijian Li (Fujitsu)
  2026-02-09 13:47     ` Alireza Sanaee via qemu development
  2026-02-12 14:25     ` Alireza Sanaee via qemu development
  0 siblings, 2 replies; 12+ messages in thread
From: Zhijian Li (Fujitsu) @ 2026-02-09  3:06 UTC (permalink / raw)
  To: Alireza Sanaee, qemu-devel@nongnu.org
  Cc: linuxarm@huawei.com, jonathan.cameron@huawei.com,
	nifan.cxl@gmail.com, anisa.su887@gmail.com, mst@redhat.com,
	armbru@redhat.com, ppbonzini@redhat.com, peterx@redhat.com,
	philmd@linaro.org, david@kernel.org, imammedo@redhat.com,
	xiaoguangrong.eric@gmail.com, venkataravis@micron.com,
	gourry@gourry.net



On 04/02/2026 23:00, Alireza Sanaee wrote:
> The CXL address to device decoding logic is complex because of the need
> to correctly decode fine grained interleave. The current implementation
> prevents use with KVM where executed instructions may reside in that
> memory and gives very slow performance even in TCG.


Do you mean KVM will work after this patch in 1-way case


> 
> In many real cases non interleaved memory configurations are useful and
> for those we can use a more conventional memory region alias allowing
> similar performance to other memory in the system.


Do you have any benchmark data to demonstrate the performance improvement?



> 
> Co-developed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Alireza Sanaee <alireza.sanaee@huawei.com>
> ---
>   hw/cxl/cxl-component-utils.c |   8 ++
>   hw/cxl/cxl-host.c            | 197 ++++++++++++++++++++++++++++++++++-
>   hw/mem/cxl_type3.c           |   4 +
>   include/hw/cxl/cxl.h         |   1 +
>   include/hw/cxl/cxl_device.h  |   1 +
>   5 files changed, 206 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/cxl/cxl-component-utils.c b/hw/cxl/cxl-component-utils.c
> index d36162e91b..d459d04b6d 100644
> --- a/hw/cxl/cxl-component-utils.c
> +++ b/hw/cxl/cxl-component-utils.c
> @@ -142,6 +142,14 @@ static void dumb_hdm_handler(CXLComponentState *cxl_cstate, hwaddr offset,
>           value = FIELD_DP32(value, CXL_HDM_DECODER0_CTRL, COMMITTED, 0);
>       }
>       stl_le_p((uint8_t *)cache_mem + offset, value);
> +
> +    if (should_commit) {
> +        cfmws_update_non_interleaved(true);
> +    }
> +
> +    if (should_uncommit) {
> +        cfmws_update_non_interleaved(false);
> +    }


This might be slightly better as an if-else block, since should_commit and should_uncommit are mutually exclusive:


if (should_commit) {
	cfmws_update_non_interleaved(true);;
} else if (should_uncommit) {
	cfmws_update_non_interleaved(false);
}


>   }
>   
>   static void bi_handler(CXLComponentState *cxl_cstate, hwaddr offset,
> diff --git a/hw/cxl/cxl-host.c b/hw/cxl/cxl-host.c
> index 9633b01abf..1fcfe01164 100644
> --- a/hw/cxl/cxl-host.c
> +++ b/hw/cxl/cxl-host.c
> @@ -104,7 +104,7 @@ void cxl_fmws_link_targets(Error **errp)
>   }
>   
>   static bool cxl_hdm_find_target(uint32_t *cache_mem, hwaddr addr,
> -                                uint8_t *target)
> +                                uint8_t *target, bool *interleaved)
>   {
>       int hdm_inc = R_CXL_HDM_DECODER1_BASE_LO - R_CXL_HDM_DECODER0_BASE_LO;
>       unsigned int hdm_count;
> @@ -138,6 +138,11 @@ static bool cxl_hdm_find_target(uint32_t *cache_mem, hwaddr addr,
>           found = true;
>           ig_enc = FIELD_EX32(ctrl, CXL_HDM_DECODER0_CTRL, IG);
>           iw_enc = FIELD_EX32(ctrl, CXL_HDM_DECODER0_CTRL, IW);
> +
> +        if (interleaved) {
> +            *interleaved = iw_enc != 0;
> +        }
> +
>           target_idx = (addr / cxl_decode_ig(ig_enc)) % (1 << iw_enc);
>   
>           if (target_idx < 4) {
> @@ -157,7 +162,8 @@ static bool cxl_hdm_find_target(uint32_t *cache_mem, hwaddr addr,
>       return found;

[...]

> +
> +static int update_non_interleaved(Object *obj, void *opaque)
> +{
> +    const int hdm_inc = R_CXL_HDM_DECODER1_BASE_LO - R_CXL_HDM_DECODER0_BASE_LO;
> +    bool commit = *(bool *)opaque;
> +    CXLType3Dev *ct3d;
> +    uint32_t *cache_mem;
> +    unsigned int hdm_count, i;
> +    uint32_t cap;
> +    uint64_t dpa_base = 0;
> +
> +    if (!object_dynamic_cast(obj, TYPE_CXL_TYPE3)) {
> +        return 0;
> +    }
> +
> +    ct3d = CXL_TYPE3(obj);
> +    cache_mem = ct3d->cxl_cstate.crb.cache_mem_registers;
> +    cap = ldl_le_p(cache_mem + R_CXL_HDM_DECODER_CAPABILITY);
> +    hdm_count = cxl_decoder_count_dec(FIELD_EX32(cap,
> +                                                 CXL_HDM_DECODER_CAPABILITY,
> +                                                 DECODER_COUNT));
> +    /*
> +     * Walk the decoders and find any committed with iw set to 0
> +     * (non interleaved).
> +     */
> +    for (i = 0; i < hdm_count; i++) {
> +        uint64_t decoder_base, decoder_size, skip;
> +        uint32_t hdm_ctrl, low, high;
> +        int iw, committed;
> +
> +        hdm_ctrl = ldl_le_p(cache_mem + R_CXL_HDM_DECODER0_CTRL + i * hdm_inc);
> +        committed = FIELD_EX32(hdm_ctrl, CXL_HDM_DECODER0_CTRL, COMMITTED);
> +        if (commit ^ committed) {


Is there any specific reason for using ^ here instead of '==' ?



Thanks
Zhijian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] hw/cxl: Add a performant (and correct) path for the non interleaved cases
  2026-02-09  3:06   ` Zhijian Li (Fujitsu)
@ 2026-02-09 13:47     ` Alireza Sanaee via qemu development
  2026-02-12 14:25     ` Alireza Sanaee via qemu development
  1 sibling, 0 replies; 12+ messages in thread
From: Alireza Sanaee via qemu development @ 2026-02-09 13:47 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu)
  Cc: qemu-devel@nongnu.org, linuxarm@huawei.com,
	jonathan.cameron@huawei.com, nifan.cxl@gmail.com,
	anisa.su887@gmail.com, mst@redhat.com, armbru@redhat.com,
	ppbonzini@redhat.com, peterx@redhat.com, philmd@linaro.org,
	david@kernel.org, imammedo@redhat.com,
	xiaoguangrong.eric@gmail.com, venkataravis@micron.com,
	gourry@gourry.net

On Mon, 9 Feb 2026 03:06:00 +0000
"Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> wrote:

> On 04/02/2026 23:00, Alireza Sanaee wrote:
> > The CXL address to device decoding logic is complex because of the need
> > to correctly decode fine grained interleave. The current implementation
> > prevents use with KVM where executed instructions may reside in that
> > memory and gives very slow performance even in TCG.  
> 
> 
> Do you mean KVM will work after this patch in 1-way case

Yes

> 
> 
> > 
> > In many real cases non interleaved memory configurations are useful and
> > for those we can use a more conventional memory region alias allowing
> > similar performance to other memory in the system.  
> 
> 
> Do you have any benchmark data to demonstrate the performance improvement?

Yes, if you setup a simple devdax application which sweeps through the region with reads and writes, it is easy to see the performance difference, in KVM/TCG. I can share more if you will.

> 
> 
> 
> > 
> > Co-developed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Signed-off-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Signed-off-by: Alireza Sanaee <alireza.sanaee@huawei.com>
> > ---
> >   hw/cxl/cxl-component-utils.c |   8 ++
> >   hw/cxl/cxl-host.c            | 197 ++++++++++++++++++++++++++++++++++-
> >   hw/mem/cxl_type3.c           |   4 +
> >   include/hw/cxl/cxl.h         |   1 +
> >   include/hw/cxl/cxl_device.h  |   1 +
> >   5 files changed, 206 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/cxl/cxl-component-utils.c b/hw/cxl/cxl-component-utils.c
> > index d36162e91b..d459d04b6d 100644
> > --- a/hw/cxl/cxl-component-utils.c
> > +++ b/hw/cxl/cxl-component-utils.c
> > @@ -142,6 +142,14 @@ static void dumb_hdm_handler(CXLComponentState *cxl_cstate, hwaddr offset,
> >           value = FIELD_DP32(value, CXL_HDM_DECODER0_CTRL, COMMITTED, 0);
> >       }
> >       stl_le_p((uint8_t *)cache_mem + offset, value);
> > +
> > +    if (should_commit) {
> > +        cfmws_update_non_interleaved(true);
> > +    }
> > +
> > +    if (should_uncommit) {
> > +        cfmws_update_non_interleaved(false);
> > +    }  
> 
> 
> This might be slightly better as an if-else block, since should_commit and should_uncommit are mutually exclusive:
> 
> 
> if (should_commit) {
> 	cfmws_update_non_interleaved(true);;
> } else if (should_uncommit) {
> 	cfmws_update_non_interleaved(false);
> }
> 
> 
> >   }
> >   
> >   static void bi_handler(CXLComponentState *cxl_cstate, hwaddr offset,
> > diff --git a/hw/cxl/cxl-host.c b/hw/cxl/cxl-host.c
> > index 9633b01abf..1fcfe01164 100644
> > --- a/hw/cxl/cxl-host.c
> > +++ b/hw/cxl/cxl-host.c
> > @@ -104,7 +104,7 @@ void cxl_fmws_link_targets(Error **errp)
> >   }
> >   
> >   static bool cxl_hdm_find_target(uint32_t *cache_mem, hwaddr addr,
> > -                                uint8_t *target)
> > +                                uint8_t *target, bool *interleaved)
> >   {
> >       int hdm_inc = R_CXL_HDM_DECODER1_BASE_LO - R_CXL_HDM_DECODER0_BASE_LO;
> >       unsigned int hdm_count;
> > @@ -138,6 +138,11 @@ static bool cxl_hdm_find_target(uint32_t *cache_mem, hwaddr addr,
> >           found = true;
> >           ig_enc = FIELD_EX32(ctrl, CXL_HDM_DECODER0_CTRL, IG);
> >           iw_enc = FIELD_EX32(ctrl, CXL_HDM_DECODER0_CTRL, IW);
> > +
> > +        if (interleaved) {
> > +            *interleaved = iw_enc != 0;
> > +        }
> > +
> >           target_idx = (addr / cxl_decode_ig(ig_enc)) % (1 << iw_enc);
> >   
> >           if (target_idx < 4) {
> > @@ -157,7 +162,8 @@ static bool cxl_hdm_find_target(uint32_t *cache_mem, hwaddr addr,
> >       return found;  
> 
> [...]
> 
> > +
> > +static int update_non_interleaved(Object *obj, void *opaque)
> > +{
> > +    const int hdm_inc = R_CXL_HDM_DECODER1_BASE_LO - R_CXL_HDM_DECODER0_BASE_LO;
> > +    bool commit = *(bool *)opaque;
> > +    CXLType3Dev *ct3d;
> > +    uint32_t *cache_mem;
> > +    unsigned int hdm_count, i;
> > +    uint32_t cap;
> > +    uint64_t dpa_base = 0;
> > +
> > +    if (!object_dynamic_cast(obj, TYPE_CXL_TYPE3)) {
> > +        return 0;
> > +    }
> > +
> > +    ct3d = CXL_TYPE3(obj);
> > +    cache_mem = ct3d->cxl_cstate.crb.cache_mem_registers;
> > +    cap = ldl_le_p(cache_mem + R_CXL_HDM_DECODER_CAPABILITY);
> > +    hdm_count = cxl_decoder_count_dec(FIELD_EX32(cap,
> > +                                                 CXL_HDM_DECODER_CAPABILITY,
> > +                                                 DECODER_COUNT));
> > +    /*
> > +     * Walk the decoders and find any committed with iw set to 0
> > +     * (non interleaved).
> > +     */
> > +    for (i = 0; i < hdm_count; i++) {
> > +        uint64_t decoder_base, decoder_size, skip;
> > +        uint32_t hdm_ctrl, low, high;
> > +        int iw, committed;
> > +
> > +        hdm_ctrl = ldl_le_p(cache_mem + R_CXL_HDM_DECODER0_CTRL + i * hdm_inc);
> > +        committed = FIELD_EX32(hdm_ctrl, CXL_HDM_DECODER0_CTRL, COMMITTED);
> > +        if (commit ^ committed) {  
> 
> 
> Is there any specific reason for using ^ here instead of '==' ?
> 
> 
> 
> Thanks
> Zhijian


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/2] hw/cxl: Add a performant (and correct) path for the non interleaved cases
  2026-02-09  3:06   ` Zhijian Li (Fujitsu)
  2026-02-09 13:47     ` Alireza Sanaee via qemu development
@ 2026-02-12 14:25     ` Alireza Sanaee via qemu development
  1 sibling, 0 replies; 12+ messages in thread
From: Alireza Sanaee via qemu development @ 2026-02-12 14:25 UTC (permalink / raw)
  To: Zhijian Li (Fujitsu)
  Cc: qemu-devel@nongnu.org, linuxarm@huawei.com,
	jonathan.cameron@huawei.com, nifan.cxl@gmail.com,
	anisa.su887@gmail.com, mst@redhat.com, armbru@redhat.com,
	ppbonzini@redhat.com, peterx@redhat.com, philmd@linaro.org,
	david@kernel.org, imammedo@redhat.com,
	xiaoguangrong.eric@gmail.com, venkataravis@micron.com,
	gourry@gourry.net

On Mon, 9 Feb 2026 03:06:00 +0000
"Zhijian Li (Fujitsu)" <lizhijian@fujitsu.com> wrote:

Hi Zhijian,

I forgot to address the other feedback you shared. This is another attempt.

> On 04/02/2026 23:00, Alireza Sanaee wrote:
> > The CXL address to device decoding logic is complex because of the need
> > to correctly decode fine grained interleave. The current implementation
> > prevents use with KVM where executed instructions may reside in that
> > memory and gives very slow performance even in TCG.  
> 
> 
> Do you mean KVM will work after this patch in 1-way case
> 

Yes.

> 
> > 
> > In many real cases non interleaved memory configurations are useful and
> > for those we can use a more conventional memory region alias allowing
> > similar performance to other memory in the system.  
> 
> 
> Do you have any benchmark data to demonstrate the performance improvement?

Yes, simple experiments with this code snippet. Compilation instructions are in the gist itself.

https://gist.github.com/sarsanaee/8791a4d32351b31228764ce7c2c4bb9c

> 
> 
> 
> > 
> > Co-developed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Signed-off-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Signed-off-by: Alireza Sanaee <alireza.sanaee@huawei.com>
> > ---
> >   hw/cxl/cxl-component-utils.c |   8 ++
> >   hw/cxl/cxl-host.c            | 197 ++++++++++++++++++++++++++++++++++-
> >   hw/mem/cxl_type3.c           |   4 +
> >   include/hw/cxl/cxl.h         |   1 +
> >   include/hw/cxl/cxl_device.h  |   1 +
> >   5 files changed, 206 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/cxl/cxl-component-utils.c b/hw/cxl/cxl-component-utils.c
> > index d36162e91b..d459d04b6d 100644
> > --- a/hw/cxl/cxl-component-utils.c
> > +++ b/hw/cxl/cxl-component-utils.c
> > @@ -142,6 +142,14 @@ static void dumb_hdm_handler(CXLComponentState *cxl_cstate, hwaddr offset,
> >           value = FIELD_DP32(value, CXL_HDM_DECODER0_CTRL, COMMITTED, 0);
> >       }
> >       stl_le_p((uint8_t *)cache_mem + offset, value);
> > +
> > +    if (should_commit) {
> > +        cfmws_update_non_interleaved(true);
> > +    }
> > +
> > +    if (should_uncommit) {
> > +        cfmws_update_non_interleaved(false);
> > +    }  
> 
> 
> This might be slightly better as an if-else block, since should_commit and should_uncommit are mutually exclusive:
> 
> 
> if (should_commit) {
> 	cfmws_update_non_interleaved(true);;
> } else if (should_uncommit) {
> 	cfmws_update_non_interleaved(false);
> }
I agree, this can be changed.
> 
> 
> >   }
> >   
> >   static void bi_handler(CXLComponentState *cxl_cstate, hwaddr offset,
> > diff --git a/hw/cxl/cxl-host.c b/hw/cxl/cxl-host.c
> > index 9633b01abf..1fcfe01164 100644
> > --- a/hw/cxl/cxl-host.c
> > +++ b/hw/cxl/cxl-host.c
> > @@ -104,7 +104,7 @@ void cxl_fmws_link_targets(Error **errp)
> >   }
> >   
> >   static bool cxl_hdm_find_target(uint32_t *cache_mem, hwaddr addr,
> > -                                uint8_t *target)
> > +                                uint8_t *target, bool *interleaved)
> >   {
> >       int hdm_inc = R_CXL_HDM_DECODER1_BASE_LO - R_CXL_HDM_DECODER0_BASE_LO;
> >       unsigned int hdm_count;
> > @@ -138,6 +138,11 @@ static bool cxl_hdm_find_target(uint32_t *cache_mem, hwaddr addr,
> >           found = true;
> >           ig_enc = FIELD_EX32(ctrl, CXL_HDM_DECODER0_CTRL, IG);
> >           iw_enc = FIELD_EX32(ctrl, CXL_HDM_DECODER0_CTRL, IW);
> > +
> > +        if (interleaved) {
> > +            *interleaved = iw_enc != 0;
> > +        }
> > +
> >           target_idx = (addr / cxl_decode_ig(ig_enc)) % (1 << iw_enc);
> >   
> >           if (target_idx < 4) {
> > @@ -157,7 +162,8 @@ static bool cxl_hdm_find_target(uint32_t *cache_mem, hwaddr addr,
> >       return found;  
> 
> [...]
> 
> > +
> > +static int update_non_interleaved(Object *obj, void *opaque)
> > +{
> > +    const int hdm_inc = R_CXL_HDM_DECODER1_BASE_LO - R_CXL_HDM_DECODER0_BASE_LO;
> > +    bool commit = *(bool *)opaque;
> > +    CXLType3Dev *ct3d;
> > +    uint32_t *cache_mem;
> > +    unsigned int hdm_count, i;
> > +    uint32_t cap;
> > +    uint64_t dpa_base = 0;
> > +
> > +    if (!object_dynamic_cast(obj, TYPE_CXL_TYPE3)) {
> > +        return 0;
> > +    }
> > +
> > +    ct3d = CXL_TYPE3(obj);
> > +    cache_mem = ct3d->cxl_cstate.crb.cache_mem_registers;
> > +    cap = ldl_le_p(cache_mem + R_CXL_HDM_DECODER_CAPABILITY);
> > +    hdm_count = cxl_decoder_count_dec(FIELD_EX32(cap,
> > +                                                 CXL_HDM_DECODER_CAPABILITY,
> > +                                                 DECODER_COUNT));
> > +    /*
> > +     * Walk the decoders and find any committed with iw set to 0
> > +     * (non interleaved).
> > +     */
> > +    for (i = 0; i < hdm_count; i++) {
> > +        uint64_t decoder_base, decoder_size, skip;
> > +        uint32_t hdm_ctrl, low, high;
> > +        int iw, committed;
> > +
> > +        hdm_ctrl = ldl_le_p(cache_mem + R_CXL_HDM_DECODER0_CTRL + i * hdm_inc);
> > +        committed = FIELD_EX32(hdm_ctrl, CXL_HDM_DECODER0_CTRL, COMMITTED);
> > +        if (commit ^ committed) {  
> 
> 
> Is there any specific reason for using ^ here instead of '==' ?

This was only there to make sure, the status must be commit ON and uncommit OFF and the other way around.

It is not valid to map anything or walk down the for loop if the states are anything other than those.
> 
> 
> 
> Thanks
> Zhijian


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/2] Performant CXL type 3 non-interleaved regions
  2026-02-04 14:59 [PATCH v2 0/2] Performant CXL type 3 non-interleaved regions Alireza Sanaee via qemu development
  2026-02-04 15:00 ` [PATCH v2 1/2] hw/cxl: Use HPA in cxl_cfmws_find_device() rather than offset in window Alireza Sanaee via qemu development
  2026-02-04 15:00 ` [PATCH v2 2/2] hw/cxl: Add a performant (and correct) path for the non interleaved cases Alireza Sanaee via qemu development
@ 2026-02-05 18:07 ` Gregory Price
  2026-02-06  9:27   ` Alireza Sanaee via qemu development
  2026-02-09  3:17 ` Gregory Price
  3 siblings, 1 reply; 12+ messages in thread
From: Gregory Price @ 2026-02-05 18:07 UTC (permalink / raw)
  To: Alireza Sanaee
  Cc: qemu-devel, linuxarm, jonathan.cameron, nifan.cxl, anisa.su887,
	mst, armbru, lizhijian, ppbonzini, peterx, philmd, david,
	imammedo, xiaoguangrong.eric, venkataravis

On Wed, Feb 04, 2026 at 02:59:59PM +0000, Alireza Sanaee via qemu development wrote:
> The CXL address to device decoding logic is complex because of the need
> to correctly decode fine grained interleave. The current implementation
> prevents use with KVM where executed instructions may reside in that
> memory and gives very slow performance even in TCG.
> 

what was the base-commit of this set? I'm having trouble trying to get
it applied.

~Gregory


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/2] Performant CXL type 3 non-interleaved regions
  2026-02-05 18:07 ` [PATCH v2 0/2] Performant CXL type 3 non-interleaved regions Gregory Price
@ 2026-02-06  9:27   ` Alireza Sanaee via qemu development
  0 siblings, 0 replies; 12+ messages in thread
From: Alireza Sanaee via qemu development @ 2026-02-06  9:27 UTC (permalink / raw)
  To: Gregory Price
  Cc: qemu-devel, linuxarm, jonathan.cameron, nifan.cxl, anisa.su887,
	mst, armbru, lizhijian, ppbonzini, peterx, philmd, david,
	imammedo, xiaoguangrong.eric, venkataravis

On Thu, 5 Feb 2026 13:07:03 -0500
Gregory Price <gourry@gourry.net> wrote:

> On Wed, Feb 04, 2026 at 02:59:59PM +0000, Alireza Sanaee via qemu development wrote:
> > The CXL address to device decoding logic is complex because of the need
> > to correctly decode fine grained interleave. The current implementation
> > prevents use with KVM where executed instructions may reside in that
> > memory and gives very slow performance even in TCG.
> >   
> 
> what was the base-commit of this set? I'm having trouble trying to get
> it applied.
> 
> ~Gregory

Hi Gregory,

It is on top of bunch of patches Jonathan's sending to be picked up. Not sure which one is picked up but here is my tree here including Jonathan's patches.

https://github.com/sarsanaee/qemu-lab/commits/performant-v2-E/

Thanks,
Ali


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/2] Performant CXL type 3 non-interleaved regions
  2026-02-04 14:59 [PATCH v2 0/2] Performant CXL type 3 non-interleaved regions Alireza Sanaee via qemu development
                   ` (2 preceding siblings ...)
  2026-02-05 18:07 ` [PATCH v2 0/2] Performant CXL type 3 non-interleaved regions Gregory Price
@ 2026-02-09  3:17 ` Gregory Price
  2026-02-09 12:39   ` Jonathan Cameron via qemu development
  3 siblings, 1 reply; 12+ messages in thread
From: Gregory Price @ 2026-02-09  3:17 UTC (permalink / raw)
  To: Alireza Sanaee
  Cc: qemu-devel, linuxarm, jonathan.cameron, nifan.cxl, anisa.su887,
	mst, armbru, lizhijian, ppbonzini, peterx, philmd, david,
	imammedo, xiaoguangrong.eric, venkataravis

On Wed, Feb 04, 2026 at 02:59:59PM +0000, Alireza Sanaee wrote:
> The CXL address to device decoding logic is complex because of the need
> to correctly decode fine grained interleave. The current implementation
> prevents use with KVM where executed instructions may reside in that
> memory and gives very slow performance even in TCG.
> 
> In many real cases non interleaved memory configurations are useful and
> for those we can use a more conventional memory region alias allowing
> similar performance to other memory in the system.
> 
> Whether this fast path is applicable can be established once the full
> set of HDM decoders has been committed (in whatever order the guest
> decides to commit them). As such a check is performed on each commit /
> uncommit of HDM decoder to establish if the alias should be added or
> removed.
> 

Tested this on top of Jonathan's most recent draft branch, works nicely
(with obvious fixups mentioned here). Has been working nicely.

Tested-by: Gregory Price <gourry@gourry.net>


----

Jonathan the HACK patch was giving me issues with registers getting
plopped in the middle of a CFMW, not sure if this is the right fix or not

Fixes: 9a1b11bc03 hw/i386/pc: Add Aspeed i2c controller + MCTP with ACPI table

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index f9d0bb3b41..9c244d40a7 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -957,8 +957,8 @@ void pc_memory_init(PCMachineState *pcms,
         memory_region_init(mr, OBJECT(machine), "cxl_host_reg", cxl_size);
         memory_region_add_subregion(system_memory, cxl_base, mr);
         cxl_base = ROUND_UP(cxl_base + cxl_size, 256 * MiB);
-        pcms->i2c_base = cxl_base + cxl_size - 0x4000;
         cxl_resv_end = cxl_fmws_set_memmap(cxl_base, maxphysaddr);
+        pcms->i2c_base = cxl_resv_end + 0x1000;
         cxl_fmws_update_mmio();
     }



^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/2] Performant CXL type 3 non-interleaved regions
  2026-02-09  3:17 ` Gregory Price
@ 2026-02-09 12:39   ` Jonathan Cameron via qemu development
  0 siblings, 0 replies; 12+ messages in thread
From: Jonathan Cameron via qemu development @ 2026-02-09 12:39 UTC (permalink / raw)
  To: Gregory Price, linuxarm
  Cc: Alireza Sanaee, qemu-devel, jonathan.cameron, nifan.cxl,
	anisa.su887, mst, armbru, lizhijian, ppbonzini, peterx, philmd,
	david, imammedo, xiaoguangrong.eric, venkataravis

On Sun, 8 Feb 2026 22:17:46 -0500
Gregory Price <gourry@gourry.net> wrote:

> On Wed, Feb 04, 2026 at 02:59:59PM +0000, Alireza Sanaee wrote:
> > The CXL address to device decoding logic is complex because of the need
> > to correctly decode fine grained interleave. The current implementation
> > prevents use with KVM where executed instructions may reside in that
> > memory and gives very slow performance even in TCG.
> > 
> > In many real cases non interleaved memory configurations are useful and
> > for those we can use a more conventional memory region alias allowing
> > similar performance to other memory in the system.
> > 
> > Whether this fast path is applicable can be established once the full
> > set of HDM decoders has been committed (in whatever order the guest
> > decides to commit them). As such a check is performed on each commit /
> > uncommit of HDM decoder to establish if the alias should be added or
> > removed.
> >   
> 
> Tested this on top of Jonathan's most recent draft branch, works nicely
> (with obvious fixups mentioned here). Has been working nicely.
> 
> Tested-by: Gregory Price <gourry@gourry.net>
> 
> 
> ----
> 
> Jonathan the HACK patch was giving me issues with registers getting
> plopped in the middle of a CFMW, not sure if this is the right fix or not
> 
> Fixes: 9a1b11bc03 hw/i386/pc: Add Aspeed i2c controller + MCTP with ACPI table

I'm kind of planning to drop the I2C / MCTP stuff anyway very soon as the
USB route is much less hacky and avoids need for hiding these registers somewhere.

I'll push a new draft branch soon (probably later this week) - Ideally that'll have
the DCD equivalent of this fast path as well.

The code definitely looks wrong.  If I keep the I2C stuff in my tree I'll
try and figure out what it should be doing...

Note if anyone sees this and really wants the I2C stuff then shout.
If not I'll drop it and see if anyone shouts ;)

Thanks!

Jonathan


> 
> diff --git a/hw/i386/pc.c b/hw/i386/pc.c
> index f9d0bb3b41..9c244d40a7 100644
> --- a/hw/i386/pc.c
> +++ b/hw/i386/pc.c
> @@ -957,8 +957,8 @@ void pc_memory_init(PCMachineState *pcms,
>          memory_region_init(mr, OBJECT(machine), "cxl_host_reg", cxl_size);
>          memory_region_add_subregion(system_memory, cxl_base, mr);
>          cxl_base = ROUND_UP(cxl_base + cxl_size, 256 * MiB);
> -        pcms->i2c_base = cxl_base + cxl_size - 0x4000;

At first glance I thing this is misplaced and should be before the line
above that updates the cxl_base.  It's still a horrible hack as it
will be grabbing a bit of the host reg space and relying on their
not being so many host bridges that we run out of space and get an overlap
for that.

>          cxl_resv_end = cxl_fmws_set_memmap(cxl_base, maxphysaddr);
> +        pcms->i2c_base = cxl_resv_end + 0x1000;
>          cxl_fmws_update_mmio();
>      }
> 



^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-02-12 14:27 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-04 14:59 [PATCH v2 0/2] Performant CXL type 3 non-interleaved regions Alireza Sanaee via qemu development
2026-02-04 15:00 ` [PATCH v2 1/2] hw/cxl: Use HPA in cxl_cfmws_find_device() rather than offset in window Alireza Sanaee via qemu development
2026-02-06  9:57   ` Zhijian Li (Fujitsu)
2026-02-06 12:51     ` Jonathan Cameron via qemu development
2026-02-04 15:00 ` [PATCH v2 2/2] hw/cxl: Add a performant (and correct) path for the non interleaved cases Alireza Sanaee via qemu development
2026-02-09  3:06   ` Zhijian Li (Fujitsu)
2026-02-09 13:47     ` Alireza Sanaee via qemu development
2026-02-12 14:25     ` Alireza Sanaee via qemu development
2026-02-05 18:07 ` [PATCH v2 0/2] Performant CXL type 3 non-interleaved regions Gregory Price
2026-02-06  9:27   ` Alireza Sanaee via qemu development
2026-02-09  3:17 ` Gregory Price
2026-02-09 12:39   ` Jonathan Cameron via qemu development

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.