All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/3] hw/cxl: Add a performant (and correct) path for the non interleaved cases
@ 2026-02-26 15:20 Alireza Sanaee via qemu development
  2026-02-26 15:20 ` [PATCH v6 1/3] hw/cxl: Use HPA in cxl_cfmws_find_device() rather than offset in window Alireza Sanaee via qemu development
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Alireza Sanaee via qemu development @ 2026-02-26 15:20 UTC (permalink / raw)
  To: qemu-devel, lizhijian
  Cc: anisa.su887, armbru, david, gourry, imammedo, jonathan.cameron,
	linuxarm, mst, nifan.cxl, peterx, philmd, ppbonzini, venkataravis,
	xiaoguangrong.eric

Hey everyone,

This is v6 of performant CXL type 3 regions set:

v5 -> v6: 
          - Use object_unparent() in the third commit when deleting alias regions. 
          - Thanks to Gregory for the suggestion and testing.
v4 -> v5: 
          - Fixed some minor patch style like missing trailing white space and such.
v3 -> v4: 
          - Tear down path changed, given that it is done differently than
          setup.
          - Dropped Gregory's tested-by tag due to tear down changes.
v2 -> v3: 
          - Addressing Zhijian Li. Thanks for the feedback.
v1 -> v2: 
          - Mainly rebase.

==========================================================

The CXL address to device decoding logic is complex because of the need
to correctly decode fine grained interleave. The current implementation
prevents use with KVM where executed instructions may reside in that
memory and gives very slow performance even in TCG.

In many real cases non interleaved memory configurations are useful and
for those we can use a more conventional memory region alias allowing
similar performance to other memory in the system.

Whether this fast path is applicable can be established once the full
set of HDM decoders has been committed (in whatever order the guest
decides to commit them). As such a check is performed on each commit /
uncommit of HDM decoder to establish if the alias should be added or
removed.


Performance numbers:

For a read/write test with 4K block size, 256M region size, and 1 thread
with 100 iteration on TCG (it should do similar on KVM):

  - Non-interleaved region (fast path): 25-30 seconds.
  - Interleaved region (no fast path):  Never finishes within 10
    minutes.

Tested Topologies and Region Layouts
====================================

This series was validated across multiple CXL topology configurations,
covering single-device, multi-device, multi-host-bridge, and switched
fabrics. Region creation was exercised using the `cxl` userspace tool
with both non-interleaved and interleaved setups.

Decoder and memdev identifiers were discovered using:

  cxl list
  cxl list -D

Decoder IDs (e.g. decoder0.0) and memdev names (mem0, mem1) are
environment-specific. Commands below use placeholders such as
<decoder_span_both> which should be replaced with IDs from `cxl list -D`.

---------------------------------------------------------------------

Region Layout Notation
----------------------

CFMW (CXL Fixed Memory Window) is shown as a linear address space
containing regions:

  CFMW: [ R0 | R1 | R2 ]

R0, R1, R2 are regions created by `cxl create-region`.

Non-interleaved region:

  R0 (ways=1) -> entirely on one device (mem0 or mem1)
  Fast path: APPLICABLE

2-way interleaved region (g=256):

  R1 (ways=2, g=256) striped across devices:

    |mem0|mem1|mem0|mem1|mem0|mem1| ...
     256  256  256  256  256  256  bytes

  Fast path: NOT APPLICABLE

---------------------------------------------------------------------

1) One device, one host bridge, one fixed window
------------------------------------------------

QEMU:

  -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
  -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
  -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
  -object memory-backend-ram,id=mem0,size=512M,share=on
  -device cxl-type3,id=dev0,bus=rp0,memdev=mem0

Topology:

  Host
    |
    +-- CXL Host Bridge (cxl.0)
         |
         +-- Root Port (rp0)
              |
              +-- Type-3 (dev0, mem0)

Regions created:

  cxl create-region ... -w 1 ... mem0   (Fast path: YES)
  cxl create-region ... -w 1 ... mem0   (Fast path: YES)

Layout:

  CFMW: [ R0 | R1 ]

  R0 -> mem0  (Fast path: YES)
  R1 -> mem0  (Fast path: YES)

---------------------------------------------------------------------

2) One host bridge, two Type-3 devices (via two root ports)
------------------------------------------------------------

QEMU:

  -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
  -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
  -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
  -device cxl-rp,id=rp1,bus=cxl.0,port=1,chassis=0,slot=3
  -object memory-backend-ram,id=mem0,size=512M,share=on
  -object memory-backend-ram,id=mem1,size=512M,share=on
  -device cxl-type3,id=dev0,bus=rp0,memdev=mem0
  -device cxl-type3,id=dev1,bus=rp1,memdev=mem1

Topology:

  Host
    |
    +-- CXL Host Bridge (cxl.0)
         |
         +-- Root Port (rp0) -- Type-3 (dev0, mem0)
         |
         +-- Root Port (rp1) -- Type-3 (dev1, mem1)

Region patterns exercised:

2.1 All non-interleaved:
  R0 -> mem0  (Fast path: YES)
  R1 -> mem0  (Fast path: YES)
  R2 -> mem1  (Fast path: YES)
  R3 -> mem1  (Fast path: YES)

2.2 Interleaved + local:
  R0 -> mem0/mem1 interleaved  (Fast path: NO)
  R1 -> mem0                   (Fast path: YES)

2.3 Local + interleaved + local:
  R0 -> mem0                   (Fast path: YES)
  R1 -> mem0/mem1 interleaved  (Fast path: NO)
  R2 -> mem1                   (Fast path: YES)

---------------------------------------------------------------------

3) Two host bridges, one device per host bridge
------------------------------------------------

QEMU:

  -M q35,cxl=on,
     cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G,
     cxl-fmw.1.targets.0=cxl.1,cxl-fmw.1.size=4G,
     cxl-fmw.2.targets.0=cxl.0,cxl-fmw.2.size=4G
  -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
  -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
  -object memory-backend-ram,id=mem0,size=512M,share=on
  -device cxl-type3,id=dev0,bus=rp0,memdev=mem0
  -device pxb-cxl,id=cxl.1,bus=pcie.0,bus_nr=13
  -device cxl-rp,id=rp1,bus=cxl.1,port=0,chassis=1,slot=2
  -object memory-backend-ram,id=mem1,size=512M,share=on
  -device cxl-type3,id=dev1,bus=rp1,memdev=mem1

Region patterns identical to section 2, and fast-path applicability is
identical per region mapping (non-interleaved: YES, interleaved: NO).

---------------------------------------------------------------------

4) Switch topology
------------------

QEMU:

  -M q35,cxl=on,cxl-fmw.0.targets.0=cxl.0,cxl-fmw.0.size=4G
  -device pxb-cxl,id=cxl.0,bus=pcie.0,bus_nr=12
  -device cxl-rp,id=rp0,bus=cxl.0,port=0,chassis=0,slot=2
  -device cxl-rp,id=rp1,bus=cxl.0,port=0,chassis=0,slot=3
  -device cxl-upstream,id=us0,bus=rp0
  -device cxl-downstream,id=ds0,bus=us0,port=0,chassis=0,slot=4
  -object memory-backend-ram,id=mem0,size=512M,share=on
  -device cxl-type3,id=dev0,bus=ds0,memdev=mem0

Topology (detailed):

  Host
    |
    +-- CXL Host Bridge (cxl.0)
         |
         +-- Root Port (rp0)
         |     |
         |     +-- CXL Switch (upstream us0)
         |           |
         |           +-- Downstream Port (ds0) -- Type-3 (mem0)
         |           |
         |           +-- Downstream Port (ds1) -- Type-3 (mem1) [optional]
         +-- Root Port (rp1)
               |
               +-- More devices/switches.

Fast-path interpretation in this topology:

  If only mem0 exists:
    All regions -> Fast path: YES

  If mem0 and mem1 exist:
    Non-interleaved regions -> Fast path: YES
    Interleaved regions     -> Fast path: NO

---------------------------------------------------------------------

Summary
-------

Across all topologies, region creation, enablement, and HDM decoder
commit/uncommit flows were exercised. The fast path is enabled only when
all decoders describe a non-interleaved mapping and is removed when any
interleave configuration is introduced.

Alireza Sanaee (3):
  hw/cxl: Use HPA in cxl_cfmws_find_device() rather than offset in
    window.
  hw/cxl: Allow cxl_cfmws_find_device() to filter on whether interleaved
    paths are accepted
  hw/cxl: Add a performant (and correct) path for the non interleaved
    cases

 hw/cxl/cxl-component-utils.c |   6 +
 hw/cxl/cxl-host.c            | 224 +++++++++++++++++++++++++++++++++--
 hw/mem/cxl_type3.c           |   4 +
 include/hw/cxl/cxl.h         |   1 +
 include/hw/cxl/cxl_device.h  |   4 +
 5 files changed, 230 insertions(+), 9 deletions(-)


base-commit: 6593154e7d65f61d8f9dbeb98224731b7137c53e
-- 
2.43.0



^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2026-03-06 10:27 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-26 15:20 [PATCH v6 0/3] hw/cxl: Add a performant (and correct) path for the non interleaved cases Alireza Sanaee via qemu development
2026-02-26 15:20 ` [PATCH v6 1/3] hw/cxl: Use HPA in cxl_cfmws_find_device() rather than offset in window Alireza Sanaee via qemu development
2026-02-27  7:44   ` Zhijian Li (Fujitsu)
2026-02-26 15:20 ` [PATCH v6 2/3] hw/cxl: Allow cxl_cfmws_find_device() to filter on whether interleaved paths are accepted Alireza Sanaee via qemu development
2026-02-27  7:45   ` Zhijian Li (Fujitsu)
2026-02-26 15:20 ` [PATCH v6 3/3] hw/cxl: Add a performant (and correct) path for the non interleaved cases Alireza Sanaee via qemu development
2026-02-27  9:50   ` Zhijian Li (Fujitsu)
2026-03-05 21:13   ` Gregory Price
2026-03-06 10:26     ` Jonathan Cameron via qemu development

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.