[RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data.

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data.
@ 2025-01-24 17:29 Jonathan Cameron via
  2025-01-24 17:29 ` [RFC PATCH QEMU 1/3] hw/cxl: Initial CXL Hotness Monitoring Unit Emulation Jonathan Cameron via
                   ` (3 more replies)
  0 siblings, 4 replies; 13+ messages in thread
From: Jonathan Cameron via @ 2025-01-24 17:29 UTC (permalink / raw)
  To: fan.ni, linux-cxl, qemu-devel
  Cc: Alex Bennée, Alexandre Iooss, Mahmoud Mandour,
	Pierrick Bouvier, linuxarm, Niyas Sait

Hi All,

This is an RFC mainly to seek feedback on the approach used, particularly
the aspect of how to get data from a TCG plugin into a device model.
Two options that we have tried
1. Socket over which the plugin sends data to an external server
   (as seen here)
2. Register and manage a plugin from within a device model

The external server approach keeps things loosely coupled, but at the cost
of separately maintaining that server, protocol definitions etc and
some overhead.
The closely couple solution is neater, but I suspect might be controversial
(hence I didn't start with that :)

The code here is at best a PoC to illustrate what we have in mind
It's not nice code at all, feature gaps, bugs and all!  So whilst
review is always welcome I'm not requesting it for now.

Kernel support was posted a while back but was done against fake data
(still supported here if you don't provide the port parameter to the type3 device)
https://lore.kernel.org/linux-cxl/20241121101845.1815660-1-Jonathan.Cameron@huawei.com/
I'll post a minor update of that driver shortly to take into account
a few specification clarifications but it should work with this without
those.

Note there are some other patches on the tree I generated this from
so this may not apply to upstream. Easiest is probably to test
using gitlab.com/jic23/qemu cxl-2025-01-24

Thanks to Niyas for his suggestions on how to make all this work!

Background
----------

What is the Compute eXpress Link Hotness Monitoring unit and what is it for?
- In a tiered memory equipped server with the slow tier being attached via
  CXL the expectation is a given workload will benefit from putting data
  that is frequently fetched from memory in lower latency directly attached
  DRAM.  Less frequently used data can be served from the CXL attached memory
  with no significant loss of performance.  Any data that is hot enough to
  almost always be in cache doesn't matter as it is only fetch from memory
  occasionally.
- Working out which memory is best places where is hard to do and in some
  workloads a dynamic problem. As such we need something we can measure
  to provide some indication of what data is in the wrong place.
  There are existing techniques to do this (page faulting, various
  CPU tracing systems, access bit scanning etc) but they all have significant
  overheads.
- Monitoring accesses on the CXL device provides a path to getting good
  data without those overheads.  These units are known as CXL Hotness
  Monitoring Units or CHMUs.  Loosely speaking they count accesses to
  granuals of data (e.g. 4KiB pages).  Exactly how they do that and
  where they sacrifice data accuracy is an implementation trade off.

Why do we need a model that gives real data?
- In general there is a need to develop software on top of these units
  to move data to the right place. Hard to evaluate that if we are making
  up the info on what is 'hot'.
- Need to allow for a bunch of 'impdef' solutions. Note that CHMU
  in this patch set is an oracle - it has enough counters to count
  every access.  That's not realistic but it doesn't get me shouted
  at by our architecture teams for giving away any secrets.
  If we move forward with this, I'll probably implement a limited
  counter + full CAM solution (also unrealistic, but closer to real)
  I'd be very interested in contributions of other approaches (there
  are lots in the literature, under the term top-k)
- Resources will be constrained, so whilst a CHMU might in theory
  allow monitoring everything at once, that will come with a big
  accuracy cost.  We need to design the algorithms that give us
  good data given those constraints.

So we need a solution to explore the design space and develop the software
to take advantage of this hardware (there are various LSF/MM proposals
on how to use this an other ways of tracking hotness).
https://lore.kernel.org/all/20250123105721.424117-1-raghavendra.kt@amd.com/
https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/

QEMU plugins give us a way to do this.  In particular the existing
Cache plugin can be easily modified to tell use what memory addresses
missed at the last level of emulated cache.  We can then filter those
for the memory address range that maps to CXL and feed them to our
counter implementation. On the other side, each instance of CXL type 3
device can connect to this server and request hotness monitoring
services + provide parameters etc.  Elements such as list threshold
management and overflow detection etc are in the CXL HMU QEMU device mode.
As noted above, we have an alternative approach that can closely couple
things, so the device model registers the plugin directly and there
is no server.

How to use it!
--------------

It runs a little slow but it runs and generates somewhat plausible outputs.
I'd definitely suggest running it with the pass through optimization
patch on the CXL staging tree (and a single direct connected device).
Your millage will vary if you try to use other parameters, or
hotness units beyond the first one (implementation far from complete!)

To run start the server in contrib/hmu/ providing a port number to listen
on.

./chmu 4443

Then launch QEMU with something like the following.

qemu-system-aarch64 -icount shift=1 \
 -plugin ../qemu/bin/native/contrib/plugins/libcache.so,port=4443,missfilterbase=1099511627776,missfiltersize=1099511627776,dcachesize=8192,dassoc=4,dblksize=64,icachesize=8192,iassoc=4,iblksize=64,l2cachesize=32768,l2assoc=16,l2blksize=64 \
 -M virt,ras=on,nvdimm=on,gic-version=3,cxl=on,hmat=on -m 4g,maxmem=8g,slots=4 -cpu max -smp 4 \
 -kernel Image \
 -drive if=none,file=full.qcow2,format=qcow2,id=hd \
 -device pcie-root-port,id=root_port1 \
 -device virtio-blk-pci,drive=hd,x-max-bounce-buffer-size=512k \
 -nographic -no-reboot -append 'earlycon memblock=debug root=/dev/vda2 fsck.mode=skip maxcpus=4 tp_printk' \
 -monitor telnet:127.0.0.1:1234,server,nowait -bios QEMU_EFI.fd \
 -object memory-backend-ram,size=4G,id=mem0 \
 -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/t3_cxl1.raw,size=1G,align=256M \
 -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/t3_cxl2.raw,size=1G,align=256M \
 -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/t3_lsa1.raw,size=1M,align=1M \
  -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/t3_cxl3.raw,size=1G,align=256M \
 -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/t3_cxl4.raw,size=1G,align=256M \
 -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/t3_lsa2.raw,size=1M,align=1M \
 -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1,hdm_for_passthrough=true,numa_node=0\
 -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
 -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-pmem1,lsa=cxl-lsa1,sn=3,x-speed=32,x-width=16,chmu-port=4443 \
 -machine cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=8G,cxl-fmw.0.interleave-granularity=1k \
 -numa node,nodeid=0,cpus=0-3,memdev=mem0 \
 -numa node,nodeid=1 \
 -object acpi-generic-initiator,id=bob2,pci-dev=bob,node=1 \
 -numa node,nodeid=2 \
 -object acpi-generic-port,id=bob11,pci-bus=cxl.1,node=2 \

In the guest, create and bind the region - this brings up the CXL memory
device so accesses go to the memory.

  cd /sys/bus/cxl/devices/decoder0.0/
  cat create_ram_region
  echo region0 > create_ram_region
  echo ram > /sys/bus/cxl/devices/decoder2.0/mode
  echo ram > /sys/bus/cxl/devices/decoder3.0/mode
  echo $((256 << 21)) > /sys/bus/cxl/devices/decoder2.0/dpa_size
  cd /sys/bus/cxl/devices/region0/
  echo 256 > interleave_granularity
  echo 1 > interleave_ways
  echo $((256 << 21)) > size
  echo decoder2.0 > target0
  echo 1 > commit
  echo region0 > /sys/bus/cxl/drivers/cxl_region/bind

Finally start perf with something like:

./perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
hotness_threshold=635,epoch_multiplier=4,epoch_scale=4,\
range_base=0,range_size=4096/  ./stress.sh

where stress.sh is

  sleep 2
  numactl --membind 3 stress-ng --vm 1 --vm-bytes 1M --vm-keep -t 5s
  sleep 2

See the results with
./perf report --dump-raw-trace | grep -A 200 HMU

Enjoy and have a good weekend.

Thanks,

Jonathan

Jonathan Cameron (3):
  hw/cxl: Initial CXL Hotness Monitoring Unit Emulation
  plugins: Add cache miss reporting over a socket.
  contrib: Add example hotness monitoring unit server

 include/hw/cxl/cxl.h        |   1 +
 include/hw/cxl/cxl_chmu.h   | 154 ++++++++++++
 include/hw/cxl/cxl_device.h |  13 +-
 include/hw/cxl/cxl_pci.h    |   7 +-
 contrib/hmu/hmu.c           | 312 ++++++++++++++++++++++++
 contrib/plugins/cache.c     |  75 +++++-
 hw/cxl/cxl-chmu.c           | 459 ++++++++++++++++++++++++++++++++++++
 hw/mem/cxl_type3.c          |  25 +-
 hw/cxl/meson.build          |   1 +
 9 files changed, 1035 insertions(+), 12 deletions(-)
 create mode 100644 include/hw/cxl/cxl_chmu.h
 create mode 100644 contrib/hmu/hmu.c
 create mode 100644 hw/cxl/cxl-chmu.c

-- 
2.43.0

^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH QEMU 1/3] hw/cxl: Initial CXL Hotness Monitoring Unit Emulation
  2025-01-24 17:29 [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data Jonathan Cameron via
@ 2025-01-24 17:29 ` Jonathan Cameron via
  2025-01-24 17:29 ` [RFC PATCH QEMU 2/3] plugins: Add cache miss reporting over a socket Jonathan Cameron via
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 13+ messages in thread
From: Jonathan Cameron via @ 2025-01-24 17:29 UTC (permalink / raw)
  To: fan.ni, linux-cxl, qemu-devel
  Cc: Alex Bennée, Alexandre Iooss, Mahmoud Mandour,
	Pierrick Bouvier, linuxarm, Niyas Sait

Intended to support enabling in kernel. For now this is dumb and the data
made up.  That will change in the near future.

Instantiates 3 instances within one CHMU with separate
interrupts.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 include/hw/cxl/cxl.h        |   1 +
 include/hw/cxl/cxl_chmu.h   | 154 ++++++++++++
 include/hw/cxl/cxl_device.h |  13 +-
 include/hw/cxl/cxl_pci.h    |   7 +-
 hw/cxl/cxl-chmu.c           | 459 ++++++++++++++++++++++++++++++++++++
 hw/mem/cxl_type3.c          |  25 +-
 hw/cxl/meson.build          |   1 +
 7 files changed, 655 insertions(+), 5 deletions(-)

diff --git a/include/hw/cxl/cxl.h b/include/hw/cxl/cxl.h
index 857fa61898..bef856485f 100644
--- a/include/hw/cxl/cxl.h
+++ b/include/hw/cxl/cxl.h
@@ -16,6 +16,7 @@
 #include "hw/pci/pci_host.h"
 #include "cxl_pci.h"
 #include "cxl_component.h"
+#include "cxl_chmu.h"
 #include "cxl_cpmu.h"
 #include "cxl_device.h"
 
diff --git a/include/hw/cxl/cxl_chmu.h b/include/hw/cxl/cxl_chmu.h
new file mode 100644
index 0000000000..2de04ea605
--- /dev/null
+++ b/include/hw/cxl/cxl_chmu.h
@@ -0,0 +1,154 @@
+/*
+ * QEMU CXL Hotness Monitoring Unit
+ *
+ * Copyright (c) 2024 Huawei
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "hw/register.h"
+
+#ifndef _CXL_CHMU_H_
+#define _CXL_CHMU_H_
+
+/* Emulated parameters - arbitrary choices */
+#define CXL_CHMU_INSTANCES_PER_BLOCK 3
+#define CXL_HOTLIST_ENTRIES 1024
+ /* 1TB - should be enough for anyone, right? */
+#define CXL_MAX_DRAM_CAPACITY 0x10000000000UL
+
+/* In instance address space */
+#define CXL_CHMU_HL_START (0x70 + (CXL_MAX_DRAM_CAPACITY / (0x10000000UL * 8)))
+#define CXL_CHMU_INSTANCE_SIZE (CXL_CHMU_HL_START + CXL_HOTLIST_ENTRIES * 8)
+#define CXL_CHMU_SIZE \
+    (0x10 + CXL_CHMU_INSTANCE_SIZE * CXL_CHMU_INSTANCES_PER_BLOCK)
+
+/*
+ * Many of these registers are documented as being a multiple of 64 bits long.
+ * Reading then can only be done in 64 bit chunks though so specify them here
+ * as multiple registers.
+ */
+REG64(CXL_CHMU_COMMON_CAP0, 0x0)
+    FIELD(CXL_CHMU_COMMON_CAP0, VERSION, 0, 4)
+    FIELD(CXL_CHMU_COMMON_CAP0, NUM_INSTANCES, 8, 8)
+REG64(CXL_CHMU_COMMON_CAP1, 0x8)
+    FIELD(CXL_CHMU_COMMON_CAP1, INSTANCE_LENGTH, 0, 16)
+
+/* Per instance registers for instance 0 in CHMU main address space */
+REG64(CXL_CHMU0_CAP0, 0x10)
+    FIELD(CXL_CHMU0_CAP0, MSI_N, 0, 4)
+    FIELD(CXL_CHMU0_CAP0, OVERFLOW_INT, 4, 1)
+    FIELD(CXL_CHMU0_CAP0, LEVEL_INT, 5, 1)
+    FIELD(CXL_CHMU0_CAP0, EPOCH_TYPE, 6, 2)
+#define CXL_CHMU0_CAP0_EPOCH_TYPE_GLOBAL 0
+#define CXL_CHMU0_CAP0_EPOCH_TYPE_PERCNT 1
+    /* Break up the Tracked M2S Request field into flags */
+    FIELD(CXL_CHMU0_CAP0, TRACKED_M2S_REQ_NONTEE_R, 8, 1)
+    FIELD(CXL_CHMU0_CAP0, TRACKED_M2S_REQ_NONTEE_W, 9, 1)
+    FIELD(CXL_CHMU0_CAP0, TRACKED_M2S_REQ_NONTEE_RW, 10, 1)
+    FIELD(CXL_CHMU0_CAP0, TRACKED_M2S_REQ_ALL_R, 11, 1)
+    FIELD(CXL_CHMU0_CAP0, TRACKED_M2S_REQ_ALL_W, 12, 1)
+    FIELD(CXL_CHMU0_CAP0, TRACKED_M2S_REQ_ALL_RW, 13, 1)
+
+    FIELD(CXL_CHMU0_CAP0, MAX_EPOCH_LENGTH_SCALE, 16, 4)
+#define CXL_CHMU_EPOCH_LENGTH_SCALE_100USEC 1
+#define CXL_CHMU_EPOCH_LENGTH_SCALE_1MSEC 2
+#define CXL_CHMU_EPOCH_LENGTH_SCALE_10MSEC 3
+#define CXL_CHMU_EPOCH_LENGTH_SCALE_100MSEC 4
+#define CXL_CHMU_EPOCH_LENGTH_SCALE_1SEC 5
+    FIELD(CXL_CHMU0_CAP0, MAX_EPOCH_LENGTH_VAL, 20, 12)
+    FIELD(CXL_CHMU0_CAP0, MIN_EPOCH_LENGTH_SCALE, 32, 4)
+    FIELD(CXL_CHMU0_CAP0, MIN_EPOCH_LENGTH_VAL, 36, 12)
+    FIELD(CXL_CHMU0_CAP0, HOTLIST_SIZE, 48, 16)
+REG64(CXL_CHMU0_CAP1, 0x18)
+    FIELD(CXL_CHMU0_CAP1, UNIT_SIZES, 0, 32)
+    FIELD(CXL_CHMU0_CAP1, DOWN_SAMPLING_FACTORS, 32, 16)
+    /* Split up Flags */
+    FIELD(CXL_CHMU0_CAP1, FLAGS_EPOCH_BASED, 48, 1)
+    FIELD(CXL_CHMU0_CAP1, FLAGS_ALWAYS_ON, 49, 1)
+    FIELD(CXL_CHMU0_CAP1, FLAGS_RANDOMIZED_DOWN_SAMPLING, 50, 1)
+    FIELD(CXL_CHMU0_CAP1, FLAGS_OVERLAPPING_ADDRESS_RANGES, 51, 1)
+    FIELD(CXL_CHMU0_CAP1, FLAGS_INSERT_AFTER_CLEAR, 52, 1)
+REG64(CXL_CHMU0_CAP2, 0x20)
+    FIELD(CXL_CHMU0_CAP2, BITMAP_REG_OFFSET, 0, 64)
+REG64(CXL_CHMU0_CAP3, 0x28)
+    FIELD(CXL_CHMU0_CAP3, HOTLIST_REG_OFFSET, 0, 64)
+
+REG64(CXL_CHMU0_CONF0, 0x50)
+    FIELD(CXL_CHMU0_CONF0, M2S_REQ_TO_TRACK, 0, 8)
+    FIELD(CXL_CHMU0_CONF0, FLAGS_RANDOMIZE_DOWNSAMPLING, 8, 1)
+    FIELD(CXL_CHMU0_CONF0, FLAGS_INT_ON_OVERFLOW, 9, 1)
+    FIELD(CXL_CHMU0_CONF0, FLAGS_INT_ON_FILL_THRESH, 10, 1)
+    FIELD(CXL_CHMU0_CONF0, CONTROL_ENABLE, 16, 1)
+    FIELD(CXL_CHMU0_CONF0, CONTROL_RESET, 17, 1)
+    FIELD(CXL_CHMU0_CONF0, HOTNESS_THRESHOLD, 32, 32)
+REG64(CXL_CHMU0_CONF1, 0x58)
+    FIELD(CXL_CHMU0_CONF1, UNIT_SIZE, 0, 32)
+    FIELD(CXL_CHMU0_CONF1, DOWN_SAMPLING_FACTOR, 32, 8)
+    FIELD(CXL_CHMU0_CONF1, REPORTING_MODE, 40, 8)
+    FIELD(CXL_CHMU0_CONF1, EPOCH_LENGTH_SCALE, 48, 4)
+    FIELD(CXL_CHMU0_CONF1, EPOCH_LENGTH_VAL, 52, 12)
+REG64(CXL_CHMU0_CONF2, 0x60)
+    FIELD(CXL_CHMU0_CONF2, NOTIFICATION_THRESHOLD, 0, 16)
+
+REG64(CXL_CHMU0_STATUS, 0x70)
+    /* Break up status field into separate flags */
+    FIELD(CXL_CHMU0_STATUS, STATUS_ENABLED, 0, 1)
+    FIELD(CXL_CHMU0_STATUS, OPERATION_IN_PROG, 16, 16)
+    FIELD(CXL_CHMU0_STATUS, COUNTER_WIDTH, 32, 8)
+    /* Break up oddly name overflow interrupt stats */
+    FIELD(CXL_CHMU0_STATUS, OVERFLOW_INT, 40, 1)
+    FIELD(CXL_CHMU0_STATUS, LEVEL_INT, 41, 1)
+
+REG16(CXL_CHMU0_HEAD, 0x78)
+REG16(CXL_CHMU0_TAIL, 0x7A)
+
+/* Provide first few of these so we can calculate the size */
+REG64(CXL_CHMU0_RANGE_CONFIG_BITMAP0, 0x80)
+REG64(CXL_CHMU0_RANGE_CONFIG_BITMAP1, 0x88)
+
+REG64(CXL_CHMU0_HOTLIST0, CXL_CHMU_HL_START + 0x10)
+REG64(CXL_CHMU0_HOTLIST1, CXL_CHMU_HL_START + 0x10)
+
+REG64(CXL_CHMU1_CAP0, 0x10 + CXL_CHMU_INSTANCE_SIZE)
+
+typedef struct CHMUState CHMUState;
+
+typedef struct CHMUInstance {
+    Object *private;
+    uint32_t hotness_thresh;
+    uint32_t unit_size;
+    uint8_t ds_factor;
+    uint16_t head, tail, fillthresh, op_in_prog;
+    uint8_t what;
+
+    bool int_on_overflow;
+    bool int_on_fill_thresh;
+    bool overflow_set;
+    bool fill_thresh_set;
+    uint8_t msi_n;
+
+    bool enabled;
+    uint64_t hotlist[CXL_HOTLIST_ENTRIES];
+    QEMUTimer *timer;
+    uint32_t epoch_ms;
+    /* Hack for now */
+    CHMUState *parent;
+} CHMUInstance;
+
+typedef struct CHMUState {
+    CHMUInstance inst[CXL_CHMU_INSTANCES_PER_BLOCK];
+    int socket;
+    /* Hack updated on first HDM decoder only */
+    uint64_t base;
+    uint64_t size;
+    uint16_t port;
+} CHMUState;
+typedef struct cxl_device_state CXLDeviceState;
+int cxl_chmu_register_block_init(Object *obj,
+                                 CXLDeviceState *cxl_dstte,
+                                 int id, uint8_t msi_n,
+                                 Error **errp);
+
+#endif /* _CXL_CHMU_H_ */
diff --git a/include/hw/cxl/cxl_device.h b/include/hw/cxl/cxl_device.h
index 04c93cd753..f855cd69d9 100644
--- a/include/hw/cxl/cxl_device.h
+++ b/include/hw/cxl/cxl_device.h
@@ -15,6 +15,7 @@
 #include "hw/register.h"
 #include "hw/cxl/cxl_events.h"
 
+#include "hw/cxl/cxl_chmu.h"
 #include "hw/cxl/cxl_cpmu.h"
 /*
  * The following is how a CXL device's Memory Device registers are laid out.
@@ -109,12 +110,20 @@
                   (x) * (1 << 16),                                      \
                   1 << 16)
 
+#define CXL_NUM_CHMU_INSTANCES 1
+#define CXL_CHMU_OFFSET(x)                                               \
+    QEMU_ALIGN_UP(CXL_MEMORY_DEVICE_REGISTERS_OFFSET +                  \
+                  CXL_MEMORY_DEVICE_REGISTERS_LENGTH +                  \
+                  (1 << 16) * CXL_NUM_CPMU_INSTANCES,                   \
+                  1 << 16)
+
 #define CXL_MMIO_SIZE                                                   \
     QEMU_ALIGN_UP(CXL_DEVICE_CAP_REG_SIZE +                             \
                   CXL_DEVICE_STATUS_REGISTERS_LENGTH +                  \
                   CXL_MAILBOX_REGISTERS_LENGTH +                        \
                   CXL_MEMORY_DEVICE_REGISTERS_LENGTH +                  \
-                  CXL_NUM_CPMU_INSTANCES * (1 << 16),                   \
+                  CXL_NUM_CPMU_INSTANCES * (1 << 16) +                  \
+                  CXL_NUM_CHMU_INSTANCES * (1 << 16),                   \
                   (1 << 16))
 
 /* CXL r3.1 Table 8-34: Command Return Codes */
@@ -231,6 +240,7 @@ typedef struct CXLCCI {
 typedef struct cxl_device_state {
     MemoryRegion device_registers;
     MemoryRegion cpmu_registers[CXL_NUM_CPMU_INSTANCES];
+    MemoryRegion chmu_registers[1];
     /* CXL r3.1 Section 8.2.8.3: Device Status Registers */
     struct {
         MemoryRegion device;
@@ -280,6 +290,7 @@ typedef struct cxl_device_state {
 
     const struct cxl_cmd (*cxl_cmd_set)[256];
     CPMUState cpmu[CXL_NUM_CPMU_INSTANCES];
+    CHMUState chmu[1];
     CXLEventLog event_logs[CXL_EVENT_TYPE_MAX];
 } CXLDeviceState;
 
diff --git a/include/hw/cxl/cxl_pci.h b/include/hw/cxl/cxl_pci.h
index c54ed54a25..88a5e3958e 100644
--- a/include/hw/cxl/cxl_pci.h
+++ b/include/hw/cxl/cxl_pci.h
@@ -32,7 +32,7 @@
 #define PCIE_CXL3_FLEXBUS_PORT_DVSEC_LENGTH 0x20
 #define PCIE_CXL3_FLEXBUS_PORT_DVSEC_REVID  2
 
-#define REG_LOC_DVSEC_LENGTH 0x2c
+#define REG_LOC_DVSEC_LENGTH 0x34
 #define REG_LOC_DVSEC_REVID  0
 
 enum {
@@ -172,9 +172,9 @@ typedef struct CXLDVSECRegisterLocator {
     struct {
             uint32_t lo;
             uint32_t hi;
-    } reg_base[4];
+    } reg_base[5];
 } QEMU_PACKED CXLDVSECRegisterLocator;
-QEMU_BUILD_BUG_ON(sizeof(CXLDVSECRegisterLocator) != 0x2C);
+QEMU_BUILD_BUG_ON(sizeof(CXLDVSECRegisterLocator) != 0x34);
 
 /* BAR Equivalence Indicator */
 #define BEI_BAR_10H 0
@@ -190,5 +190,6 @@ QEMU_BUILD_BUG_ON(sizeof(CXLDVSECRegisterLocator) != 0x2C);
 #define RBI_BAR_VIRT_ACL   (2 << 8)
 #define RBI_CXL_DEVICE_REG (3 << 8)
 #define RBI_CXL_CPMU_REG   (4 << 8)
+#define RBI_CXL_CHMU_REG   (5 << 8)
 
 #endif
diff --git a/hw/cxl/cxl-chmu.c b/hw/cxl/cxl-chmu.c
new file mode 100644
index 0000000000..5922d78ffc
--- /dev/null
+++ b/hw/cxl/cxl-chmu.c
@@ -0,0 +1,459 @@
+/*
+ * CXL Hotness Monitoring Unit
+ *
+ * Copyright(C) 2024 Huawei
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qemu/guest-random.h"
+#include "hw/cxl/cxl.h"
+#include "hw/cxl/cxl_chmu.h"
+
+#include "hw/pci/msi.h"
+#include "hw/pci/msix.h"
+
+#define CHMU_HOTLIST_LENGTH 1024
+
+enum chmu_consumer_request {
+    QUERY_TAIL,
+    QUERY_HEAD,
+    SET_HEAD,
+    SET_HOTLIST_SIZE,
+    QUERY_HOTLIST_ENTRY,
+    SIGNAL_EPOCH_END,
+    SET_ENABLED,
+    SET_NUMBER_GRANUALS,
+    SET_HPA_BASE,
+    SET_HPA_SIZE,
+};
+
+static int chmu_send(CHMUState *chmu, uint64_t instance,
+                     enum chmu_consumer_request command,
+                     uint64_t param, uint64_t *response)
+{
+    uint64_t request[3] = { instance, command, param };
+    uint64_t temp;
+    uint64_t *reply = response ?: &temp;
+    int rc;
+
+    send(chmu->socket, request, sizeof(request), 0);
+    rc = recv(chmu->socket, reply, sizeof(*reply), 0);
+    if (rc < sizeof(reply)) {
+        return -1;
+    }
+    return 0;
+}
+
+static uint64_t chmu_read(void *opaque, hwaddr offset, unsigned size)
+{
+    CHMUState *chmu = opaque;
+    CHMUInstance *chmui;
+    uint64_t val = 0;
+    hwaddr chmu_stride = A_CXL_CHMU1_CAP0 - A_CXL_CHMU0_CAP0;
+    int instance = 0;
+    int rc;
+
+    if (offset >= A_CXL_CHMU0_CAP0) {
+        instance = (offset - A_CXL_CHMU0_CAP0) / chmu_stride;
+        /*
+         * Offset allows register defs for CHMU instance 0 to be used
+         * for all instances. Includes common cap.
+         */
+        offset -= chmu_stride * instance;
+    }
+
+    if (instance >= CXL_CHMU_INSTANCES_PER_BLOCK) {
+        return 0;
+    }
+
+    chmui = &chmu->inst[instance];
+    switch (offset) {
+    case A_CXL_CHMU_COMMON_CAP0:
+        val = FIELD_DP64(val, CXL_CHMU_COMMON_CAP0, VERSION, 1);
+        val = FIELD_DP64(val, CXL_CHMU_COMMON_CAP0, NUM_INSTANCES,
+                         CXL_CHMU_INSTANCES_PER_BLOCK);
+        break;
+    case A_CXL_CHMU_COMMON_CAP1:
+        val = FIELD_DP64(val, CXL_CHMU_COMMON_CAP1, INSTANCE_LENGTH,
+                         A_CXL_CHMU1_CAP0 - A_CXL_CHMU0_CAP0);
+        break;
+    case A_CXL_CHMU0_CAP0:
+        val = FIELD_DP64(val, CXL_CHMU0_CAP0, MSI_N, chmui->msi_n);
+        val = FIELD_DP64(val, CXL_CHMU0_CAP0, OVERFLOW_INT, 1);
+        val = FIELD_DP64(val, CXL_CHMU0_CAP0, LEVEL_INT, 1);
+        val = FIELD_DP64(val, CXL_CHMU0_CAP0, EPOCH_TYPE,
+                         CXL_CHMU0_CAP0_EPOCH_TYPE_GLOBAL);
+        val = FIELD_DP64(val, CXL_CHMU0_CAP0, TRACKED_M2S_REQ_NONTEE_R, 1);
+        val = FIELD_DP64(val, CXL_CHMU0_CAP0, TRACKED_M2S_REQ_NONTEE_W, 1);
+        val = FIELD_DP64(val, CXL_CHMU0_CAP0, TRACKED_M2S_REQ_NONTEE_RW, 1);
+        /* No emulation of TEE modes yet so don't pretend to support them */
+        val = FIELD_DP64(val, CXL_CHMU0_CAP0, MAX_EPOCH_LENGTH_SCALE,
+                         CXL_CHMU_EPOCH_LENGTH_SCALE_1SEC);
+        val = FIELD_DP64(val, CXL_CHMU0_CAP0, MAX_EPOCH_LENGTH_VAL, 100);
+        val = FIELD_DP64(val, CXL_CHMU0_CAP0, MIN_EPOCH_LENGTH_SCALE,
+                         CXL_CHMU_EPOCH_LENGTH_SCALE_100MSEC);
+        val = FIELD_DP64(val, CXL_CHMU0_CAP0, MIN_EPOCH_LENGTH_VAL, 1);
+        val = FIELD_DP64(val, CXL_CHMU0_CAP0, HOTLIST_SIZE,
+                         CXL_HOTLIST_ENTRIES);
+        break;
+    case A_CXL_CHMU0_CAP1:
+        /* 4KiB and 8KiB only */
+        val = FIELD_DP64(val, CXL_CHMU0_CAP1, UNIT_SIZES, BIT(4) | BIT(5));
+        /* Only support downsamp by 32 */
+        val = FIELD_DP64(val, CXL_CHMU0_CAP1, DOWN_SAMPLING_FACTORS, BIT(5));
+        val = FIELD_DP64(val, CXL_CHMU0_CAP1, FLAGS_EPOCH_BASED, 1);
+        val = FIELD_DP64(val, CXL_CHMU0_CAP1, FLAGS_ALWAYS_ON, 0);
+        val = FIELD_DP64(val, CXL_CHMU0_CAP1, FLAGS_RANDOMIZED_DOWN_SAMPLING,
+                         1);
+        val = FIELD_DP64(val, CXL_CHMU0_CAP1, FLAGS_OVERLAPPING_ADDRESS_RANGES,
+                         1);
+        val = FIELD_DP64(val, CXL_CHMU0_CAP1, FLAGS_INSERT_AFTER_CLEAR, 0);
+        break;
+    case A_CXL_CHMU0_CAP2:
+        val = FIELD_DP64(val, CXL_CHMU0_CAP2, BITMAP_REG_OFFSET,
+                         A_CXL_CHMU0_RANGE_CONFIG_BITMAP0 - A_CXL_CHMU0_CAP0);
+        break;
+    case A_CXL_CHMU0_CAP3:
+        val = FIELD_DP64(val, CXL_CHMU0_CAP3, HOTLIST_REG_OFFSET,
+                         A_CXL_CHMU0_HOTLIST0 - A_CXL_CHMU0_CAP0);
+        break;
+    case A_CXL_CHMU0_STATUS:
+        val = FIELD_DP64(val, CXL_CHMU0_STATUS, STATUS_ENABLED,
+                         chmui->enabled ? 1 : 0);
+        val = FIELD_DP64(val, CXL_CHMU0_STATUS, OPERATION_IN_PROG,
+                         chmui->op_in_prog);
+        val = FIELD_DP64(val, CXL_CHMU0_STATUS, COUNTER_WIDTH, 16);
+        val = FIELD_DP64(val, CXL_CHMU0_STATUS, OVERFLOW_INT,
+                         chmui->overflow_set ? 1 : 0);
+        val = FIELD_DP64(val, CXL_CHMU0_STATUS, LEVEL_INT,
+                         chmui->fill_thresh_set ? 1 : 0);
+        break;
+    case A_CXL_CHMU0_TAIL:
+        if (chmu->socket) {
+            rc = chmu_send(chmu, instance, QUERY_TAIL, 0, &val);
+            if (rc < 0) {
+                printf("Failed to read tail\n");
+                return 0;
+            }
+        } else {
+            val = chmui->tail;
+        }
+        break;
+    case A_CXL_CHMU0_HEAD:
+        if (chmu->socket) {
+            rc = chmu_send(chmu, instance, QUERY_HEAD, 0, &val);
+            if (rc < 0) {
+                printf("Failed to read head\n");
+                return 0;
+            }
+        } else {
+            val = chmui->head;
+        }
+        break;
+    case A_CXL_CHMU0_HOTLIST0...(8 * (A_CXL_CHMU0_HOTLIST0 +
+                                      CHMU_HOTLIST_LENGTH)):
+        if (chmu->socket) {
+            rc = chmu_send(chmu, instance, QUERY_HOTLIST_ENTRY,
+                           (offset - A_CXL_CHMU0_HOTLIST0) / 8, &val);
+            if (rc < 0) {
+                printf("Failed to read a hotlist entry\n");
+                return 0;
+            }
+        } else {
+            val = chmui->hotlist[(offset - A_CXL_CHMU0_HOTLIST0) / 8];
+        }
+        break;
+    }
+    return val;
+}
+
+static void chmu_write(void *opaque, hwaddr offset, uint64_t value,
+                       unsigned size)
+{
+    CHMUState *chmu = opaque;
+    CHMUInstance *chmui;
+    hwaddr chmu_stride = A_CXL_CHMU1_CAP0 - A_CXL_CHMU0_CAP0;
+    int instance = 0;
+    int i, rc;
+
+    if (offset >= A_CXL_CHMU0_CAP0) {
+        instance = (offset - A_CXL_CHMU0_CAP0) / chmu_stride;
+        /* offset as if in chmu0 so includes the common caps */
+        offset -= chmu_stride * instance;
+    }
+    if (instance >= CXL_CHMU_INSTANCES_PER_BLOCK) {
+        return;
+    }
+
+    chmui = &chmu->inst[instance];
+
+    switch (offset) {
+    case A_CXL_CHMU0_STATUS:
+        /* The interrupt fields are RW12C */
+        if (FIELD_EX64(value, CXL_CHMU0_STATUS, OVERFLOW_INT)) {
+            chmui->overflow_set = false;
+        }
+        if (FIELD_EX64(value, CXL_CHMU0_STATUS, LEVEL_INT)) {
+            chmui->fill_thresh_set = false;
+        }
+        break;
+    case A_CXL_CHMU0_RANGE_CONFIG_BITMAP0...(A_CXL_CHMU0_HOTLIST0 - 8):
+        /* TODO - wire this up */
+        printf("Bitmap write %lx %lx\n",
+               offset - A_CXL_CHMU0_RANGE_CONFIG_BITMAP0, value);
+        break;
+    case A_CXL_CHMU0_CONF0:
+        if (FIELD_EX64(value, CXL_CHMU0_CONF0, CONTROL_ENABLE)) {
+            chmui->enabled = true;
+            timer_mod(chmui->timer,
+                      qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + chmui->epoch_ms);
+        } else {
+            timer_del(chmui->timer);
+            chmui->enabled = false;
+        }
+        if (chmu->socket) {
+            bool enabled = FIELD_EX64(value, CXL_CHMU0_CONF0, CONTROL_ENABLE);
+
+            if (enabled) {
+                rc = chmu_send(chmu, instance, SET_HPA_BASE, chmu->base, NULL);
+                if (rc < 0) {
+                    printf("Failed to set base\n");
+                }
+                rc = chmu_send(chmu, instance, SET_HPA_SIZE, chmu->size, NULL);
+                if (rc < 0) {
+                    printf("Failed to set size\n");
+                }
+            }
+            rc = chmu_send(chmu, instance, SET_ENABLED, enabled ? 1 : 0, NULL);
+            if (rc < 0) {
+                printf("Failed to set enabled\n");
+            }
+        }
+
+        if (FIELD_EX64(value, CXL_CHMU0_CONF0, CONTROL_RESET)) {
+            /* TODO reset counters once implemented */
+            chmui->head = 0;
+            chmui->tail = 0;
+            for (i = 0; i < CXL_HOTLIST_ENTRIES; i++) {
+                chmui->hotlist[i] = 0;
+            }
+        }
+        chmui->what =
+            FIELD_EX64(value, CXL_CHMU0_CONF0, M2S_REQ_TO_TRACK);
+        chmui->int_on_overflow =
+            FIELD_EX64(value, CXL_CHMU0_CONF0, FLAGS_INT_ON_OVERFLOW);
+        chmui->int_on_fill_thresh =
+            FIELD_EX64(value, CXL_CHMU0_CONF0, FLAGS_INT_ON_FILL_THRESH);
+        chmui->hotness_thresh =
+            FIELD_EX64(value, CXL_CHMU0_CONF0, HOTNESS_THRESHOLD);
+        break;
+    case A_CXL_CHMU0_CONF1: {
+        uint8_t scale;
+        uint32_t mult;
+
+        chmui->unit_size = FIELD_EX64(value, CXL_CHMU0_CONF1, UNIT_SIZE);
+        chmui->ds_factor =
+            FIELD_EX64(value, CXL_CHMU0_CONF1, DOWN_SAMPLING_FACTOR);
+
+        /* TODO: Sanity check value in supported range */
+        scale = FIELD_EX64(value, CXL_CHMU0_CONF1, EPOCH_LENGTH_SCALE);
+        mult = FIELD_EX64(value, CXL_CHMU0_CONF1, EPOCH_LENGTH_VAL);
+        switch (scale) {
+            /* TODO: Implement maths, not lookup */
+        case 1: /* 100usec */
+            chmui->epoch_ms = mult / 10;
+            break;
+        case 2:
+            chmui->epoch_ms = mult;
+            break;
+        case 3:
+            chmui->epoch_ms = mult * 10;
+            break;
+        case 4:
+            chmui->epoch_ms = mult * 100;
+            break;
+        case 5:
+            chmui->epoch_ms = mult * 1000;
+            break;
+        default:
+            /* Unknown value so ignore */
+            break;
+        }
+        break;
+    }
+    case A_CXL_CHMU0_CONF2:
+        chmui->fillthresh = FIELD_EX64(value, CXL_CHMU0_CONF2,
+                                      NOTIFICATION_THRESHOLD);
+        break;
+    case A_CXL_CHMU0_HEAD:
+        chmui->head = value;
+        if (chmu->socket) {
+            rc = chmu_send(chmu, instance, SET_HEAD, value, NULL);
+            if (rc < 0) {
+                printf("Failed to set head\n");
+            }
+        }
+        break;
+    case A_CXL_CHMU0_TAIL: /* Not sure why this is writeable! */
+        chmui->tail = value;
+        break;
+    }
+}
+
+static const MemoryRegionOps chmu_ops = {
+    .read = chmu_read,
+    .write = chmu_write,
+    .endianness = DEVICE_LITTLE_ENDIAN,
+    .valid = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+        .unaligned = false,
+    },
+    .impl = {
+        .min_access_size = 4,
+        .max_access_size = 8,
+    },
+};
+
+static void chmu_timer_update(void *opaque)
+{
+    CHMUInstance *chmui = opaque;
+    PCIDevice *pdev = PCI_DEVICE(chmui->private);
+    int i;
+#define entries_to_add  167
+    bool interrupt_needed = false;
+    bool remote = chmui->parent->socket;
+
+    timer_del(chmui->timer);
+
+    /* This tick is the epoch. How to handle? */
+    if (remote) {
+        int rc;
+        uint64_t reply;
+        /* hack instance always 0! */
+        rc = chmu_send(chmui->parent, 0, SIGNAL_EPOCH_END, 0, &reply);
+        if (rc < 0) {
+            printf("Epoch signalling failed\n");
+        }
+
+        rc = chmu_send(chmui->parent, 0, QUERY_TAIL, 0, &reply);
+        if (rc < 0) {
+            printf("failed to read the tail\n");
+        }
+        chmui->tail = reply;
+        printf("after epoch tail is %x\n", chmui->tail);
+    } else { /* Fake some data if we don't have a real source */
+        uint8_t rand[entries_to_add];
+
+        qemu_guest_getrandom_nofail(rand, sizeof(rand));
+        for (i = 0; i < entries_to_add; i++) {
+            if ((chmui->tail + 1) % CXL_HOTLIST_ENTRIES == chmui->head) {
+                /* Overflow occured, drop out */
+                break;
+            }
+            chmui->hotlist[chmui->tail % CXL_HOTLIST_ENTRIES] =
+                (chmui->tail << 16) | (chmui->hotness_thresh + rand[i]);
+            chmui->tail++;
+            chmui->tail %= CXL_HOTLIST_ENTRIES;
+        }
+    }
+
+    /* All interrupt code is kept in here whatever the data source */
+    if (chmui->int_on_fill_thresh && !chmui->fill_thresh_set) {
+        if (((chmui->tail > chmui->head) &&
+             (chmui->tail - chmui->head > chmui->fillthresh)) |
+            ((chmui->tail < chmui->head) &&
+             (CXL_HOTLIST_ENTRIES - chmui->head + chmui->tail >
+              chmui->fillthresh))) {
+            chmui->fill_thresh_set = true;
+            interrupt_needed = true;
+        }
+    }
+    if (chmui->int_on_overflow && !chmui->overflow_set) {
+        if ((chmui->tail + 1) % CXL_HOTLIST_ENTRIES == chmui->head) {
+            chmui->overflow_set = true;
+            interrupt_needed = true;
+        }
+    }
+
+    if (interrupt_needed) {
+        if (msix_enabled(pdev)) {
+            msix_notify(pdev, chmui->msi_n);
+        } else if (msi_enabled(pdev)) {
+            msi_notify(pdev, chmui->msi_n);
+        }
+    }
+
+    timer_mod(chmui->timer,
+              qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL) + chmui->epoch_ms);
+}
+
+int cxl_chmu_register_block_init(Object *obj,
+                                 CXLDeviceState *cxl_dstate,
+                                 int id, uint8_t msi_n,
+                                 Error **errp)
+{
+    CHMUState *chmu = &cxl_dstate->chmu[id];
+    MemoryRegion *registers = &cxl_dstate->chmu_registers[id];
+    g_autofree gchar *name = g_strdup_printf("chmu%d-registers", id);
+    struct sockaddr_in server_addr;
+    int i;
+
+    memory_region_init_io(registers, obj, &chmu_ops, chmu, name,
+                          pow2ceil(CXL_CHMU_SIZE));
+    memory_region_add_subregion(&cxl_dstate->device_registers,
+                                CXL_CHMU_OFFSET(id), registers);
+
+    for (i = 0; i < CXL_CHMU_INSTANCES_PER_BLOCK; i++) {
+        CHMUInstance *chmui = &chmu->inst[i];
+
+        chmui->parent = chmu;/* hack */
+        chmui->private = obj;
+        chmui->msi_n = msi_n + i;
+        chmui->timer = timer_new_ms(QEMU_CLOCK_VIRTUAL, chmu_timer_update,
+                                    chmui);
+    }
+
+    if (chmu->port) {
+        uint64_t helloval = 41;
+        chmu->socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+        if (chmu->socket < 0) {
+            error_setg(errp, "Failed to create a socket");
+            return -1;
+        }
+
+        memset((char *)&server_addr, 0, sizeof(server_addr));
+        server_addr.sin_family = AF_INET;
+        server_addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+        server_addr.sin_port = htons(chmu->port);
+        if (connect(chmu->socket, (struct sockaddr *)&server_addr,
+                    sizeof(server_addr)) < 0) {
+            close(chmu->socket);
+            error_setg(errp, "Socket connect failed");
+            return -1;
+        }
+
+        send(chmu->socket, &helloval, sizeof(helloval), 0);
+        for (i = 0; i < CXL_CHMU_INSTANCES_PER_BLOCK; i++) {
+            int rc;
+            rc = chmu_send(chmu, i, SET_HOTLIST_SIZE,
+                           CHMU_HOTLIST_LENGTH, NULL);
+            if (rc) {
+                error_setg(errp, "Failed to set hotlist size");
+                return rc;
+            }
+
+            rc = chmu_send(chmu, i, SET_NUMBER_GRANUALS,
+                           cxl_dstate->static_mem_size / 4096, NULL);
+            if (rc) {
+                error_setg(errp, "Failed to set number of granuals");
+                return rc;
+            }
+        }
+    }
+    return 0;
+}
diff --git a/hw/mem/cxl_type3.c b/hw/mem/cxl_type3.c
index c1004ddae8..78426758af 100644
--- a/hw/mem/cxl_type3.c
+++ b/hw/mem/cxl_type3.c
@@ -38,7 +38,10 @@ enum CXL_T3_MSIX_VECTOR {
     CXL_T3_MSIX_CPMU0,
     CXL_T3_MSIX_CPMU1,
     CXL_T3_MSIX_PCIE_DOE_COMPLIANCE,
-    CXL_T3_MSIX_VECTOR_NR
+    CXL_T3_MSIX_CHMU0_BASE,
+    /* One interrupt per CMUH instance in the block */
+    CXL_T3_MSIX_VECTOR_NR =
+        CXL_T3_MSIX_CHMU0_BASE + CXL_CHMU_INSTANCES_PER_BLOCK,
 };
 
 #define DWORD_BYTE 4
@@ -499,6 +502,8 @@ static void build_dvsecs(CXLType3Dev *ct3d)
             RBI_CXL_CPMU_REG | CXL_DEVICE_REG_BAR_IDX;
         regloc_dvsec->reg_base[2 + i].hi = 0;
     }
+    regloc_dvsec->reg_base[4].lo = CXL_CHMU_OFFSET(0) | RBI_CXL_CHMU_REG |
+        CXL_DEVICE_REG_BAR_IDX;
     cxl_component_create_dvsec(cxl_cstate, CXL2_TYPE3_DEVICE,
                                REG_LOC_DVSEC_LENGTH, REG_LOC_DVSEC,
                                REG_LOC_DVSEC_REVID, (uint8_t *)regloc_dvsec);
@@ -535,6 +540,17 @@ static void hdm_decoder_commit(CXLType3Dev *ct3d, int which)
     ctrl = FIELD_DP32(ctrl, CXL_HDM_DECODER0_CTRL, COMMITTED, 1);
 
     stl_le_p(cache_mem + R_CXL_HDM_DECODER0_CTRL + which * hdm_inc, ctrl);
+
+    if (which == 0) {
+        uint32_t low, high;
+        low = ldl_le_p(cache_mem + R_CXL_HDM_DECODER0_BASE_LO);
+        high = ldl_le_p(cache_mem + R_CXL_HDM_DECODER0_BASE_HI);
+        ct3d->cxl_dstate.chmu[0].base = ((uint64_t)high << 32) | (low & 0xf0000000);
+        
+        low = ldl_le_p(cache_mem + R_CXL_HDM_DECODER0_SIZE_LO);
+        high = ldl_le_p(cache_mem + R_CXL_HDM_DECODER0_SIZE_HI);
+        ct3d->cxl_dstate.chmu[0].size = ((uint64_t)high << 32) | (low & 0xf0000000);
+    }
 }
 
 static void hdm_decoder_uncommit(CXLType3Dev *ct3d, int which)
@@ -1008,6 +1024,12 @@ static void ct3_realize(PCIDevice *pci_dev, Error **errp)
                                  CXL_T3_MSIX_CPMU0);
     cxl_cpmu_register_block_init(OBJECT(pci_dev), &ct3d->cxl_dstate, 1,
                                  CXL_T3_MSIX_CPMU1);
+    rc = cxl_chmu_register_block_init(OBJECT(pci_dev), &ct3d->cxl_dstate,
+                                      0, CXL_T3_MSIX_CHMU0_BASE, errp);
+    if (rc) {
+        goto err_free_special_ops;
+    }
+
     pci_register_bar(pci_dev, CXL_DEVICE_REG_BAR_IDX,
                      PCI_BASE_ADDRESS_SPACE_MEMORY |
                          PCI_BASE_ADDRESS_MEM_TYPE_64,
@@ -1317,6 +1339,7 @@ static const Property ct3_props[] = {
                                 speed, PCIE_LINK_SPEED_32),
     DEFINE_PROP_PCIE_LINK_WIDTH("x-width", CXLType3Dev,
                                 width, PCIE_LINK_WIDTH_16),
+    DEFINE_PROP_UINT16("chmu-port", CXLType3Dev, cxl_dstate.chmu[0].port, 0), 
 };
 
 static uint64_t get_lsa_size(CXLType3Dev *ct3d)
diff --git a/hw/cxl/meson.build b/hw/cxl/meson.build
index 4db7cad267..c97e64b586 100644
--- a/hw/cxl/meson.build
+++ b/hw/cxl/meson.build
@@ -6,6 +6,7 @@ system_ss.add(when: 'CONFIG_CXL',
                    'cxl-host.c',
                    'cxl-cdat.c',
                    'cxl-events.c',
+                   'cxl-chmu.c',
                    'cxl-cpmu.c',
                    'switch-mailbox-cci.c',
                ),
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [RFC PATCH QEMU 2/3] plugins: Add cache miss reporting over a socket.
  2025-01-24 17:29 [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data Jonathan Cameron via
  2025-01-24 17:29 ` [RFC PATCH QEMU 1/3] hw/cxl: Initial CXL Hotness Monitoring Unit Emulation Jonathan Cameron via
@ 2025-01-24 17:29 ` Jonathan Cameron via
  2025-05-20 14:16   ` Alex Bennée
  2025-01-24 17:29 ` [RFC PATCH QEMU x3/3] contrib: Add example hotness monitoring unit server Jonathan Cameron via
  2025-01-24 20:55 ` [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data Pierrick Bouvier
  3 siblings, 1 reply; 13+ messages in thread
From: Jonathan Cameron via @ 2025-01-24 17:29 UTC (permalink / raw)
  To: fan.ni, linux-cxl, qemu-devel
  Cc: Alex Bennée, Alexandre Iooss, Mahmoud Mandour,
	Pierrick Bouvier, linuxarm, Niyas Sait

This allows an external program to act as a hotness tracker.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 contrib/plugins/cache.c | 75 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 68 insertions(+), 7 deletions(-)

diff --git a/contrib/plugins/cache.c b/contrib/plugins/cache.c
index 7baff86860..5af1e6559c 100644
--- a/contrib/plugins/cache.c
+++ b/contrib/plugins/cache.c
@@ -7,10 +7,17 @@
 
 #include <inttypes.h>
 #include <stdio.h>
+#include <unistd.h>
 #include <glib.h>
+#include <sys/socket.h>
+#include <arpa/inet.h>
 
 #include <qemu-plugin.h>
 
+static int client_socket = -1;
+static uint64_t missfilterbase;
+static uint64_t missfiltersize;
+
 #define STRTOLL(x) g_ascii_strtoll(x, NULL, 10)
 
 QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION;
@@ -104,6 +111,7 @@ static Cache **l2_ucaches;
 static GMutex *l1_dcache_locks;
 static GMutex *l1_icache_locks;
 static GMutex *l2_ucache_locks;
+static GMutex *socket_lock;
 
 static uint64_t l1_dmem_accesses;
 static uint64_t l1_imem_accesses;
@@ -385,6 +393,21 @@ static bool access_cache(Cache *cache, uint64_t addr)
     return false;
 }
 
+static void miss(uint64_t paddr)
+{
+    if (client_socket < 0) {
+        return;
+    }
+
+    if (paddr < missfilterbase || paddr >= missfilterbase + missfiltersize) {
+        return;
+    }
+
+    g_mutex_lock(socket_lock);
+    send(client_socket, &paddr, sizeof(paddr), 0);
+    g_mutex_unlock(socket_lock);
+}
+
 static void vcpu_mem_access(unsigned int vcpu_index, qemu_plugin_meminfo_t info,
                             uint64_t vaddr, void *userdata)
 {
@@ -395,9 +418,6 @@ static void vcpu_mem_access(unsigned int vcpu_index, qemu_plugin_meminfo_t info,
     bool hit_in_l1;
 
     hwaddr = qemu_plugin_get_hwaddr(info, vaddr);
-    if (hwaddr && qemu_plugin_hwaddr_is_io(hwaddr)) {
-        return;
-    }
 
     effective_addr = hwaddr ? qemu_plugin_hwaddr_phys_addr(hwaddr) : vaddr;
     cache_idx = vcpu_index % cores;
@@ -412,7 +432,11 @@ static void vcpu_mem_access(unsigned int vcpu_index, qemu_plugin_meminfo_t info,
     l1_dcaches[cache_idx]->accesses++;
     g_mutex_unlock(&l1_dcache_locks[cache_idx]);
 
-    if (hit_in_l1 || !use_l2) {
+    if (hit_in_l1) {
+        return;
+    }
+    if (!use_l2) {
+        miss(effective_addr);
         /* No need to access L2 */
         return;
     }
@@ -422,6 +446,7 @@ static void vcpu_mem_access(unsigned int vcpu_index, qemu_plugin_meminfo_t info,
         insn = userdata;
         __atomic_fetch_add(&insn->l2_misses, 1, __ATOMIC_SEQ_CST);
         l2_ucaches[cache_idx]->misses++;
+        miss(effective_addr);
     }
     l2_ucaches[cache_idx]->accesses++;
     g_mutex_unlock(&l2_ucache_locks[cache_idx]);
@@ -447,8 +472,12 @@ static void vcpu_insn_exec(unsigned int vcpu_index, void *userdata)
     l1_icaches[cache_idx]->accesses++;
     g_mutex_unlock(&l1_icache_locks[cache_idx]);
 
-    if (hit_in_l1 || !use_l2) {
-        /* No need to access L2 */
+    if (hit_in_l1) {
+        return;
+    }
+
+    if (!use_l2) {
+        miss(insn_addr);
         return;
     }
 
@@ -739,14 +768,16 @@ QEMU_PLUGIN_EXPORT
 int qemu_plugin_install(qemu_plugin_id_t id, const qemu_info_t *info,
                         int argc, char **argv)
 {
-    int i;
+    int i, port;
     int l1_iassoc, l1_iblksize, l1_icachesize;
     int l1_dassoc, l1_dblksize, l1_dcachesize;
     int l2_assoc, l2_blksize, l2_cachesize;
+    struct sockaddr_in server_addr;
 
     limit = 32;
     sys = info->system_emulation;
 
+    port = -1;
     l1_dassoc = 8;
     l1_dblksize = 64;
     l1_dcachesize = l1_dblksize * l1_dassoc * 32;
@@ -808,11 +839,39 @@ int qemu_plugin_install(qemu_plugin_id_t id, const qemu_info_t *info,
                 fprintf(stderr, "invalid eviction policy: %s\n", opt);
                 return -1;
             }
+        } else if (g_strcmp0(tokens[0], "port") == 0) {
+            port = STRTOLL(tokens[1]);
+        } else if (g_strcmp0(tokens[0], "missfilterbase") == 0) {
+            missfilterbase = STRTOLL(tokens[1]);
+        } else if (g_strcmp0(tokens[0], "missfiltersize") == 0) {
+            missfiltersize = STRTOLL(tokens[1]);
         } else {
             fprintf(stderr, "option parsing failed: %s\n", opt);
             return -1;
         }
     }
+    if (port >= -1) {
+        uint64_t paddr = 42; /* hello, I'm a provider */
+        client_socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
+        if (client_socket < 0) {
+            printf("failed to create a socket\n");
+            return -1;
+        }
+        printf("Cache miss reported on on %lx size %lx\n",
+               missfilterbase, missfiltersize);
+        memset((char *)&server_addr, 0, sizeof(server_addr));
+        server_addr.sin_family = AF_INET;
+        server_addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
+        server_addr.sin_port = htons(port);
+
+        if (connect(client_socket, (struct sockaddr *)&server_addr,
+                    sizeof(server_addr)) < 0) {
+            close(client_socket);
+            return -1;
+        }
+        /* Let it know we are a data provider */
+        send(client_socket, &paddr, sizeof(paddr), 0);
+    }
 
     policy_init();
 
@@ -840,6 +899,8 @@ int qemu_plugin_install(qemu_plugin_id_t id, const qemu_info_t *info,
         return -1;
     }
 
+    socket_lock = g_new0(GMutex, 1);
+
     l1_dcache_locks = g_new0(GMutex, cores);
     l1_icache_locks = g_new0(GMutex, cores);
     l2_ucache_locks = use_l2 ? g_new0(GMutex, cores) : NULL;
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH QEMU 2/3] plugins: Add cache miss reporting over a socket.
  2025-01-24 17:29 ` [RFC PATCH QEMU 2/3] plugins: Add cache miss reporting over a socket Jonathan Cameron via
@ 2025-05-20 14:16   ` Alex Bennée
  0 siblings, 0 replies; 13+ messages in thread
From: Alex Bennée @ 2025-05-20 14:16 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: fan.ni, linux-cxl, qemu-devel, Alexandre Iooss, Mahmoud Mandour,
	Pierrick Bouvier, linuxarm, Niyas Sait

Jonathan Cameron <Jonathan.Cameron@huawei.com> writes:

> This allows an external program to act as a hotness tracker.
>
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  contrib/plugins/cache.c | 75 +++++++++++++++++++++++++++++++++++++----
>  1 file changed, 68 insertions(+), 7 deletions(-)
>
> diff --git a/contrib/plugins/cache.c b/contrib/plugins/cache.c
> index 7baff86860..5af1e6559c 100644
> --- a/contrib/plugins/cache.c
> +++ b/contrib/plugins/cache.c
> @@ -7,10 +7,17 @@
>  
>  #include <inttypes.h>
>  #include <stdio.h>
> +#include <unistd.h>
>  #include <glib.h>
> +#include <sys/socket.h>
> +#include <arpa/inet.h>
>  
>  #include <qemu-plugin.h>
>  
> +static int client_socket = -1;
> +static uint64_t missfilterbase;
> +static uint64_t missfiltersize;
> +
>  #define STRTOLL(x) g_ascii_strtoll(x, NULL, 10)
>  
>  QEMU_PLUGIN_EXPORT int qemu_plugin_version = QEMU_PLUGIN_VERSION;
> @@ -104,6 +111,7 @@ static Cache **l2_ucaches;
>  static GMutex *l1_dcache_locks;
>  static GMutex *l1_icache_locks;
>  static GMutex *l2_ucache_locks;
> +static GMutex *socket_lock;
>  
>  static uint64_t l1_dmem_accesses;
>  static uint64_t l1_imem_accesses;
> @@ -385,6 +393,21 @@ static bool access_cache(Cache *cache, uint64_t addr)
>      return false;
>  }
>  
> +static void miss(uint64_t paddr)
> +{
> +    if (client_socket < 0) {
> +        return;
> +    }
> +
> +    if (paddr < missfilterbase || paddr >= missfilterbase + missfiltersize) {
> +        return;
> +    }
> +
> +    g_mutex_lock(socket_lock);
> +    send(client_socket, &paddr, sizeof(paddr), 0);
> +    g_mutex_unlock(socket_lock);
> +}
> +
>  static void vcpu_mem_access(unsigned int vcpu_index, qemu_plugin_meminfo_t info,
>                              uint64_t vaddr, void *userdata)
>  {
> @@ -395,9 +418,6 @@ static void vcpu_mem_access(unsigned int vcpu_index, qemu_plugin_meminfo_t info,
>      bool hit_in_l1;
>  
>      hwaddr = qemu_plugin_get_hwaddr(info, vaddr);
> -    if (hwaddr && qemu_plugin_hwaddr_is_io(hwaddr)) {
> -        return;
> -    }
>  
>      effective_addr = hwaddr ? qemu_plugin_hwaddr_phys_addr(hwaddr) : vaddr;
>      cache_idx = vcpu_index % cores;
> @@ -412,7 +432,11 @@ static void vcpu_mem_access(unsigned int vcpu_index, qemu_plugin_meminfo_t info,
>      l1_dcaches[cache_idx]->accesses++;
>      g_mutex_unlock(&l1_dcache_locks[cache_idx]);
>  
> -    if (hit_in_l1 || !use_l2) {
> +    if (hit_in_l1) {
> +        return;
> +    }
> +    if (!use_l2) {
> +        miss(effective_addr);
>          /* No need to access L2 */
>          return;
>      }
> @@ -422,6 +446,7 @@ static void vcpu_mem_access(unsigned int vcpu_index, qemu_plugin_meminfo_t info,
>          insn = userdata;
>          __atomic_fetch_add(&insn->l2_misses, 1, __ATOMIC_SEQ_CST);
>          l2_ucaches[cache_idx]->misses++;
> +        miss(effective_addr);
>      }
>      l2_ucaches[cache_idx]->accesses++;
>      g_mutex_unlock(&l2_ucache_locks[cache_idx]);
> @@ -447,8 +472,12 @@ static void vcpu_insn_exec(unsigned int vcpu_index, void *userdata)
>      l1_icaches[cache_idx]->accesses++;
>      g_mutex_unlock(&l1_icache_locks[cache_idx]);
>  
> -    if (hit_in_l1 || !use_l2) {
> -        /* No need to access L2 */
> +    if (hit_in_l1) {
> +        return;
> +    }
> +
> +    if (!use_l2) {
> +        miss(insn_addr);
>          return;
>      }
>  
> @@ -739,14 +768,16 @@ QEMU_PLUGIN_EXPORT
>  int qemu_plugin_install(qemu_plugin_id_t id, const qemu_info_t *info,
>                          int argc, char **argv)
>  {
> -    int i;
> +    int i, port;
>      int l1_iassoc, l1_iblksize, l1_icachesize;
>      int l1_dassoc, l1_dblksize, l1_dcachesize;
>      int l2_assoc, l2_blksize, l2_cachesize;
> +    struct sockaddr_in server_addr;
>  
>      limit = 32;
>      sys = info->system_emulation;
>  
> +    port = -1;
>      l1_dassoc = 8;
>      l1_dblksize = 64;
>      l1_dcachesize = l1_dblksize * l1_dassoc * 32;
> @@ -808,11 +839,39 @@ int qemu_plugin_install(qemu_plugin_id_t id, const qemu_info_t *info,
>                  fprintf(stderr, "invalid eviction policy: %s\n", opt);
>                  return -1;
>              }
> +        } else if (g_strcmp0(tokens[0], "port") == 0) {
> +            port = STRTOLL(tokens[1]);
> +        } else if (g_strcmp0(tokens[0], "missfilterbase") == 0) {
> +            missfilterbase = STRTOLL(tokens[1]);
> +        } else if (g_strcmp0(tokens[0], "missfiltersize") == 0) {
> +            missfiltersize = STRTOLL(tokens[1]);
>          } else {
>              fprintf(stderr, "option parsing failed: %s\n", opt);
>              return -1;
>          }
>      }
> +    if (port >= -1) {
> +        uint64_t paddr = 42; /* hello, I'm a provider */
> +        client_socket = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP);
> +        if (client_socket < 0) {
> +            printf("failed to create a socket\n");
> +            return -1;
> +        }
> +        printf("Cache miss reported on on %lx size %lx\n",
> +               missfilterbase, missfiltersize);
> +        memset((char *)&server_addr, 0, sizeof(server_addr));
> +        server_addr.sin_family = AF_INET;
> +        server_addr.sin_addr.s_addr = htonl(INADDR_LOOPBACK);
> +        server_addr.sin_port = htons(port);
> +
> +        if (connect(client_socket, (struct sockaddr *)&server_addr,
> +                    sizeof(server_addr)) < 0) {
> +            close(client_socket);
> +            return -1;
> +        }
> +        /* Let it know we are a data provider */
> +        send(client_socket, &paddr, sizeof(paddr), 0);
> +    }

No particular objections to the patch as is. I do wonder if it would be
worth exposing a chardev pipe to plugins so we could take advantage of
QEMU's flexible redirection handling.

But not a blocker for this.

>  
>      policy_init();
>  
> @@ -840,6 +899,8 @@ int qemu_plugin_install(qemu_plugin_id_t id, const qemu_info_t *info,
>          return -1;
>      }
>  
> +    socket_lock = g_new0(GMutex, 1);
> +
>      l1_dcache_locks = g_new0(GMutex, cores);
>      l1_icache_locks = g_new0(GMutex, cores);
>      l2_ucache_locks = use_l2 ? g_new0(GMutex, cores) : NULL;

-- 
Alex Bennée
Virtualisation Tech Lead @ Linaro


^ permalink raw reply	[flat|nested] 13+ messages in thread

* [RFC PATCH QEMU x3/3] contrib: Add example hotness monitoring unit server
  2025-01-24 17:29 [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data Jonathan Cameron via
  2025-01-24 17:29 ` [RFC PATCH QEMU 1/3] hw/cxl: Initial CXL Hotness Monitoring Unit Emulation Jonathan Cameron via
  2025-01-24 17:29 ` [RFC PATCH QEMU 2/3] plugins: Add cache miss reporting over a socket Jonathan Cameron via
@ 2025-01-24 17:29 ` Jonathan Cameron via
  2025-01-24 20:55 ` [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data Pierrick Bouvier
  3 siblings, 0 replies; 13+ messages in thread
From: Jonathan Cameron via @ 2025-01-24 17:29 UTC (permalink / raw)
  To: fan.ni, linux-cxl, qemu-devel
  Cc: Alex Bennée, Alexandre Iooss, Mahmoud Mandour,
	Pierrick Bouvier, linuxarm, Niyas Sait

This is used inconjuction with the cache plugin (with port
parameter supplied) and the CXL Type 3 device with a hotness
monitoring unit (chmu-port parameter supplied).

It implements a very basic oracle with a counter per 4KiB page
and simple loop to find large counts.  The hotlist length is
controlled by the QEMU device implementation.

This is only responsible for the data handling events etc are
a problem for the CXL HMU emulation.

Note that when running this things are fairly slow.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 contrib/hmu/hmu.c | 312 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 312 insertions(+)

diff --git a/contrib/hmu/hmu.c b/contrib/hmu/hmu.c
new file mode 100644
index 0000000000..aa47efd98b
--- /dev/null
+++ b/contrib/hmu/hmu.c
@@ -0,0 +1,312 @@
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>
+#include <unistd.h>
+#include <stdbool.h>
+#include <stdint.h>
+#include <pthread.h>
+#include <arpa/inet.h>
+
+#define ID_PROVIDER 42
+#define ID_CONSUMER 41
+
+/* Move to shared header */
+enum consumer_request {
+    QUERY_TAIL,
+    QUERY_HEAD,
+    SET_HEAD,
+    SET_HOTLIST_SIZE,
+    QUERY_HOTLIST_ENTRY,
+    SIGNAL_EPOCH_END,
+    SET_ENABLED,
+    SET_NUMBER_GRANUALS,
+    SET_HPA_BASE,
+    SET_HPA_SIZE,
+};
+
+struct tracking_instance {
+    uint64_t base, size;
+    uint16_t head, tail;
+    uint16_t hotlist_length;
+    uint64_t *hotlist;
+    int32_t *counters;
+    size_t num_counters;
+    bool enabled;
+};
+
+#define MAX_INSTANCES 16
+static int num_tracking_instances;
+static struct tracking_instance *instances[MAX_INSTANCES] = {};
+/*
+ * Instances never removed so this only protects the index against
+ * parallel creations.
+ */
+pthread_mutex_t instances_lock;
+static int register_tracker(struct tracking_instance *inst)
+{
+    pthread_mutex_lock(&instances_lock);
+    if (num_tracking_instances >= MAX_INSTANCES) {
+        pthread_mutex_unlock(&instances_lock);
+        return -1;
+    }
+    instances[num_tracking_instances++] = inst;
+    printf("registered %d\n", num_tracking_instances);
+    pthread_mutex_unlock(&instances_lock);
+    return 0;
+}
+
+static void notify_tracker(struct tracking_instance *inst, uint64_t paddr)
+{
+    uint64_t offset;
+
+    if (paddr < inst->base || paddr >= inst->base + inst->size) {
+        return;
+    }
+    /* Fixme: multiple regions */
+    offset = (paddr - inst->base) / 4096;
+
+    /*  TODO - check masking */
+
+    if (!inst->counters) {
+        printf("No counter storage\n");
+        return;
+    }
+    if (offset >= inst->num_counters) {
+        printf("out of range? %lx %lx\n", offset, inst->num_counters);
+        return;
+    }
+    inst->counters[offset]++;
+}
+
+/* CHMU instance in QEMU */
+static void *provider_innerloop(void * _socket)
+{
+    int socket = *(int *)_socket;
+    uint64_t paddr;
+    int rc;
+
+    printf("Provider connected\n");
+    while (1) {
+        rc = read(socket, &paddr, sizeof(paddr));
+        if (rc == 0) {
+            return NULL;
+        }
+       /* Lock not taken as instances only goes up which should be safe */
+        for (int i = 0; i < num_tracking_instances; i++)
+            if (instances[i]->enabled) {
+                notify_tracker(instances[i], paddr);
+            }
+    }
+}
+
+
+/* Cache plugin hopefully squirting us some data */
+static void *consumer_innerloop(void *_socket)
+{
+    int socket = *(int *)_socket;
+    /* for now all chmu have 3 instances */
+    struct tracking_instance insts[3] = {};
+    /* Instance, command, parameter */
+    uint64_t paddr[3];
+    int rc;
+
+    for (int i = 0; i < 3; i++) {
+        rc = register_tracker(&insts[i]);
+        if (rc) {
+            printf("Failed to register tracker\n");
+            return NULL;
+            /* todo cleanup to not have partial trackers registered */
+        }
+    }
+    printf("Consumer connected\n");
+
+    while (1) {
+        uint64_t reply, param;
+        enum consumer_request request;
+
+        struct tracking_instance *inst;
+
+        rc = read(socket, paddr, sizeof(paddr));
+        if (rc < sizeof(paddr)) {
+            printf("short message %x\n", rc);
+            return NULL;
+        }
+        if (paddr[0] > 3) {
+            printf("garbage\n");
+            exit(-1);
+        }
+        inst = &insts[paddr[0]];
+        request = paddr[1];
+        param = paddr[2];
+
+        switch (request) {
+        case QUERY_TAIL:
+            reply = inst->tail;
+            break;
+        case QUERY_HEAD:
+            reply = inst->head;
+            break;
+        case SET_HEAD:
+            reply = param;
+            inst->head = param;
+            break;
+        case SET_HOTLIST_SIZE: {
+            uint64_t *newlist;
+            reply = param;
+            inst->hotlist_length = param;
+            newlist = realloc(inst->hotlist, sizeof(*inst->hotlist) * param);
+            if (!newlist) {
+                printf("failed to allocate hotlist\n");
+                break;
+            }
+            inst->hotlist = newlist;
+            break;
+        }
+        case QUERY_HOTLIST_ENTRY:
+            if (param >= inst->hotlist_length) {
+                printf("out of range hotlist read?\n");
+                break;
+            }
+            reply = inst->hotlist[param];
+            break;
+        case SIGNAL_EPOCH_END: {
+            int space;
+            int added = 0;
+            printf("into epoch end\n");
+            reply = param;
+
+            if (insts->tail > inst->head) {
+                space = inst->tail - inst->head;
+            } else {
+                space = inst->hotlist_length - inst->tail +
+                    inst->head;
+            }
+            if (!inst->counters) {
+                printf("How did we reach end of an epoque without counters?\n");
+                break;
+            }
+            for (int i = 0; i < inst->num_counters; i++) {
+                if (!(inst->counters[i] > 0)) {
+                    continue;
+                }
+                inst->hotlist[inst->tail] =
+                    (uint64_t)inst->counters[i] | ((uint64_t)i << 32);
+                printf("added hotlist element %lx at %u\n",
+                       inst->hotlist[inst->tail], inst->tail);
+                inst->tail = (inst->tail + 1) % inst->hotlist_length;
+                added++;
+                if (added == space) {
+                    break;
+                }
+            }
+            memset(inst->counters, 0,
+                   inst->num_counters * sizeof(*inst->counters));
+
+            printf("End of epoch %u %u\n", inst->head, inst->tail);
+            /* Overflow hadnling based on fullness detection in qemu */
+            break;
+        }
+        case SET_ENABLED:
+            reply = param;
+            inst->enabled = !!param;
+            printf("enabled? %d\n", inst->enabled);
+            break;
+        case SET_NUMBER_GRANUALS: { /* FIXME Should derive from granual size */
+            uint32_t *newcounters;
+
+            reply = param;
+            newcounters = realloc(inst->counters,
+                                  sizeof(*inst->counters) *
+                                  param);
+            if (!newcounters) {
+                printf("Failed to allocate counter storage\n");
+            }
+            printf("allocated space for %lu counters\n", param);
+            inst->counters = newcounters;
+            inst->num_counters = param;
+            break;
+        }
+        case SET_HPA_BASE:
+            reply = param;
+            inst->base = param;
+            break;
+        case SET_HPA_SIZE: /* Size */
+            reply = param;
+            inst->size = param;
+            break;
+        default:
+            printf("No idea yet\n");
+            break;
+        }
+        write(socket, &reply, sizeof(reply));
+    }
+}
+
+int main(int argc, char **argv)
+{
+    int server_fd, new_socket;
+    struct sockaddr_in address;
+    int opt = 1;
+    int addrlen = sizeof(address);
+    uint64_t paddr;
+    unsigned short port;
+
+    if (argc < 2) {
+        printf("Please provide port to listen on\n");
+        return -1;
+    }
+    port = atoi(argv[1]);
+
+    if ((server_fd = socket(AF_INET, SOCK_STREAM, IPPROTO_TCP)) == 0) {
+        return -1;
+    }
+
+    if (setsockopt(server_fd, SOL_SOCKET, SO_REUSEADDR | SO_REUSEPORT,
+                   &opt, sizeof(opt))) {
+        return -1;
+    }
+    address.sin_family = AF_INET;
+    address.sin_addr.s_addr = INADDR_ANY;
+    address.sin_port = htons(port);
+
+    if (bind(server_fd, (struct sockaddr *)&address, sizeof(address)) < 0) {
+        return -1;
+    }
+
+    printf("Listening on port %u\n", port);
+    if (listen(server_fd, 3) < 0) {
+        return -1;
+    }
+
+    while (1) {
+        int rc;
+        pthread_t thread;
+        if ((new_socket = accept(server_fd, (struct sockaddr *)&address,
+                                 (socklen_t *)&addrlen)) < 0) {
+            exit(-1);
+        }
+
+        rc = read(new_socket, &paddr, sizeof(paddr));
+        if (rc == 0) {
+            return 0;
+        }
+
+        if (paddr == ID_PROVIDER) {
+            if (pthread_create(&thread, NULL, provider_innerloop,
+                               &new_socket)) {
+                printf("thread create fail\n");
+            };
+        } else if (paddr == ID_CONSUMER) {
+            if (pthread_create(&thread, NULL, consumer_innerloop,
+                               &new_socket)) {
+                printf("thread create fail\n");
+            };
+        } else {
+            printf("No idea what this was - initial value not provider or consumer\n");
+            close(new_socket);
+            return 0;
+        }
+    }
+
+    return 0;
+}
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data.
  2025-01-24 17:29 [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data Jonathan Cameron via
                   ` (2 preceding siblings ...)
  2025-01-24 17:29 ` [RFC PATCH QEMU x3/3] contrib: Add example hotness monitoring unit server Jonathan Cameron via
@ 2025-01-24 20:55 ` Pierrick Bouvier
  2025-01-27 10:20   ` Jonathan Cameron via
  3 siblings, 1 reply; 13+ messages in thread
From: Pierrick Bouvier @ 2025-01-24 20:55 UTC (permalink / raw)
  To: Jonathan Cameron, fan.ni, linux-cxl, qemu-devel
  Cc: Alex Bennée, Alexandre Iooss, Mahmoud Mandour, linuxarm,
	Niyas Sait

Hi Jonathan,

thanks for posting this. It's a creative usage of plugins.

I think that your current approach, decoupling plugins, CHMU and device 
model is a good thing.

I'm not familiar with CXL, but one question that comes to my mind is:
Is that mandatory to do this analysis during execution (vs dumping 
binary traces from CHMU and plugin and running an analysis post execution)?

Regards,
Pierrick

On 1/24/25 09:29, Jonathan Cameron wrote:
> Hi All,
> 
> This is an RFC mainly to seek feedback on the approach used, particularly
> the aspect of how to get data from a TCG plugin into a device model.
> Two options that we have tried
> 1. Socket over which the plugin sends data to an external server
>     (as seen here)
> 2. Register and manage a plugin from within a device model
> 
> The external server approach keeps things loosely coupled, but at the cost
> of separately maintaining that server, protocol definitions etc and
> some overhead.
> The closely couple solution is neater, but I suspect might be controversial
> (hence I didn't start with that :)
> 
> The code here is at best a PoC to illustrate what we have in mind
> It's not nice code at all, feature gaps, bugs and all!  So whilst
> review is always welcome I'm not requesting it for now.
> 
> Kernel support was posted a while back but was done against fake data
> (still supported here if you don't provide the port parameter to the type3 device)
> https://lore.kernel.org/linux-cxl/20241121101845.1815660-1-Jonathan.Cameron@huawei.com/
> I'll post a minor update of that driver shortly to take into account
> a few specification clarifications but it should work with this without
> those.
> 
> Note there are some other patches on the tree I generated this from
> so this may not apply to upstream. Easiest is probably to test
> using gitlab.com/jic23/qemu cxl-2025-01-24
> 
> Thanks to Niyas for his suggestions on how to make all this work!
> 
> Background
> ----------
> 
> What is the Compute eXpress Link Hotness Monitoring unit and what is it for?
> - In a tiered memory equipped server with the slow tier being attached via
>    CXL the expectation is a given workload will benefit from putting data
>    that is frequently fetched from memory in lower latency directly attached
>    DRAM.  Less frequently used data can be served from the CXL attached memory
>    with no significant loss of performance.  Any data that is hot enough to
>    almost always be in cache doesn't matter as it is only fetch from memory
>    occasionally.
> - Working out which memory is best places where is hard to do and in some
>    workloads a dynamic problem. As such we need something we can measure
>    to provide some indication of what data is in the wrong place.
>    There are existing techniques to do this (page faulting, various
>    CPU tracing systems, access bit scanning etc) but they all have significant
>    overheads.
> - Monitoring accesses on the CXL device provides a path to getting good
>    data without those overheads.  These units are known as CXL Hotness
>    Monitoring Units or CHMUs.  Loosely speaking they count accesses to
>    granuals of data (e.g. 4KiB pages).  Exactly how they do that and
>    where they sacrifice data accuracy is an implementation trade off.
> 
> Why do we need a model that gives real data?
> - In general there is a need to develop software on top of these units
>    to move data to the right place. Hard to evaluate that if we are making
>    up the info on what is 'hot'.
> - Need to allow for a bunch of 'impdef' solutions. Note that CHMU
>    in this patch set is an oracle - it has enough counters to count
>    every access.  That's not realistic but it doesn't get me shouted
>    at by our architecture teams for giving away any secrets.
>    If we move forward with this, I'll probably implement a limited
>    counter + full CAM solution (also unrealistic, but closer to real)
>    I'd be very interested in contributions of other approaches (there
>    are lots in the literature, under the term top-k)
> - Resources will be constrained, so whilst a CHMU might in theory
>    allow monitoring everything at once, that will come with a big
>    accuracy cost.  We need to design the algorithms that give us
>    good data given those constraints.
> 
> So we need a solution to explore the design space and develop the software
> to take advantage of this hardware (there are various LSF/MM proposals
> on how to use this an other ways of tracking hotness).
> https://lore.kernel.org/all/20250123105721.424117-1-raghavendra.kt@amd.com/
> https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
> 
> QEMU plugins give us a way to do this.  In particular the existing
> Cache plugin can be easily modified to tell use what memory addresses
> missed at the last level of emulated cache.  We can then filter those
> for the memory address range that maps to CXL and feed them to our
> counter implementation. On the other side, each instance of CXL type 3
> device can connect to this server and request hotness monitoring
> services + provide parameters etc.  Elements such as list threshold
> management and overflow detection etc are in the CXL HMU QEMU device mode.
> As noted above, we have an alternative approach that can closely couple
> things, so the device model registers the plugin directly and there
> is no server.
> 
> How to use it!
> --------------
> 
> It runs a little slow but it runs and generates somewhat plausible outputs.
> I'd definitely suggest running it with the pass through optimization
> patch on the CXL staging tree (and a single direct connected device).
> Your millage will vary if you try to use other parameters, or
> hotness units beyond the first one (implementation far from complete!)
> 
> To run start the server in contrib/hmu/ providing a port number to listen
> on.
> 
> ./chmu 4443
> 
> Then launch QEMU with something like the following.
> 
> qemu-system-aarch64 -icount shift=1 \
>   -plugin ../qemu/bin/native/contrib/plugins/libcache.so,port=4443,missfilterbase=1099511627776,missfiltersize=1099511627776,dcachesize=8192,dassoc=4,dblksize=64,icachesize=8192,iassoc=4,iblksize=64,l2cachesize=32768,l2assoc=16,l2blksize=64 \
>   -M virt,ras=on,nvdimm=on,gic-version=3,cxl=on,hmat=on -m 4g,maxmem=8g,slots=4 -cpu max -smp 4 \
>   -kernel Image \
>   -drive if=none,file=full.qcow2,format=qcow2,id=hd \
>   -device pcie-root-port,id=root_port1 \
>   -device virtio-blk-pci,drive=hd,x-max-bounce-buffer-size=512k \
>   -nographic -no-reboot -append 'earlycon memblock=debug root=/dev/vda2 fsck.mode=skip maxcpus=4 tp_printk' \
>   -monitor telnet:127.0.0.1:1234,server,nowait -bios QEMU_EFI.fd \
>   -object memory-backend-ram,size=4G,id=mem0 \
>   -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/t3_cxl1.raw,size=1G,align=256M \
>   -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/t3_cxl2.raw,size=1G,align=256M \
>   -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/t3_lsa1.raw,size=1M,align=1M \
>    -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/t3_cxl3.raw,size=1G,align=256M \
>   -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/t3_cxl4.raw,size=1G,align=256M \
>   -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/t3_lsa2.raw,size=1M,align=1M \
>   -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1,hdm_for_passthrough=true,numa_node=0\
>   -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
>   -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-pmem1,lsa=cxl-lsa1,sn=3,x-speed=32,x-width=16,chmu-port=4443 \
>   -machine cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=8G,cxl-fmw.0.interleave-granularity=1k \
>   -numa node,nodeid=0,cpus=0-3,memdev=mem0 \
>   -numa node,nodeid=1 \
>   -object acpi-generic-initiator,id=bob2,pci-dev=bob,node=1 \
>   -numa node,nodeid=2 \
>   -object acpi-generic-port,id=bob11,pci-bus=cxl.1,node=2 \
> 
> In the guest, create and bind the region - this brings up the CXL memory
> device so accesses go to the memory.
> 
>    cd /sys/bus/cxl/devices/decoder0.0/
>    cat create_ram_region
>    echo region0 > create_ram_region
>    echo ram > /sys/bus/cxl/devices/decoder2.0/mode
>    echo ram > /sys/bus/cxl/devices/decoder3.0/mode
>    echo $((256 << 21)) > /sys/bus/cxl/devices/decoder2.0/dpa_size
>    cd /sys/bus/cxl/devices/region0/
>    echo 256 > interleave_granularity
>    echo 1 > interleave_ways
>    echo $((256 << 21)) > size
>    echo decoder2.0 > target0
>    echo 1 > commit
>    echo region0 > /sys/bus/cxl/drivers/cxl_region/bind
> 
> Finally start perf with something like:
> 
> ./perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
> hotness_threshold=635,epoch_multiplier=4,epoch_scale=4,\
> range_base=0,range_size=4096/  ./stress.sh
> 
> where stress.sh is
> 
>    sleep 2
>    numactl --membind 3 stress-ng --vm 1 --vm-bytes 1M --vm-keep -t 5s
>    sleep 2
> 
> See the results with
> ./perf report --dump-raw-trace | grep -A 200 HMU
> 
> Enjoy and have a good weekend.
> 
> Thanks,
> 
> Jonathan
> 
> Jonathan Cameron (3):
>    hw/cxl: Initial CXL Hotness Monitoring Unit Emulation
>    plugins: Add cache miss reporting over a socket.
>    contrib: Add example hotness monitoring unit server
> 
>   include/hw/cxl/cxl.h        |   1 +
>   include/hw/cxl/cxl_chmu.h   | 154 ++++++++++++
>   include/hw/cxl/cxl_device.h |  13 +-
>   include/hw/cxl/cxl_pci.h    |   7 +-
>   contrib/hmu/hmu.c           | 312 ++++++++++++++++++++++++
>   contrib/plugins/cache.c     |  75 +++++-
>   hw/cxl/cxl-chmu.c           | 459 ++++++++++++++++++++++++++++++++++++
>   hw/mem/cxl_type3.c          |  25 +-
>   hw/cxl/meson.build          |   1 +
>   9 files changed, 1035 insertions(+), 12 deletions(-)
>   create mode 100644 include/hw/cxl/cxl_chmu.h
>   create mode 100644 contrib/hmu/hmu.c
>   create mode 100644 hw/cxl/cxl-chmu.c
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data.
  2025-01-24 20:55 ` [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data Pierrick Bouvier
@ 2025-01-27 10:20   ` Jonathan Cameron via
  2025-01-28 20:04     ` Pierrick Bouvier
  0 siblings, 1 reply; 13+ messages in thread
From: Jonathan Cameron via @ 2025-01-27 10:20 UTC (permalink / raw)
  To: Pierrick Bouvier
  Cc: fan.ni, linux-cxl, qemu-devel, Alex Bennée, Alexandre Iooss,
	Mahmoud Mandour, linuxarm, Niyas Sait

On Fri, 24 Jan 2025 12:55:52 -0800
Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:

> Hi Jonathan,
> 
> thanks for posting this. It's a creative usage of plugins.
> 
> I think that your current approach, decoupling plugins, CHMU and device 
> model is a good thing.
> 
> I'm not familiar with CXL, but one question that comes to my mind is:
> Is that mandatory to do this analysis during execution (vs dumping 
> binary traces from CHMU and plugin and running an analysis post execution)?

Short answer is that post run analysis isn't of much use for developing the OS
software story. It works to some degree if you are designing the tracking
hardware or algorithms to use that hardware capture a snapshot of hotness -
dealing with lack of counters, that sort of thing. 

The main intent of this support is to drive live usage of the data in the OS.
So it gets this hotness information and migrates more frequently accessed memory
to a 'nearer'/lower latency memory node.

From an OS point of view there will be two ways it uses it:
1) Offline application optimization  - that aligns with your suggestion of offline
   analysis but would typically still need to be live because we have to do
   the reverse maps and work out what was allocated in particular locations.
   Not impossible to dump that information from QEMU + the guest OS but the usage
   flow would then look quite different from what makes sense on real hardware
   where all the data is available to the host OS directly.
2) Migration of memory.  This will dynamically change the PA backing a VA whilst
   applications are running. The aim being to develop how that happens, we need
   the dynamic state.

Jonathan

> 
> Regards,
> Pierrick
> 
> On 1/24/25 09:29, Jonathan Cameron wrote:
> > Hi All,
> > 
> > This is an RFC mainly to seek feedback on the approach used, particularly
> > the aspect of how to get data from a TCG plugin into a device model.
> > Two options that we have tried
> > 1. Socket over which the plugin sends data to an external server
> >     (as seen here)
> > 2. Register and manage a plugin from within a device model
> > 
> > The external server approach keeps things loosely coupled, but at the cost
> > of separately maintaining that server, protocol definitions etc and
> > some overhead.
> > The closely couple solution is neater, but I suspect might be controversial
> > (hence I didn't start with that :)
> > 
> > The code here is at best a PoC to illustrate what we have in mind
> > It's not nice code at all, feature gaps, bugs and all!  So whilst
> > review is always welcome I'm not requesting it for now.
> > 
> > Kernel support was posted a while back but was done against fake data
> > (still supported here if you don't provide the port parameter to the type3 device)
> > https://lore.kernel.org/linux-cxl/20241121101845.1815660-1-Jonathan.Cameron@huawei.com/
> > I'll post a minor update of that driver shortly to take into account
> > a few specification clarifications but it should work with this without
> > those.
> > 
> > Note there are some other patches on the tree I generated this from
> > so this may not apply to upstream. Easiest is probably to test
> > using gitlab.com/jic23/qemu cxl-2025-01-24
> > 
> > Thanks to Niyas for his suggestions on how to make all this work!
> > 
> > Background
> > ----------
> > 
> > What is the Compute eXpress Link Hotness Monitoring unit and what is it for?
> > - In a tiered memory equipped server with the slow tier being attached via
> >    CXL the expectation is a given workload will benefit from putting data
> >    that is frequently fetched from memory in lower latency directly attached
> >    DRAM.  Less frequently used data can be served from the CXL attached memory
> >    with no significant loss of performance.  Any data that is hot enough to
> >    almost always be in cache doesn't matter as it is only fetch from memory
> >    occasionally.
> > - Working out which memory is best places where is hard to do and in some
> >    workloads a dynamic problem. As such we need something we can measure
> >    to provide some indication of what data is in the wrong place.
> >    There are existing techniques to do this (page faulting, various
> >    CPU tracing systems, access bit scanning etc) but they all have significant
> >    overheads.
> > - Monitoring accesses on the CXL device provides a path to getting good
> >    data without those overheads.  These units are known as CXL Hotness
> >    Monitoring Units or CHMUs.  Loosely speaking they count accesses to
> >    granuals of data (e.g. 4KiB pages).  Exactly how they do that and
> >    where they sacrifice data accuracy is an implementation trade off.
> > 
> > Why do we need a model that gives real data?
> > - In general there is a need to develop software on top of these units
> >    to move data to the right place. Hard to evaluate that if we are making
> >    up the info on what is 'hot'.
> > - Need to allow for a bunch of 'impdef' solutions. Note that CHMU
> >    in this patch set is an oracle - it has enough counters to count
> >    every access.  That's not realistic but it doesn't get me shouted
> >    at by our architecture teams for giving away any secrets.
> >    If we move forward with this, I'll probably implement a limited
> >    counter + full CAM solution (also unrealistic, but closer to real)
> >    I'd be very interested in contributions of other approaches (there
> >    are lots in the literature, under the term top-k)
> > - Resources will be constrained, so whilst a CHMU might in theory
> >    allow monitoring everything at once, that will come with a big
> >    accuracy cost.  We need to design the algorithms that give us
> >    good data given those constraints.
> > 
> > So we need a solution to explore the design space and develop the software
> > to take advantage of this hardware (there are various LSF/MM proposals
> > on how to use this an other ways of tracking hotness).
> > https://lore.kernel.org/all/20250123105721.424117-1-raghavendra.kt@amd.com/
> > https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
> > 
> > QEMU plugins give us a way to do this.  In particular the existing
> > Cache plugin can be easily modified to tell use what memory addresses
> > missed at the last level of emulated cache.  We can then filter those
> > for the memory address range that maps to CXL and feed them to our
> > counter implementation. On the other side, each instance of CXL type 3
> > device can connect to this server and request hotness monitoring
> > services + provide parameters etc.  Elements such as list threshold
> > management and overflow detection etc are in the CXL HMU QEMU device mode.
> > As noted above, we have an alternative approach that can closely couple
> > things, so the device model registers the plugin directly and there
> > is no server.
> > 
> > How to use it!
> > --------------
> > 
> > It runs a little slow but it runs and generates somewhat plausible outputs.
> > I'd definitely suggest running it with the pass through optimization
> > patch on the CXL staging tree (and a single direct connected device).
> > Your millage will vary if you try to use other parameters, or
> > hotness units beyond the first one (implementation far from complete!)
> > 
> > To run start the server in contrib/hmu/ providing a port number to listen
> > on.
> > 
> > ./chmu 4443
> > 
> > Then launch QEMU with something like the following.
> > 
> > qemu-system-aarch64 -icount shift=1 \
> >   -plugin ../qemu/bin/native/contrib/plugins/libcache.so,port=4443,missfilterbase=1099511627776,missfiltersize=1099511627776,dcachesize=8192,dassoc=4,dblksize=64,icachesize=8192,iassoc=4,iblksize=64,l2cachesize=32768,l2assoc=16,l2blksize=64 \
> >   -M virt,ras=on,nvdimm=on,gic-version=3,cxl=on,hmat=on -m 4g,maxmem=8g,slots=4 -cpu max -smp 4 \
> >   -kernel Image \
> >   -drive if=none,file=full.qcow2,format=qcow2,id=hd \
> >   -device pcie-root-port,id=root_port1 \
> >   -device virtio-blk-pci,drive=hd,x-max-bounce-buffer-size=512k \
> >   -nographic -no-reboot -append 'earlycon memblock=debug root=/dev/vda2 fsck.mode=skip maxcpus=4 tp_printk' \
> >   -monitor telnet:127.0.0.1:1234,server,nowait -bios QEMU_EFI.fd \
> >   -object memory-backend-ram,size=4G,id=mem0 \
> >   -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/t3_cxl1.raw,size=1G,align=256M \
> >   -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/t3_cxl2.raw,size=1G,align=256M \
> >   -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/t3_lsa1.raw,size=1M,align=1M \
> >    -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/t3_cxl3.raw,size=1G,align=256M \
> >   -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/t3_cxl4.raw,size=1G,align=256M \
> >   -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/t3_lsa2.raw,size=1M,align=1M \
> >   -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1,hdm_for_passthrough=true,numa_node=0\
> >   -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> >   -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-pmem1,lsa=cxl-lsa1,sn=3,x-speed=32,x-width=16,chmu-port=4443 \
> >   -machine cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=8G,cxl-fmw.0.interleave-granularity=1k \
> >   -numa node,nodeid=0,cpus=0-3,memdev=mem0 \
> >   -numa node,nodeid=1 \
> >   -object acpi-generic-initiator,id=bob2,pci-dev=bob,node=1 \
> >   -numa node,nodeid=2 \
> >   -object acpi-generic-port,id=bob11,pci-bus=cxl.1,node=2 \
> > 
> > In the guest, create and bind the region - this brings up the CXL memory
> > device so accesses go to the memory.
> > 
> >    cd /sys/bus/cxl/devices/decoder0.0/
> >    cat create_ram_region
> >    echo region0 > create_ram_region
> >    echo ram > /sys/bus/cxl/devices/decoder2.0/mode
> >    echo ram > /sys/bus/cxl/devices/decoder3.0/mode
> >    echo $((256 << 21)) > /sys/bus/cxl/devices/decoder2.0/dpa_size
> >    cd /sys/bus/cxl/devices/region0/
> >    echo 256 > interleave_granularity
> >    echo 1 > interleave_ways
> >    echo $((256 << 21)) > size
> >    echo decoder2.0 > target0
> >    echo 1 > commit
> >    echo region0 > /sys/bus/cxl/drivers/cxl_region/bind
> > 
> > Finally start perf with something like:
> > 
> > ./perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
> > hotness_threshold=635,epoch_multiplier=4,epoch_scale=4,\
> > range_base=0,range_size=4096/  ./stress.sh
> > 
> > where stress.sh is
> > 
> >    sleep 2
> >    numactl --membind 3 stress-ng --vm 1 --vm-bytes 1M --vm-keep -t 5s
> >    sleep 2
> > 
> > See the results with
> > ./perf report --dump-raw-trace | grep -A 200 HMU
> > 
> > Enjoy and have a good weekend.
> > 
> > Thanks,
> > 
> > Jonathan
> > 
> > Jonathan Cameron (3):
> >    hw/cxl: Initial CXL Hotness Monitoring Unit Emulation
> >    plugins: Add cache miss reporting over a socket.
> >    contrib: Add example hotness monitoring unit server
> > 
> >   include/hw/cxl/cxl.h        |   1 +
> >   include/hw/cxl/cxl_chmu.h   | 154 ++++++++++++
> >   include/hw/cxl/cxl_device.h |  13 +-
> >   include/hw/cxl/cxl_pci.h    |   7 +-
> >   contrib/hmu/hmu.c           | 312 ++++++++++++++++++++++++
> >   contrib/plugins/cache.c     |  75 +++++-
> >   hw/cxl/cxl-chmu.c           | 459 ++++++++++++++++++++++++++++++++++++
> >   hw/mem/cxl_type3.c          |  25 +-
> >   hw/cxl/meson.build          |   1 +
> >   9 files changed, 1035 insertions(+), 12 deletions(-)
> >   create mode 100644 include/hw/cxl/cxl_chmu.h
> >   create mode 100644 contrib/hmu/hmu.c
> >   create mode 100644 hw/cxl/cxl-chmu.c
> >   
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data.
  2025-01-27 10:20   ` Jonathan Cameron via
@ 2025-01-28 20:04     ` Pierrick Bouvier
  2025-01-29 10:29       ` Jonathan Cameron via
  0 siblings, 1 reply; 13+ messages in thread
From: Pierrick Bouvier @ 2025-01-28 20:04 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: fan.ni, linux-cxl, qemu-devel, Alex Bennée, Alexandre Iooss,
	Mahmoud Mandour, linuxarm, Niyas Sait

On 1/27/25 02:20, Jonathan Cameron wrote:
> On Fri, 24 Jan 2025 12:55:52 -0800
> Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:
> 
>> Hi Jonathan,
>>
>> thanks for posting this. It's a creative usage of plugins.
>>
>> I think that your current approach, decoupling plugins, CHMU and device
>> model is a good thing.
>>
>> I'm not familiar with CXL, but one question that comes to my mind is:
>> Is that mandatory to do this analysis during execution (vs dumping
>> binary traces from CHMU and plugin and running an analysis post execution)?
> 
> Short answer is that post run analysis isn't of much use for developing the OS
> software story. It works to some degree if you are designing the tracking
> hardware or algorithms to use that hardware capture a snapshot of hotness -
> dealing with lack of counters, that sort of thing.
> 
> The main intent of this support is to drive live usage of the data in the OS.
> So it gets this hotness information and migrates more frequently accessed memory
> to a 'nearer'/lower latency memory node.
> 
>  From an OS point of view there will be two ways it uses it:
> 1) Offline application optimization  - that aligns with your suggestion of offline
>     analysis but would typically still need to be live because we have to do
>     the reverse maps and work out what was allocated in particular locations.
>     Not impossible to dump that information from QEMU + the guest OS but the usage
>     flow would then look quite different from what makes sense on real hardware
>     where all the data is available to the host OS directly.
> 2) Migration of memory.  This will dynamically change the PA backing a VA whilst
>     applications are running. The aim being to develop how that happens, we need
>     the dynamic state.
> 

In the end, are you modeling how the real CHMU will work, or simply 
gathering data to help designing it (number of counters, line size, ...)?

Pierrick

> Jonathan
> 
>>
>> Regards,
>> Pierrick
>>
>> On 1/24/25 09:29, Jonathan Cameron wrote:
>>> Hi All,
>>>
>>> This is an RFC mainly to seek feedback on the approach used, particularly
>>> the aspect of how to get data from a TCG plugin into a device model.
>>> Two options that we have tried
>>> 1. Socket over which the plugin sends data to an external server
>>>      (as seen here)
>>> 2. Register and manage a plugin from within a device model
>>>
>>> The external server approach keeps things loosely coupled, but at the cost
>>> of separately maintaining that server, protocol definitions etc and
>>> some overhead.
>>> The closely couple solution is neater, but I suspect might be controversial
>>> (hence I didn't start with that :)
>>>
>>> The code here is at best a PoC to illustrate what we have in mind
>>> It's not nice code at all, feature gaps, bugs and all!  So whilst
>>> review is always welcome I'm not requesting it for now.
>>>
>>> Kernel support was posted a while back but was done against fake data
>>> (still supported here if you don't provide the port parameter to the type3 device)
>>> https://lore.kernel.org/linux-cxl/20241121101845.1815660-1-Jonathan.Cameron@huawei.com/
>>> I'll post a minor update of that driver shortly to take into account
>>> a few specification clarifications but it should work with this without
>>> those.
>>>
>>> Note there are some other patches on the tree I generated this from
>>> so this may not apply to upstream. Easiest is probably to test
>>> using gitlab.com/jic23/qemu cxl-2025-01-24
>>>
>>> Thanks to Niyas for his suggestions on how to make all this work!
>>>
>>> Background
>>> ----------
>>>
>>> What is the Compute eXpress Link Hotness Monitoring unit and what is it for?
>>> - In a tiered memory equipped server with the slow tier being attached via
>>>     CXL the expectation is a given workload will benefit from putting data
>>>     that is frequently fetched from memory in lower latency directly attached
>>>     DRAM.  Less frequently used data can be served from the CXL attached memory
>>>     with no significant loss of performance.  Any data that is hot enough to
>>>     almost always be in cache doesn't matter as it is only fetch from memory
>>>     occasionally.
>>> - Working out which memory is best places where is hard to do and in some
>>>     workloads a dynamic problem. As such we need something we can measure
>>>     to provide some indication of what data is in the wrong place.
>>>     There are existing techniques to do this (page faulting, various
>>>     CPU tracing systems, access bit scanning etc) but they all have significant
>>>     overheads.
>>> - Monitoring accesses on the CXL device provides a path to getting good
>>>     data without those overheads.  These units are known as CXL Hotness
>>>     Monitoring Units or CHMUs.  Loosely speaking they count accesses to
>>>     granuals of data (e.g. 4KiB pages).  Exactly how they do that and
>>>     where they sacrifice data accuracy is an implementation trade off.
>>>
>>> Why do we need a model that gives real data?
>>> - In general there is a need to develop software on top of these units
>>>     to move data to the right place. Hard to evaluate that if we are making
>>>     up the info on what is 'hot'.
>>> - Need to allow for a bunch of 'impdef' solutions. Note that CHMU
>>>     in this patch set is an oracle - it has enough counters to count
>>>     every access.  That's not realistic but it doesn't get me shouted
>>>     at by our architecture teams for giving away any secrets.
>>>     If we move forward with this, I'll probably implement a limited
>>>     counter + full CAM solution (also unrealistic, but closer to real)
>>>     I'd be very interested in contributions of other approaches (there
>>>     are lots in the literature, under the term top-k)
>>> - Resources will be constrained, so whilst a CHMU might in theory
>>>     allow monitoring everything at once, that will come with a big
>>>     accuracy cost.  We need to design the algorithms that give us
>>>     good data given those constraints.
>>>
>>> So we need a solution to explore the design space and develop the software
>>> to take advantage of this hardware (there are various LSF/MM proposals
>>> on how to use this an other ways of tracking hotness).
>>> https://lore.kernel.org/all/20250123105721.424117-1-raghavendra.kt@amd.com/
>>> https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
>>>
>>> QEMU plugins give us a way to do this.  In particular the existing
>>> Cache plugin can be easily modified to tell use what memory addresses
>>> missed at the last level of emulated cache.  We can then filter those
>>> for the memory address range that maps to CXL and feed them to our
>>> counter implementation. On the other side, each instance of CXL type 3
>>> device can connect to this server and request hotness monitoring
>>> services + provide parameters etc.  Elements such as list threshold
>>> management and overflow detection etc are in the CXL HMU QEMU device mode.
>>> As noted above, we have an alternative approach that can closely couple
>>> things, so the device model registers the plugin directly and there
>>> is no server.
>>>
>>> How to use it!
>>> --------------
>>>
>>> It runs a little slow but it runs and generates somewhat plausible outputs.
>>> I'd definitely suggest running it with the pass through optimization
>>> patch on the CXL staging tree (and a single direct connected device).
>>> Your millage will vary if you try to use other parameters, or
>>> hotness units beyond the first one (implementation far from complete!)
>>>
>>> To run start the server in contrib/hmu/ providing a port number to listen
>>> on.
>>>
>>> ./chmu 4443
>>>
>>> Then launch QEMU with something like the following.
>>>
>>> qemu-system-aarch64 -icount shift=1 \
>>>    -plugin ../qemu/bin/native/contrib/plugins/libcache.so,port=4443,missfilterbase=1099511627776,missfiltersize=1099511627776,dcachesize=8192,dassoc=4,dblksize=64,icachesize=8192,iassoc=4,iblksize=64,l2cachesize=32768,l2assoc=16,l2blksize=64 \
>>>    -M virt,ras=on,nvdimm=on,gic-version=3,cxl=on,hmat=on -m 4g,maxmem=8g,slots=4 -cpu max -smp 4 \
>>>    -kernel Image \
>>>    -drive if=none,file=full.qcow2,format=qcow2,id=hd \
>>>    -device pcie-root-port,id=root_port1 \
>>>    -device virtio-blk-pci,drive=hd,x-max-bounce-buffer-size=512k \
>>>    -nographic -no-reboot -append 'earlycon memblock=debug root=/dev/vda2 fsck.mode=skip maxcpus=4 tp_printk' \
>>>    -monitor telnet:127.0.0.1:1234,server,nowait -bios QEMU_EFI.fd \
>>>    -object memory-backend-ram,size=4G,id=mem0 \
>>>    -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/t3_cxl1.raw,size=1G,align=256M \
>>>    -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/t3_cxl2.raw,size=1G,align=256M \
>>>    -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/t3_lsa1.raw,size=1M,align=1M \
>>>     -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/t3_cxl3.raw,size=1G,align=256M \
>>>    -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/t3_cxl4.raw,size=1G,align=256M \
>>>    -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/t3_lsa2.raw,size=1M,align=1M \
>>>    -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1,hdm_for_passthrough=true,numa_node=0\
>>>    -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
>>>    -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-pmem1,lsa=cxl-lsa1,sn=3,x-speed=32,x-width=16,chmu-port=4443 \
>>>    -machine cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=8G,cxl-fmw.0.interleave-granularity=1k \
>>>    -numa node,nodeid=0,cpus=0-3,memdev=mem0 \
>>>    -numa node,nodeid=1 \
>>>    -object acpi-generic-initiator,id=bob2,pci-dev=bob,node=1 \
>>>    -numa node,nodeid=2 \
>>>    -object acpi-generic-port,id=bob11,pci-bus=cxl.1,node=2 \
>>>
>>> In the guest, create and bind the region - this brings up the CXL memory
>>> device so accesses go to the memory.
>>>
>>>     cd /sys/bus/cxl/devices/decoder0.0/
>>>     cat create_ram_region
>>>     echo region0 > create_ram_region
>>>     echo ram > /sys/bus/cxl/devices/decoder2.0/mode
>>>     echo ram > /sys/bus/cxl/devices/decoder3.0/mode
>>>     echo $((256 << 21)) > /sys/bus/cxl/devices/decoder2.0/dpa_size
>>>     cd /sys/bus/cxl/devices/region0/
>>>     echo 256 > interleave_granularity
>>>     echo 1 > interleave_ways
>>>     echo $((256 << 21)) > size
>>>     echo decoder2.0 > target0
>>>     echo 1 > commit
>>>     echo region0 > /sys/bus/cxl/drivers/cxl_region/bind
>>>
>>> Finally start perf with something like:
>>>
>>> ./perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
>>> hotness_threshold=635,epoch_multiplier=4,epoch_scale=4,\
>>> range_base=0,range_size=4096/  ./stress.sh
>>>
>>> where stress.sh is
>>>
>>>     sleep 2
>>>     numactl --membind 3 stress-ng --vm 1 --vm-bytes 1M --vm-keep -t 5s
>>>     sleep 2
>>>
>>> See the results with
>>> ./perf report --dump-raw-trace | grep -A 200 HMU
>>>
>>> Enjoy and have a good weekend.
>>>
>>> Thanks,
>>>
>>> Jonathan
>>>
>>> Jonathan Cameron (3):
>>>     hw/cxl: Initial CXL Hotness Monitoring Unit Emulation
>>>     plugins: Add cache miss reporting over a socket.
>>>     contrib: Add example hotness monitoring unit server
>>>
>>>    include/hw/cxl/cxl.h        |   1 +
>>>    include/hw/cxl/cxl_chmu.h   | 154 ++++++++++++
>>>    include/hw/cxl/cxl_device.h |  13 +-
>>>    include/hw/cxl/cxl_pci.h    |   7 +-
>>>    contrib/hmu/hmu.c           | 312 ++++++++++++++++++++++++
>>>    contrib/plugins/cache.c     |  75 +++++-
>>>    hw/cxl/cxl-chmu.c           | 459 ++++++++++++++++++++++++++++++++++++
>>>    hw/mem/cxl_type3.c          |  25 +-
>>>    hw/cxl/meson.build          |   1 +
>>>    9 files changed, 1035 insertions(+), 12 deletions(-)
>>>    create mode 100644 include/hw/cxl/cxl_chmu.h
>>>    create mode 100644 contrib/hmu/hmu.c
>>>    create mode 100644 hw/cxl/cxl-chmu.c
>>>    
>>
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data.
  2025-01-28 20:04     ` Pierrick Bouvier
@ 2025-01-29 10:29       ` Jonathan Cameron via
  2025-01-29 22:31         ` Pierrick Bouvier
  0 siblings, 1 reply; 13+ messages in thread
From: Jonathan Cameron via @ 2025-01-29 10:29 UTC (permalink / raw)
  To: Pierrick Bouvier
  Cc: fan.ni, linux-cxl, qemu-devel, Alex Bennée, Alexandre Iooss,
	Mahmoud Mandour, linuxarm, Niyas Sait

On Tue, 28 Jan 2025 12:04:19 -0800
Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:

> On 1/27/25 02:20, Jonathan Cameron wrote:
> > On Fri, 24 Jan 2025 12:55:52 -0800
> > Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:
> >   
> >> Hi Jonathan,
> >>
> >> thanks for posting this. It's a creative usage of plugins.
> >>
> >> I think that your current approach, decoupling plugins, CHMU and device
> >> model is a good thing.
> >>
> >> I'm not familiar with CXL, but one question that comes to my mind is:
> >> Is that mandatory to do this analysis during execution (vs dumping
> >> binary traces from CHMU and plugin and running an analysis post execution)?  
> > 
> > Short answer is that post run analysis isn't of much use for developing the OS
> > software story. It works to some degree if you are designing the tracking
> > hardware or algorithms to use that hardware capture a snapshot of hotness -
> > dealing with lack of counters, that sort of thing.
> > 
> > The main intent of this support is to drive live usage of the data in the OS.
> > So it gets this hotness information and migrates more frequently accessed memory
> > to a 'nearer'/lower latency memory node.
> > 
> >  From an OS point of view there will be two ways it uses it:
> > 1) Offline application optimization  - that aligns with your suggestion of offline
> >     analysis but would typically still need to be live because we have to do
> >     the reverse maps and work out what was allocated in particular locations.
> >     Not impossible to dump that information from QEMU + the guest OS but the usage
> >     flow would then look quite different from what makes sense on real hardware
> >     where all the data is available to the host OS directly.
> > 2) Migration of memory.  This will dynamically change the PA backing a VA whilst
> >     applications are running. The aim being to develop how that happens, we need
> >     the dynamic state.
> >   
> 
> In the end, are you modeling how the real CHMU will work, or simply 
> gathering data to help designing it (number of counters, line size, ...)?

This work is modeling how a real (ish) CHMU will work - particular interest being
use in Linux kernel usecases. Otherwise we wouldn't share! :) 

For CHMU hardware design, until people reach the live algorithms in the loop
stage, tracing techniques and offline analysis tend to be easier to use.

A annoying corner is that the implementations in QEMU will 'probably' remain
simplistic because the detailed designs of CHMUs may be considered sensitive.
It's a complex space and there are some really interesting and to me surprising
approaches.

What we can implement should be good enough for working out the basics of a
general software stack but possible it will need tuning against specific
implementations.  Maybe that necessity will result in more openness on the
parts of various uarch / arch teams.

There are some academic works on how to build these trackers, and there should
be less sensitivity around those.

This is perhaps an odd corner for QEMU because we are emulating an interface
accurately but the hardware behind it intentionally does not have a specification
defined implementation and the unusual bit is that implementation affects
the output.  We can implement a few options that are well defined though.
1) An Oracle ('infinite' counters)
2) Limited counters allocated on first touch in a given epoch (sampling period).

1 is useful for putting an upper bound on data accuracy.
2 is a typical first thing people will look at when considering a implementation.

So conclusion. This is about enabling software development, not tuning a hardware
design.

Jonathan

> 
> Pierrick
> 
> > Jonathan
> >   
> >>
> >> Regards,
> >> Pierrick
> >>
> >> On 1/24/25 09:29, Jonathan Cameron wrote:  
> >>> Hi All,
> >>>
> >>> This is an RFC mainly to seek feedback on the approach used, particularly
> >>> the aspect of how to get data from a TCG plugin into a device model.
> >>> Two options that we have tried
> >>> 1. Socket over which the plugin sends data to an external server
> >>>      (as seen here)
> >>> 2. Register and manage a plugin from within a device model
> >>>
> >>> The external server approach keeps things loosely coupled, but at the cost
> >>> of separately maintaining that server, protocol definitions etc and
> >>> some overhead.
> >>> The closely couple solution is neater, but I suspect might be controversial
> >>> (hence I didn't start with that :)
> >>>
> >>> The code here is at best a PoC to illustrate what we have in mind
> >>> It's not nice code at all, feature gaps, bugs and all!  So whilst
> >>> review is always welcome I'm not requesting it for now.
> >>>
> >>> Kernel support was posted a while back but was done against fake data
> >>> (still supported here if you don't provide the port parameter to the type3 device)
> >>> https://lore.kernel.org/linux-cxl/20241121101845.1815660-1-Jonathan.Cameron@huawei.com/
> >>> I'll post a minor update of that driver shortly to take into account
> >>> a few specification clarifications but it should work with this without
> >>> those.
> >>>
> >>> Note there are some other patches on the tree I generated this from
> >>> so this may not apply to upstream. Easiest is probably to test
> >>> using gitlab.com/jic23/qemu cxl-2025-01-24
> >>>
> >>> Thanks to Niyas for his suggestions on how to make all this work!
> >>>
> >>> Background
> >>> ----------
> >>>
> >>> What is the Compute eXpress Link Hotness Monitoring unit and what is it for?
> >>> - In a tiered memory equipped server with the slow tier being attached via
> >>>     CXL the expectation is a given workload will benefit from putting data
> >>>     that is frequently fetched from memory in lower latency directly attached
> >>>     DRAM.  Less frequently used data can be served from the CXL attached memory
> >>>     with no significant loss of performance.  Any data that is hot enough to
> >>>     almost always be in cache doesn't matter as it is only fetch from memory
> >>>     occasionally.
> >>> - Working out which memory is best places where is hard to do and in some
> >>>     workloads a dynamic problem. As such we need something we can measure
> >>>     to provide some indication of what data is in the wrong place.
> >>>     There are existing techniques to do this (page faulting, various
> >>>     CPU tracing systems, access bit scanning etc) but they all have significant
> >>>     overheads.
> >>> - Monitoring accesses on the CXL device provides a path to getting good
> >>>     data without those overheads.  These units are known as CXL Hotness
> >>>     Monitoring Units or CHMUs.  Loosely speaking they count accesses to
> >>>     granuals of data (e.g. 4KiB pages).  Exactly how they do that and
> >>>     where they sacrifice data accuracy is an implementation trade off.
> >>>
> >>> Why do we need a model that gives real data?
> >>> - In general there is a need to develop software on top of these units
> >>>     to move data to the right place. Hard to evaluate that if we are making
> >>>     up the info on what is 'hot'.
> >>> - Need to allow for a bunch of 'impdef' solutions. Note that CHMU
> >>>     in this patch set is an oracle - it has enough counters to count
> >>>     every access.  That's not realistic but it doesn't get me shouted
> >>>     at by our architecture teams for giving away any secrets.
> >>>     If we move forward with this, I'll probably implement a limited
> >>>     counter + full CAM solution (also unrealistic, but closer to real)
> >>>     I'd be very interested in contributions of other approaches (there
> >>>     are lots in the literature, under the term top-k)
> >>> - Resources will be constrained, so whilst a CHMU might in theory
> >>>     allow monitoring everything at once, that will come with a big
> >>>     accuracy cost.  We need to design the algorithms that give us
> >>>     good data given those constraints.
> >>>
> >>> So we need a solution to explore the design space and develop the software
> >>> to take advantage of this hardware (there are various LSF/MM proposals
> >>> on how to use this an other ways of tracking hotness).
> >>> https://lore.kernel.org/all/20250123105721.424117-1-raghavendra.kt@amd.com/
> >>> https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
> >>>
> >>> QEMU plugins give us a way to do this.  In particular the existing
> >>> Cache plugin can be easily modified to tell use what memory addresses
> >>> missed at the last level of emulated cache.  We can then filter those
> >>> for the memory address range that maps to CXL and feed them to our
> >>> counter implementation. On the other side, each instance of CXL type 3
> >>> device can connect to this server and request hotness monitoring
> >>> services + provide parameters etc.  Elements such as list threshold
> >>> management and overflow detection etc are in the CXL HMU QEMU device mode.
> >>> As noted above, we have an alternative approach that can closely couple
> >>> things, so the device model registers the plugin directly and there
> >>> is no server.
> >>>
> >>> How to use it!
> >>> --------------
> >>>
> >>> It runs a little slow but it runs and generates somewhat plausible outputs.
> >>> I'd definitely suggest running it with the pass through optimization
> >>> patch on the CXL staging tree (and a single direct connected device).
> >>> Your millage will vary if you try to use other parameters, or
> >>> hotness units beyond the first one (implementation far from complete!)
> >>>
> >>> To run start the server in contrib/hmu/ providing a port number to listen
> >>> on.
> >>>
> >>> ./chmu 4443
> >>>
> >>> Then launch QEMU with something like the following.
> >>>
> >>> qemu-system-aarch64 -icount shift=1 \
> >>>    -plugin ../qemu/bin/native/contrib/plugins/libcache.so,port=4443,missfilterbase=1099511627776,missfiltersize=1099511627776,dcachesize=8192,dassoc=4,dblksize=64,icachesize=8192,iassoc=4,iblksize=64,l2cachesize=32768,l2assoc=16,l2blksize=64 \
> >>>    -M virt,ras=on,nvdimm=on,gic-version=3,cxl=on,hmat=on -m 4g,maxmem=8g,slots=4 -cpu max -smp 4 \
> >>>    -kernel Image \
> >>>    -drive if=none,file=full.qcow2,format=qcow2,id=hd \
> >>>    -device pcie-root-port,id=root_port1 \
> >>>    -device virtio-blk-pci,drive=hd,x-max-bounce-buffer-size=512k \
> >>>    -nographic -no-reboot -append 'earlycon memblock=debug root=/dev/vda2 fsck.mode=skip maxcpus=4 tp_printk' \
> >>>    -monitor telnet:127.0.0.1:1234,server,nowait -bios QEMU_EFI.fd \
> >>>    -object memory-backend-ram,size=4G,id=mem0 \
> >>>    -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/t3_cxl1.raw,size=1G,align=256M \
> >>>    -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/t3_cxl2.raw,size=1G,align=256M \
> >>>    -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/t3_lsa1.raw,size=1M,align=1M \
> >>>     -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/t3_cxl3.raw,size=1G,align=256M \
> >>>    -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/t3_cxl4.raw,size=1G,align=256M \
> >>>    -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/t3_lsa2.raw,size=1M,align=1M \
> >>>    -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1,hdm_for_passthrough=true,numa_node=0\
> >>>    -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> >>>    -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-pmem1,lsa=cxl-lsa1,sn=3,x-speed=32,x-width=16,chmu-port=4443 \
> >>>    -machine cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=8G,cxl-fmw.0.interleave-granularity=1k \
> >>>    -numa node,nodeid=0,cpus=0-3,memdev=mem0 \
> >>>    -numa node,nodeid=1 \
> >>>    -object acpi-generic-initiator,id=bob2,pci-dev=bob,node=1 \
> >>>    -numa node,nodeid=2 \
> >>>    -object acpi-generic-port,id=bob11,pci-bus=cxl.1,node=2 \
> >>>
> >>> In the guest, create and bind the region - this brings up the CXL memory
> >>> device so accesses go to the memory.
> >>>
> >>>     cd /sys/bus/cxl/devices/decoder0.0/
> >>>     cat create_ram_region
> >>>     echo region0 > create_ram_region
> >>>     echo ram > /sys/bus/cxl/devices/decoder2.0/mode
> >>>     echo ram > /sys/bus/cxl/devices/decoder3.0/mode
> >>>     echo $((256 << 21)) > /sys/bus/cxl/devices/decoder2.0/dpa_size
> >>>     cd /sys/bus/cxl/devices/region0/
> >>>     echo 256 > interleave_granularity
> >>>     echo 1 > interleave_ways
> >>>     echo $((256 << 21)) > size
> >>>     echo decoder2.0 > target0
> >>>     echo 1 > commit
> >>>     echo region0 > /sys/bus/cxl/drivers/cxl_region/bind
> >>>
> >>> Finally start perf with something like:
> >>>
> >>> ./perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
> >>> hotness_threshold=635,epoch_multiplier=4,epoch_scale=4,\
> >>> range_base=0,range_size=4096/  ./stress.sh
> >>>
> >>> where stress.sh is
> >>>
> >>>     sleep 2
> >>>     numactl --membind 3 stress-ng --vm 1 --vm-bytes 1M --vm-keep -t 5s
> >>>     sleep 2
> >>>
> >>> See the results with
> >>> ./perf report --dump-raw-trace | grep -A 200 HMU
> >>>
> >>> Enjoy and have a good weekend.
> >>>
> >>> Thanks,
> >>>
> >>> Jonathan
> >>>
> >>> Jonathan Cameron (3):
> >>>     hw/cxl: Initial CXL Hotness Monitoring Unit Emulation
> >>>     plugins: Add cache miss reporting over a socket.
> >>>     contrib: Add example hotness monitoring unit server
> >>>
> >>>    include/hw/cxl/cxl.h        |   1 +
> >>>    include/hw/cxl/cxl_chmu.h   | 154 ++++++++++++
> >>>    include/hw/cxl/cxl_device.h |  13 +-
> >>>    include/hw/cxl/cxl_pci.h    |   7 +-
> >>>    contrib/hmu/hmu.c           | 312 ++++++++++++++++++++++++
> >>>    contrib/plugins/cache.c     |  75 +++++-
> >>>    hw/cxl/cxl-chmu.c           | 459 ++++++++++++++++++++++++++++++++++++
> >>>    hw/mem/cxl_type3.c          |  25 +-
> >>>    hw/cxl/meson.build          |   1 +
> >>>    9 files changed, 1035 insertions(+), 12 deletions(-)
> >>>    create mode 100644 include/hw/cxl/cxl_chmu.h
> >>>    create mode 100644 contrib/hmu/hmu.c
> >>>    create mode 100644 hw/cxl/cxl-chmu.c
> >>>      
> >>  
> >   
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data.
  2025-01-29 10:29       ` Jonathan Cameron via
@ 2025-01-29 22:31         ` Pierrick Bouvier
  2025-01-30 15:52           ` Jonathan Cameron via
  0 siblings, 1 reply; 13+ messages in thread
From: Pierrick Bouvier @ 2025-01-29 22:31 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: fan.ni, linux-cxl, qemu-devel, Alex Bennée, Alexandre Iooss,
	Mahmoud Mandour, linuxarm, Niyas Sait

Hi Jonathan,

On 1/29/25 02:29, Jonathan Cameron wrote:
> On Tue, 28 Jan 2025 12:04:19 -0800
> Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:
> 
>> On 1/27/25 02:20, Jonathan Cameron wrote:
>>> On Fri, 24 Jan 2025 12:55:52 -0800
>>> Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:
>>>    
>>>> Hi Jonathan,
>>>>
>>>> thanks for posting this. It's a creative usage of plugins.
>>>>
>>>> I think that your current approach, decoupling plugins, CHMU and device
>>>> model is a good thing.
>>>>
>>>> I'm not familiar with CXL, but one question that comes to my mind is:
>>>> Is that mandatory to do this analysis during execution (vs dumping
>>>> binary traces from CHMU and plugin and running an analysis post execution)?
>>>
>>> Short answer is that post run analysis isn't of much use for developing the OS
>>> software story. It works to some degree if you are designing the tracking
>>> hardware or algorithms to use that hardware capture a snapshot of hotness -
>>> dealing with lack of counters, that sort of thing.
>>>
>>> The main intent of this support is to drive live usage of the data in the OS.
>>> So it gets this hotness information and migrates more frequently accessed memory
>>> to a 'nearer'/lower latency memory node.
>>>
>>>   From an OS point of view there will be two ways it uses it:
>>> 1) Offline application optimization  - that aligns with your suggestion of offline
>>>      analysis but would typically still need to be live because we have to do
>>>      the reverse maps and work out what was allocated in particular locations.
>>>      Not impossible to dump that information from QEMU + the guest OS but the usage
>>>      flow would then look quite different from what makes sense on real hardware
>>>      where all the data is available to the host OS directly.
>>> 2) Migration of memory.  This will dynamically change the PA backing a VA whilst
>>>      applications are running. The aim being to develop how that happens, we need
>>>      the dynamic state.
>>>    
>>
>> In the end, are you modeling how the real CHMU will work, or simply
>> gathering data to help designing it (number of counters, line size, ...)?
> 
> This work is modeling how a real (ish) CHMU will work - particular interest being
> use in Linux kernel usecases. Otherwise we wouldn't share! :)
> 
> For CHMU hardware design, until people reach the live algorithms in the loop
> stage, tracing techniques and offline analysis tend to be easier to use.
> 
> A annoying corner is that the implementations in QEMU will 'probably' remain
> simplistic because the detailed designs of CHMUs may be considered sensitive.
> It's a complex space and there are some really interesting and to me surprising
> approaches.
> 
> What we can implement should be good enough for working out the basics of a
> general software stack but possible it will need tuning against specific
> implementations.  Maybe that necessity will result in more openness on the
> parts of various uarch / arch teams.
> 
> There are some academic works on how to build these trackers, and there should
> be less sensitivity around those.
> 
> This is perhaps an odd corner for QEMU because we are emulating an interface
> accurately but the hardware behind it intentionally does not have a specification
> defined implementation and the unusual bit is that implementation affects
> the output.  We can implement a few options that are well defined though.
> 1) An Oracle ('infinite' counters)
> 2) Limited counters allocated on first touch in a given epoch (sampling period).
> 
> 1 is useful for putting an upper bound on data accuracy.
> 2 is a typical first thing people will look at when considering a implementation.
> 
> So conclusion. This is about enabling software development, not tuning a hardware
> design.
> 

Ok, thanks for the clarification.

Considering the approach you followed, as said before, choosing a 
decoupled solution is the right choice. Plugins should not allow to 
access internal details of QEMU implementation as a general rule.

Did you think about integrating the server directly in the plugin? From 
what I understand, the CHMU will contact the server on a per request 
basis, while instrumentation will contact it for every access.

Beyond communication, the biggest overhead here is to have 
instrumentation on all memory accesses. Another thing that comes out of 
my mind is to do a sampling instrumentation. However, it's not easy to 
do, because the current API does not allow to force a new translation of 
existing TB, and at translation time, you have no clue whether or not 
this will be a hot TB.
What you can do though, is to ignore most of the accesses, and only send 
information every 1000 memory accesses. (Note: I have no idea if 1000 is 
a good threshold :)). It would allow you to skip most of the overhead 
related to communication and hot pages management.

 From what I understood about using plugins, the goal is to track hot 
pages. Using the cache one is a possibility, but it might be better to 
use contrib/plugins/hotpages.c instead, or better, a custom simple 
plugin, simply reporting reads/writes to the server (and implementing 
sampling as well).

Which slow-down factor (order of magnitude) do you have with this series?

> Jonathan
> 
>>
>> Pierrick
>>
>>> Jonathan
>>>    
>>>>
>>>> Regards,
>>>> Pierrick
>>>>
>>>> On 1/24/25 09:29, Jonathan Cameron wrote:
>>>>> Hi All,
>>>>>
>>>>> This is an RFC mainly to seek feedback on the approach used, particularly
>>>>> the aspect of how to get data from a TCG plugin into a device model.
>>>>> Two options that we have tried
>>>>> 1. Socket over which the plugin sends data to an external server
>>>>>       (as seen here)
>>>>> 2. Register and manage a plugin from within a device model
>>>>>
>>>>> The external server approach keeps things loosely coupled, but at the cost
>>>>> of separately maintaining that server, protocol definitions etc and
>>>>> some overhead.
>>>>> The closely couple solution is neater, but I suspect might be controversial
>>>>> (hence I didn't start with that :)
>>>>>
>>>>> The code here is at best a PoC to illustrate what we have in mind
>>>>> It's not nice code at all, feature gaps, bugs and all!  So whilst
>>>>> review is always welcome I'm not requesting it for now.
>>>>>
>>>>> Kernel support was posted a while back but was done against fake data
>>>>> (still supported here if you don't provide the port parameter to the type3 device)
>>>>> https://lore.kernel.org/linux-cxl/20241121101845.1815660-1-Jonathan.Cameron@huawei.com/
>>>>> I'll post a minor update of that driver shortly to take into account
>>>>> a few specification clarifications but it should work with this without
>>>>> those.
>>>>>
>>>>> Note there are some other patches on the tree I generated this from
>>>>> so this may not apply to upstream. Easiest is probably to test
>>>>> using gitlab.com/jic23/qemu cxl-2025-01-24
>>>>>
>>>>> Thanks to Niyas for his suggestions on how to make all this work!
>>>>>
>>>>> Background
>>>>> ----------
>>>>>
>>>>> What is the Compute eXpress Link Hotness Monitoring unit and what is it for?
>>>>> - In a tiered memory equipped server with the slow tier being attached via
>>>>>      CXL the expectation is a given workload will benefit from putting data
>>>>>      that is frequently fetched from memory in lower latency directly attached
>>>>>      DRAM.  Less frequently used data can be served from the CXL attached memory
>>>>>      with no significant loss of performance.  Any data that is hot enough to
>>>>>      almost always be in cache doesn't matter as it is only fetch from memory
>>>>>      occasionally.
>>>>> - Working out which memory is best places where is hard to do and in some
>>>>>      workloads a dynamic problem. As such we need something we can measure
>>>>>      to provide some indication of what data is in the wrong place.
>>>>>      There are existing techniques to do this (page faulting, various
>>>>>      CPU tracing systems, access bit scanning etc) but they all have significant
>>>>>      overheads.
>>>>> - Monitoring accesses on the CXL device provides a path to getting good
>>>>>      data without those overheads.  These units are known as CXL Hotness
>>>>>      Monitoring Units or CHMUs.  Loosely speaking they count accesses to
>>>>>      granuals of data (e.g. 4KiB pages).  Exactly how they do that and
>>>>>      where they sacrifice data accuracy is an implementation trade off.
>>>>>
>>>>> Why do we need a model that gives real data?
>>>>> - In general there is a need to develop software on top of these units
>>>>>      to move data to the right place. Hard to evaluate that if we are making
>>>>>      up the info on what is 'hot'.
>>>>> - Need to allow for a bunch of 'impdef' solutions. Note that CHMU
>>>>>      in this patch set is an oracle - it has enough counters to count
>>>>>      every access.  That's not realistic but it doesn't get me shouted
>>>>>      at by our architecture teams for giving away any secrets.
>>>>>      If we move forward with this, I'll probably implement a limited
>>>>>      counter + full CAM solution (also unrealistic, but closer to real)
>>>>>      I'd be very interested in contributions of other approaches (there
>>>>>      are lots in the literature, under the term top-k)
>>>>> - Resources will be constrained, so whilst a CHMU might in theory
>>>>>      allow monitoring everything at once, that will come with a big
>>>>>      accuracy cost.  We need to design the algorithms that give us
>>>>>      good data given those constraints.
>>>>>
>>>>> So we need a solution to explore the design space and develop the software
>>>>> to take advantage of this hardware (there are various LSF/MM proposals
>>>>> on how to use this an other ways of tracking hotness).
>>>>> https://lore.kernel.org/all/20250123105721.424117-1-raghavendra.kt@amd.com/
>>>>> https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
>>>>>
>>>>> QEMU plugins give us a way to do this.  In particular the existing
>>>>> Cache plugin can be easily modified to tell use what memory addresses
>>>>> missed at the last level of emulated cache.  We can then filter those
>>>>> for the memory address range that maps to CXL and feed them to our
>>>>> counter implementation. On the other side, each instance of CXL type 3
>>>>> device can connect to this server and request hotness monitoring
>>>>> services + provide parameters etc.  Elements such as list threshold
>>>>> management and overflow detection etc are in the CXL HMU QEMU device mode.
>>>>> As noted above, we have an alternative approach that can closely couple
>>>>> things, so the device model registers the plugin directly and there
>>>>> is no server.
>>>>>
>>>>> How to use it!
>>>>> --------------
>>>>>
>>>>> It runs a little slow but it runs and generates somewhat plausible outputs.
>>>>> I'd definitely suggest running it with the pass through optimization
>>>>> patch on the CXL staging tree (and a single direct connected device).
>>>>> Your millage will vary if you try to use other parameters, or
>>>>> hotness units beyond the first one (implementation far from complete!)
>>>>>
>>>>> To run start the server in contrib/hmu/ providing a port number to listen
>>>>> on.
>>>>>
>>>>> ./chmu 4443
>>>>>
>>>>> Then launch QEMU with something like the following.
>>>>>
>>>>> qemu-system-aarch64 -icount shift=1 \
>>>>>     -plugin ../qemu/bin/native/contrib/plugins/libcache.so,port=4443,missfilterbase=1099511627776,missfiltersize=1099511627776,dcachesize=8192,dassoc=4,dblksize=64,icachesize=8192,iassoc=4,iblksize=64,l2cachesize=32768,l2assoc=16,l2blksize=64 \
>>>>>     -M virt,ras=on,nvdimm=on,gic-version=3,cxl=on,hmat=on -m 4g,maxmem=8g,slots=4 -cpu max -smp 4 \
>>>>>     -kernel Image \
>>>>>     -drive if=none,file=full.qcow2,format=qcow2,id=hd \
>>>>>     -device pcie-root-port,id=root_port1 \
>>>>>     -device virtio-blk-pci,drive=hd,x-max-bounce-buffer-size=512k \
>>>>>     -nographic -no-reboot -append 'earlycon memblock=debug root=/dev/vda2 fsck.mode=skip maxcpus=4 tp_printk' \
>>>>>     -monitor telnet:127.0.0.1:1234,server,nowait -bios QEMU_EFI.fd \
>>>>>     -object memory-backend-ram,size=4G,id=mem0 \
>>>>>     -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/t3_cxl1.raw,size=1G,align=256M \
>>>>>     -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/t3_cxl2.raw,size=1G,align=256M \
>>>>>     -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/t3_lsa1.raw,size=1M,align=1M \
>>>>>      -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/t3_cxl3.raw,size=1G,align=256M \
>>>>>     -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/t3_cxl4.raw,size=1G,align=256M \
>>>>>     -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/t3_lsa2.raw,size=1M,align=1M \
>>>>>     -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1,hdm_for_passthrough=true,numa_node=0\
>>>>>     -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
>>>>>     -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-pmem1,lsa=cxl-lsa1,sn=3,x-speed=32,x-width=16,chmu-port=4443 \
>>>>>     -machine cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=8G,cxl-fmw.0.interleave-granularity=1k \
>>>>>     -numa node,nodeid=0,cpus=0-3,memdev=mem0 \
>>>>>     -numa node,nodeid=1 \
>>>>>     -object acpi-generic-initiator,id=bob2,pci-dev=bob,node=1 \
>>>>>     -numa node,nodeid=2 \
>>>>>     -object acpi-generic-port,id=bob11,pci-bus=cxl.1,node=2 \
>>>>>
>>>>> In the guest, create and bind the region - this brings up the CXL memory
>>>>> device so accesses go to the memory.
>>>>>
>>>>>      cd /sys/bus/cxl/devices/decoder0.0/
>>>>>      cat create_ram_region
>>>>>      echo region0 > create_ram_region
>>>>>      echo ram > /sys/bus/cxl/devices/decoder2.0/mode
>>>>>      echo ram > /sys/bus/cxl/devices/decoder3.0/mode
>>>>>      echo $((256 << 21)) > /sys/bus/cxl/devices/decoder2.0/dpa_size
>>>>>      cd /sys/bus/cxl/devices/region0/
>>>>>      echo 256 > interleave_granularity
>>>>>      echo 1 > interleave_ways
>>>>>      echo $((256 << 21)) > size
>>>>>      echo decoder2.0 > target0
>>>>>      echo 1 > commit
>>>>>      echo region0 > /sys/bus/cxl/drivers/cxl_region/bind
>>>>>
>>>>> Finally start perf with something like:
>>>>>
>>>>> ./perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
>>>>> hotness_threshold=635,epoch_multiplier=4,epoch_scale=4,\
>>>>> range_base=0,range_size=4096/  ./stress.sh
>>>>>
>>>>> where stress.sh is
>>>>>
>>>>>      sleep 2
>>>>>      numactl --membind 3 stress-ng --vm 1 --vm-bytes 1M --vm-keep -t 5s
>>>>>      sleep 2
>>>>>
>>>>> See the results with
>>>>> ./perf report --dump-raw-trace | grep -A 200 HMU
>>>>>
>>>>> Enjoy and have a good weekend.
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Jonathan
>>>>>
>>>>> Jonathan Cameron (3):
>>>>>      hw/cxl: Initial CXL Hotness Monitoring Unit Emulation
>>>>>      plugins: Add cache miss reporting over a socket.
>>>>>      contrib: Add example hotness monitoring unit server
>>>>>
>>>>>     include/hw/cxl/cxl.h        |   1 +
>>>>>     include/hw/cxl/cxl_chmu.h   | 154 ++++++++++++
>>>>>     include/hw/cxl/cxl_device.h |  13 +-
>>>>>     include/hw/cxl/cxl_pci.h    |   7 +-
>>>>>     contrib/hmu/hmu.c           | 312 ++++++++++++++++++++++++
>>>>>     contrib/plugins/cache.c     |  75 +++++-
>>>>>     hw/cxl/cxl-chmu.c           | 459 ++++++++++++++++++++++++++++++++++++
>>>>>     hw/mem/cxl_type3.c          |  25 +-
>>>>>     hw/cxl/meson.build          |   1 +
>>>>>     9 files changed, 1035 insertions(+), 12 deletions(-)
>>>>>     create mode 100644 include/hw/cxl/cxl_chmu.h
>>>>>     create mode 100644 contrib/hmu/hmu.c
>>>>>     create mode 100644 hw/cxl/cxl-chmu.c
>>>>>       
>>>>   
>>>    
>>
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data.
  2025-01-29 22:31         ` Pierrick Bouvier
@ 2025-01-30 15:52           ` Jonathan Cameron via
  2025-01-30 18:28             ` Pierrick Bouvier
  0 siblings, 1 reply; 13+ messages in thread
From: Jonathan Cameron via @ 2025-01-30 15:52 UTC (permalink / raw)
  To: Pierrick Bouvier
  Cc: fan.ni, linux-cxl, qemu-devel, Alex Bennée, Alexandre Iooss,
	Mahmoud Mandour, linuxarm, Niyas Sait

On Wed, 29 Jan 2025 14:31:10 -0800
Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:

> Hi Jonathan,
> 
> On 1/29/25 02:29, Jonathan Cameron wrote:
> > On Tue, 28 Jan 2025 12:04:19 -0800
> > Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:
> >   
> >> On 1/27/25 02:20, Jonathan Cameron wrote:  
> >>> On Fri, 24 Jan 2025 12:55:52 -0800
> >>> Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:
> >>>      
> >>>> Hi Jonathan,
> >>>>
> >>>> thanks for posting this. It's a creative usage of plugins.
> >>>>
> >>>> I think that your current approach, decoupling plugins, CHMU and device
> >>>> model is a good thing.
> >>>>
> >>>> I'm not familiar with CXL, but one question that comes to my mind is:
> >>>> Is that mandatory to do this analysis during execution (vs dumping
> >>>> binary traces from CHMU and plugin and running an analysis post execution)?  
> >>>
> >>> Short answer is that post run analysis isn't of much use for developing the OS
> >>> software story. It works to some degree if you are designing the tracking
> >>> hardware or algorithms to use that hardware capture a snapshot of hotness -
> >>> dealing with lack of counters, that sort of thing.
> >>>
> >>> The main intent of this support is to drive live usage of the data in the OS.
> >>> So it gets this hotness information and migrates more frequently accessed memory
> >>> to a 'nearer'/lower latency memory node.
> >>>
> >>>   From an OS point of view there will be two ways it uses it:
> >>> 1) Offline application optimization  - that aligns with your suggestion of offline
> >>>      analysis but would typically still need to be live because we have to do
> >>>      the reverse maps and work out what was allocated in particular locations.
> >>>      Not impossible to dump that information from QEMU + the guest OS but the usage
> >>>      flow would then look quite different from what makes sense on real hardware
> >>>      where all the data is available to the host OS directly.
> >>> 2) Migration of memory.  This will dynamically change the PA backing a VA whilst
> >>>      applications are running. The aim being to develop how that happens, we need
> >>>      the dynamic state.
> >>>      
> >>
> >> In the end, are you modeling how the real CHMU will work, or simply
> >> gathering data to help designing it (number of counters, line size, ...)?  
> > 
> > This work is modeling how a real (ish) CHMU will work - particular interest being
> > use in Linux kernel usecases. Otherwise we wouldn't share! :)
> > 
> > For CHMU hardware design, until people reach the live algorithms in the loop
> > stage, tracing techniques and offline analysis tend to be easier to use.
> > 
> > A annoying corner is that the implementations in QEMU will 'probably' remain
> > simplistic because the detailed designs of CHMUs may be considered sensitive.
> > It's a complex space and there are some really interesting and to me surprising
> > approaches.
> > 
> > What we can implement should be good enough for working out the basics of a
> > general software stack but possible it will need tuning against specific
> > implementations.  Maybe that necessity will result in more openness on the
> > parts of various uarch / arch teams.
> > 
> > There are some academic works on how to build these trackers, and there should
> > be less sensitivity around those.
> > 
> > This is perhaps an odd corner for QEMU because we are emulating an interface
> > accurately but the hardware behind it intentionally does not have a specification
> > defined implementation and the unusual bit is that implementation affects
> > the output.  We can implement a few options that are well defined though.
> > 1) An Oracle ('infinite' counters)
> > 2) Limited counters allocated on first touch in a given epoch (sampling period).
> > 
> > 1 is useful for putting an upper bound on data accuracy.
> > 2 is a typical first thing people will look at when considering a implementation.
> > 
> > So conclusion. This is about enabling software development, not tuning a hardware
> > design.
> >   
> 
> Ok, thanks for the clarification.
> 
> Considering the approach you followed, as said before, choosing a 
> decoupled solution is the right choice. Plugins should not allow to 
> access internal details of QEMU implementation as a general rule.
> 
> Did you think about integrating the server directly in the plugin? From 
> what I understand, the CHMU will contact the server on a per request 
> basis, while instrumentation will contact it for every access.

Definitely a possibility. I was a little nervous that would put ordering
constraints on plugin load and device creation. I suppose we can connect
when the device is enabled by the OS though so that should be easily
avoided (if it is a problem at all!). The other thought was that we'd
do some of the data handling asynchronously, but that hasn't happened
yet and maybe never will + maybe we can do that in a plugin anyway.

The CHMU device model is responsible for timing signals, so it will
signal a regular tick (and read back the buffer state to identify
overflow etc). So not quite on request but a lot lower data rate
than the plugin to CHMU data flow.

> 
> Beyond communication, the biggest overhead here is to have 
> instrumentation on all memory accesses. Another thing that comes out of 
> my mind is to do a sampling instrumentation. However, it's not easy to 
> do, because the current API does not allow to force a new translation of 
> existing TB, and at translation time, you have no clue whether or not 
> this will be a hot TB.

Whilst we will have sampling controls (they are part of the spec, both
fixed interval and pseudo random I just haven't wired them up yet),
typically sampling and the noise it creates is one of the causes of
bad numbers with hot page tracking.

We go to the effort to capture as much data as possible in a short time
so that we can move on to focusing the limited tracking resources on another
range of the memory.  Pushing knowledge of the sampling to the plugin is
definitely an option, but we'd still want full resolution hitting the
cache simulator as what we want to sample is what misses in the cache.

> What you can do though, is to ignore most of the accesses, and only send 
> information every 1000 memory accesses. (Note: I have no idea if 1000 is 
> a good threshold :)). It would allow you to skip most of the overhead 
> related to communication and hot pages management.

Definitely will implement that as an option when I wire up the rest of
the control interface.  It can only happen with the knowledge of the
OS as it is another thing that will be controlled (or at least taken into
account if the device implements only fixed subsampling).

> 
>  From what I understood about using plugins, the goal is to track hot 
> pages. Using the cache one is a possibility, but it might be better to 
> use contrib/plugins/hotpages.c instead, or better, a custom simple 
> plugin, simply reporting reads/writes to the server (and implementing 
> sampling as well).

We need the cache part.  Without that the data ends up quite different
from what needs to be measured.  Data that is in cache and so not fetched
much from main memory as a result is (perhaps confusingly) cold.
No point in moving it to faster memory and from CXL point of view
we never see the accesses anyway so couldn't if we wanted to.
Cache simulation is not perfect as we don't have any simulation of
prefetchers etc though not sure how much that will hurt us yet.
Current cache plugin is obviously simplistic but it is a
reasonable starting point.

Also, in some more interesting cases granual it doesn't align with pages
(interleaving across multiple devices, much larger regions than pages)
so we'd need to switch back to something like this later anyway.

If bringing the hotness monitoring data capture into the plugin it probably
makes sense to fork the code to a new plugin anyway as that will add a
lot of complexity wherever we add it.

> 
> Which slow-down factor (order of magnitude) do you have with this series?

I've not measured it accurately but not too bad.  Maybe 10-20x slow down over
TCG (when running arm64 on x86 host). Few minutes to boot a standard distro.

Jonathan

> 
> > Jonathan
> >   
> >>
> >> Pierrick
> >>  
> >>> Jonathan
> >>>      
> >>>>
> >>>> Regards,
> >>>> Pierrick
> >>>>
> >>>> On 1/24/25 09:29, Jonathan Cameron wrote:  
> >>>>> Hi All,
> >>>>>
> >>>>> This is an RFC mainly to seek feedback on the approach used, particularly
> >>>>> the aspect of how to get data from a TCG plugin into a device model.
> >>>>> Two options that we have tried
> >>>>> 1. Socket over which the plugin sends data to an external server
> >>>>>       (as seen here)
> >>>>> 2. Register and manage a plugin from within a device model
> >>>>>
> >>>>> The external server approach keeps things loosely coupled, but at the cost
> >>>>> of separately maintaining that server, protocol definitions etc and
> >>>>> some overhead.
> >>>>> The closely couple solution is neater, but I suspect might be controversial
> >>>>> (hence I didn't start with that :)
> >>>>>
> >>>>> The code here is at best a PoC to illustrate what we have in mind
> >>>>> It's not nice code at all, feature gaps, bugs and all!  So whilst
> >>>>> review is always welcome I'm not requesting it for now.
> >>>>>
> >>>>> Kernel support was posted a while back but was done against fake data
> >>>>> (still supported here if you don't provide the port parameter to the type3 device)
> >>>>> https://lore.kernel.org/linux-cxl/20241121101845.1815660-1-Jonathan.Cameron@huawei.com/
> >>>>> I'll post a minor update of that driver shortly to take into account
> >>>>> a few specification clarifications but it should work with this without
> >>>>> those.
> >>>>>
> >>>>> Note there are some other patches on the tree I generated this from
> >>>>> so this may not apply to upstream. Easiest is probably to test
> >>>>> using gitlab.com/jic23/qemu cxl-2025-01-24
> >>>>>
> >>>>> Thanks to Niyas for his suggestions on how to make all this work!
> >>>>>
> >>>>> Background
> >>>>> ----------
> >>>>>
> >>>>> What is the Compute eXpress Link Hotness Monitoring unit and what is it for?
> >>>>> - In a tiered memory equipped server with the slow tier being attached via
> >>>>>      CXL the expectation is a given workload will benefit from putting data
> >>>>>      that is frequently fetched from memory in lower latency directly attached
> >>>>>      DRAM.  Less frequently used data can be served from the CXL attached memory
> >>>>>      with no significant loss of performance.  Any data that is hot enough to
> >>>>>      almost always be in cache doesn't matter as it is only fetch from memory
> >>>>>      occasionally.
> >>>>> - Working out which memory is best places where is hard to do and in some
> >>>>>      workloads a dynamic problem. As such we need something we can measure
> >>>>>      to provide some indication of what data is in the wrong place.
> >>>>>      There are existing techniques to do this (page faulting, various
> >>>>>      CPU tracing systems, access bit scanning etc) but they all have significant
> >>>>>      overheads.
> >>>>> - Monitoring accesses on the CXL device provides a path to getting good
> >>>>>      data without those overheads.  These units are known as CXL Hotness
> >>>>>      Monitoring Units or CHMUs.  Loosely speaking they count accesses to
> >>>>>      granuals of data (e.g. 4KiB pages).  Exactly how they do that and
> >>>>>      where they sacrifice data accuracy is an implementation trade off.
> >>>>>
> >>>>> Why do we need a model that gives real data?
> >>>>> - In general there is a need to develop software on top of these units
> >>>>>      to move data to the right place. Hard to evaluate that if we are making
> >>>>>      up the info on what is 'hot'.
> >>>>> - Need to allow for a bunch of 'impdef' solutions. Note that CHMU
> >>>>>      in this patch set is an oracle - it has enough counters to count
> >>>>>      every access.  That's not realistic but it doesn't get me shouted
> >>>>>      at by our architecture teams for giving away any secrets.
> >>>>>      If we move forward with this, I'll probably implement a limited
> >>>>>      counter + full CAM solution (also unrealistic, but closer to real)
> >>>>>      I'd be very interested in contributions of other approaches (there
> >>>>>      are lots in the literature, under the term top-k)
> >>>>> - Resources will be constrained, so whilst a CHMU might in theory
> >>>>>      allow monitoring everything at once, that will come with a big
> >>>>>      accuracy cost.  We need to design the algorithms that give us
> >>>>>      good data given those constraints.
> >>>>>
> >>>>> So we need a solution to explore the design space and develop the software
> >>>>> to take advantage of this hardware (there are various LSF/MM proposals
> >>>>> on how to use this an other ways of tracking hotness).
> >>>>> https://lore.kernel.org/all/20250123105721.424117-1-raghavendra.kt@amd.com/
> >>>>> https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
> >>>>>
> >>>>> QEMU plugins give us a way to do this.  In particular the existing
> >>>>> Cache plugin can be easily modified to tell use what memory addresses
> >>>>> missed at the last level of emulated cache.  We can then filter those
> >>>>> for the memory address range that maps to CXL and feed them to our
> >>>>> counter implementation. On the other side, each instance of CXL type 3
> >>>>> device can connect to this server and request hotness monitoring
> >>>>> services + provide parameters etc.  Elements such as list threshold
> >>>>> management and overflow detection etc are in the CXL HMU QEMU device mode.
> >>>>> As noted above, we have an alternative approach that can closely couple
> >>>>> things, so the device model registers the plugin directly and there
> >>>>> is no server.
> >>>>>
> >>>>> How to use it!
> >>>>> --------------
> >>>>>
> >>>>> It runs a little slow but it runs and generates somewhat plausible outputs.
> >>>>> I'd definitely suggest running it with the pass through optimization
> >>>>> patch on the CXL staging tree (and a single direct connected device).
> >>>>> Your millage will vary if you try to use other parameters, or
> >>>>> hotness units beyond the first one (implementation far from complete!)
> >>>>>
> >>>>> To run start the server in contrib/hmu/ providing a port number to listen
> >>>>> on.
> >>>>>
> >>>>> ./chmu 4443
> >>>>>
> >>>>> Then launch QEMU with something like the following.
> >>>>>
> >>>>> qemu-system-aarch64 -icount shift=1 \
> >>>>>     -plugin ../qemu/bin/native/contrib/plugins/libcache.so,port=4443,missfilterbase=1099511627776,missfiltersize=1099511627776,dcachesize=8192,dassoc=4,dblksize=64,icachesize=8192,iassoc=4,iblksize=64,l2cachesize=32768,l2assoc=16,l2blksize=64 \
> >>>>>     -M virt,ras=on,nvdimm=on,gic-version=3,cxl=on,hmat=on -m 4g,maxmem=8g,slots=4 -cpu max -smp 4 \
> >>>>>     -kernel Image \
> >>>>>     -drive if=none,file=full.qcow2,format=qcow2,id=hd \
> >>>>>     -device pcie-root-port,id=root_port1 \
> >>>>>     -device virtio-blk-pci,drive=hd,x-max-bounce-buffer-size=512k \
> >>>>>     -nographic -no-reboot -append 'earlycon memblock=debug root=/dev/vda2 fsck.mode=skip maxcpus=4 tp_printk' \
> >>>>>     -monitor telnet:127.0.0.1:1234,server,nowait -bios QEMU_EFI.fd \
> >>>>>     -object memory-backend-ram,size=4G,id=mem0 \
> >>>>>     -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/t3_cxl1.raw,size=1G,align=256M \
> >>>>>     -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/t3_cxl2.raw,size=1G,align=256M \
> >>>>>     -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/t3_lsa1.raw,size=1M,align=1M \
> >>>>>      -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/t3_cxl3.raw,size=1G,align=256M \
> >>>>>     -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/t3_cxl4.raw,size=1G,align=256M \
> >>>>>     -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/t3_lsa2.raw,size=1M,align=1M \
> >>>>>     -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1,hdm_for_passthrough=true,numa_node=0\
> >>>>>     -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> >>>>>     -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-pmem1,lsa=cxl-lsa1,sn=3,x-speed=32,x-width=16,chmu-port=4443 \
> >>>>>     -machine cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=8G,cxl-fmw.0.interleave-granularity=1k \
> >>>>>     -numa node,nodeid=0,cpus=0-3,memdev=mem0 \
> >>>>>     -numa node,nodeid=1 \
> >>>>>     -object acpi-generic-initiator,id=bob2,pci-dev=bob,node=1 \
> >>>>>     -numa node,nodeid=2 \
> >>>>>     -object acpi-generic-port,id=bob11,pci-bus=cxl.1,node=2 \
> >>>>>
> >>>>> In the guest, create and bind the region - this brings up the CXL memory
> >>>>> device so accesses go to the memory.
> >>>>>
> >>>>>      cd /sys/bus/cxl/devices/decoder0.0/
> >>>>>      cat create_ram_region
> >>>>>      echo region0 > create_ram_region
> >>>>>      echo ram > /sys/bus/cxl/devices/decoder2.0/mode
> >>>>>      echo ram > /sys/bus/cxl/devices/decoder3.0/mode
> >>>>>      echo $((256 << 21)) > /sys/bus/cxl/devices/decoder2.0/dpa_size
> >>>>>      cd /sys/bus/cxl/devices/region0/
> >>>>>      echo 256 > interleave_granularity
> >>>>>      echo 1 > interleave_ways
> >>>>>      echo $((256 << 21)) > size
> >>>>>      echo decoder2.0 > target0
> >>>>>      echo 1 > commit
> >>>>>      echo region0 > /sys/bus/cxl/drivers/cxl_region/bind
> >>>>>
> >>>>> Finally start perf with something like:
> >>>>>
> >>>>> ./perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
> >>>>> hotness_threshold=635,epoch_multiplier=4,epoch_scale=4,\
> >>>>> range_base=0,range_size=4096/  ./stress.sh
> >>>>>
> >>>>> where stress.sh is
> >>>>>
> >>>>>      sleep 2
> >>>>>      numactl --membind 3 stress-ng --vm 1 --vm-bytes 1M --vm-keep -t 5s
> >>>>>      sleep 2
> >>>>>
> >>>>> See the results with
> >>>>> ./perf report --dump-raw-trace | grep -A 200 HMU
> >>>>>
> >>>>> Enjoy and have a good weekend.
> >>>>>
> >>>>> Thanks,
> >>>>>
> >>>>> Jonathan
> >>>>>
> >>>>> Jonathan Cameron (3):
> >>>>>      hw/cxl: Initial CXL Hotness Monitoring Unit Emulation
> >>>>>      plugins: Add cache miss reporting over a socket.
> >>>>>      contrib: Add example hotness monitoring unit server
> >>>>>
> >>>>>     include/hw/cxl/cxl.h        |   1 +
> >>>>>     include/hw/cxl/cxl_chmu.h   | 154 ++++++++++++
> >>>>>     include/hw/cxl/cxl_device.h |  13 +-
> >>>>>     include/hw/cxl/cxl_pci.h    |   7 +-
> >>>>>     contrib/hmu/hmu.c           | 312 ++++++++++++++++++++++++
> >>>>>     contrib/plugins/cache.c     |  75 +++++-
> >>>>>     hw/cxl/cxl-chmu.c           | 459 ++++++++++++++++++++++++++++++++++++
> >>>>>     hw/mem/cxl_type3.c          |  25 +-
> >>>>>     hw/cxl/meson.build          |   1 +
> >>>>>     9 files changed, 1035 insertions(+), 12 deletions(-)
> >>>>>     create mode 100644 include/hw/cxl/cxl_chmu.h
> >>>>>     create mode 100644 contrib/hmu/hmu.c
> >>>>>     create mode 100644 hw/cxl/cxl-chmu.c
> >>>>>         
> >>>>     
> >>>      
> >>  
> >   
> 
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data.
  2025-01-30 15:52           ` Jonathan Cameron via
@ 2025-01-30 18:28             ` Pierrick Bouvier
  2025-01-31 11:15               ` Jonathan Cameron via
  0 siblings, 1 reply; 13+ messages in thread
From: Pierrick Bouvier @ 2025-01-30 18:28 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: fan.ni, linux-cxl, qemu-devel, Alex Bennée, Alexandre Iooss,
	Mahmoud Mandour, linuxarm, Niyas Sait

On 1/30/25 07:52, Jonathan Cameron wrote:
> On Wed, 29 Jan 2025 14:31:10 -0800
> Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:
> 
>> Hi Jonathan,
>>
>> On 1/29/25 02:29, Jonathan Cameron wrote:
>>> On Tue, 28 Jan 2025 12:04:19 -0800
>>> Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:
>>>    
>>>> On 1/27/25 02:20, Jonathan Cameron wrote:
>>>>> On Fri, 24 Jan 2025 12:55:52 -0800
>>>>> Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:
>>>>>       
>>>>>> Hi Jonathan,
>>>>>>
>>>>>> thanks for posting this. It's a creative usage of plugins.
>>>>>>
>>>>>> I think that your current approach, decoupling plugins, CHMU and device
>>>>>> model is a good thing.
>>>>>>
>>>>>> I'm not familiar with CXL, but one question that comes to my mind is:
>>>>>> Is that mandatory to do this analysis during execution (vs dumping
>>>>>> binary traces from CHMU and plugin and running an analysis post execution)?
>>>>>
>>>>> Short answer is that post run analysis isn't of much use for developing the OS
>>>>> software story. It works to some degree if you are designing the tracking
>>>>> hardware or algorithms to use that hardware capture a snapshot of hotness -
>>>>> dealing with lack of counters, that sort of thing.
>>>>>
>>>>> The main intent of this support is to drive live usage of the data in the OS.
>>>>> So it gets this hotness information and migrates more frequently accessed memory
>>>>> to a 'nearer'/lower latency memory node.
>>>>>
>>>>>    From an OS point of view there will be two ways it uses it:
>>>>> 1) Offline application optimization  - that aligns with your suggestion of offline
>>>>>       analysis but would typically still need to be live because we have to do
>>>>>       the reverse maps and work out what was allocated in particular locations.
>>>>>       Not impossible to dump that information from QEMU + the guest OS but the usage
>>>>>       flow would then look quite different from what makes sense on real hardware
>>>>>       where all the data is available to the host OS directly.
>>>>> 2) Migration of memory.  This will dynamically change the PA backing a VA whilst
>>>>>       applications are running. The aim being to develop how that happens, we need
>>>>>       the dynamic state.
>>>>>       
>>>>
>>>> In the end, are you modeling how the real CHMU will work, or simply
>>>> gathering data to help designing it (number of counters, line size, ...)?
>>>
>>> This work is modeling how a real (ish) CHMU will work - particular interest being
>>> use in Linux kernel usecases. Otherwise we wouldn't share! :)
>>>
>>> For CHMU hardware design, until people reach the live algorithms in the loop
>>> stage, tracing techniques and offline analysis tend to be easier to use.
>>>
>>> A annoying corner is that the implementations in QEMU will 'probably' remain
>>> simplistic because the detailed designs of CHMUs may be considered sensitive.
>>> It's a complex space and there are some really interesting and to me surprising
>>> approaches.
>>>
>>> What we can implement should be good enough for working out the basics of a
>>> general software stack but possible it will need tuning against specific
>>> implementations.  Maybe that necessity will result in more openness on the
>>> parts of various uarch / arch teams.
>>>
>>> There are some academic works on how to build these trackers, and there should
>>> be less sensitivity around those.
>>>
>>> This is perhaps an odd corner for QEMU because we are emulating an interface
>>> accurately but the hardware behind it intentionally does not have a specification
>>> defined implementation and the unusual bit is that implementation affects
>>> the output.  We can implement a few options that are well defined though.
>>> 1) An Oracle ('infinite' counters)
>>> 2) Limited counters allocated on first touch in a given epoch (sampling period).
>>>
>>> 1 is useful for putting an upper bound on data accuracy.
>>> 2 is a typical first thing people will look at when considering a implementation.
>>>
>>> So conclusion. This is about enabling software development, not tuning a hardware
>>> design.
>>>    
>>
>> Ok, thanks for the clarification.
>>
>> Considering the approach you followed, as said before, choosing a
>> decoupled solution is the right choice. Plugins should not allow to
>> access internal details of QEMU implementation as a general rule.
>>
>> Did you think about integrating the server directly in the plugin? From
>> what I understand, the CHMU will contact the server on a per request
>> basis, while instrumentation will contact it for every access.
> 
> Definitely a possibility. I was a little nervous that would put ordering
> constraints on plugin load and device creation. I suppose we can connect
> when the device is enabled by the OS though so that should be easily
> avoided (if it is a problem at all!). The other thought was that we'd
> do some of the data handling asynchronously, but that hasn't happened
> yet and maybe never will + maybe we can do that in a plugin anyway.
> 

You still need a separate thread triggered from the plugin install 
function for the server, and initiating the connection first time you 
need it seems like a good approach.

However, in the plugin itself, you now have to deal with concurrency 
issues, since several cpus can report memory accesses at the same time, 
which were sequentialized through the socket before.

> The CHMU device model is responsible for timing signals, so it will
> signal a regular tick (and read back the buffer state to identify
> overflow etc). So not quite on request but a lot lower data rate
> than the plugin to CHMU data flow.
> 
>>
>> Beyond communication, the biggest overhead here is to have
>> instrumentation on all memory accesses. Another thing that comes out of
>> my mind is to do a sampling instrumentation. However, it's not easy to
>> do, because the current API does not allow to force a new translation of
>> existing TB, and at translation time, you have no clue whether or not
>> this will be a hot TB.
> 
> Whilst we will have sampling controls (they are part of the spec, both
> fixed interval and pseudo random I just haven't wired them up yet),
> typically sampling and the noise it creates is one of the causes of
> bad numbers with hot page tracking.
> 
> We go to the effort to capture as much data as possible in a short time
> so that we can move on to focusing the limited tracking resources on another
> range of the memory.  Pushing knowledge of the sampling to the plugin is
> definitely an option, but we'd still want full resolution hitting the
> cache simulator as what we want to sample is what misses in the cache.
> 

Sure, I was just proposing this to speed up execution. There is a 
trade-off concerning accuracy. However, I'm surprised you observed such 
bad numbers (no data locality in your use cases?)

>> What you can do though, is to ignore most of the accesses, and only send
>> information every 1000 memory accesses. (Note: I have no idea if 1000 is
>> a good threshold :)). It would allow you to skip most of the overhead
>> related to communication and hot pages management.
> 
> Definitely will implement that as an option when I wire up the rest of
> the control interface.  It can only happen with the knowledge of the
> OS as it is another thing that will be controlled (or at least taken into
> account if the device implements only fixed subsampling).
> 
>>
>>   From what I understood about using plugins, the goal is to track hot
>> pages. Using the cache one is a possibility, but it might be better to
>> use contrib/plugins/hotpages.c instead, or better, a custom simple
>> plugin, simply reporting reads/writes to the server (and implementing
>> sampling as well).
> 
> We need the cache part.  Without that the data ends up quite different
> from what needs to be measured.  Data that is in cache and so not fetched
> much from main memory as a result is (perhaps confusingly) cold.
> No point in moving it to faster memory and from CXL point of view
> we never see the accesses anyway so couldn't if we wanted to.
> Cache simulation is not perfect as we don't have any simulation of
> prefetchers etc though not sure how much that will hurt us yet.
> Current cache plugin is obviously simplistic but it is a
> reasonable starting point.
> 
> Also, in some more interesting cases granual it doesn't align with pages
> (interleaving across multiple devices, much larger regions than pages)
> so we'd need to switch back to something like this later anyway.
> 
> If bringing the hotness monitoring data capture into the plugin it probably
> makes sense to fork the code to a new plugin anyway as that will add a
> lot of complexity wherever we add it.
> 

 From what I understand, the goal is to keep hot pages in a specific CXL 
cache, located between the cpu cache, and DRAM, and to avoid keeping the 
same data than in cpu cache.
However, cache topology will be different for various cpus, with 
different levels, different sizes, and different associativity.
Thus, it seems that it will be impossible to model anything, simply 
based on a generic "cache" algorithm.

The thing that puzzles me in the approach is that we compare a cpu data 
cache (based on small chunks of data), and a hot page cache (based on a 
page size). It seems normal to hold the same data, and be able to 
provide the rest of the page to cpu cache when needed.

>>
>> Which slow-down factor (order of magnitude) do you have with this series?
> 
> I've not measured it accurately but not too bad.  Maybe 10-20x slow down over
> TCG (when running arm64 on x86 host). Few minutes to boot a standard distro.
> 
> Jonathan
> 
>>
>>> Jonathan
>>>    
>>>>
>>>> Pierrick
>>>>   
>>>>> Jonathan
>>>>>       
>>>>>>
>>>>>> Regards,
>>>>>> Pierrick
>>>>>>
>>>>>> On 1/24/25 09:29, Jonathan Cameron wrote:
>>>>>>> Hi All,
>>>>>>>
>>>>>>> This is an RFC mainly to seek feedback on the approach used, particularly
>>>>>>> the aspect of how to get data from a TCG plugin into a device model.
>>>>>>> Two options that we have tried
>>>>>>> 1. Socket over which the plugin sends data to an external server
>>>>>>>        (as seen here)
>>>>>>> 2. Register and manage a plugin from within a device model
>>>>>>>
>>>>>>> The external server approach keeps things loosely coupled, but at the cost
>>>>>>> of separately maintaining that server, protocol definitions etc and
>>>>>>> some overhead.
>>>>>>> The closely couple solution is neater, but I suspect might be controversial
>>>>>>> (hence I didn't start with that :)
>>>>>>>
>>>>>>> The code here is at best a PoC to illustrate what we have in mind
>>>>>>> It's not nice code at all, feature gaps, bugs and all!  So whilst
>>>>>>> review is always welcome I'm not requesting it for now.
>>>>>>>
>>>>>>> Kernel support was posted a while back but was done against fake data
>>>>>>> (still supported here if you don't provide the port parameter to the type3 device)
>>>>>>> https://lore.kernel.org/linux-cxl/20241121101845.1815660-1-Jonathan.Cameron@huawei.com/
>>>>>>> I'll post a minor update of that driver shortly to take into account
>>>>>>> a few specification clarifications but it should work with this without
>>>>>>> those.
>>>>>>>
>>>>>>> Note there are some other patches on the tree I generated this from
>>>>>>> so this may not apply to upstream. Easiest is probably to test
>>>>>>> using gitlab.com/jic23/qemu cxl-2025-01-24
>>>>>>>
>>>>>>> Thanks to Niyas for his suggestions on how to make all this work!
>>>>>>>
>>>>>>> Background
>>>>>>> ----------
>>>>>>>
>>>>>>> What is the Compute eXpress Link Hotness Monitoring unit and what is it for?
>>>>>>> - In a tiered memory equipped server with the slow tier being attached via
>>>>>>>       CXL the expectation is a given workload will benefit from putting data
>>>>>>>       that is frequently fetched from memory in lower latency directly attached
>>>>>>>       DRAM.  Less frequently used data can be served from the CXL attached memory
>>>>>>>       with no significant loss of performance.  Any data that is hot enough to
>>>>>>>       almost always be in cache doesn't matter as it is only fetch from memory
>>>>>>>       occasionally.
>>>>>>> - Working out which memory is best places where is hard to do and in some
>>>>>>>       workloads a dynamic problem. As such we need something we can measure
>>>>>>>       to provide some indication of what data is in the wrong place.
>>>>>>>       There are existing techniques to do this (page faulting, various
>>>>>>>       CPU tracing systems, access bit scanning etc) but they all have significant
>>>>>>>       overheads.
>>>>>>> - Monitoring accesses on the CXL device provides a path to getting good
>>>>>>>       data without those overheads.  These units are known as CXL Hotness
>>>>>>>       Monitoring Units or CHMUs.  Loosely speaking they count accesses to
>>>>>>>       granuals of data (e.g. 4KiB pages).  Exactly how they do that and
>>>>>>>       where they sacrifice data accuracy is an implementation trade off.
>>>>>>>
>>>>>>> Why do we need a model that gives real data?
>>>>>>> - In general there is a need to develop software on top of these units
>>>>>>>       to move data to the right place. Hard to evaluate that if we are making
>>>>>>>       up the info on what is 'hot'.
>>>>>>> - Need to allow for a bunch of 'impdef' solutions. Note that CHMU
>>>>>>>       in this patch set is an oracle - it has enough counters to count
>>>>>>>       every access.  That's not realistic but it doesn't get me shouted
>>>>>>>       at by our architecture teams for giving away any secrets.
>>>>>>>       If we move forward with this, I'll probably implement a limited
>>>>>>>       counter + full CAM solution (also unrealistic, but closer to real)
>>>>>>>       I'd be very interested in contributions of other approaches (there
>>>>>>>       are lots in the literature, under the term top-k)
>>>>>>> - Resources will be constrained, so whilst a CHMU might in theory
>>>>>>>       allow monitoring everything at once, that will come with a big
>>>>>>>       accuracy cost.  We need to design the algorithms that give us
>>>>>>>       good data given those constraints.
>>>>>>>
>>>>>>> So we need a solution to explore the design space and develop the software
>>>>>>> to take advantage of this hardware (there are various LSF/MM proposals
>>>>>>> on how to use this an other ways of tracking hotness).
>>>>>>> https://lore.kernel.org/all/20250123105721.424117-1-raghavendra.kt@amd.com/
>>>>>>> https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
>>>>>>>
>>>>>>> QEMU plugins give us a way to do this.  In particular the existing
>>>>>>> Cache plugin can be easily modified to tell use what memory addresses
>>>>>>> missed at the last level of emulated cache.  We can then filter those
>>>>>>> for the memory address range that maps to CXL and feed them to our
>>>>>>> counter implementation. On the other side, each instance of CXL type 3
>>>>>>> device can connect to this server and request hotness monitoring
>>>>>>> services + provide parameters etc.  Elements such as list threshold
>>>>>>> management and overflow detection etc are in the CXL HMU QEMU device mode.
>>>>>>> As noted above, we have an alternative approach that can closely couple
>>>>>>> things, so the device model registers the plugin directly and there
>>>>>>> is no server.
>>>>>>>
>>>>>>> How to use it!
>>>>>>> --------------
>>>>>>>
>>>>>>> It runs a little slow but it runs and generates somewhat plausible outputs.
>>>>>>> I'd definitely suggest running it with the pass through optimization
>>>>>>> patch on the CXL staging tree (and a single direct connected device).
>>>>>>> Your millage will vary if you try to use other parameters, or
>>>>>>> hotness units beyond the first one (implementation far from complete!)
>>>>>>>
>>>>>>> To run start the server in contrib/hmu/ providing a port number to listen
>>>>>>> on.
>>>>>>>
>>>>>>> ./chmu 4443
>>>>>>>
>>>>>>> Then launch QEMU with something like the following.
>>>>>>>
>>>>>>> qemu-system-aarch64 -icount shift=1 \
>>>>>>>      -plugin ../qemu/bin/native/contrib/plugins/libcache.so,port=4443,missfilterbase=1099511627776,missfiltersize=1099511627776,dcachesize=8192,dassoc=4,dblksize=64,icachesize=8192,iassoc=4,iblksize=64,l2cachesize=32768,l2assoc=16,l2blksize=64 \
>>>>>>>      -M virt,ras=on,nvdimm=on,gic-version=3,cxl=on,hmat=on -m 4g,maxmem=8g,slots=4 -cpu max -smp 4 \
>>>>>>>      -kernel Image \
>>>>>>>      -drive if=none,file=full.qcow2,format=qcow2,id=hd \
>>>>>>>      -device pcie-root-port,id=root_port1 \
>>>>>>>      -device virtio-blk-pci,drive=hd,x-max-bounce-buffer-size=512k \
>>>>>>>      -nographic -no-reboot -append 'earlycon memblock=debug root=/dev/vda2 fsck.mode=skip maxcpus=4 tp_printk' \
>>>>>>>      -monitor telnet:127.0.0.1:1234,server,nowait -bios QEMU_EFI.fd \
>>>>>>>      -object memory-backend-ram,size=4G,id=mem0 \
>>>>>>>      -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/t3_cxl1.raw,size=1G,align=256M \
>>>>>>>      -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/t3_cxl2.raw,size=1G,align=256M \
>>>>>>>      -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/t3_lsa1.raw,size=1M,align=1M \
>>>>>>>       -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/t3_cxl3.raw,size=1G,align=256M \
>>>>>>>      -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/t3_cxl4.raw,size=1G,align=256M \
>>>>>>>      -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/t3_lsa2.raw,size=1M,align=1M \
>>>>>>>      -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1,hdm_for_passthrough=true,numa_node=0\
>>>>>>>      -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
>>>>>>>      -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-pmem1,lsa=cxl-lsa1,sn=3,x-speed=32,x-width=16,chmu-port=4443 \
>>>>>>>      -machine cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=8G,cxl-fmw.0.interleave-granularity=1k \
>>>>>>>      -numa node,nodeid=0,cpus=0-3,memdev=mem0 \
>>>>>>>      -numa node,nodeid=1 \
>>>>>>>      -object acpi-generic-initiator,id=bob2,pci-dev=bob,node=1 \
>>>>>>>      -numa node,nodeid=2 \
>>>>>>>      -object acpi-generic-port,id=bob11,pci-bus=cxl.1,node=2 \
>>>>>>>
>>>>>>> In the guest, create and bind the region - this brings up the CXL memory
>>>>>>> device so accesses go to the memory.
>>>>>>>
>>>>>>>       cd /sys/bus/cxl/devices/decoder0.0/
>>>>>>>       cat create_ram_region
>>>>>>>       echo region0 > create_ram_region
>>>>>>>       echo ram > /sys/bus/cxl/devices/decoder2.0/mode
>>>>>>>       echo ram > /sys/bus/cxl/devices/decoder3.0/mode
>>>>>>>       echo $((256 << 21)) > /sys/bus/cxl/devices/decoder2.0/dpa_size
>>>>>>>       cd /sys/bus/cxl/devices/region0/
>>>>>>>       echo 256 > interleave_granularity
>>>>>>>       echo 1 > interleave_ways
>>>>>>>       echo $((256 << 21)) > size
>>>>>>>       echo decoder2.0 > target0
>>>>>>>       echo 1 > commit
>>>>>>>       echo region0 > /sys/bus/cxl/drivers/cxl_region/bind
>>>>>>>
>>>>>>> Finally start perf with something like:
>>>>>>>
>>>>>>> ./perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
>>>>>>> hotness_threshold=635,epoch_multiplier=4,epoch_scale=4,\
>>>>>>> range_base=0,range_size=4096/  ./stress.sh
>>>>>>>
>>>>>>> where stress.sh is
>>>>>>>
>>>>>>>       sleep 2
>>>>>>>       numactl --membind 3 stress-ng --vm 1 --vm-bytes 1M --vm-keep -t 5s
>>>>>>>       sleep 2
>>>>>>>
>>>>>>> See the results with
>>>>>>> ./perf report --dump-raw-trace | grep -A 200 HMU
>>>>>>>
>>>>>>> Enjoy and have a good weekend.
>>>>>>>
>>>>>>> Thanks,
>>>>>>>
>>>>>>> Jonathan
>>>>>>>
>>>>>>> Jonathan Cameron (3):
>>>>>>>       hw/cxl: Initial CXL Hotness Monitoring Unit Emulation
>>>>>>>       plugins: Add cache miss reporting over a socket.
>>>>>>>       contrib: Add example hotness monitoring unit server
>>>>>>>
>>>>>>>      include/hw/cxl/cxl.h        |   1 +
>>>>>>>      include/hw/cxl/cxl_chmu.h   | 154 ++++++++++++
>>>>>>>      include/hw/cxl/cxl_device.h |  13 +-
>>>>>>>      include/hw/cxl/cxl_pci.h    |   7 +-
>>>>>>>      contrib/hmu/hmu.c           | 312 ++++++++++++++++++++++++
>>>>>>>      contrib/plugins/cache.c     |  75 +++++-
>>>>>>>      hw/cxl/cxl-chmu.c           | 459 ++++++++++++++++++++++++++++++++++++
>>>>>>>      hw/mem/cxl_type3.c          |  25 +-
>>>>>>>      hw/cxl/meson.build          |   1 +
>>>>>>>      9 files changed, 1035 insertions(+), 12 deletions(-)
>>>>>>>      create mode 100644 include/hw/cxl/cxl_chmu.h
>>>>>>>      create mode 100644 contrib/hmu/hmu.c
>>>>>>>      create mode 100644 hw/cxl/cxl-chmu.c
>>>>>>>          
>>>>>>      
>>>>>       
>>>>   
>>>    
>>
>>
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data.
  2025-01-30 18:28             ` Pierrick Bouvier
@ 2025-01-31 11:15               ` Jonathan Cameron via
  0 siblings, 0 replies; 13+ messages in thread
From: Jonathan Cameron via @ 2025-01-31 11:15 UTC (permalink / raw)
  To: Pierrick Bouvier
  Cc: fan.ni, linux-cxl, qemu-devel, Alex Bennée, Alexandre Iooss,
	Mahmoud Mandour, linuxarm, Niyas Sait

On Thu, 30 Jan 2025 10:28:03 -0800
Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:

> On 1/30/25 07:52, Jonathan Cameron wrote:
> > On Wed, 29 Jan 2025 14:31:10 -0800
> > Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:
> >   
> >> Hi Jonathan,
> >>
> >> On 1/29/25 02:29, Jonathan Cameron wrote:  
> >>> On Tue, 28 Jan 2025 12:04:19 -0800
> >>> Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:
> >>>      
> >>>> On 1/27/25 02:20, Jonathan Cameron wrote:  
> >>>>> On Fri, 24 Jan 2025 12:55:52 -0800
> >>>>> Pierrick Bouvier <pierrick.bouvier@linaro.org> wrote:
> >>>>>         
> >>>>>> Hi Jonathan,
> >>>>>>
> >>>>>> thanks for posting this. It's a creative usage of plugins.
> >>>>>>
> >>>>>> I think that your current approach, decoupling plugins, CHMU and device
> >>>>>> model is a good thing.
> >>>>>>
> >>>>>> I'm not familiar with CXL, but one question that comes to my mind is:
> >>>>>> Is that mandatory to do this analysis during execution (vs dumping
> >>>>>> binary traces from CHMU and plugin and running an analysis post execution)?  
> >>>>>
> >>>>> Short answer is that post run analysis isn't of much use for developing the OS
> >>>>> software story. It works to some degree if you are designing the tracking
> >>>>> hardware or algorithms to use that hardware capture a snapshot of hotness -
> >>>>> dealing with lack of counters, that sort of thing.
> >>>>>
> >>>>> The main intent of this support is to drive live usage of the data in the OS.
> >>>>> So it gets this hotness information and migrates more frequently accessed memory
> >>>>> to a 'nearer'/lower latency memory node.
> >>>>>
> >>>>>    From an OS point of view there will be two ways it uses it:
> >>>>> 1) Offline application optimization  - that aligns with your suggestion of offline
> >>>>>       analysis but would typically still need to be live because we have to do
> >>>>>       the reverse maps and work out what was allocated in particular locations.
> >>>>>       Not impossible to dump that information from QEMU + the guest OS but the usage
> >>>>>       flow would then look quite different from what makes sense on real hardware
> >>>>>       where all the data is available to the host OS directly.
> >>>>> 2) Migration of memory.  This will dynamically change the PA backing a VA whilst
> >>>>>       applications are running. The aim being to develop how that happens, we need
> >>>>>       the dynamic state.
> >>>>>         
> >>>>
> >>>> In the end, are you modeling how the real CHMU will work, or simply
> >>>> gathering data to help designing it (number of counters, line size, ...)?  
> >>>
> >>> This work is modeling how a real (ish) CHMU will work - particular interest being
> >>> use in Linux kernel usecases. Otherwise we wouldn't share! :)
> >>>
> >>> For CHMU hardware design, until people reach the live algorithms in the loop
> >>> stage, tracing techniques and offline analysis tend to be easier to use.
> >>>
> >>> A annoying corner is that the implementations in QEMU will 'probably' remain
> >>> simplistic because the detailed designs of CHMUs may be considered sensitive.
> >>> It's a complex space and there are some really interesting and to me surprising
> >>> approaches.
> >>>
> >>> What we can implement should be good enough for working out the basics of a
> >>> general software stack but possible it will need tuning against specific
> >>> implementations.  Maybe that necessity will result in more openness on the
> >>> parts of various uarch / arch teams.
> >>>
> >>> There are some academic works on how to build these trackers, and there should
> >>> be less sensitivity around those.
> >>>
> >>> This is perhaps an odd corner for QEMU because we are emulating an interface
> >>> accurately but the hardware behind it intentionally does not have a specification
> >>> defined implementation and the unusual bit is that implementation affects
> >>> the output.  We can implement a few options that are well defined though.
> >>> 1) An Oracle ('infinite' counters)
> >>> 2) Limited counters allocated on first touch in a given epoch (sampling period).
> >>>
> >>> 1 is useful for putting an upper bound on data accuracy.
> >>> 2 is a typical first thing people will look at when considering a implementation.
> >>>
> >>> So conclusion. This is about enabling software development, not tuning a hardware
> >>> design.
> >>>      
> >>
> >> Ok, thanks for the clarification.
> >>
> >> Considering the approach you followed, as said before, choosing a
> >> decoupled solution is the right choice. Plugins should not allow to
> >> access internal details of QEMU implementation as a general rule.
> >>
> >> Did you think about integrating the server directly in the plugin? From
> >> what I understand, the CHMU will contact the server on a per request
> >> basis, while instrumentation will contact it for every access.  
> > 
> > Definitely a possibility. I was a little nervous that would put ordering
> > constraints on plugin load and device creation. I suppose we can connect
> > when the device is enabled by the OS though so that should be easily
> > avoided (if it is a problem at all!). The other thought was that we'd
> > do some of the data handling asynchronously, but that hasn't happened
> > yet and maybe never will + maybe we can do that in a plugin anyway.
> >   
> 
> You still need a separate thread triggered from the plugin install 
> function for the server, and initiating the connection first time you 
> need it seems like a good approach.
> 
> However, in the plugin itself, you now have to deal with concurrency 
> issues, since several cpus can report memory accesses at the same time, 
> which were sequentialized through the socket before.

True. There is already serialization going on for shared l2 caches 
but needs something a little more general. 

> 
> > The CHMU device model is responsible for timing signals, so it will
> > signal a regular tick (and read back the buffer state to identify
> > overflow etc). So not quite on request but a lot lower data rate
> > than the plugin to CHMU data flow.
> >   
> >>
> >> Beyond communication, the biggest overhead here is to have
> >> instrumentation on all memory accesses. Another thing that comes out of
> >> my mind is to do a sampling instrumentation. However, it's not easy to
> >> do, because the current API does not allow to force a new translation of
> >> existing TB, and at translation time, you have no clue whether or not
> >> this will be a hot TB.  
> > 
> > Whilst we will have sampling controls (they are part of the spec, both
> > fixed interval and pseudo random I just haven't wired them up yet),
> > typically sampling and the noise it creates is one of the causes of
> > bad numbers with hot page tracking.
> > 
> > We go to the effort to capture as much data as possible in a short time
> > so that we can move on to focusing the limited tracking resources on another
> > range of the memory.  Pushing knowledge of the sampling to the plugin is
> > definitely an option, but we'd still want full resolution hitting the
> > cache simulator as what we want to sample is what misses in the cache.
> >   
> 
> Sure, I was just proposing this to speed up execution. There is a 
> trade-off concerning accuracy. However, I'm surprised you observed such 
> bad numbers (no data locality in your use cases?)

Not at all - in fact kind of the opposite.  Mostly I've just seen
how bad the data is and not looked too closely at why.

I think what can happen is a that a clump of accesses to a page
come together as streaming across the cachelines of a page.

If you subsample the data you hit maybe one or two of those rather than
a much higher count.  Have that happen a few times and your subsampled
data badly missrepresents the actual accesses.  Arguably those streaming
cases might not matter to performance (prefetcher heaven) but they effect
what we see when trying to match what the hardware does.
So unless the hardware has been specifically told to subsample we should
not do so.

> 
> >> What you can do though, is to ignore most of the accesses, and only send
> >> information every 1000 memory accesses. (Note: I have no idea if 1000 is
> >> a good threshold :)). It would allow you to skip most of the overhead
> >> related to communication and hot pages management.  
> > 
> > Definitely will implement that as an option when I wire up the rest of
> > the control interface.  It can only happen with the knowledge of the
> > OS as it is another thing that will be controlled (or at least taken into
> > account if the device implements only fixed subsampling).
> >   
> >>
> >>   From what I understood about using plugins, the goal is to track hot
> >> pages. Using the cache one is a possibility, but it might be better to
> >> use contrib/plugins/hotpages.c instead, or better, a custom simple
> >> plugin, simply reporting reads/writes to the server (and implementing
> >> sampling as well).  
> > 
> > We need the cache part.  Without that the data ends up quite different
> > from what needs to be measured.  Data that is in cache and so not fetched
> > much from main memory as a result is (perhaps confusingly) cold.
> > No point in moving it to faster memory and from CXL point of view
> > we never see the accesses anyway so couldn't if we wanted to.
> > Cache simulation is not perfect as we don't have any simulation of
> > prefetchers etc though not sure how much that will hurt us yet.
> > Current cache plugin is obviously simplistic but it is a
> > reasonable starting point.
> > 
> > Also, in some more interesting cases granual it doesn't align with pages
> > (interleaving across multiple devices, much larger regions than pages)
> > so we'd need to switch back to something like this later anyway.
> > 
> > If bringing the hotness monitoring data capture into the plugin it probably
> > makes sense to fork the code to a new plugin anyway as that will add a
> > lot of complexity wherever we add it.
> >   
> 
>  From what I understand, the goal is to keep hot pages in a specific CXL 
> cache, located between the cpu cache, and DRAM, and to avoid keeping the 
> same data than in cpu cache.

No.  This is about tiering across CXL and other memory in the system with
software migrating that memory (similar to NUMA balancing).
Typically your low latency memory that you are migrating to is direct attached
DDR4/5.  The high latency tier is the CXL memory.  There are no extra
caches in this set up this is targetting.

> However, cache topology will be different for various cpus, with 
> different levels, different sizes, and different associativity.
> Thus, it seems that it will be impossible to model anything, simply 
> based on a generic "cache" algorithm.

Sure we may need to improve that cache part of the plugin. Right now
it is indeed simplistic.  However, aim here is to get 'a reasonable'
software stack up.  Emulation won't be perfect but at least the numbers
should be somewhat close to the real ones.  Obviously anyone doing system
design should have a suitable cycle accurate model, but that's useless
for the software stack design (too slow by far!!!)

> 
> The thing that puzzles me in the approach is that we compare a cpu data 
> cache (based on small chunks of data), and a hot page cache (based on a 
> page size). It seems normal to hold the same data, and be able to 
> provide the rest of the page to cpu cache when needed.

This is the confusion above. This is not in any way about a special cache.
There are other things you can do in that space, but this is just fast
and slow dram + the normal CPU caching heirarchy with the OS choosing
to migrate the pages (and map the VA to PA appropriately to point to the
new PA).

Jonathan


> 
> >>
> >> Which slow-down factor (order of magnitude) do you have with this series?  
> > 
> > I've not measured it accurately but not too bad.  Maybe 10-20x slow down over
> > TCG (when running arm64 on x86 host). Few minutes to boot a standard distro.
> > 
> > Jonathan
> >   
> >>  
> >>> Jonathan
> >>>      
> >>>>
> >>>> Pierrick
> >>>>     
> >>>>> Jonathan
> >>>>>         
> >>>>>>
> >>>>>> Regards,
> >>>>>> Pierrick
> >>>>>>
> >>>>>> On 1/24/25 09:29, Jonathan Cameron wrote:  
> >>>>>>> Hi All,
> >>>>>>>
> >>>>>>> This is an RFC mainly to seek feedback on the approach used, particularly
> >>>>>>> the aspect of how to get data from a TCG plugin into a device model.
> >>>>>>> Two options that we have tried
> >>>>>>> 1. Socket over which the plugin sends data to an external server
> >>>>>>>        (as seen here)
> >>>>>>> 2. Register and manage a plugin from within a device model
> >>>>>>>
> >>>>>>> The external server approach keeps things loosely coupled, but at the cost
> >>>>>>> of separately maintaining that server, protocol definitions etc and
> >>>>>>> some overhead.
> >>>>>>> The closely couple solution is neater, but I suspect might be controversial
> >>>>>>> (hence I didn't start with that :)
> >>>>>>>
> >>>>>>> The code here is at best a PoC to illustrate what we have in mind
> >>>>>>> It's not nice code at all, feature gaps, bugs and all!  So whilst
> >>>>>>> review is always welcome I'm not requesting it for now.
> >>>>>>>
> >>>>>>> Kernel support was posted a while back but was done against fake data
> >>>>>>> (still supported here if you don't provide the port parameter to the type3 device)
> >>>>>>> https://lore.kernel.org/linux-cxl/20241121101845.1815660-1-Jonathan.Cameron@huawei.com/
> >>>>>>> I'll post a minor update of that driver shortly to take into account
> >>>>>>> a few specification clarifications but it should work with this without
> >>>>>>> those.
> >>>>>>>
> >>>>>>> Note there are some other patches on the tree I generated this from
> >>>>>>> so this may not apply to upstream. Easiest is probably to test
> >>>>>>> using gitlab.com/jic23/qemu cxl-2025-01-24
> >>>>>>>
> >>>>>>> Thanks to Niyas for his suggestions on how to make all this work!
> >>>>>>>
> >>>>>>> Background
> >>>>>>> ----------
> >>>>>>>
> >>>>>>> What is the Compute eXpress Link Hotness Monitoring unit and what is it for?
> >>>>>>> - In a tiered memory equipped server with the slow tier being attached via
> >>>>>>>       CXL the expectation is a given workload will benefit from putting data
> >>>>>>>       that is frequently fetched from memory in lower latency directly attached
> >>>>>>>       DRAM.  Less frequently used data can be served from the CXL attached memory
> >>>>>>>       with no significant loss of performance.  Any data that is hot enough to
> >>>>>>>       almost always be in cache doesn't matter as it is only fetch from memory
> >>>>>>>       occasionally.
> >>>>>>> - Working out which memory is best places where is hard to do and in some
> >>>>>>>       workloads a dynamic problem. As such we need something we can measure
> >>>>>>>       to provide some indication of what data is in the wrong place.
> >>>>>>>       There are existing techniques to do this (page faulting, various
> >>>>>>>       CPU tracing systems, access bit scanning etc) but they all have significant
> >>>>>>>       overheads.
> >>>>>>> - Monitoring accesses on the CXL device provides a path to getting good
> >>>>>>>       data without those overheads.  These units are known as CXL Hotness
> >>>>>>>       Monitoring Units or CHMUs.  Loosely speaking they count accesses to
> >>>>>>>       granuals of data (e.g. 4KiB pages).  Exactly how they do that and
> >>>>>>>       where they sacrifice data accuracy is an implementation trade off.
> >>>>>>>
> >>>>>>> Why do we need a model that gives real data?
> >>>>>>> - In general there is a need to develop software on top of these units
> >>>>>>>       to move data to the right place. Hard to evaluate that if we are making
> >>>>>>>       up the info on what is 'hot'.
> >>>>>>> - Need to allow for a bunch of 'impdef' solutions. Note that CHMU
> >>>>>>>       in this patch set is an oracle - it has enough counters to count
> >>>>>>>       every access.  That's not realistic but it doesn't get me shouted
> >>>>>>>       at by our architecture teams for giving away any secrets.
> >>>>>>>       If we move forward with this, I'll probably implement a limited
> >>>>>>>       counter + full CAM solution (also unrealistic, but closer to real)
> >>>>>>>       I'd be very interested in contributions of other approaches (there
> >>>>>>>       are lots in the literature, under the term top-k)
> >>>>>>> - Resources will be constrained, so whilst a CHMU might in theory
> >>>>>>>       allow monitoring everything at once, that will come with a big
> >>>>>>>       accuracy cost.  We need to design the algorithms that give us
> >>>>>>>       good data given those constraints.
> >>>>>>>
> >>>>>>> So we need a solution to explore the design space and develop the software
> >>>>>>> to take advantage of this hardware (there are various LSF/MM proposals
> >>>>>>> on how to use this an other ways of tracking hotness).
> >>>>>>> https://lore.kernel.org/all/20250123105721.424117-1-raghavendra.kt@amd.com/
> >>>>>>> https://lore.kernel.org/lkml/Z4XUoWlU-UgRik18@gourry-fedora-PF4VCD3F/T/
> >>>>>>>
> >>>>>>> QEMU plugins give us a way to do this.  In particular the existing
> >>>>>>> Cache plugin can be easily modified to tell use what memory addresses
> >>>>>>> missed at the last level of emulated cache.  We can then filter those
> >>>>>>> for the memory address range that maps to CXL and feed them to our
> >>>>>>> counter implementation. On the other side, each instance of CXL type 3
> >>>>>>> device can connect to this server and request hotness monitoring
> >>>>>>> services + provide parameters etc.  Elements such as list threshold
> >>>>>>> management and overflow detection etc are in the CXL HMU QEMU device mode.
> >>>>>>> As noted above, we have an alternative approach that can closely couple
> >>>>>>> things, so the device model registers the plugin directly and there
> >>>>>>> is no server.
> >>>>>>>
> >>>>>>> How to use it!
> >>>>>>> --------------
> >>>>>>>
> >>>>>>> It runs a little slow but it runs and generates somewhat plausible outputs.
> >>>>>>> I'd definitely suggest running it with the pass through optimization
> >>>>>>> patch on the CXL staging tree (and a single direct connected device).
> >>>>>>> Your millage will vary if you try to use other parameters, or
> >>>>>>> hotness units beyond the first one (implementation far from complete!)
> >>>>>>>
> >>>>>>> To run start the server in contrib/hmu/ providing a port number to listen
> >>>>>>> on.
> >>>>>>>
> >>>>>>> ./chmu 4443
> >>>>>>>
> >>>>>>> Then launch QEMU with something like the following.
> >>>>>>>
> >>>>>>> qemu-system-aarch64 -icount shift=1 \
> >>>>>>>      -plugin ../qemu/bin/native/contrib/plugins/libcache.so,port=4443,missfilterbase=1099511627776,missfiltersize=1099511627776,dcachesize=8192,dassoc=4,dblksize=64,icachesize=8192,iassoc=4,iblksize=64,l2cachesize=32768,l2assoc=16,l2blksize=64 \
> >>>>>>>      -M virt,ras=on,nvdimm=on,gic-version=3,cxl=on,hmat=on -m 4g,maxmem=8g,slots=4 -cpu max -smp 4 \
> >>>>>>>      -kernel Image \
> >>>>>>>      -drive if=none,file=full.qcow2,format=qcow2,id=hd \
> >>>>>>>      -device pcie-root-port,id=root_port1 \
> >>>>>>>      -device virtio-blk-pci,drive=hd,x-max-bounce-buffer-size=512k \
> >>>>>>>      -nographic -no-reboot -append 'earlycon memblock=debug root=/dev/vda2 fsck.mode=skip maxcpus=4 tp_printk' \
> >>>>>>>      -monitor telnet:127.0.0.1:1234,server,nowait -bios QEMU_EFI.fd \
> >>>>>>>      -object memory-backend-ram,size=4G,id=mem0 \
> >>>>>>>      -object memory-backend-file,id=cxl-mem1,share=on,mem-path=/tmp/t3_cxl1.raw,size=1G,align=256M \
> >>>>>>>      -object memory-backend-file,id=cxl-mem2,share=on,mem-path=/tmp/t3_cxl2.raw,size=1G,align=256M \
> >>>>>>>      -object memory-backend-file,id=cxl-lsa1,share=on,mem-path=/tmp/t3_lsa1.raw,size=1M,align=1M \
> >>>>>>>       -object memory-backend-file,id=cxl-mem3,share=on,mem-path=/tmp/t3_cxl3.raw,size=1G,align=256M \
> >>>>>>>      -object memory-backend-file,id=cxl-mem4,share=on,mem-path=/tmp/t3_cxl4.raw,size=1G,align=256M \
> >>>>>>>      -object memory-backend-file,id=cxl-lsa2,share=on,mem-path=/tmp/t3_lsa2.raw,size=1M,align=1M \
> >>>>>>>      -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1,hdm_for_passthrough=true,numa_node=0\
> >>>>>>>      -device cxl-rp,port=0,bus=cxl.1,id=cxl_rp_port0,chassis=0,slot=2 \
> >>>>>>>      -device cxl-type3,bus=cxl_rp_port0,volatile-memdev=cxl-mem1,id=cxl-pmem1,lsa=cxl-lsa1,sn=3,x-speed=32,x-width=16,chmu-port=4443 \
> >>>>>>>      -machine cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=8G,cxl-fmw.0.interleave-granularity=1k \
> >>>>>>>      -numa node,nodeid=0,cpus=0-3,memdev=mem0 \
> >>>>>>>      -numa node,nodeid=1 \
> >>>>>>>      -object acpi-generic-initiator,id=bob2,pci-dev=bob,node=1 \
> >>>>>>>      -numa node,nodeid=2 \
> >>>>>>>      -object acpi-generic-port,id=bob11,pci-bus=cxl.1,node=2 \
> >>>>>>>
> >>>>>>> In the guest, create and bind the region - this brings up the CXL memory
> >>>>>>> device so accesses go to the memory.
> >>>>>>>
> >>>>>>>       cd /sys/bus/cxl/devices/decoder0.0/
> >>>>>>>       cat create_ram_region
> >>>>>>>       echo region0 > create_ram_region
> >>>>>>>       echo ram > /sys/bus/cxl/devices/decoder2.0/mode
> >>>>>>>       echo ram > /sys/bus/cxl/devices/decoder3.0/mode
> >>>>>>>       echo $((256 << 21)) > /sys/bus/cxl/devices/decoder2.0/dpa_size
> >>>>>>>       cd /sys/bus/cxl/devices/region0/
> >>>>>>>       echo 256 > interleave_granularity
> >>>>>>>       echo 1 > interleave_ways
> >>>>>>>       echo $((256 << 21)) > size
> >>>>>>>       echo decoder2.0 > target0
> >>>>>>>       echo 1 > commit
> >>>>>>>       echo region0 > /sys/bus/cxl/drivers/cxl_region/bind
> >>>>>>>
> >>>>>>> Finally start perf with something like:
> >>>>>>>
> >>>>>>> ./perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
> >>>>>>> hotness_threshold=635,epoch_multiplier=4,epoch_scale=4,\
> >>>>>>> range_base=0,range_size=4096/  ./stress.sh
> >>>>>>>
> >>>>>>> where stress.sh is
> >>>>>>>
> >>>>>>>       sleep 2
> >>>>>>>       numactl --membind 3 stress-ng --vm 1 --vm-bytes 1M --vm-keep -t 5s
> >>>>>>>       sleep 2
> >>>>>>>
> >>>>>>> See the results with
> >>>>>>> ./perf report --dump-raw-trace | grep -A 200 HMU
> >>>>>>>
> >>>>>>> Enjoy and have a good weekend.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>>
> >>>>>>> Jonathan
> >>>>>>>
> >>>>>>> Jonathan Cameron (3):
> >>>>>>>       hw/cxl: Initial CXL Hotness Monitoring Unit Emulation
> >>>>>>>       plugins: Add cache miss reporting over a socket.
> >>>>>>>       contrib: Add example hotness monitoring unit server
> >>>>>>>
> >>>>>>>      include/hw/cxl/cxl.h        |   1 +
> >>>>>>>      include/hw/cxl/cxl_chmu.h   | 154 ++++++++++++
> >>>>>>>      include/hw/cxl/cxl_device.h |  13 +-
> >>>>>>>      include/hw/cxl/cxl_pci.h    |   7 +-
> >>>>>>>      contrib/hmu/hmu.c           | 312 ++++++++++++++++++++++++
> >>>>>>>      contrib/plugins/cache.c     |  75 +++++-
> >>>>>>>      hw/cxl/cxl-chmu.c           | 459 ++++++++++++++++++++++++++++++++++++
> >>>>>>>      hw/mem/cxl_type3.c          |  25 +-
> >>>>>>>      hw/cxl/meson.build          |   1 +
> >>>>>>>      9 files changed, 1035 insertions(+), 12 deletions(-)
> >>>>>>>      create mode 100644 include/hw/cxl/cxl_chmu.h
> >>>>>>>      create mode 100644 contrib/hmu/hmu.c
> >>>>>>>      create mode 100644 hw/cxl/cxl-chmu.c
> >>>>>>>            
> >>>>>>        
> >>>>>         
> >>>>     
> >>>      
> >>
> >>  
> >   
> 



^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2025-05-20 14:17 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-24 17:29 [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data Jonathan Cameron via
2025-01-24 17:29 ` [RFC PATCH QEMU 1/3] hw/cxl: Initial CXL Hotness Monitoring Unit Emulation Jonathan Cameron via
2025-01-24 17:29 ` [RFC PATCH QEMU 2/3] plugins: Add cache miss reporting over a socket Jonathan Cameron via
2025-05-20 14:16   ` Alex Bennée
2025-01-24 17:29 ` [RFC PATCH QEMU x3/3] contrib: Add example hotness monitoring unit server Jonathan Cameron via
2025-01-24 20:55 ` [RFC PATCH QEMU 0/3] cxl/plugins: Hotness Monitoring Unit with 'real' data Pierrick Bouvier
2025-01-27 10:20   ` Jonathan Cameron via
2025-01-28 20:04     ` Pierrick Bouvier
2025-01-29 10:29       ` Jonathan Cameron via
2025-01-29 22:31         ` Pierrick Bouvier
2025-01-30 15:52           ` Jonathan Cameron via
2025-01-30 18:28             ` Pierrick Bouvier
2025-01-31 11:15               ` Jonathan Cameron via

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).