[RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
@ 2024-11-21 10:18 Jonathan Cameron
  2024-11-21 10:18 ` [RFC PATCH 1/4] cxl: Register devices for CXL Hotness Monitoring Units (CHMU) Jonathan Cameron
                   ` (7 more replies)
  0 siblings, 8 replies; 27+ messages in thread
From: Jonathan Cameron @ 2024-11-21 10:18 UTC (permalink / raw)
  To: linux-cxl, linux-mm, linux-perf-users, linux-kernel
  Cc: linuxarm, tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi,
	Vandana Salve, Davidlohr Bueso, Dave Jiang, Alison Schofield,
	Ira Weiny, Dan Williams, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Gregory Price, Huang Ying

The CXL specification release 3.2 is now available under a click through at
https://computeexpresslink.org/cxl-specification/ and it brings new
shiny toys.

RFC reason
- Whilst trace capture with a particular configuration is potentially useful
  the intent is that CXL HMU units will be used to drive various forms of
  hotpage migration for memory tiering setups. This driver doesn't do this
  (yet), but rather provides data capture etc for experimentation and
  for working out how to mostly put the allocations in the right place to
  start with by tuning applications.

CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
of this is to provide a way to establish which units of memory (typically
pages or larger) in CXL attached memory are hot. The implementation details
and algorithm are all implementation defined. The specification simply
describes the 'interface' which takes the form of ring buffer of hotness
records in a PCI BAR and defined capability, configuration and status
registers.

The hardware may have constraints on what it can track, granularity etc
and on how accurately it tracks (e.g. counter exhaustion, inaccurate
trackers). Some of these constraints are discoverable from the hardware
registers, others such as loss of accuracy have no universally accepted
measures as they are typically access pattern dependent. Sadly it is
very unlikely any hardware will implement a truly precise tracker given
the large resource requirements for tracking at a useful granularity.

There are two fundamental operation modes:

* Epoch based. Counters are checked after a period of time (Epoch) and
  if over a threshold added to the hotlist.
* Always on. Counters run until a threshold is reached, after that the
  hot unit is added to the hotlist and the counter released.

Counting can be filtered on:

* Region of CXL DPA space (256MiB per bit in a bitmap).
* Type of access - Trusted and non trusted or non trusted only, R/W/RW

Sampling can be modified by:

* Downsampling including potentially randomized downsampling.

The driver presented here is intended to be useful in its own right but
also to act as the first step of a possible path towards hotness monitoring
based hot page migration. Those steps might look like.

1. Gather data - drivers provide telemetry like solutions to get that
   data. May be enhanced, for example in this driver by providing the
   HPA address rather than DPA Unit Address. Userspace can access enough
   information to do this so maybe not.
2. Userspace algorithm development, possibly combined with userspace
   triggered migration by PA. Working out how to use different levels
   of constrained hardware resources will be challenging.
3. Move those algorithms in kernel. Will require generalization across
   different hotpage trackers etc.

So far this driver just gives access to the raw data. I will probably kick
of a longer discussion on how to do adaptive sampling needed to actually
use these units for tiering etc, sometime soon (if no one one else beats
me too it).  There is a follow up topic of how to virtualize this stuff
for memory stranding cases (VM gets a fixed mixture of fast and slow
memory and should do it's own tiering).

More details in the Documentation patch but typical commands are:

$perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
 hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
 range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
 hotness_granual=12

$perf report --dump-raw-traces

Example output.  With a counter_width of 16 (0x10) the least significant
4 bytes are the counter value and the unit index is bits 16-63.
Here all units are over the threshold and the indexes are 0,1,2 etc.

. ... CXL_HMU data: size 33512 bytes
Header 0: units: 29c counter_width 10
Header 1 : deadbeef
0000000000000283
0000000000010364
0000000000020366
000000000003033c
0000000000040343
00000000000502ff
000000000006030d
000000000007031a

Which will produce a list of hotness entries.
Bits[N-1:0] counter value
Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
  config to get to a Host Physical Address)

Specific RFC questions.
- What should be in the header added to the aux buffer.
  Currently just the minimum is provided. Number of records
  and the counter width needed to decode them.
- Should we reset the counters when doing sampling "-F X"
  If the frequency is higher than the epoch we never see any hot units.
  If so, when should we reset them?

Note testing has been light and on emulation only + as perf tool is
a pain to build on a striped back VM,  build testing has all be on
arm64 so far.  The driver loads though on both arm64 and x86 so
any problems are likely in the perf tool arch specific code
which is build tested (on wrong machine)

The QEMU emulation needs some cleanup, but I should be able to post
that shortly to let people actually play with this.  There are lots
of open questions there on how 'right' we want the emulation to be
and what counting uarch to emulate.

Jonathan Cameron (4):
  cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
  cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
  perf: Add support for CXL Hotness Monitoring Units (CHMU)
  hwtrace: Document CXL Hotness Monitoring Unit driver

 Documentation/trace/cxl-hmu.rst     | 197 +++++++
 Documentation/trace/index.rst       |   1 +
 drivers/cxl/Kconfig                 |   6 +
 drivers/cxl/Makefile                |   3 +
 drivers/cxl/core/Makefile           |   1 +
 drivers/cxl/core/core.h             |   1 +
 drivers/cxl/core/hmu.c              |  64 ++
 drivers/cxl/core/port.c             |   2 +
 drivers/cxl/core/regs.c             |  14 +
 drivers/cxl/cxl.h                   |   5 +
 drivers/cxl/cxlpci.h                |   1 +
 drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
 drivers/cxl/hmu.h                   |  23 +
 drivers/cxl/pci.c                   |  26 +-
 tools/perf/arch/arm/util/auxtrace.c |  58 ++
 tools/perf/arch/x86/util/auxtrace.c |  76 +++
 tools/perf/util/Build               |   1 +
 tools/perf/util/auxtrace.c          |   4 +
 tools/perf/util/auxtrace.h          |   1 +
 tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
 tools/perf/util/cxl-hmu.h           |  18 +
 21 files changed, 1748 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/trace/cxl-hmu.rst
 create mode 100644 drivers/cxl/core/hmu.c
 create mode 100644 drivers/cxl/hmu.c
 create mode 100644 drivers/cxl/hmu.h
 create mode 100644 tools/perf/util/cxl-hmu.c
 create mode 100644 tools/perf/util/cxl-hmu.h

-- 
2.43.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 1/4] cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
  2024-11-21 10:18 [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver Jonathan Cameron
@ 2024-11-21 10:18 ` Jonathan Cameron
       [not found]   ` <CGME20250103052421epcas5p4a1a917ba5d367dfccec91d4522666ca0@epcas5p4.samsung.com>
                     ` (3 more replies)
  2024-11-21 10:18 ` [RFC PATCH 2/4] cxl: Hotness Monitoring Unit via a Perf AUX Buffer Jonathan Cameron
                   ` (6 subsequent siblings)
  7 siblings, 4 replies; 27+ messages in thread
From: Jonathan Cameron @ 2024-11-21 10:18 UTC (permalink / raw)
  To: linux-cxl, linux-mm, linux-perf-users, linux-kernel
  Cc: linuxarm, tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi,
	Vandana Salve, Davidlohr Bueso, Dave Jiang, Alison Schofield,
	Ira Weiny, Dan Williams, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Gregory Price, Huang Ying

Basic registration using similar approach to how the CPMUs
are registered.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/cxl/core/Makefile |  1 +
 drivers/cxl/core/hmu.c    | 64 +++++++++++++++++++++++++++++++++++++++
 drivers/cxl/core/regs.c   | 14 +++++++++
 drivers/cxl/cxl.h         |  4 +++
 drivers/cxl/cxlpci.h      |  1 +
 drivers/cxl/hmu.h         | 23 ++++++++++++++
 drivers/cxl/pci.c         | 26 +++++++++++++++-
 7 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 9259bcc6773c..d060abb773ae 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -12,6 +12,7 @@ cxl_core-y += memdev.o
 cxl_core-y += mbox.o
 cxl_core-y += pci.o
 cxl_core-y += hdm.o
+cxl_core-y += hmu.o
 cxl_core-y += pmu.o
 cxl_core-y += cdat.o
 cxl_core-$(CONFIG_TRACING) += trace.o
diff --git a/drivers/cxl/core/hmu.c b/drivers/cxl/core/hmu.c
new file mode 100644
index 000000000000..3ee938bb6c05
--- /dev/null
+++ b/drivers/cxl/core/hmu.c
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2024 Huawei. All rights reserved. */
+
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/idr.h>
+#include <cxlmem.h>
+#include <hmu.h>
+#include <cxl.h>
+#include "core.h"
+
+static void cxl_hmu_release(struct device *dev)
+{
+	struct cxl_hmu *hmu = to_cxl_hmu(dev);
+
+	kfree(hmu);
+}
+
+const struct device_type cxl_hmu_type = {
+	.name = "cxl_hmu",
+	.release = cxl_hmu_release,
+};
+
+static void remove_dev(void *dev)
+{
+	device_unregister(dev);
+}
+
+int devm_cxl_hmu_add(struct device *parent, struct cxl_hmu_regs *regs,
+		     int assoc_id, int index)
+{
+	struct cxl_hmu *hmu;
+	struct device *dev;
+	int rc;
+
+	hmu = kzalloc(sizeof(*hmu), GFP_KERNEL);
+	if (!hmu)
+		return -ENOMEM;
+
+	hmu->assoc_id = assoc_id;
+	hmu->index = index;
+	hmu->base = regs->hmu;
+	dev = &hmu->dev;
+	device_initialize(dev);
+	device_set_pm_not_required(dev);
+	dev->parent = parent;
+	dev->bus = &cxl_bus_type;
+	dev->type = &cxl_hmu_type;
+	rc = dev_set_name(dev, "hmu_mem%d.%d", assoc_id, index);
+	if (rc)
+		goto err;
+
+	rc = device_add(dev);
+	if (rc)
+		goto err;
+
+	return devm_add_action_or_reset(parent, remove_dev, dev);
+
+err:
+	put_device(&hmu->dev);
+	return rc;
+}
+EXPORT_SYMBOL_NS_GPL(devm_cxl_hmu_add, CXL);
+
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index e1082e749c69..c12afaa6ef98 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -401,6 +401,20 @@ int cxl_map_pmu_regs(struct cxl_register_map *map, struct cxl_pmu_regs *regs)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_map_pmu_regs, CXL);
 
+int cxl_map_hmu_regs(struct cxl_register_map *map, struct cxl_hmu_regs *regs)
+{
+	struct device *dev = map->host;
+	resource_size_t phys_addr;
+
+	phys_addr = map->resource;
+	regs->hmu = devm_cxl_iomap_block(dev, phys_addr, map->max_size);
+	if (!regs->hmu)
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_map_hmu_regs, CXL);
+
 static int cxl_map_regblock(struct cxl_register_map *map)
 {
 	struct device *host = map->host;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 5406e3ab3d4a..8172bc1f7a8d 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -227,6 +227,9 @@ struct cxl_regs {
 	struct_group_tagged(cxl_pmu_regs, pmu_regs,
 		void __iomem *pmu;
 	);
+	struct_group_tagged(cxl_hmu_regs, hmu_regs,
+		void __iomem *hmu;
+	);
 
 	/*
 	 * RCH downstream port specific RAS register
@@ -292,6 +295,7 @@ int cxl_map_component_regs(const struct cxl_register_map *map,
 			   unsigned long map_mask);
 int cxl_map_device_regs(const struct cxl_register_map *map,
 			struct cxl_device_regs *regs);
+int cxl_map_hmu_regs(struct cxl_register_map *map, struct cxl_hmu_regs *regs);
 int cxl_map_pmu_regs(struct cxl_register_map *map, struct cxl_pmu_regs *regs);
 
 enum cxl_regloc_type;
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 4da07727ab9c..71f5e9620137 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -67,6 +67,7 @@ enum cxl_regloc_type {
 	CXL_REGLOC_RBI_VIRT,
 	CXL_REGLOC_RBI_MEMDEV,
 	CXL_REGLOC_RBI_PMU,
+	CXL_REGLOC_RBI_HMU,
 	CXL_REGLOC_RBI_TYPES
 };
 
diff --git a/drivers/cxl/hmu.h b/drivers/cxl/hmu.h
new file mode 100644
index 000000000000..c4798ed9764b
--- /dev/null
+++ b/drivers/cxl/hmu.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright(c) 2024 Huawei
+ * CXL Specification rev 3.2 Setion 8.2.8 (CHMU Register Interface)
+ */
+#ifndef CXL_HMU_H
+#define CXL_HMU_H
+#include <linux/device.h>
+
+#define CXL_HMU_REGMAP_SIZE 0xe00 /* Table 8-32 CXL 3.0 specification */
+struct cxl_hmu {
+	struct device dev;
+	void __iomem *base;
+	int assoc_id;
+	int index;
+};
+
+#define to_cxl_hmu(dev) container_of(dev, struct cxl_hmu, dev)
+struct cxl_hmu_regs;
+int devm_cxl_hmu_add(struct device *parent, struct cxl_hmu_regs *regs,
+		     int assoc_id, int idx);
+
+#endif
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 188412d45e0d..e89ea9d3f007 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -15,6 +15,7 @@
 #include "cxlmem.h"
 #include "cxlpci.h"
 #include "cxl.h"
+#include "hmu.h"
 #include "pmu.h"
 
 /**
@@ -814,7 +815,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	struct cxl_dev_state *cxlds;
 	struct cxl_register_map map;
 	struct cxl_memdev *cxlmd;
-	int i, rc, pmu_count;
+	int i, rc, hmu_count, pmu_count;
 	bool irq_avail;
 
 	/*
@@ -938,6 +939,29 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		}
 	}
 
+	hmu_count = cxl_count_regblock(pdev, CXL_REGLOC_RBI_HMU);
+	for (i = 0; i < hmu_count; i++) {
+		struct cxl_hmu_regs hmu_regs;
+
+		rc = cxl_find_regblock_instance(pdev, CXL_REGLOC_RBI_HMU, &map, i);
+		if (rc) {
+			dev_dbg(&pdev->dev, "Could not find HMU regblock\n");
+			break;
+		}
+
+		rc = cxl_map_hmu_regs(&map, &hmu_regs);
+		if (rc) {
+			dev_dbg(&pdev->dev, "Could not map HMU regs\n");
+			break;
+		}
+
+		rc = devm_cxl_hmu_add(cxlds->dev, &hmu_regs, cxlmd->id, i);
+		if (rc) {
+			dev_dbg(&pdev->dev, "Could not add HMU instance\n");
+			break;
+		}
+	}
+
 	rc = cxl_event_config(host_bridge, mds, irq_avail);
 	if (rc)
 		return rc;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 2/4] cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
  2024-11-21 10:18 [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver Jonathan Cameron
  2024-11-21 10:18 ` [RFC PATCH 1/4] cxl: Register devices for CXL Hotness Monitoring Units (CHMU) Jonathan Cameron
@ 2024-11-21 10:18 ` Jonathan Cameron
  2025-08-08  8:29   ` wangyuquan
  2024-11-21 10:18 ` [RFC PATCH 3/4] perf: Add support for CXL Hotness Monitoring Units (CHMU) Jonathan Cameron
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Jonathan Cameron @ 2024-11-21 10:18 UTC (permalink / raw)
  To: linux-cxl, linux-mm, linux-perf-users, linux-kernel
  Cc: linuxarm, tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi,
	Vandana Salve, Davidlohr Bueso, Dave Jiang, Alison Schofield,
	Ira Weiny, Dan Williams, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Gregory Price, Huang Ying

There are many ways that support for the new CXL hotness monitoring unit
could be enabled.

The existing infrastructure of perf + auxiliary buffers is used for the
similar activity of trace capture.

This driver is based on the existing hisi_ptt (PCI trace and tune) driver
and the CXL PMU driver.  Testing was done against QEMU emulation of
the feature but it's early days and lots more testing is needed as
this is a flexible specification with many corner cases.

The raw hotlist elements cannot be interpreted without knowing
the counter width. This is unfortunately dependent in an implementation
defined way on the unit size used for monitoring.
As such, store this and a count of new hotlist entrees in a header
that is inserted at the start of each set of records added to the auxiliary
buffer.

TODO: Add capabilities to expose what can be set for at least some
of these parameters.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/cxl/Kconfig     |   6 +
 drivers/cxl/Makefile    |   3 +
 drivers/cxl/core/core.h |   1 +
 drivers/cxl/core/port.c |   2 +
 drivers/cxl/cxl.h       |   1 +
 drivers/cxl/hmu.c       | 880 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 893 insertions(+)

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 876469e23f7a..c420f828fe20 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -146,4 +146,10 @@ config CXL_REGION_INVALIDATION_TEST
 	  If unsure, or if this kernel is meant for production environments,
 	  say N.
 
+config CXL_HMU
+       tristate "CXL: Hotness Monitoring Unit Driver"
+       depends on PERF_EVENTS
+       help
+	 Read data out from the CXL hotness units and provide it to userspace
+	 via the perf auxbuffer framework.
 endif
diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
index 2caa90fa4bf2..b678aa927298 100644
--- a/drivers/cxl/Makefile
+++ b/drivers/cxl/Makefile
@@ -7,15 +7,18 @@
 # - 'mem' and 'pmem' before endpoint drivers so that memdevs are
 #   immediately enabled
 # - 'pci' last, also mirrors the hardware enumeration hierarchy
+# - 'hmu' doesn't matter for now.
 obj-y += core/
 obj-$(CONFIG_CXL_PORT) += cxl_port.o
 obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
 obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o
 obj-$(CONFIG_CXL_MEM) += cxl_mem.o
 obj-$(CONFIG_CXL_PCI) += cxl_pci.o
+obj-$(CONFIG_CXL_HMU) += cxl_hmu.o
 
 cxl_port-y := port.o
 cxl_acpi-y := acpi.o
 cxl_pmem-y := pmem.o security.o
 cxl_mem-y := mem.o
 cxl_pci-y := pci.o
+cxl_hmu-y := hmu.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 0c62b4069ba0..88c673a4d950 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -6,6 +6,7 @@
 
 extern const struct device_type cxl_nvdimm_bridge_type;
 extern const struct device_type cxl_nvdimm_type;
+extern const struct device_type cxl_hmu_type;
 extern const struct device_type cxl_pmu_type;
 
 extern struct attribute_group cxl_base_attribute_group;
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index af92c67bc954..a91712757830 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -74,6 +74,8 @@ static int cxl_device_id(const struct device *dev)
 		return CXL_DEVICE_REGION;
 	if (dev->type == &cxl_pmu_type)
 		return CXL_DEVICE_PMU;
+	if (dev->type == &cxl_hmu_type)
+		return CXL_DEVICE_HMU;
 	return 0;
 }
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 8172bc1f7a8d..bd190e2baa1d 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -850,6 +850,7 @@ void cxl_driver_unregister(struct cxl_driver *cxl_drv);
 #define CXL_DEVICE_PMEM_REGION		7
 #define CXL_DEVICE_DAX_REGION		8
 #define CXL_DEVICE_PMU			9
+#define CXL_DEVICE_HMU			10
 
 #define MODULE_ALIAS_CXL(type) MODULE_ALIAS("cxl:t" __stringify(type) "*")
 #define CXL_MODALIAS_FMT "cxl:t%d"
diff --git a/drivers/cxl/hmu.c b/drivers/cxl/hmu.c
new file mode 100644
index 000000000000..9f5947afbb4b
--- /dev/null
+++ b/drivers/cxl/hmu.c
@@ -0,0 +1,880 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Driver for CXL Hotness Monitoring unit
+ *
+ * Based on hisi_ptt.c (Author Yicong Yang <yangyicong@hisilicon.com>)
+ * Copyright (c) 2022-2024 HiSilicon Technologies Co., Ltd.
+ *
+ * TODO:
+ * - Add capabilities attributes to help userspace know what can be set.
+ * - Find out if timeouts are appropriate for real hardware. Currently
+ *   assuming 0.1 seconds is enough for anything.
+ */
+#include <linux/dev_printk.h>
+#include <linux/perf_event.h>
+#include <linux/bitfield.h>
+#include <linux/spinlock.h>
+#include <linux/cleanup.h>
+#include <linux/vmalloc.h>
+#include <linux/device.h>
+#include <linux/iopoll.h>
+#include <linux/delay.h>
+#include <linux/list.h>
+#include <linux/math.h>
+#include <linux/pci.h>
+
+#include "cxlpci.h"
+#include "cxl.h"
+#include "hmu.h"
+
+#define CHMU_COMMON_CAP0_REG				0x00
+#define   CHMU_COMMON_CAP0_VER_MSK			GENMASK(3, 0)
+#define   CHMU_COMMON_CAP0_NUMINST_MSK			GENMASK(15, 8)
+#define CHMU_COMMON_CAP1_REG				0x08
+#define   CHMU_COMMON_CAP1_INSTLEN_MSK			GENMASK(15, 0)
+
+/* Register offsets within instance */
+#define CHMU_INST0_CAP0_REG				0x00
+#define   CHMU_INST0_CAP0_MSI_N_MSK			GENMASK(3, 0)
+#define   CHMU_INST0_CAP0_OVRFLW_CAP			BIT(4)
+#define   CHMU_INST0_CAP0_FILLTHRESH_CAP		BIT(5)
+#define   CHMU_INST0_CAP0_EPOCH_TYPE_MSK		GENMASK(7, 6)
+#define     CHMU_INST0_CAP0_EPOCH_TYPE_GLOBAL		0
+#define     CHMU_INST0_CAP0_EPOCH_TYPE_PERCNT		1
+#define   CHMU_INST0_CAP0_TRACK_NONTEE_R		BIT(8)
+#define   CHMU_INST0_CAP0_TRACK_NONTEE_W		BIT(9)
+#define   CHMU_INST0_CAP0_TRACK_NONTEE_RW		BIT(10)
+#define   CHMU_INST0_CAP0_TRACK_R			BIT(11)
+#define   CHMU_INST0_CAP0_TRACK_W			BIT(12)
+#define   CHMU_INST0_CAP0_TRACK_RW			BIT(13)
+/* Epoch defined as scale * multiplier */
+#define   CHMU_INST0_CAP0_EPOCH_MAX_SCALE_MSK		GENMASK(19, 16)
+#define     CHMU_EPOCH_SCALE_100US			1
+#define     CHMU_EPOCH_SCALE_1MS			2
+#define     CHMU_INST0_SCALE_10MS			3
+#define     CHMU_INST0_SCALE_100MS			4
+#define     CHMU_INST0_SCALE_1US			5
+#define   CHMU_INST0_CAP0_EPOCH_MAX_MULT_MSK		GENMASK(31, 20)
+#define   CHMU_INST0_CAP0_EPOCH_MIN_SCALE_MSK		GENMASK_ULL(35, 32)
+#define   CHMU_INST0_CAP0_EPOCH_MIN_MULT_MSK		GENMASK_ULL(47, 36)
+#define   CHMU_INST0_CAP0_HOTLIST_SIZE_MSK		GENMASK_ULL(63, 48)
+#define CHMU_INST0_CAP1_REG				0x08
+/* Power of 2 * 256 bits */
+#define   CHMU_INST0_CAP1_UNIT_SIZE_MSK			GENMASK(31, 0)
+/* Power of 2 */
+#define   CHMU_INST0_CAP1_DOWNSAMP_MSK			GENMASK_ULL(47, 32)
+#define   CHMU_INST0_CAP1_EPOCH_SUP			BIT_ULL(48)
+#define   CHMU_INST0_CAP1_ALWAYS_ON_SUP			BIT_ULL(49)
+#define   CHMU_INST0_CAP1_RAND_DOWNSAMP_SUP		BIT_ULL(50)
+#define   CHMU_INST0_CAP1_ADDR_OVERLAP_SUP		BIT_ULL(51)
+#define   CHMU_INST0_CAP1_POSTPONED_ON_OVRFLOW_SUP	BIT_ULL(52)
+
+/*
+ * In CXL r3.2 all defined as part of single giant CAP register.
+ * Where a whole 64 bits is in one field just name after the field.
+ */
+#define CHMU_INST0_RANGE_BITMAP_OFFSET_REG		0x10
+#define CHMU_INST0_HOTLIST_OFFSET_REG			0x18
+
+#define CHMU_INST0_CFG0_REG				0x40
+#define   CHMU_INST0_CFG0_WHAT_MSK			GENMASK(7, 0)
+#define      CHMU_INST0_CFG0_WHAT_NONTEE_R		1
+#define      CHMU_INST0_CFG0_WHAT_NONTEE_W		2
+#define      CHMU_INST0_CFG0_WHAT_NONTEE_RW		3
+#define      CHMU_INST0_CFG0_WHAT_R			4
+#define      CHMU_INST0_CFG0_WHAT_W			5
+#define      CHMU_INST0_CFG0_WHAT_RW			6
+#define   CHMU_INST0_CFG0_RAND_DOWNSAMP_EN		BIT(8)
+#define   CHMU_INST0_CFG0_OVRFLW_INT_EN			BIT(9)
+#define   CHMU_INST0_CFG0_FILLTHRESH_INT_EN		BIT(10)
+#define   CHMU_INST0_CFG0_ENABLE			BIT(16)
+#define   CHMU_INST0_CFG0_RESET_COUNTERS		BIT(17)
+#define   CHMU_INST0_CFG0_HOTNESS_THRESH_MSK		GENMASK_ULL(63, 32)
+#define CHMU_INST0_CFG1_REG				0x48
+#define   CHMU_INST0_CFG1_UNIT_SIZE_MSK			GENMASK(31, 0)
+#define   CHMU_INST0_CFG1_DS_FACTOR_MSK			GENMASK(35, 32)
+#define   CHMU_INST0_CFG1_MODE_MSK			GENMASK_ULL(47, 40)
+#define   CHMU_INST0_CFG1_EPOCH_SCALE_MSK		GENMASK_ULL(51, 48)
+#define   CHMU_INST0_CFG1_EPOCH_MULT_MSK		GENMASK_ULL(63, 52)
+#define CHMU_INST0_CFG2_REG				0x50
+#define   CHMU_INST0_CFG2_FILLTHRESH_THRESHOLD_MSK	GENMASK(15, 0)
+
+#define CHMU_INST0_STATUS_REG				0x60
+#define   CHMU_INST0_STATUS_ENABLED			BIT(0)
+#define   CHMU_INST0_STATUS_OP_INPROG_MSK		GENMASK(31, 16)
+#define     CHMU_INST0_STATUS_OP_INPROG_NONE		0
+#define     CHMU_INST0_STATUS_OP_INPROG_ENABLE		1
+#define     CHMU_INST0_STATUS_OP_INPROG_DISABLE		2
+#define     CHMU_INST0_STATUS_OP_INPROG_RESET		3
+#define   CHMU_INST0_STATUS_COUNTER_WIDTH_MSK		GENMASK_ULL(39, 32)
+#define   CHMU_INST0_STATUS_OVRFLW			BIT_ULL(40)
+#define   CHMU_INST0_STATUS_FILLTHRESH			BIT_ULL(41)
+
+/* 2 byte registers */
+#define CHMU_INST0_HEAD_REG				0x68
+#define CHMU_INST0_TAIL_REG				0x6A
+
+/* CFG attribute bit mappings */
+#define CXL_HMU_ATTR_CONFIG_EPOCH_TYPE_MASK GENMASK(1, 0)
+#define CXL_HMU_ATTR_CONFIG_ACCESS_TYPE_MASK GENMASK(9, 2)
+#define CXL_HMU_ATTR_CONFIG_EPOCH_SCALE_MASK GENMASK(13, 10)
+#define CXL_HMU_ATTR_CONFIG_EPOCH_MULT_MASK GENMASK(25, 14)
+#define CXL_HMU_ATTR_CONFIG_RANDOM_DS_MASK BIT(26)
+#define CXL_HMU_ATTR_CONFIG_DS_FACTOR_MASK GENMASK_ULL(34, 27)
+
+#define CXL_HMU_ATTR_CONFIG1_HOTNESS_THRESH_MASK GENMASK(31, 0)
+#define CXL_HMU_ATTR_CONFIG1_HOTNESS_GRANUAL_MASK GENMASK_ULL(63, 32)
+
+/* In multiples of 256MiB */
+#define CXL_HMU_ATTR_CONFIG2_DPA_BASE_MASK GENMASK(31, 0)
+#define CXL_HMU_ATTR_CONFIG2_DPA_SIZE_MASK GENMASK_ULL(63, 32)
+
+/* Range bitmap registers at offset 0x10 + Range Config Bitmap offset */
+/* Hotlist registers at offset 0x10 + Hotlist Register offset */
+static int cxl_hmu_cpuhp_state_num;
+
+enum cxl_hmu_reporting_mode {
+	CHMU_MODE_EPOCH = 0,
+	CHMU_MODE_ALWAYS_ON = 1,
+};
+
+struct cxl_hmu_info {
+	struct pmu pmu;
+	struct perf_output_handle handle;
+	void __iomem *base;
+	struct hlist_node node;
+	int irq;
+	int on_cpu;
+	u32 hot_thresh;
+	u32 hot_gran; /* power of 2, 256 to 2 GiB */
+	/* For now use a range rather than a bitmap, chunks of 256MiB */
+	u32 range_base;
+	u32 range_num;
+	enum cxl_hmu_reporting_mode reporting_mode;
+	u8 m2s_requests_to_track;
+	u8 ds_factor_pow2;
+	u8 epoch_scale;
+	u16 epoch_mult;
+	bool randomized_ds;
+
+	/* Protect both the device state for RMW and the pmu state */
+	spinlock_t lock;
+};
+
+#define pmu_to_cxl_hmu(p) container_of(p, struct cxl_hmu_info, pmu)
+
+/* destriptor for the aux buffer */
+struct cxl_hmu_buf {
+	size_t length;
+	int nr_pages;
+	void *base;
+	long pos;
+};
+
+static ssize_t cpumask_show(struct device *dev, struct device_attribute *attr,
+			    char *buf)
+{
+	struct cxl_hmu_info *hmu = dev_get_drvdata(dev);
+
+	return cpumap_print_to_pagebuf(true, buf, cpumask_of(hmu->on_cpu));
+}
+static DEVICE_ATTR_RO(cpumask);
+
+static struct attribute *cxl_hmu_cpumask_attrs[] = {
+	&dev_attr_cpumask.attr,
+	NULL
+};
+
+static const struct attribute_group cxl_hmu_cpumask_attr_group = {
+	.attrs = cxl_hmu_cpumask_attrs,
+};
+
+/* Sized fields to future proof based on space in spec */
+PMU_FORMAT_ATTR(epoch_type, "config:0-1"); /* 2 bits to future proof */
+PMU_FORMAT_ATTR(access_type, "config:2-9");
+PMU_FORMAT_ATTR(epoch_scale, "config:10-13");
+PMU_FORMAT_ATTR(epoch_multiplier, "config:14-25");
+PMU_FORMAT_ATTR(randomized_downsampling, "config:26-26");
+PMU_FORMAT_ATTR(downsampling_factor, "config:27-34");
+
+PMU_FORMAT_ATTR(hotness_threshold, "config1:0-31");
+PMU_FORMAT_ATTR(hotness_granual, "config1:32-63");
+
+/* RFC this is a bitmap can we control it better? */
+PMU_FORMAT_ATTR(range_base, "config2:0-31");
+PMU_FORMAT_ATTR(range_size, "config2:32-63");
+static struct attribute *cxl_hmu_format_attrs[] = {
+	&format_attr_epoch_type.attr,
+	&format_attr_access_type.attr,
+	&format_attr_epoch_scale.attr,
+	&format_attr_epoch_multiplier.attr,
+	&format_attr_randomized_downsampling.attr,
+	&format_attr_downsampling_factor.attr,
+	&format_attr_hotness_threshold.attr,
+	&format_attr_hotness_granual.attr,
+	&format_attr_range_base.attr,
+	&format_attr_range_size.attr,
+	NULL
+};
+
+static struct attribute_group cxl_hmu_format_attr_group = {
+	.name = "format",
+	.attrs = cxl_hmu_format_attrs,
+};
+
+static const struct attribute_group *cxl_hmu_groups[] = {
+	&cxl_hmu_cpumask_attr_group,
+	&cxl_hmu_format_attr_group,
+	NULL
+};
+
+static int cxl_hmu_event_init(struct perf_event *event)
+{
+	struct cxl_hmu_info *hmu = pmu_to_cxl_hmu(event->pmu);
+	struct device *dev = event->pmu->dev;
+	u32 gran_sup;
+	u16 ds_sup;
+	u64 cap0, cap1;
+	u64 epoch_min, epoch_max, epoch;
+	u64 hotlist_offset = readq(hmu->base + CHMU_INST0_HOTLIST_OFFSET_REG);
+	u64 bitmap_offset = readq(hmu->base + CHMU_INST0_RANGE_BITMAP_OFFSET_REG);
+
+	if (event->attr.type != hmu->pmu.type)
+		return -ENOENT;
+
+	if (event->cpu < 0) {
+		dev_info(dev, "Per-task mode not supported\n");
+		return -EOPNOTSUPP;
+	}
+
+	if (event->attach_state & PERF_ATTACH_TASK)
+		return -EOPNOTSUPP;
+
+	cap0 = readq(hmu->base + CHMU_INST0_CAP0_REG);
+	cap1 = readq(hmu->base + CHMU_INST0_CAP1_REG);
+
+	switch (FIELD_GET(CXL_HMU_ATTR_CONFIG_EPOCH_TYPE_MASK,
+		event->attr.config)) {
+	case 0:
+		if (!FIELD_GET(CHMU_INST0_CAP1_EPOCH_SUP, cap1))
+			return -EOPNOTSUPP;
+		hmu->reporting_mode = CHMU_MODE_EPOCH;
+		break;
+	case 1:
+		if (!FIELD_GET(CHMU_INST0_CAP1_ALWAYS_ON_SUP, cap1))
+			return -EOPNOTSUPP;
+		hmu->reporting_mode = CHMU_MODE_ALWAYS_ON;
+		break;
+	default:
+		dev_dbg(dev, "Tried for a non existent type\n");
+		return -EINVAL;
+	}
+	hmu->randomized_ds = FIELD_GET(CXL_HMU_ATTR_CONFIG_RANDOM_DS_MASK,
+				      event->attr.config);
+	if (hmu->randomized_ds && !FIELD_GET(CHMU_INST0_CAP1_RAND_DOWNSAMP_SUP, cap1)) {
+		dev_info(dev, "Randomized downsampling not supported\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* RFC: sanity check against currently defined or not? */
+	hmu->m2s_requests_to_track = FIELD_GET(CXL_HMU_ATTR_CONFIG_ACCESS_TYPE_MASK,
+					       event->attr.config);
+	if (hmu->m2s_requests_to_track < CHMU_INST0_CFG0_WHAT_NONTEE_R ||
+	    hmu->m2s_requests_to_track > CHMU_INST0_CFG0_WHAT_RW) {
+		dev_dbg(dev, "Requested a reserved type to track\n");
+		return -EINVAL;
+	}
+
+	hmu->hot_thresh = FIELD_GET(CXL_HMU_ATTR_CONFIG1_HOTNESS_THRESH_MASK,
+					   event->attr.config1);
+
+	hmu->hot_gran = FIELD_GET(CXL_HMU_ATTR_CONFIG1_HOTNESS_GRANUAL_MASK,
+					 event->attr.config1);
+
+	gran_sup = FIELD_GET(CHMU_INST0_CAP1_UNIT_SIZE_MSK, cap1);
+	/* Default to smallest granual if not specified */
+	if (hmu->hot_gran == 0 && gran_sup)
+		hmu->hot_gran = 8 + ffs(gran_sup);
+
+	if (hmu->hot_gran < 8) {
+		dev_dbg(dev, "Granual less than 256 bytes, not valid in CXL 3.2\n");
+		return -EINVAL;
+	}
+
+	if (!((1 << (hmu->hot_gran - 8)) & gran_sup)) {
+		dev_dbg(dev, "Granual %d not supported, supported mask %x\n",
+			hmu->hot_gran - 8, gran_sup);
+		return -EOPNOTSUPP;
+	}
+
+	ds_sup = FIELD_GET(CHMU_INST0_CAP1_DOWNSAMP_MSK, cap1);
+	hmu->ds_factor_pow2 = FIELD_GET(CXL_HMU_ATTR_CONFIG_DS_FACTOR_MASK,
+					event->attr.config);
+	if (!((1 << hmu->ds_factor_pow2) & ds_sup)) {
+		/* Special case default of 0 if not supported as smallest DS possibe */
+		if (hmu->ds_factor_pow2 == 0 && ds_sup) {
+			hmu->ds_factor_pow2 = ffs(ds_sup);
+			dev_dbg(dev, "Downsampling set to default min of %d\n",
+				hmu->ds_factor_pow2);
+		} else {
+			dev_dbg(dev, "Downsampling %d no supported, supported mask %x\n",
+				hmu->ds_factor_pow2, ds_sup);
+			return -EOPNOTSUPP;
+		}
+	}
+
+	hmu->epoch_scale = FIELD_GET(CXL_HMU_ATTR_CONFIG_EPOCH_SCALE_MASK,
+				     event->attr.config);
+
+	hmu->epoch_mult = FIELD_GET(CXL_HMU_ATTR_CONFIG_EPOCH_MULT_MASK,
+				    event->attr.config);
+
+	/* Default to what? */
+	if (hmu->epoch_mult == 0 && hmu->epoch_scale == 0) {
+		hmu->epoch_scale = FIELD_GET(CHMU_INST0_CAP0_EPOCH_MIN_SCALE_MSK, cap0);
+		hmu->epoch_mult = FIELD_GET(CHMU_INST0_CAP0_EPOCH_MIN_MULT_MSK, cap0);
+	}
+	if (hmu->epoch_mult == 0)
+		return -EINVAL;
+
+	/* Units of 100ms */
+	epoch_min = int_pow(10, FIELD_GET(CHMU_INST0_CAP0_EPOCH_MIN_SCALE_MSK, cap0)) *
+		(u64)FIELD_GET(CHMU_INST0_CAP0_EPOCH_MIN_MULT_MSK, cap0);
+	epoch_max = int_pow(10, FIELD_GET(CHMU_INST0_CAP0_EPOCH_MAX_SCALE_MSK, cap0)) *
+		(u64)FIELD_GET(CHMU_INST0_CAP0_EPOCH_MAX_MULT_MSK, cap0);
+	epoch = int_pow(10, hmu->epoch_scale) * (u64)hmu->epoch_mult;
+
+	if (epoch > epoch_max || epoch < epoch_min) {
+		dev_dbg(dev, "out of range %llu %llu %llu\n",
+			epoch, epoch_max, epoch_min);
+		return -EINVAL;
+	}
+
+	hmu->range_base = FIELD_GET(CXL_HMU_ATTR_CONFIG2_DPA_BASE_MASK,
+				    event->attr.config2);
+	hmu->range_num = FIELD_GET(CXL_HMU_ATTR_CONFIG2_DPA_SIZE_MASK,
+				   event->attr.config2);
+
+	if (hmu->range_num == 0) {
+		/* Set a default of 'everything' */
+		hmu->range_num = (hotlist_offset - bitmap_offset) * 8;
+	}
+	/* TODO - pass in better DPA range info from parent driver */
+	if ((u64)hmu->range_base + hmu->range_num >
+	    (hotlist_offset - bitmap_offset) * 8) {
+		dev_dbg(dev, "Requested range that this HMU can't track Can track 0x%llx, asked for 0x%x to 0x%x\n",
+			(hotlist_offset - bitmap_offset) * 8,
+			hmu->range_base, hmu->range_base + hmu->range_num);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int cxl_hmu_update_aux(struct cxl_hmu_info *hmu, bool stop)
+{
+	struct perf_output_handle *handle = &hmu->handle;
+	struct cxl_hmu_buf *buf = perf_get_aux(handle);
+	struct perf_event *event = handle->event;
+	size_t size = 0;
+	size_t tocopy, tocopy2;
+
+	u64 offset = readq(hmu->base + CHMU_INST0_HOTLIST_OFFSET_REG);
+	u16 head = readw(hmu->base + CHMU_INST0_HEAD_REG);
+	u16 tail = readw(hmu->base + CHMU_INST0_TAIL_REG);
+	u8 count_width = FIELD_GET(CHMU_INST0_STATUS_COUNTER_WIDTH_MSK,
+				   readq(hmu->base + CHMU_INST0_STATUS_REG));
+	u16 top = FIELD_GET(CHMU_INST0_CAP0_HOTLIST_SIZE_MSK,
+			    readq(hmu->base + CHMU_INST0_CAP0_REG));
+	/* 16 bytes of header - arbitrary choice! */
+#define CHMU_HEADER0_SIZE_MASK GENMASK(15, 0)
+#define CHMU_HEADER0_COUNT_WIDTH GENMASK(23, 16)
+	u64 header[2];
+
+	if (tail > head) {
+		tocopy = min_t(size_t, (tail - head) * 8,
+			       buf->length - buf->pos - sizeof(header));
+		header[0] = FIELD_PREP(CHMU_HEADER0_SIZE_MASK, tocopy / 8) |
+			    FIELD_PREP(CHMU_HEADER0_COUNT_WIDTH, count_width);
+		header[1] = 0xDEADBEEF;
+		if (tocopy) {
+			memcpy(buf->base + buf->pos, header, sizeof(header));
+			size += sizeof(header);
+			buf->pos += sizeof(header);
+			memcpy_fromio(buf->base + buf->pos,
+				      hmu->base + offset + head * 8, tocopy);
+			size += tocopy;
+			buf->pos += tocopy;
+		}
+
+	} else if (tail < head) { /* wrap around */
+		tocopy = min_t(size_t, (top - head) * 8,
+			       buf->length - buf->pos - sizeof(header));
+		tocopy2 = min_t(size_t, tail * 8,
+				buf->length - buf->pos - tocopy - sizeof(header));
+		header[0] = FIELD_PREP(CHMU_HEADER0_SIZE_MASK, (tocopy + tocopy2) / 8) |
+			    FIELD_PREP(CHMU_HEADER0_COUNT_WIDTH, count_width);
+		header[1] = 0xDEADBEEF;
+		if (tocopy) {
+			memcpy(buf->base + buf->pos, header, sizeof(header));
+			size += sizeof(header);
+			buf->pos += sizeof(header);
+			memcpy_fromio(buf->base + buf->pos,
+				      hmu->base + offset + head * 8, tocopy);
+			size += tocopy;
+			buf->pos += tocopy;
+
+		}
+
+		if (tocopy2) {
+			memcpy_fromio(buf->base + buf->pos,
+				      hmu->base + offset, tocopy2);
+			size += tocopy2;
+			buf->pos += tocopy2;
+		}
+	} /* may be no data */
+
+	perf_aux_output_end(handle, size);
+	if (buf->pos == buf->length)
+		return -EINVAL; /* FULL */
+
+	/* Do this after the space check, so the buffer on device will not overwrite */
+	writew(tail, hmu->base + CHMU_INST0_HEAD_REG);
+
+	if (!stop) {
+		buf = perf_aux_output_begin(handle, event);
+		if (!buf)
+			return -EINVAL;
+		buf->pos = handle->head % buf->length;
+	}
+	return 0;
+}
+
+static int __cxl_hmu_start(struct perf_event *event, int flags)
+{
+	struct cxl_hmu_info *hmu = pmu_to_cxl_hmu(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	struct device *dev = event->pmu->dev;
+	struct cxl_hmu_buf *buf;
+	int cpu = event->cpu;
+	u64 val, status, bitmap_base;
+	int ret, i;
+	u16 list_len = FIELD_GET(CHMU_INST0_CAP0_HOTLIST_SIZE_MSK,
+				 readq(hmu->base + CHMU_INST0_CAP0_REG));
+
+	hwc->state = 0;
+	status = readq(hmu->base + CHMU_INST0_STATUS_REG);
+	if (FIELD_GET(CHMU_INST0_STATUS_ENABLED, status)) {
+		dev_dbg(dev, "trace already started\n");
+		return -EBUSY;
+	}
+	/* TODO: Figure out what to do as very likely this is shared
+	 *  - Hopefully only with other HMU instances
+	 */
+	ret = irq_set_affinity(hmu->irq, cpumask_of(cpu));
+	if (ret)
+		dev_warn(dev, "failed to affinity of HMU interrupt\n");
+
+	hmu->on_cpu = cpu;
+
+	buf = perf_aux_output_begin(&hmu->handle, event);
+	if (!buf) {
+		dev_dbg(event->pmu->dev, "aux output begin failed\n");
+		return -EINVAL;
+	}
+
+	buf->pos = hmu->handle.head % buf->length;
+	/* Reset here disrupts samping with -F, should we avoid doing so? */
+	writeq(FIELD_PREP(CHMU_INST0_CFG0_RESET_COUNTERS, 1),
+		hmu->base + CHMU_INST0_CFG0_REG);
+
+	ret = readq_poll_timeout_atomic(hmu->base + CHMU_INST0_STATUS_REG, status,
+		(FIELD_GET(CHMU_INST0_STATUS_OP_INPROG_MSK, status) == 0),
+		10, 100000);
+	if (ret) {
+		dev_dbg(event->pmu->dev, "Reset timed out\n");
+		return ret;
+	}
+	/* Setup what is being capured */
+	/* Type of capture, granularity etc */
+
+	val = FIELD_PREP(CHMU_INST0_CFG1_UNIT_SIZE_MSK, hmu->hot_gran) |
+		FIELD_PREP(CHMU_INST0_CFG1_DS_FACTOR_MSK, hmu->ds_factor_pow2) |
+		FIELD_PREP(CHMU_INST0_CFG1_MODE_MSK, hmu->reporting_mode) |
+		FIELD_PREP(CHMU_INST0_CFG1_EPOCH_SCALE_MSK, hmu->epoch_scale) |
+		FIELD_PREP(CHMU_INST0_CFG1_EPOCH_MULT_MSK, hmu->epoch_mult);
+	writeq(val, hmu->base + CHMU_INST0_CFG1_REG);
+
+	val = 0;
+	bitmap_base = readq(hmu->base + CHMU_INST0_RANGE_BITMAP_OFFSET_REG);
+	for (i = hmu->range_base; i < hmu->range_base + hmu->range_num; i++) {
+		val |= BIT(i % 64);
+		if (i % 64 == 63) {
+			writeq(val, hmu->base + bitmap_base + (i / 64) * 8);
+			val = 0;
+		}
+	}
+	/* Potential duplicate write that doesn't matter */
+	writeq(val, hmu->base + bitmap_base + (i / 64) * 8);
+
+	/* Set notificaiton threshold to half of buffer */
+	val = FIELD_PREP(CHMU_INST0_CFG2_FILLTHRESH_THRESHOLD_MSK,
+			 list_len / 2);
+	writeq(val, hmu->base + CHMU_INST0_CFG2_REG);
+
+	/*
+	 * RFC: Only after granual is set can the width be known - so can only check here,
+	 * or program granual size earlier just to see if it will work here.
+	 */
+	status = readq(hmu->base + CHMU_INST0_STATUS_REG);
+	if (hmu->hot_thresh >= (1 << (64 - FIELD_GET(CHMU_INST0_STATUS_COUNTER_WIDTH_MSK, status))))
+		return -EINVAL;
+	/* Start the unit up */
+	val = FIELD_PREP(CHMU_INST0_CFG0_WHAT_MSK, hmu->m2s_requests_to_track) |
+		FIELD_PREP(CHMU_INST0_CFG0_RAND_DOWNSAMP_EN,
+			   hmu->randomized_ds ? 1 : 0) |
+		FIELD_PREP(CHMU_INST0_CFG0_OVRFLW_INT_EN, 1) |
+		FIELD_PREP(CHMU_INST0_CFG0_FILLTHRESH_INT_EN, 1) |
+		FIELD_PREP(CHMU_INST0_CFG0_ENABLE, 1) |
+		FIELD_PREP(CHMU_INST0_CFG0_HOTNESS_THRESH_MSK, hmu->hot_thresh);
+	writeq(val, hmu->base + CHMU_INST0_CFG0_REG);
+
+	/* Poll status register for enablement to complete */
+	ret = readq_poll_timeout_atomic(hmu->base + CHMU_INST0_STATUS_REG, status,
+		(FIELD_GET(CHMU_INST0_STATUS_OP_INPROG_MSK, status) == 0),
+		10, 100000);
+	if (ret) {
+		dev_info(event->pmu->dev, "Enable timed out\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static void cxl_hmu_start(struct perf_event *event, int flags)
+{
+	struct cxl_hmu_info *hmu = pmu_to_cxl_hmu(event->pmu);
+	int ret;
+
+	guard(spinlock)(&hmu->lock);
+
+	ret = __cxl_hmu_start(event, flags);
+	if (ret)
+		event->hw.state |= PERF_HES_STOPPED;
+}
+
+static void cxl_hmu_stop(struct perf_event *event, int flags)
+{
+	struct cxl_hmu_info *hmu = pmu_to_cxl_hmu(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	u64 status, val;
+	int ret;
+
+	if (hwc->state & PERF_HES_STOPPED)
+		return;
+
+	guard(spinlock)(&hmu->lock);
+	status = readq(hmu->base + CHMU_INST0_STATUS_REG);
+	if (FIELD_GET(CHMU_INST0_STATUS_ENABLED, status)) {
+		/* Stop the HMU instance */
+		val = readq(hmu->base + CHMU_INST0_CFG0_REG);
+		val &= ~CHMU_INST0_CFG0_ENABLE;
+		writeq(val, hmu->base + CHMU_INST0_CFG0_REG);
+
+		ret = readq_poll_timeout_atomic(hmu->base + CHMU_INST0_STATUS_REG, status,
+			(FIELD_GET(CHMU_INST0_STATUS_OP_INPROG_MSK, status) == 0),
+			10, 100000);
+		if (ret) {
+			dev_info(event->pmu->dev, "Disable timed out\n");
+			return;
+		}
+
+		cxl_hmu_update_aux(hmu, true);
+	}
+
+}
+static void cxl_hmu_read(struct perf_event *event)
+{
+	/* Nothing to do */
+}
+
+static int cxl_hmu_add(struct perf_event *event, int flags)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
+	if (flags & PERF_EF_START) {
+		cxl_hmu_start(event, PERF_EF_RELOAD);
+		if (hwc->state & PERF_HES_STOPPED)
+			return -EINVAL;
+	}
+	return 0;
+}
+
+/*
+ * There is a lot to do in here, but using a thread is not
+ * currently possible for a perf PMU driver.
+ */
+static irqreturn_t cxl_hmu_irq(int irq, void *data)
+{
+	struct cxl_hmu_info *info = data;
+	u64 status;
+	int ret;
+
+	status = readq(info->base + CHMU_INST0_STATUS_REG);
+	if (!FIELD_GET(CHMU_INST0_STATUS_OVRFLW, status) &&
+	    !FIELD_GET(CHMU_INST0_STATUS_FILLTHRESH, status))
+		return IRQ_NONE;
+
+	ret = cxl_hmu_update_aux(info, false);
+	if (ret)
+		dev_err(info->pmu.dev, "interrupt update failed\n");
+
+	/*
+	 * They are level interrupts so should trigger on next fill
+	 * hence should be no problem with races.
+	 */
+	writeq(status, info->base + CHMU_INST0_STATUS_REG);
+
+	return IRQ_HANDLED;
+}
+
+static void cxl_hmu_del(struct perf_event *event, int flags)
+{
+	cxl_hmu_stop(event, PERF_EF_UPDATE);
+}
+
+static void *cxl_hmu_setup_aux(struct perf_event *event, void **pages,
+			int nr_pages, bool overwrite)
+{
+	int i;
+
+	if (overwrite) {
+		dev_warn(event->pmu->dev, "Overwrite mode is not supported\n");
+		return NULL;
+	}
+
+	if (nr_pages < 1)
+		return NULL;
+
+	struct cxl_hmu_buf *buf __free(kfree) =
+		kzalloc(sizeof(*buf), GFP_KERNEL);
+	if (!buf)
+		return NULL;
+
+	struct page **pagelist __free(kfree) =
+		kcalloc(nr_pages, sizeof(*pagelist), GFP_KERNEL);
+	if (!pagelist)
+		return NULL;
+
+	for (i = 0; i < nr_pages; i++)
+		pagelist[i] = virt_to_page(pages[i]);
+
+	buf->base = vmap(pagelist, nr_pages, VM_MAP, PAGE_KERNEL);
+	if (!buf->base)
+		return NULL;
+
+	buf->nr_pages = nr_pages;
+	buf->length = nr_pages * PAGE_SIZE;
+	buf->pos = 0;
+
+	return_ptr(buf);
+}
+
+static void cxl_hmu_free_aux(void *aux)
+{
+	struct cxl_hmu_buf *buf = aux;
+
+	vunmap(buf->base);
+	kfree(buf);
+}
+
+static void cxl_hmu_perf_unregister(void *_info)
+{
+	struct cxl_hmu_info *info = _info;
+
+	perf_pmu_unregister(&info->pmu);
+}
+
+static void cxl_hmu_cpuhp_remove(void *_info)
+{
+	struct cxl_hmu_info *info = _info;
+
+	cpuhp_state_remove_instance_nocalls(cxl_hmu_cpuhp_state_num,
+					    &info->node);
+}
+
+static int cxl_hmu_probe(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev->parent);
+	struct cxl_hmu *hmu = to_cxl_hmu(dev);
+	int i, rc;
+
+	int num_inst = FIELD_GET(CHMU_COMMON_CAP0_NUMINST_MSK,
+				 readq(hmu->base + CHMU_COMMON_CAP0_REG));
+	int inst_len = FIELD_GET(CHMU_COMMON_CAP1_INSTLEN_MSK,
+				 readq(hmu->base + CHMU_COMMON_CAP1_REG));
+
+	for (i = 0; i < num_inst; i++) {
+		struct cxl_hmu_info *info;
+		char *pmu_name;
+		int msg_num;
+		u64 val;
+
+		info = devm_kzalloc(dev, sizeof(*info), GFP_KERNEL);
+		if (!info)
+			return -ENOMEM;
+
+		dev_set_drvdata(dev, info);
+		info->on_cpu = -1;
+		info->base = hmu->base + 0x10 + inst_len * i;
+
+		val = readq(info->base + CHMU_INST0_CAP0_REG);
+		msg_num = FIELD_GET(CHMU_INST0_CAP0_MSI_N_MSK, val);
+
+		/* TODO add polling support - for now require threshold */
+		if (!FIELD_GET(CHMU_INST0_CAP0_FILLTHRESH_CAP, val)) {
+			devm_kfree(dev, info);
+			continue;
+		}
+
+		spin_lock_init(&info->lock);
+
+		pmu_name = devm_kasprintf(dev, GFP_KERNEL,
+					  "cxl_hmu_mem%d.%d.%d",
+					  hmu->assoc_id, hmu->index, i);
+		if (!pmu_name)
+			return -ENOMEM;
+
+		info->pmu = (struct pmu) {
+			.name = pmu_name,
+			.parent = dev,
+			.module = THIS_MODULE,
+			.capabilities = PERF_PMU_CAP_EXCLUSIVE |
+					PERF_PMU_CAP_NO_EXCLUDE,
+			.task_ctx_nr = perf_sw_context,
+			.attr_groups = cxl_hmu_groups,
+			.event_init = cxl_hmu_event_init,
+			.setup_aux = cxl_hmu_setup_aux,
+			.free_aux = cxl_hmu_free_aux,
+			.start = cxl_hmu_start,
+			.stop = cxl_hmu_stop,
+			.add = cxl_hmu_add,
+			.del = cxl_hmu_del,
+			.read = cxl_hmu_read,
+		};
+		rc = pci_irq_vector(pdev, msg_num);
+		if (rc < 0)
+			return rc;
+		info->irq = rc;
+
+		/*
+		 * Whilst there is a 'strong' recomendation that the interrupt
+		 * should not be shared it is not a requirement.
+		 * Can we support IRQF_SHARED on a PMU?
+		 */
+		rc = devm_request_irq(dev, info->irq, cxl_hmu_irq,
+				      IRQF_NO_THREAD | IRQF_NOBALANCING,
+				      pmu_name, info);
+		if (rc)
+			return rc;
+
+		rc = cpuhp_state_add_instance(cxl_hmu_cpuhp_state_num,
+					      &info->node);
+		if (rc)
+			return rc;
+
+		rc = devm_add_action_or_reset(dev, cxl_hmu_cpuhp_remove, info);
+		if (rc)
+			return rc;
+
+		rc = perf_pmu_register(&info->pmu, info->pmu.name, -1);
+		if (rc)
+			return rc;
+
+		rc = devm_add_action_or_reset(dev, cxl_hmu_perf_unregister,
+					      info);
+		if (rc)
+			return rc;
+	}
+	return 0;
+}
+
+static struct cxl_driver cxl_hmu_driver = {
+	.name = "cxl_hmu",
+	.probe = cxl_hmu_probe,
+	.id = CXL_DEVICE_HMU,
+};
+
+static int cxl_hmu_online_cpu(unsigned int cpu, struct hlist_node *node)
+{
+	struct cxl_hmu_info *info =
+		hlist_entry_safe(node, struct cxl_hmu_info, node);
+
+	if (info->on_cpu != -1)
+		return 0;
+
+	info->on_cpu = cpu;
+
+	WARN_ON(irq_set_affinity(info->irq, cpumask_of(cpu)));
+
+	return 0;
+}
+
+static int cxl_hmu_offline_cpu(unsigned int cpu, struct hlist_node *node)
+{
+	struct cxl_hmu_info *info =
+		hlist_entry_safe(node, struct cxl_hmu_info, node);
+	unsigned int target;
+
+	if (info->on_cpu != cpu)
+		return 0;
+
+	info->on_cpu = -1;
+	target = cpumask_any_but(cpu_online_mask, cpu);
+	if (target >= nr_cpu_ids) {
+		dev_err(info->pmu.dev, "Unable to find a suitable CPU\n");
+		return 0;
+	}
+
+	perf_pmu_migrate_context(&info->pmu, cpu, target);
+	info->on_cpu = target;
+	/*
+	 * CPU HP lock is held so we should be guaranteed that this CPU hasn't
+	 * yet gone away.
+	 */
+	WARN_ON(irq_set_affinity(info->irq, cpumask_of(target)));
+	return 0;
+}
+
+static __init int cxl_hmu_init(void)
+{
+	int rc;
+
+	rc = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
+				     "AP_PERF_CXL_HMU_ONLINE",
+				     cxl_hmu_online_cpu, cxl_hmu_offline_cpu);
+	if (rc < 0)
+		return rc;
+	cxl_hmu_cpuhp_state_num = rc;
+
+	rc = cxl_driver_register(&cxl_hmu_driver);
+	if (rc)
+		cpuhp_remove_multi_state(cxl_hmu_cpuhp_state_num);
+
+	return rc;
+}
+
+static __exit void cxl_hmu_exit(void)
+{
+	cxl_driver_unregister(&cxl_hmu_driver);
+	cpuhp_remove_multi_state(cxl_hmu_cpuhp_state_num);
+}
+
+MODULE_AUTHOR("Jonathan Cameron <Jonathan.Cameron@huawei.com>");
+MODULE_DESCRIPTION("CXL Hotness Monitor Driver");
+MODULE_LICENSE("GPL");
+MODULE_IMPORT_NS(CXL);
+module_init(cxl_hmu_init);
+module_exit(cxl_hmu_exit);
+MODULE_ALIAS_CXL(CXL_DEVICE_HMU);
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 3/4] perf: Add support for CXL Hotness Monitoring Units (CHMU)
  2024-11-21 10:18 [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver Jonathan Cameron
  2024-11-21 10:18 ` [RFC PATCH 1/4] cxl: Register devices for CXL Hotness Monitoring Units (CHMU) Jonathan Cameron
  2024-11-21 10:18 ` [RFC PATCH 2/4] cxl: Hotness Monitoring Unit via a Perf AUX Buffer Jonathan Cameron
@ 2024-11-21 10:18 ` Jonathan Cameron
  2025-08-08  8:29   ` wangyuquan
  2024-11-21 10:18 ` [RFC PATCH 4/4] hwtrace: Document CXL Hotness Monitoring Unit driver Jonathan Cameron
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 27+ messages in thread
From: Jonathan Cameron @ 2024-11-21 10:18 UTC (permalink / raw)
  To: linux-cxl, linux-mm, linux-perf-users, linux-kernel
  Cc: linuxarm, tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi,
	Vandana Salve, Davidlohr Bueso, Dave Jiang, Alison Schofield,
	Ira Weiny, Dan Williams, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Gregory Price, Huang Ying

Based closely on existing support for hisi_ptt.
Provides basic record and report --dump-raw-traces support.

Example output.  With a counter_width of 16 (0x10) the least significant
4 bytes are the counter value and the unit index is bits 16-63.
Here all units are over the threshold and the indexes are 0,1,2 etc.

. ... CXL_HMU data: size 33512 bytes
Header 0: units: 29c counter_width 10
Header 1 : deadbeef
0000000000000283
0000000000010364
0000000000020366
000000000003033c
0000000000040343
00000000000502ff
000000000006030d
000000000007031a

Note this is definitely RFC quality code. Before merging should reduce
the code duplication that already exists and this code makes sorse.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 tools/perf/arch/arm/util/auxtrace.c |  58 +++++
 tools/perf/arch/x86/util/auxtrace.c |  76 ++++++
 tools/perf/util/Build               |   1 +
 tools/perf/util/auxtrace.c          |   4 +
 tools/perf/util/auxtrace.h          |   1 +
 tools/perf/util/cxl-hmu.c           | 367 ++++++++++++++++++++++++++++
 tools/perf/util/cxl-hmu.h           |  18 ++
 7 files changed, 525 insertions(+)

diff --git a/tools/perf/arch/arm/util/auxtrace.c b/tools/perf/arch/arm/util/auxtrace.c
index 3b8eca0ffb17..07ff41800808 100644
--- a/tools/perf/arch/arm/util/auxtrace.c
+++ b/tools/perf/arch/arm/util/auxtrace.c
@@ -18,6 +18,7 @@
 #include "cs-etm.h"
 #include "arm-spe.h"
 #include "hisi-ptt.h"
+#include "cxl-hmu.h"
 
 static struct perf_pmu **find_all_arm_spe_pmus(int *nr_spes, int *err)
 {
@@ -99,6 +100,49 @@ static struct perf_pmu **find_all_hisi_ptt_pmus(int *nr_ptts, int *err)
 	return hisi_ptt_pmus;
 }
 
+static struct perf_pmu **find_all_cxl_hmu_pmus(int *nr_hmus, int *err)
+{
+	struct perf_pmu **cxl_hmu_pmus = NULL;
+	struct dirent *dent;
+	char path[PATH_MAX];
+	DIR *dir = NULL;
+	int idx = 0;
+
+	perf_pmu__event_source_devices_scnprintf(path, sizeof(path));
+	dir = opendir(path);
+	if (!dir) {
+		*err = -EINVAL;
+		return NULL;
+	}
+
+	while ((dent = readdir(dir))) {
+		if (strstr(dent->d_name, "cxl_hmu"))
+			(*nr_hmus)++;
+	}
+
+	if (!(*nr_hmus))
+		goto out;
+
+	cxl_hmu_pmus = zalloc(sizeof(struct perf_pmu *) * (*nr_hmus));
+	if (!cxl_hmu_pmus) {
+		*err = -ENOMEM;
+		goto out;
+	}
+
+	rewinddir(dir);
+	while ((dent = readdir(dir))) {
+		if (strstr(dent->d_name, "cxl_hmu") && idx < *nr_hmus) {
+			cxl_hmu_pmus[idx] = perf_pmus__find(dent->d_name);
+			if (cxl_hmu_pmus[idx])
+				idx++;
+		}
+	}
+
+out:
+	closedir(dir);
+	return cxl_hmu_pmus;
+}
+
 static struct perf_pmu *find_pmu_for_event(struct perf_pmu **pmus,
 					   int pmu_nr, struct evsel *evsel)
 {
@@ -121,13 +165,16 @@ struct auxtrace_record
 	struct perf_pmu	*cs_etm_pmu = NULL;
 	struct perf_pmu **arm_spe_pmus = NULL;
 	struct perf_pmu **hisi_ptt_pmus = NULL;
+	struct perf_pmu **chmu_pmus = NULL;
 	struct evsel *evsel;
 	struct perf_pmu *found_etm = NULL;
 	struct perf_pmu *found_spe = NULL;
 	struct perf_pmu *found_ptt = NULL;
+	struct perf_pmu *found_chmu = NULL;
 	int auxtrace_event_cnt = 0;
 	int nr_spes = 0;
 	int nr_ptts = 0;
+	int nr_chmus = 0;
 
 	if (!evlist)
 		return NULL;
@@ -135,6 +182,7 @@ struct auxtrace_record
 	cs_etm_pmu = perf_pmus__find(CORESIGHT_ETM_PMU_NAME);
 	arm_spe_pmus = find_all_arm_spe_pmus(&nr_spes, err);
 	hisi_ptt_pmus = find_all_hisi_ptt_pmus(&nr_ptts, err);
+	chmu_pmus = find_all_cxl_hmu_pmus(&nr_chmus, err);
 
 	evlist__for_each_entry(evlist, evsel) {
 		if (cs_etm_pmu && !found_etm)
@@ -145,10 +193,14 @@ struct auxtrace_record
 
 		if (hisi_ptt_pmus && !found_ptt)
 			found_ptt = find_pmu_for_event(hisi_ptt_pmus, nr_ptts, evsel);
+
+		if (chmu_pmus && !found_chmu)
+			found_chmu = find_pmu_for_event(chmu_pmus, nr_chmus, evsel);
 	}
 
 	free(arm_spe_pmus);
 	free(hisi_ptt_pmus);
+	free(chmu_pmus);
 
 	if (found_etm)
 		auxtrace_event_cnt++;
@@ -159,6 +211,9 @@ struct auxtrace_record
 	if (found_ptt)
 		auxtrace_event_cnt++;
 
+	if (found_chmu)
+		auxtrace_event_cnt++;
+
 	if (auxtrace_event_cnt > 1) {
 		pr_err("Concurrent AUX trace operation not currently supported\n");
 		*err = -EOPNOTSUPP;
@@ -174,6 +229,9 @@ struct auxtrace_record
 
 	if (found_ptt)
 		return hisi_ptt_recording_init(err, found_ptt);
+
+	if (found_chmu)
+		return chmu_recording_init(err, found_chmu);
 #endif
 
 	/*
diff --git a/tools/perf/arch/x86/util/auxtrace.c b/tools/perf/arch/x86/util/auxtrace.c
index 354780ff1605..30d84ce41394 100644
--- a/tools/perf/arch/x86/util/auxtrace.c
+++ b/tools/perf/arch/x86/util/auxtrace.c
@@ -4,6 +4,7 @@
  * Copyright (c) 2013-2014, Intel Corporation.
  */
 
+#include <dirent.h>
 #include <errno.h>
 #include <stdbool.h>
 
@@ -14,6 +15,7 @@
 #include "../../../util/auxtrace.h"
 #include "../../../util/intel-pt.h"
 #include "../../../util/intel-bts.h"
+#include "../../../util/cxl-hmu.h"
 #include "../../../util/evlist.h"
 
 static
@@ -51,14 +53,88 @@ struct auxtrace_record *auxtrace_record__init_intel(struct evlist *evlist,
 	return NULL;
 }
 
+static struct perf_pmu **find_all_cxl_hmu_pmus(int *nr_hmus, int *err)
+{
+	struct perf_pmu **cxl_hmu_pmus = NULL;
+	struct dirent *dent;
+	char path[PATH_MAX];
+	DIR *dir = NULL;
+	int idx = 0;
+
+	perf_pmu__event_source_devices_scnprintf(path, sizeof(path));
+	dir = opendir(path);
+	if (!dir) {
+		*err = -EINVAL;
+		return NULL;
+	}
+
+	while ((dent = readdir(dir))) {
+		if (strstr(dent->d_name, "cxl_hmu"))
+			(*nr_hmus)++;
+	}
+
+	if (!(*nr_hmus))
+		goto out;
+
+	cxl_hmu_pmus = zalloc(sizeof(struct perf_pmu *) * (*nr_hmus));
+	if (!cxl_hmu_pmus) {
+		*err = -ENOMEM;
+		goto out;
+	}
+
+	rewinddir(dir);
+	while ((dent = readdir(dir))) {
+		if (strstr(dent->d_name, "cxl_hmu") && idx < *nr_hmus) {
+			cxl_hmu_pmus[idx] = perf_pmus__find(dent->d_name);
+			if (cxl_hmu_pmus[idx])
+				idx++;
+		}
+	}
+
+out:
+	closedir(dir);
+	return cxl_hmu_pmus;
+}
+
+static struct perf_pmu *find_pmu_for_event(struct perf_pmu **pmus,
+					   int pmu_nr, struct evsel *evsel)
+{
+	int i;
+
+	if (!pmus)
+		return NULL;
+
+	for (i = 0; i < pmu_nr; i++) {
+		if (evsel->core.attr.type == pmus[i]->type)
+			return pmus[i];
+	}
+
+	return NULL;
+}
+
 struct auxtrace_record *auxtrace_record__init(struct evlist *evlist,
 					      int *err)
 {
 	char buffer[64];
 	int ret;
+	struct perf_pmu **chmu_pmus = NULL;
+	struct perf_pmu *found_chmu = NULL;
+	struct evsel *evsel;
+	int nr_chmus = 0;
 
 	*err = 0;
 
+	chmu_pmus = find_all_cxl_hmu_pmus(&nr_chmus, err);
+
+	evlist__for_each_entry(evlist, evsel) {
+		if (chmu_pmus && !found_chmu)
+			found_chmu = find_pmu_for_event(chmu_pmus, nr_chmus, evsel);
+	}
+	free(chmu_pmus);
+
+	if (found_chmu)
+		return chmu_recording_init(err, found_chmu);
+
 	ret = get_cpuid(buffer, sizeof(buffer));
 	if (ret) {
 		*err = ret;
diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index dc616292b2dd..40c645fd0cb3 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -127,6 +127,7 @@ perf-util-$(CONFIG_AUXTRACE) += arm-spe.o
 perf-util-$(CONFIG_AUXTRACE) += arm-spe-decoder/
 perf-util-$(CONFIG_AUXTRACE) += hisi-ptt.o
 perf-util-$(CONFIG_AUXTRACE) += hisi-ptt-decoder/
+perf-util-$(CONFIG_AUXTRACE) += cxl-hmu.o
 perf-util-$(CONFIG_AUXTRACE) += s390-cpumsf.o
 
 ifdef CONFIG_LIBOPENCSD
diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c
index ca8682966fae..0efc15732a03 100644
--- a/tools/perf/util/auxtrace.c
+++ b/tools/perf/util/auxtrace.c
@@ -53,6 +53,7 @@
 #include "intel-bts.h"
 #include "arm-spe.h"
 #include "hisi-ptt.h"
+#include "cxl-hmu.h"
 #include "s390-cpumsf.h"
 #include "util/mmap.h"
 
@@ -1333,6 +1334,9 @@ int perf_event__process_auxtrace_info(struct perf_session *session,
 	case PERF_AUXTRACE_HISI_PTT:
 		err = hisi_ptt_process_auxtrace_info(event, session);
 		break;
+	case PERF_AUXTRACE_CXL_HMU:
+		err = cxl_hmu_process_auxtrace_info(event, session);
+		break;
 	case PERF_AUXTRACE_UNKNOWN:
 	default:
 		return -EINVAL;
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
index a1895a4f530b..8a7a5b7dc2d6 100644
--- a/tools/perf/util/auxtrace.h
+++ b/tools/perf/util/auxtrace.h
@@ -49,6 +49,7 @@ enum auxtrace_type {
 	PERF_AUXTRACE_ARM_SPE,
 	PERF_AUXTRACE_S390_CPUMSF,
 	PERF_AUXTRACE_HISI_PTT,
+	PERF_AUXTRACE_CXL_HMU,
 };
 
 enum itrace_period_type {
diff --git a/tools/perf/util/cxl-hmu.c b/tools/perf/util/cxl-hmu.c
new file mode 100644
index 000000000000..31844f16e4f9
--- /dev/null
+++ b/tools/perf/util/cxl-hmu.c
@@ -0,0 +1,367 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * CXL HMU support
+ * Copyright (c) 2024 Huawei
+ *
+ * Based on:
+ * HiSilicon PCIe Trace and Tuning (PTT) support
+ * Copyright (c) 2022 HiSilicon Technologies Co., Ltd.
+ */
+
+#include <byteswap.h>
+#include <endian.h>
+#include <errno.h>
+#include <inttypes.h>
+#include <linux/bitops.h>
+#include <linux/kernel.h>
+#include <linux/log2.h>
+#include <linux/types.h>
+#include <linux/zalloc.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include "auxtrace.h"
+#include "color.h"
+#include "debug.h"
+#include "evlist.h"
+#include "evsel.h"
+#include "cxl-hmu.h"
+#include "machine.h"
+#include "record.h"
+#include "session.h"
+#include "tool.h"
+#include "tsc.h"
+#include <internal/lib.h>
+
+#define KiB(x) ((x) * 1024)
+#define MiB(x) ((x) * 1024 * 1024)
+
+struct chmu_recording {
+	struct auxtrace_record	itr;
+	struct perf_pmu *chmu_pmu;
+	struct evlist *evlist;
+};
+
+static size_t
+chmu_info_priv_size(struct auxtrace_record *itr __maybe_unused,
+			struct evlist *evlist __maybe_unused)
+{
+	return CXL_HMU_AUXTRACE_PRIV_SIZE;
+}
+
+static int chmu_info_fill(struct auxtrace_record *itr,
+			      struct perf_session *session,
+			      struct perf_record_auxtrace_info *auxtrace_info,
+			      size_t priv_size)
+{
+	struct chmu_recording *pttr =
+			container_of(itr, struct chmu_recording, itr);
+	struct perf_pmu *chmu_pmu = pttr->chmu_pmu;
+
+	if (priv_size != CXL_HMU_AUXTRACE_PRIV_SIZE)
+		return -EINVAL;
+
+	if (!session->evlist->core.nr_mmaps)
+		return -EINVAL;
+
+	auxtrace_info->type = PERF_AUXTRACE_CXL_HMU;
+	auxtrace_info->priv[0] = chmu_pmu->type;
+
+	return 0;
+}
+
+static int chmu_set_auxtrace_mmap_page(struct record_opts *opts)
+{
+	bool privileged = perf_event_paranoid_check(-1);
+
+	if (!opts->full_auxtrace)
+		return 0;
+
+	if (opts->full_auxtrace && !opts->auxtrace_mmap_pages) {
+		if (privileged) {
+			opts->auxtrace_mmap_pages = MiB(16) / page_size;
+		} else {
+			opts->auxtrace_mmap_pages = KiB(128) / page_size;
+			if (opts->mmap_pages == UINT_MAX)
+				opts->mmap_pages = KiB(256) / page_size;
+		}
+	}
+
+	/* Validate auxtrace_mmap_pages */
+	if (opts->auxtrace_mmap_pages) {
+		size_t sz = opts->auxtrace_mmap_pages * (size_t)page_size;
+		size_t min_sz = KiB(8);
+
+		if (sz < min_sz || !is_power_of_2(sz)) {
+			pr_err("Invalid mmap size for CXL_HMU: must be at least %zuKiB and a power of 2\n",
+			       min_sz / 1024);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int chmu_recording_options(struct auxtrace_record *itr,
+				      struct evlist *evlist,
+				      struct record_opts *opts)
+{
+	struct chmu_recording *pttr =
+			container_of(itr, struct chmu_recording, itr);
+	struct perf_pmu *chmu_pmu = pttr->chmu_pmu;
+	struct evsel *evsel, *chmu_evsel = NULL;
+	struct evsel *tracking_evsel;
+	int err;
+
+	pttr->evlist = evlist;
+	evlist__for_each_entry(evlist, evsel) {
+		if (evsel->core.attr.type == chmu_pmu->type) {
+			if (chmu_evsel) {
+				pr_err("There may be only one cxl_hmu x event\n");
+				return -EINVAL;
+			}
+			evsel->core.attr.freq = 0;
+			evsel->core.attr.sample_period = 1;
+			evsel->needs_auxtrace_mmap = true;
+			chmu_evsel = evsel;
+			opts->full_auxtrace = true;
+		}
+	}
+
+	err = chmu_set_auxtrace_mmap_page(opts);
+	if (err)
+		return err;
+	/*
+	 * To obtain the auxtrace buffer file descriptor, the auxtrace event
+	 * must come first.
+	 */
+	evlist__to_front(evlist, chmu_evsel);
+	evsel__set_sample_bit(chmu_evsel, TIME);
+
+	/* Add dummy event to keep tracking */
+	err = parse_event(evlist, "dummy:u");
+	if (err)
+		return err;
+
+	tracking_evsel = evlist__last(evlist);
+	evlist__set_tracking_event(evlist, tracking_evsel);
+
+	tracking_evsel->core.attr.freq = 0;
+	tracking_evsel->core.attr.sample_period = 1;
+	evsel__set_sample_bit(tracking_evsel, TIME);
+
+	return 0;
+}
+
+static u64 chmu_reference(struct auxtrace_record *itr __maybe_unused)
+{
+	return rdtsc();
+}
+
+static void chmu_recording_free(struct auxtrace_record *itr)
+{
+	struct chmu_recording *pttr =
+	  container_of(itr, struct chmu_recording, itr);
+
+	free(pttr);
+}
+
+struct auxtrace_record *chmu_recording_init(int *err,
+						struct perf_pmu *chmu_pmu)
+{
+	struct chmu_recording *pttr;
+
+	if (!chmu_pmu) {
+		*err = -ENODEV;
+		return NULL;
+	}
+
+	pttr = zalloc(sizeof(*pttr));
+	if (!pttr) {
+		*err = -ENOMEM;
+		return NULL;
+	}
+
+	pttr->chmu_pmu = chmu_pmu;
+	pttr->itr.recording_options = chmu_recording_options;
+	pttr->itr.info_priv_size = chmu_info_priv_size;
+	pttr->itr.info_fill = chmu_info_fill;
+	pttr->itr.free = chmu_recording_free;
+	pttr->itr.reference = chmu_reference;
+	pttr->itr.read_finish = auxtrace_record__read_finish;
+	pttr->itr.alignment = 0;
+
+	*err = 0;
+	return &pttr->itr;
+}
+
+struct cxl_hmu {
+	struct auxtrace auxtrace;
+	u32 auxtrace_type;
+	struct perf_session *session;
+	struct machine *machine;
+	u32 pmu_type;
+};
+
+struct cxl_hmu_queue {
+	struct cxl_hmu *hmu;
+	struct auxtrace_buffer *buffer;
+};
+
+static void cxl_hmu_dump(struct cxl_hmu *hmu __maybe_unused,
+			  unsigned char *buf, size_t len)
+{
+	const char *color = PERF_COLOR_BLUE;
+	size_t pos = 0;
+	size_t packet_offset = 0, hotlist_entries_in_packet;
+
+	len = round_down(len, 8);
+	color_fprintf(stdout, color, ". ... CXL_HMU data: size %zu bytes\n",
+		      len);
+
+	while (len > 0) {
+		if (!packet_offset) {
+			hotlist_entries_in_packet = ((uint64_t *)(buf + pos))[0] & 0xFFFF;
+			color_fprintf(stdout, PERF_COLOR_BLUE,
+				      "Header 0: units: %x counter_width %x\n",
+				      hotlist_entries_in_packet,
+				      (((uint64_t *)(buf + pos))[0] >> 16) & 0xFF);
+		} else if (packet_offset == 1) {
+			color_fprintf(stdout, PERF_COLOR_BLUE,
+				      "Header 1 : %lx\n", ((uint64_t *)(buf + pos))[0]);
+		} else {
+			color_fprintf(stdout, PERF_COLOR_BLUE,
+				      "%016lx\n", ((uint64_t *)(buf + pos))[0]);
+		}
+		pos += 8;
+		len -= 8;
+		packet_offset++;
+		if (packet_offset == hotlist_entries_in_packet + 2)
+			packet_offset = 0;
+	}
+}
+
+static void cxl_hmu_dump_event(struct cxl_hmu *hmu, unsigned char *buf,
+			       size_t len)
+{
+	printf(".\n");
+
+	cxl_hmu_dump(hmu, buf, len);
+}
+
+static int cxl_hmu_process_event(struct perf_session *session __maybe_unused,
+				  union perf_event *event __maybe_unused,
+				  struct perf_sample *sample __maybe_unused,
+				  const struct perf_tool *tool __maybe_unused)
+{
+	return 0;
+}
+
+static int cxl_hmu_process_auxtrace_event(struct perf_session *session,
+					  union perf_event *event,
+					  const struct perf_tool *tool __maybe_unused)
+{
+	struct cxl_hmu *hmu = container_of(session->auxtrace, struct cxl_hmu,
+					    auxtrace);
+	int fd = perf_data__fd(session->data);
+	int size = event->auxtrace.size;
+	void *data = malloc(size);
+	off_t data_offset;
+	int err;
+
+	if (!data) {
+		printf("no data\n");
+		return -errno;
+	}
+
+	if (perf_data__is_pipe(session->data)) {
+		data_offset = 0;
+	} else {
+		data_offset = lseek(fd, 0, SEEK_CUR);
+		if (data_offset == -1) {
+			free(data);
+			printf("failed to seek\n");
+			return -errno;
+		}
+	}
+
+	err = readn(fd, data, size);
+	if (err != (ssize_t)size) {
+		free(data);
+		printf("failed to rread\n");
+		return -errno;
+	}
+
+	if (dump_trace)
+		cxl_hmu_dump_event(hmu, data, size);
+
+	free(data);
+	return 0;
+}
+
+static int cxl_hmu_flush(struct perf_session *session __maybe_unused,
+			 const struct perf_tool *tool __maybe_unused)
+{
+	return 0;
+}
+
+static void cxl_hmu_free_events(struct perf_session *session __maybe_unused)
+{
+}
+
+static void cxl_hmu_free(struct perf_session *session)
+{
+	struct cxl_hmu *hmu = container_of(session->auxtrace, struct cxl_hmu,
+					    auxtrace);
+
+	session->auxtrace = NULL;
+	free(hmu);
+}
+
+static bool cxl_hmu_evsel_is_auxtrace(struct perf_session *session,
+				       struct evsel *evsel)
+{
+	struct cxl_hmu *hmu = container_of(session->auxtrace, struct cxl_hmu, auxtrace);
+
+	return evsel->core.attr.type == hmu->pmu_type;
+}
+
+static void cxl_hmu_print_info(__u64 type)
+{
+	if (!dump_trace)
+		return;
+
+	fprintf(stdout, "  PMU Type           %" PRId64 "\n", (s64) type);
+}
+
+int cxl_hmu_process_auxtrace_info(union perf_event *event,
+				   struct perf_session *session)
+{
+	struct perf_record_auxtrace_info *auxtrace_info = &event->auxtrace_info;
+	struct cxl_hmu *hmu;
+
+	if (auxtrace_info->header.size < CXL_HMU_AUXTRACE_PRIV_SIZE +
+				sizeof(struct perf_record_auxtrace_info))
+		return -EINVAL;
+
+	hmu = zalloc(sizeof(*hmu));
+	if (!hmu)
+		return -ENOMEM;
+
+	hmu->session = session;
+	hmu->machine = &session->machines.host; /* No kvm support */
+	hmu->auxtrace_type = auxtrace_info->type;
+	hmu->pmu_type = auxtrace_info->priv[0];
+
+	hmu->auxtrace.process_event = cxl_hmu_process_event;
+	hmu->auxtrace.process_auxtrace_event = cxl_hmu_process_auxtrace_event;
+	hmu->auxtrace.flush_events = cxl_hmu_flush;
+	hmu->auxtrace.free_events = cxl_hmu_free_events;
+	hmu->auxtrace.free = cxl_hmu_free;
+	hmu->auxtrace.evsel_is_auxtrace = cxl_hmu_evsel_is_auxtrace;
+	session->auxtrace = &hmu->auxtrace;
+
+	cxl_hmu_print_info(auxtrace_info->priv[0]);
+
+	return 0;
+}
diff --git a/tools/perf/util/cxl-hmu.h b/tools/perf/util/cxl-hmu.h
new file mode 100644
index 000000000000..9b4d83219711
--- /dev/null
+++ b/tools/perf/util/cxl-hmu.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * CXL Hotness Monitoring Unit Support
+ */
+
+#ifndef INCLUDE__PERF_CXL_HMU_H__
+#define INCLUDE__PERF_CXL_HMU_H__
+
+#define CXL_HMU_PMU_NAME		"cxl_hmu"
+#define CXL_HMU_AUXTRACE_PRIV_SIZE	sizeof(u64)
+
+struct auxtrace_record *chmu_recording_init(int *err,
+					       struct perf_pmu *cxl_hmu_pmu);
+
+int cxl_hmu_process_auxtrace_info(union perf_event *event,
+				   struct perf_session *session);
+
+#endif
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 4/4] hwtrace: Document CXL Hotness Monitoring Unit driver
  2024-11-21 10:18 [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver Jonathan Cameron
                   ` (2 preceding siblings ...)
  2024-11-21 10:18 ` [RFC PATCH 3/4] perf: Add support for CXL Hotness Monitoring Units (CHMU) Jonathan Cameron
@ 2024-11-21 10:18 ` Jonathan Cameron
       [not found]   ` <CGME20250103052702epcas5p3f7eea83ac70ba7147e0de7fb30f90a62@epcas5p3.samsung.com>
  2025-08-08  8:29   ` wangyuquan
  2024-11-21 13:47 ` [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver Jonathan Cameron
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 27+ messages in thread
From: Jonathan Cameron @ 2024-11-21 10:18 UTC (permalink / raw)
  To: linux-cxl, linux-mm, linux-perf-users, linux-kernel
  Cc: linuxarm, tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi,
	Vandana Salve, Davidlohr Bueso, Dave Jiang, Alison Schofield,
	Ira Weiny, Dan Williams, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Gregory Price, Huang Ying

Add basic documentation to describe the CXL HMU and the
perf AUX buffer based interfaces.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 Documentation/trace/cxl-hmu.rst | 197 ++++++++++++++++++++++++++++++++
 Documentation/trace/index.rst   |   1 +
 2 files changed, 198 insertions(+)

diff --git a/Documentation/trace/cxl-hmu.rst b/Documentation/trace/cxl-hmu.rst
new file mode 100644
index 000000000000..f07a50ba608c
--- /dev/null
+++ b/Documentation/trace/cxl-hmu.rst
@@ -0,0 +1,197 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+CXL Hotness Monitoring Unit Driver
+==================================
+
+CXL r3.2 introduced the CXL Hotness Monitoring Unit (CHMU). A CHMU allows
+software running on a CXL Host to identify hot memory ranges, that is those with
+higher access frequency relative to other memory ranges.
+
+A given Logical Device (presentation of a CXL memory device seen by a particular
+host) can provide 1 or more CHMU each of which supports 1 or more separately
+programmable CHMU Instances (CHMUI). These CHMUI are mostly independent with
+the exception that there can be restrictions on them tracking the same memory
+regions. The CHMUs are always completely independent.
+The naming of the units is cxl_hmu_memX.Y.Z where memX matches the naming
+of the memory device in /sys/bus/cxl/devices/, Y is the CHMU index and
+Z is the CHMUI index with the CHMU.
+
+Each CHMUI provides a ring buffer structure known as the Hot List from which the
+host an read back entries that describe the hotness of particular region of
+memory (Hot List Units). The Hot List Unit combines a Unit Address and an access
+count for the particular address. Unit address to DPA requires multiplication
+by the unit size. Thus, for large unit sizes the device may support higher
+counts. It is these Hot List Units that the driver provides via a perf AUX
+buffer by copying them from PCI BAR space.
+
+The unit size at which hotness is measured is configurable for each CHMUI and
+all measurement is done in Device Physical Address space. To relate this to
+Host Physical Address space the HDM (Host-Managed Device Memory) decoder
+configuration must be taken into account to reflect the placement in a
+CXL Fixed Memory Window and any interleaving.
+
+The CHMUI can support interrupts on fills above a watermark, or on overflow
+of the hotlist.
+
+A CHMUI can support two different basic modes of operation. Epoch and
+Always On. These affect what is placed on the hotlist. Note that the actual
+implementation of tracking is implementation defined and likely to be
+inherently imprecise in that the hottest pages may not be discovered due to
+resource exhaustion and the hotness counts may not represent accurately how
+hot they are. The specification allows for a very high degree of flexibility
+in implementation, important as it is likely that a number of different
+hardware implementations will be chosen to suit particular silicon and accuracy
+budgets.
+
+Operation and configuration
+===========================
+
+An example command line is::
+
+  $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
+  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
+  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
+  hotness_granual=12
+
+  $perf report --dump-raw-traces
+
+which will produce a list of hotlist entries, one per line with a short header
+to provide sufficient information to interpret the entries::
+
+  . ... CXL_HMU data: size 33512 bytes
+  Header 0: units: 29c counter_width 10
+  Header 1 : deadbeef
+  0000000000000283
+  0000000000010364
+  0000000000020366
+  000000000003033c
+  0000000000040343
+  00000000000502ff
+  000000000006030d
+  000000000007031a
+  ...
+
+The least significant counter_width bits (here 16, hex 10) are the counter
+value, all higher bits are the unit index.  Multiply by the unit size
+to get a Device Physical Address.
+
+The parameters are as follows:
+
+epoch_type
+----------
+
+Two values may be supported::
+
+  0 - Epoch based operation
+  1 - Always on operation
+
+
+0. Epoch Based Operation
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+An Epoch is a period of time after which a counter is assessed for hotness.
+
+The device may have a global sense of an Epoch but it may also operate them on
+a per counter, or per region of device basis. This is a function of the
+implementation and is not controllable, but is discoverable. In a global Epoch
+scheme at start of each Epoch all counters are zeroed / deallocated. Counters
+are then allocated in a hardware specific manner and accesses counted. At the
+completion of the Epoch the counters are compared with a threshold and entries
+with a count above a configurable threshold are added to the hotlist. A new
+Epoch is then begun with all counters cleared.
+
+In non-global Epoch scheme, when the Epoch of a given counter begins is not
+specified. An example might be an Epoch for counter only starting on first
+touch to the relevant memory region.  When a local Epoch ends the counter is
+compared to the threshold and if appropriate added to the hotlist.
+
+Note, in Epoch Based Operation, the counter in the hotlist entry provides
+information on how hot the memory is as the counter for the full Epoch is
+provided.
+
+1. Always on Operation
+~~~~~~~~~~~~~~~~~~~~~~
+
+In this mode, counters may all be reset before enabling the CHMUI. Then
+counters are allocated to particular memory units via an hardware specific
+method, perhaps on first touch.  When a counter passes the configurable
+hotness threshold an entry is added to the hotlist and that counter is freed
+for reuse.
+
+In this scheme the count provided in the hotlist entry is not useful as it will
+depend only on the configured threshold.
+
+access_type
+-----------
+
+The parameter controls which access are counted::
+
+  1 - Non-TEE read only
+  2 - Non-TEE write only
+  3 - Non-TEE read and write
+  4 - TEE and Non-TEE read only
+  5 - TEE and Non-TEE write only
+  6 - TEE and Non-tee read and write
+
+
+TEE here refers to a trusted execution environment, specifically one that
+results in the T bit being set in the CXL transactions.
+
+
+hotness_granual
+---------------
+
+Unit size at which tracking is performed.  Must be at least 256 bytes but
+hardware may only support some sizes. Expressed as a power of 2. e.g. 12 = 4kiB.
+
+hotness_threshold
+-----------------
+
+This is the minimum counter value that must be reached for the unit to count as
+hot and be added to the hotlist.
+
+The possible range may be dependent on the unit size as a larger unit size
+requires more bits on the hotlist entry leaving fewer available for the hotness
+counter.
+
+epoch_multiplier and epoch_scale
+--------------------------------
+
+The length of an epoch (in epoch mode) is controlled by these two parameters
+with the decoded epoch_scale multiplied by the epoch_multiplier to give the
+overall epoch length.
+
+epoch_scale::
+
+  1 - 100 usecs
+  2 - 1 msec
+  3 - 10 msecs
+  4 - 100 msecs
+  5 - 1 second
+
+range_base and range_scale
+--------------------------
+
+Expressed in terms of the unit size set via hotness_granual. Each CHMUI has a
+bitmap that controls what Device Physical Address spaces is tracked. Each bit
+represents 256MiB of DPA space.
+
+This interface provides a simple base and size in units of 256MiB to configure
+this bitmap. All bits in the specified range will be set.
+
+downsampling_factor
+-------------------
+
+Hardware may be incapable of counting accesses at full speed or it may be
+desirable to count over a longer period during which the counters would
+overflow.  This control allows selection of a down sampling factor expressed
+as a power of 2 between 1 and 32768.  Default is minimum supported downsampling
+factor.
+
+randomized_downsampling
+-----------------------
+
+To avoid problems with downsampling when accesses are periodic this option
+allows for an implementation defined randomization of the sampling interval,
+whilst remaining close to the specified downsampling_factor.
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 0b300901fd75..b35ed8e9dfa9 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -36,3 +36,4 @@ Linux Tracing Technologies
    user_events
    rv/index
    hisi-ptt
+   cxl-hmu
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
  2024-11-21 10:18 [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver Jonathan Cameron
                   ` (3 preceding siblings ...)
  2024-11-21 10:18 ` [RFC PATCH 4/4] hwtrace: Document CXL Hotness Monitoring Unit driver Jonathan Cameron
@ 2024-11-21 13:47 ` Jonathan Cameron
  2024-11-21 14:24 ` Gregory Price
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 27+ messages in thread
From: Jonathan Cameron @ 2024-11-21 13:47 UTC (permalink / raw)
  To: linux-cxl, linux-mm, linux-perf-users, linux-kernel
  Cc: linuxarm, tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi,
	Vandana Salve, Davidlohr Bueso, Dave Jiang, Alison Schofield,
	Ira Weiny, Dan Williams, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Gregory Price, Huang Ying

On Thu, 21 Nov 2024 10:18:41 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> The CXL specification release 3.2 is now available under a click through at
> https://computeexpresslink.org/cxl-specification/ and it brings new
> shiny toys.
> 
> RFC reason
> - Whilst trace capture with a particular configuration is potentially useful
>   the intent is that CXL HMU units will be used to drive various forms of
>   hotpage migration for memory tiering setups. This driver doesn't do this
>   (yet), but rather provides data capture etc for experimentation and
>   for working out how to mostly put the allocations in the right place to
>   start with by tuning applications.
> 
> CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> of this is to provide a way to establish which units of memory (typically
> pages or larger) in CXL attached memory are hot. The implementation details
> and algorithm are all implementation defined. The specification simply
> describes the 'interface' which takes the form of ring buffer of hotness
> records in a PCI BAR and defined capability, configuration and status
> registers.
> 
> The hardware may have constraints on what it can track, granularity etc
> and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> trackers). Some of these constraints are discoverable from the hardware
> registers, others such as loss of accuracy have no universally accepted
> measures as they are typically access pattern dependent. Sadly it is
> very unlikely any hardware will implement a truly precise tracker given
> the large resource requirements for tracking at a useful granularity.
> 
> There are two fundamental operation modes:
> 
> * Epoch based. Counters are checked after a period of time (Epoch) and
>   if over a threshold added to the hotlist.
> * Always on. Counters run until a threshold is reached, after that the
>   hot unit is added to the hotlist and the counter released.
> 
> Counting can be filtered on:
> 
> * Region of CXL DPA space (256MiB per bit in a bitmap).
> * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> 
> Sampling can be modified by:
> 
> * Downsampling including potentially randomized downsampling.
> 
> The driver presented here is intended to be useful in its own right but
> also to act as the first step of a possible path towards hotness monitoring
> based hot page migration. Those steps might look like.
> 
> 1. Gather data - drivers provide telemetry like solutions to get that
>    data. May be enhanced, for example in this driver by providing the
>    HPA address rather than DPA Unit Address. Userspace can access enough
>    information to do this so maybe not.
> 2. Userspace algorithm development, possibly combined with userspace
>    triggered migration by PA. Working out how to use different levels
>    of constrained hardware resources will be challenging.
> 3. Move those algorithms in kernel. Will require generalization across
>    different hotpage trackers etc.
> 
> So far this driver just gives access to the raw data. I will probably kick
> of a longer discussion on how to do adaptive sampling needed to actually
> use these units for tiering etc, sometime soon (if no one one else beats
> me too it).  There is a follow up topic of how to virtualize this stuff
> for memory stranding cases (VM gets a fixed mixture of fast and slow
> memory and should do it's own tiering).
> 
> More details in the Documentation patch but typical commands are:
> 
> $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
>  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
>  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
>  hotness_granual=12
> 
> $perf report --dump-raw-traces
> 
> Example output.  With a counter_width of 16 (0x10) the least significant
> 4 bytes are the counter value and the unit index is bits 16-63.
> Here all units are over the threshold and the indexes are 0,1,2 etc.
> 
> . ... CXL_HMU data: size 33512 bytes
> Header 0: units: 29c counter_width 10
> Header 1 : deadbeef
> 0000000000000283
> 0000000000010364
> 0000000000020366
> 000000000003033c
> 0000000000040343
> 00000000000502ff
> 000000000006030d
> 000000000007031a
> 
> Which will produce a list of hotness entries.
> Bits[N-1:0] counter value
> Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
>   config to get to a Host Physical Address)
> 
> Specific RFC questions.
> - What should be in the header added to the aux buffer.
>   Currently just the minimum is provided. Number of records
>   and the counter width needed to decode them.
> - Should we reset the counters when doing sampling "-F X"
>   If the frequency is higher than the epoch we never see any hot units.
>   If so, when should we reset them?
> 
> Note testing has been light and on emulation only + as perf tool is
> a pain to build on a striped back VM,  build testing has all be on
> arm64 so far.  The driver loads though on both arm64 and x86 so
> any problems are likely in the perf tool arch specific code
> which is build tested (on wrong machine)

FWIW, runs on x86. However, it triggers a lockdep warning in
both start and stop due to the spin lock. Something to tidy up for
RFCv2.

J

> 
> The QEMU emulation needs some cleanup, but I should be able to post
> that shortly to let people actually play with this.  There are lots
> of open questions there on how 'right' we want the emulation to be
> and what counting uarch to emulate.
> 
> Jonathan Cameron (4):
>   cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
>   cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
>   perf: Add support for CXL Hotness Monitoring Units (CHMU)
>   hwtrace: Document CXL Hotness Monitoring Unit driver
> 
>  Documentation/trace/cxl-hmu.rst     | 197 +++++++
>  Documentation/trace/index.rst       |   1 +
>  drivers/cxl/Kconfig                 |   6 +
>  drivers/cxl/Makefile                |   3 +
>  drivers/cxl/core/Makefile           |   1 +
>  drivers/cxl/core/core.h             |   1 +
>  drivers/cxl/core/hmu.c              |  64 ++
>  drivers/cxl/core/port.c             |   2 +
>  drivers/cxl/core/regs.c             |  14 +
>  drivers/cxl/cxl.h                   |   5 +
>  drivers/cxl/cxlpci.h                |   1 +
>  drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
>  drivers/cxl/hmu.h                   |  23 +
>  drivers/cxl/pci.c                   |  26 +-
>  tools/perf/arch/arm/util/auxtrace.c |  58 ++
>  tools/perf/arch/x86/util/auxtrace.c |  76 +++
>  tools/perf/util/Build               |   1 +
>  tools/perf/util/auxtrace.c          |   4 +
>  tools/perf/util/auxtrace.h          |   1 +
>  tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
>  tools/perf/util/cxl-hmu.h           |  18 +
>  21 files changed, 1748 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/trace/cxl-hmu.rst
>  create mode 100644 drivers/cxl/core/hmu.c
>  create mode 100644 drivers/cxl/hmu.c
>  create mode 100644 drivers/cxl/hmu.h
>  create mode 100644 tools/perf/util/cxl-hmu.c
>  create mode 100644 tools/perf/util/cxl-hmu.h
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
  2024-11-21 10:18 [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver Jonathan Cameron
                   ` (4 preceding siblings ...)
  2024-11-21 13:47 ` [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver Jonathan Cameron
@ 2024-11-21 14:24 ` Gregory Price
  2024-11-21 14:58   ` Jonathan Cameron
  2024-11-27 16:34 ` Jonathan Cameron
  2025-01-24 17:40 ` Jonathan Cameron
  7 siblings, 1 reply; 27+ messages in thread
From: Gregory Price @ 2024-11-21 14:24 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-mm, linux-perf-users, linux-kernel, linuxarm,
	tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi, Vandana Salve,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Ira Weiny,
	Dan Williams, Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Huang Ying

On Thu, Nov 21, 2024 at 10:18:41AM +0000, Jonathan Cameron wrote:
> The CXL specification release 3.2 is now available under a click through at
> https://computeexpresslink.org/cxl-specification/ and it brings new
> shiny toys.
> 
> RFC reason
> - Whilst trace capture with a particular configuration is potentially useful
>   the intent is that CXL HMU units will be used to drive various forms of
>   hotpage migration for memory tiering setups. This driver doesn't do this
>   (yet), but rather provides data capture etc for experimentation and
>   for working out how to mostly put the allocations in the right place to
>   start with by tuning applications.
> 
> CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> of this is to provide a way to establish which units of memory (typically
> pages or larger) in CXL attached memory are hot. The implementation details
> and algorithm are all implementation defined. The specification simply
> describes the 'interface' which takes the form of ring buffer of hotness
> records in a PCI BAR and defined capability, configuration and status
> registers.
> 
> The hardware may have constraints on what it can track, granularity etc
> and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> trackers). Some of these constraints are discoverable from the hardware
> registers, others such as loss of accuracy have no universally accepted
> measures as they are typically access pattern dependent. Sadly it is
> very unlikely any hardware will implement a truly precise tracker given
> the large resource requirements for tracking at a useful granularity.
> 
> There are two fundamental operation modes:
> 
> * Epoch based. Counters are checked after a period of time (Epoch) and
>   if over a threshold added to the hotlist.
> * Always on. Counters run until a threshold is reached, after that the
>   hot unit is added to the hotlist and the counter released.
> 
> Counting can be filtered on:
> 
> * Region of CXL DPA space (256MiB per bit in a bitmap).
> * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> 
> Sampling can be modified by:
> 
> * Downsampling including potentially randomized downsampling.
> 
> The driver presented here is intended to be useful in its own right but
> also to act as the first step of a possible path towards hotness monitoring
> based hot page migration. Those steps might look like.
> 
> 1. Gather data - drivers provide telemetry like solutions to get that
>    data. May be enhanced, for example in this driver by providing the
>    HPA address rather than DPA Unit Address. Userspace can access enough
>    information to do this so maybe not.
> 2. Userspace algorithm development, possibly combined with userspace
>    triggered migration by PA. Working out how to use different levels
>    of constrained hardware resources will be challenging.

FWIW this is what i was thinking about for this extension:

https://lore.kernel.org/all/20240319172609.332900-1-gregory.price@memverge.com/

At least for testing CHMU stuff. So if anyone is poking at testing such
things, they can feel free to use that for prototyping. However, I think
there is general discomfort around userspace handling HPA/DPA.

So it might look more like

echo nr_pages > /sys/.../tiering/nodeN/promote_pages

rather than handling the raw data from the CHMU to make decisions.


> 3. Move those algorithms in kernel. Will require generalization across
>    different hotpage trackers etc.
> 

In a longer discussion with Dan, we considered something a little more
abstract - like a system that monitors bandwidth and memory access stalls
and decide to promote X pages from Y device.  This carries a pretty tall
generalization cost, but it's pretty exciting to say the least.

Definitely worth a discussion for later.

>
> So far this driver just gives access to the raw data. I will probably kick
> of a longer discussion on how to do adaptive sampling needed to actually
> use these units for tiering etc, sometime soon (if no one one else beats
> me too it).  There is a follow up topic of how to virtualize this stuff
> for memory stranding cases (VM gets a fixed mixture of fast and slow
> memory and should do it's own tiering).
>

Without having looked at the patches yet, I would presume this interface
is at least gated to admin/root? (raw data is physical address info)

~Gregory

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
  2024-11-21 14:24 ` Gregory Price
@ 2024-11-21 14:58   ` Jonathan Cameron
  2024-11-21 15:49     ` Gregory Price
  2024-11-22 20:08     ` SeongJae Park
  0 siblings, 2 replies; 27+ messages in thread
From: Jonathan Cameron @ 2024-11-21 14:58 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-cxl, linux-mm, linux-perf-users, linux-kernel, linuxarm,
	tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi, Vandana Salve,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Ira Weiny,
	Dan Williams, Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Huang Ying

On Thu, 21 Nov 2024 09:24:43 -0500
Gregory Price <gourry@gourry.net> wrote:

> On Thu, Nov 21, 2024 at 10:18:41AM +0000, Jonathan Cameron wrote:
> > The CXL specification release 3.2 is now available under a click through at
> > https://computeexpresslink.org/cxl-specification/ and it brings new
> > shiny toys.
> > 
> > RFC reason
> > - Whilst trace capture with a particular configuration is potentially useful
> >   the intent is that CXL HMU units will be used to drive various forms of
> >   hotpage migration for memory tiering setups. This driver doesn't do this
> >   (yet), but rather provides data capture etc for experimentation and
> >   for working out how to mostly put the allocations in the right place to
> >   start with by tuning applications.
> > 
> > CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> > of this is to provide a way to establish which units of memory (typically
> > pages or larger) in CXL attached memory are hot. The implementation details
> > and algorithm are all implementation defined. The specification simply
> > describes the 'interface' which takes the form of ring buffer of hotness
> > records in a PCI BAR and defined capability, configuration and status
> > registers.
> > 
> > The hardware may have constraints on what it can track, granularity etc
> > and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> > trackers). Some of these constraints are discoverable from the hardware
> > registers, others such as loss of accuracy have no universally accepted
> > measures as they are typically access pattern dependent. Sadly it is
> > very unlikely any hardware will implement a truly precise tracker given
> > the large resource requirements for tracking at a useful granularity.
> > 
> > There are two fundamental operation modes:
> > 
> > * Epoch based. Counters are checked after a period of time (Epoch) and
> >   if over a threshold added to the hotlist.
> > * Always on. Counters run until a threshold is reached, after that the
> >   hot unit is added to the hotlist and the counter released.
> > 
> > Counting can be filtered on:
> > 
> > * Region of CXL DPA space (256MiB per bit in a bitmap).
> > * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> > 
> > Sampling can be modified by:
> > 
> > * Downsampling including potentially randomized downsampling.
> > 
> > The driver presented here is intended to be useful in its own right but
> > also to act as the first step of a possible path towards hotness monitoring
> > based hot page migration. Those steps might look like.
> > 
> > 1. Gather data - drivers provide telemetry like solutions to get that
> >    data. May be enhanced, for example in this driver by providing the
> >    HPA address rather than DPA Unit Address. Userspace can access enough
> >    information to do this so maybe not.
> > 2. Userspace algorithm development, possibly combined with userspace
> >    triggered migration by PA. Working out how to use different levels
> >    of constrained hardware resources will be challenging.  
> 
> FWIW this is what i was thinking about for this extension:
> 
> https://lore.kernel.org/all/20240319172609.332900-1-gregory.price@memverge.com/

Yup. I had that in mind. Forgot to actually add a link.

> 
> At least for testing CHMU stuff. So if anyone is poking at testing such
> things, they can feel free to use that for prototyping. However, I think
> there is general discomfort around userspace handling HPA/DPA.
> 
> So it might look more like
> 
> echo nr_pages > /sys/.../tiering/nodeN/promote_pages
> 
> rather than handling the raw data from the CHMU to make decisions.

Agreed, but I think we are far away from a point where we can implement that.

Just working out how to tune the hardware to grab useful data is going
to take a while to figure out, let alone doing anything much with it.

Without care you won't get a meaningful signal for what is actually
hot out of the box. Lots of reasons why including:
a) Exhaustion of tracking resources, due to looking at too large a window
   or for too long.  Will probably need some form of auto updating of
   what is being scanning (coarse to fine might work though I'm doubtful,
   scanning across small regions maybe).
b) Threshold too high, no detections.
c) Threshold too low, everything hot.
d) Wrong timescales. Hot is not a well defined thing.
e) Hardware that won't do tracking at fine enough granularity.

> 
> 
> > 3. Move those algorithms in kernel. Will require generalization across
> >    different hotpage trackers etc.
> >   
> 
> In a longer discussion with Dan, we considered something a little more
> abstract - like a system that monitors bandwidth and memory access stalls
> and decide to promote X pages from Y device.  This carries a pretty tall
> generalization cost, but it's pretty exciting to say the least.

Agreed that ultimately we'll end up somewhere like that.
These units are just a small part of what is needed in total.

> 
> Definitely worth a discussion for later.
> 
> >
> > So far this driver just gives access to the raw data. I will probably kick
> > of a longer discussion on how to do adaptive sampling needed to actually
> > use these units for tiering etc, sometime soon (if no one one else beats
> > me too it).  There is a follow up topic of how to virtualize this stuff
> > for memory stranding cases (VM gets a fixed mixture of fast and slow
> > memory and should do it's own tiering).
> >  
> 
> Without having looked at the patches yet, I would presume this interface
> is at least gated to admin/root? (raw data is physical address info)

That's certainly the intent. It's not going upstream in this form so
I haven't actually checked yet :)  Uses similar infrastructure to ARM
SPE which can also give physical address info + a lot more than that.

Jonathan



> 
> ~Gregory
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
  2024-11-21 14:58   ` Jonathan Cameron
@ 2024-11-21 15:49     ` Gregory Price
  2024-11-22 20:08     ` SeongJae Park
  1 sibling, 0 replies; 27+ messages in thread
From: Gregory Price @ 2024-11-21 15:49 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-mm, linux-perf-users, linux-kernel, linuxarm,
	tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi, Vandana Salve,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Ira Weiny,
	Dan Williams, Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Huang Ying

On Thu, Nov 21, 2024 at 02:58:52PM +0000, Jonathan Cameron wrote:
> On Thu, 21 Nov 2024 09:24:43 -0500
> Gregory Price <gourry@gourry.net> wrote:
> 
> > On Thu, Nov 21, 2024 at 10:18:41AM +0000, Jonathan Cameron wrote:
> > > The CXL specification release 3.2 is now available under a click through at
> > > https://computeexpresslink.org/cxl-specification/ and it brings new
> > > shiny toys.
> > > 
> > > RFC reason
> > > - Whilst trace capture with a particular configuration is potentially useful
> > >   the intent is that CXL HMU units will be used to drive various forms of
> > >   hotpage migration for memory tiering setups. This driver doesn't do this
> > >   (yet), but rather provides data capture etc for experimentation and
> > >   for working out how to mostly put the allocations in the right place to
> > >   start with by tuning applications.
> > > 
> > > CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> > > of this is to provide a way to establish which units of memory (typically
> > > pages or larger) in CXL attached memory are hot. The implementation details
> > > and algorithm are all implementation defined. The specification simply
> > > describes the 'interface' which takes the form of ring buffer of hotness
> > > records in a PCI BAR and defined capability, configuration and status
> > > registers.
> > > 
> > > The hardware may have constraints on what it can track, granularity etc
> > > and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> > > trackers). Some of these constraints are discoverable from the hardware
> > > registers, others such as loss of accuracy have no universally accepted
> > > measures as they are typically access pattern dependent. Sadly it is
> > > very unlikely any hardware will implement a truly precise tracker given
> > > the large resource requirements for tracking at a useful granularity.
> > > 
> > > There are two fundamental operation modes:
> > > 
> > > * Epoch based. Counters are checked after a period of time (Epoch) and
> > >   if over a threshold added to the hotlist.
> > > * Always on. Counters run until a threshold is reached, after that the
> > >   hot unit is added to the hotlist and the counter released.
> > > 
> > > Counting can be filtered on:
> > > 
> > > * Region of CXL DPA space (256MiB per bit in a bitmap).
> > > * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> > > 
> > > Sampling can be modified by:
> > > 
> > > * Downsampling including potentially randomized downsampling.
> > > 
> > > The driver presented here is intended to be useful in its own right but
> > > also to act as the first step of a possible path towards hotness monitoring
> > > based hot page migration. Those steps might look like.
> > > 
> > > 1. Gather data - drivers provide telemetry like solutions to get that
> > >    data. May be enhanced, for example in this driver by providing the
> > >    HPA address rather than DPA Unit Address. Userspace can access enough
> > >    information to do this so maybe not.
> > > 2. Userspace algorithm development, possibly combined with userspace
> > >    triggered migration by PA. Working out how to use different levels
> > >    of constrained hardware resources will be challenging.  
> > 
> > FWIW this is what i was thinking about for this extension:
> > 
> > https://lore.kernel.org/all/20240319172609.332900-1-gregory.price@memverge.com/
> 
> Yup. I had that in mind. Forgot to actually add a link.
> 
> > 
> > At least for testing CHMU stuff. So if anyone is poking at testing such
> > things, they can feel free to use that for prototyping. However, I think
> > there is general discomfort around userspace handling HPA/DPA.
> > 
> > So it might look more like
> > 
> > echo nr_pages > /sys/.../tiering/nodeN/promote_pages
> > 
> > rather than handling the raw data from the CHMU to make decisions.
> 
> Agreed, but I think we are far away from a point where we can implement that.
> 
> Just working out how to tune the hardware to grab useful data is going
> to take a while to figure out, let alone doing anything much with it.
> 
> Without care you won't get a meaningful signal for what is actually
> hot out of the box. Lots of reasons why including:
> a) Exhaustion of tracking resources, due to looking at too large a window
>    or for too long.  Will probably need some form of auto updating of
>    what is being scanning (coarse to fine might work though I'm doubtful,
>    scanning across small regions maybe).
> b) Threshold too high, no detections.
> c) Threshold too low, everything hot.
> d) Wrong timescales. Hot is not a well defined thing.
> e) Hardware that won't do tracking at fine enough granularity.
> 

f) How does this even work with interleaving on larger pools :B
   It's pretend-addressing all the way down :D

Lots of conceptually complex and fun questions here.

~Gregory

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
  2024-11-21 14:58   ` Jonathan Cameron
  2024-11-21 15:49     ` Gregory Price
@ 2024-11-22 20:08     ` SeongJae Park
  1 sibling, 0 replies; 27+ messages in thread
From: SeongJae Park @ 2024-11-22 20:08 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: SeongJae Park, Gregory Price, linux-cxl, linux-mm,
	linux-perf-users, linux-kernel, linuxarm, tongtiangen,
	Yicong Yang, Niyas Sait, ajayjoshi, Vandana Salve,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Ira Weiny,
	Dan Williams, Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Huang Ying

On Thu, 21 Nov 2024 14:58:52 +0000 Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> On Thu, 21 Nov 2024 09:24:43 -0500
> Gregory Price <gourry@gourry.net> wrote:
> 
> > On Thu, Nov 21, 2024 at 10:18:41AM +0000, Jonathan Cameron wrote:
[...]
> Just working out how to tune the hardware to grab useful data is going
> to take a while to figure out, let alone doing anything much with it.
> 
> Without care you won't get a meaningful signal for what is actually
> hot out of the box. Lots of reasons why including:
> a) Exhaustion of tracking resources, due to looking at too large a window
>    or for too long.  Will probably need some form of auto updating of
>    what is being scanning (coarse to fine might work though I'm doubtful,
>    scanning across small regions maybe).
> b) Threshold too high, no detections.
> c) Threshold too low, everything hot.
> d) Wrong timescales. Hot is not a well defined thing.
> e) Hardware that won't do tracking at fine enough granularity.

Similar questions can be raised to general hotness monitoring including that
for DAMON.  I'm trying to summarize[1] rules of thumbs for DAMON tuning based
on my humble experiences.  Once it is done, I will further try automations of
tunings.

In future, hopefully DAMON can be extended to utilize CXL hotness monitoring
unit as low level primitive for access check.  Then, the guidance and
automation of DAMON tuning could be just applied.

Note that I'm not saying DAMON should be the only way to utilize CXL hotness
monitoring unit.  I'm saying DAMON could be one of the ways :)

[1] https://lore.kernel.org/20241108232536.73843-1-sj@kernel.org


Thanks,
SJ

[...]

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
  2024-11-21 10:18 [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver Jonathan Cameron
                   ` (5 preceding siblings ...)
  2024-11-21 14:24 ` Gregory Price
@ 2024-11-27 16:34 ` Jonathan Cameron
  2024-12-04 12:35   ` [EXT] " Ajay Joshi
       [not found]   ` <CGME20250103053521epcas5p30cd4abba59d695664335b03ba806c56d@epcas5p3.samsung.com>
  2025-01-24 17:40 ` Jonathan Cameron
  7 siblings, 2 replies; 27+ messages in thread
From: Jonathan Cameron @ 2024-11-27 16:34 UTC (permalink / raw)
  To: linux-cxl, linux-mm, linux-perf-users, linux-kernel, linuxarm
  Cc: tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi, Vandana Salve,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Ira Weiny,
	Dan Williams, Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Gregory Price, Huang Ying

On Thu, 21 Nov 2024 10:18:41 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> The CXL specification release 3.2 is now available under a click through at
> https://computeexpresslink.org/cxl-specification/ and it brings new
> shiny toys.

If anyone wants to play, basic emulation on my CXL QEMU staging tree
https://gitlab.com/jic23/qemu/-/commit/e89b35d264c1bcc04807e7afab1254f35ffc8cb9

Branch with a few other things on top is:
https://gitlab.com/jic23/qemu/-/commits/cxl-2024-11-27

Note that this currently doesn't produce real data.  I have a plan
/ initial PoC / hack to hook that up via an addition to the QEMU cache
plugin and an external tool to emulate the hotness tracker counting
hardware. Will be a little while before I get that finished, so in
a meantime the above exercises the driver.

Jonathan
 
> 
> RFC reason
> - Whilst trace capture with a particular configuration is potentially useful
>   the intent is that CXL HMU units will be used to drive various forms of
>   hotpage migration for memory tiering setups. This driver doesn't do this
>   (yet), but rather provides data capture etc for experimentation and
>   for working out how to mostly put the allocations in the right place to
>   start with by tuning applications.
> 
> CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> of this is to provide a way to establish which units of memory (typically
> pages or larger) in CXL attached memory are hot. The implementation details
> and algorithm are all implementation defined. The specification simply
> describes the 'interface' which takes the form of ring buffer of hotness
> records in a PCI BAR and defined capability, configuration and status
> registers.
> 
> The hardware may have constraints on what it can track, granularity etc
> and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> trackers). Some of these constraints are discoverable from the hardware
> registers, others such as loss of accuracy have no universally accepted
> measures as they are typically access pattern dependent. Sadly it is
> very unlikely any hardware will implement a truly precise tracker given
> the large resource requirements for tracking at a useful granularity.
> 
> There are two fundamental operation modes:
> 
> * Epoch based. Counters are checked after a period of time (Epoch) and
>   if over a threshold added to the hotlist.
> * Always on. Counters run until a threshold is reached, after that the
>   hot unit is added to the hotlist and the counter released.
> 
> Counting can be filtered on:
> 
> * Region of CXL DPA space (256MiB per bit in a bitmap).
> * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> 
> Sampling can be modified by:
> 
> * Downsampling including potentially randomized downsampling.
> 
> The driver presented here is intended to be useful in its own right but
> also to act as the first step of a possible path towards hotness monitoring
> based hot page migration. Those steps might look like.
> 
> 1. Gather data - drivers provide telemetry like solutions to get that
>    data. May be enhanced, for example in this driver by providing the
>    HPA address rather than DPA Unit Address. Userspace can access enough
>    information to do this so maybe not.
> 2. Userspace algorithm development, possibly combined with userspace
>    triggered migration by PA. Working out how to use different levels
>    of constrained hardware resources will be challenging.
> 3. Move those algorithms in kernel. Will require generalization across
>    different hotpage trackers etc.
> 
> So far this driver just gives access to the raw data. I will probably kick
> of a longer discussion on how to do adaptive sampling needed to actually
> use these units for tiering etc, sometime soon (if no one one else beats
> me too it).  There is a follow up topic of how to virtualize this stuff
> for memory stranding cases (VM gets a fixed mixture of fast and slow
> memory and should do it's own tiering).
> 
> More details in the Documentation patch but typical commands are:
> 
> $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
>  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
>  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
>  hotness_granual=12
> 
> $perf report --dump-raw-traces
> 
> Example output.  With a counter_width of 16 (0x10) the least significant
> 4 bytes are the counter value and the unit index is bits 16-63.
> Here all units are over the threshold and the indexes are 0,1,2 etc.
> 
> . ... CXL_HMU data: size 33512 bytes
> Header 0: units: 29c counter_width 10
> Header 1 : deadbeef
> 0000000000000283
> 0000000000010364
> 0000000000020366
> 000000000003033c
> 0000000000040343
> 00000000000502ff
> 000000000006030d
> 000000000007031a
> 
> Which will produce a list of hotness entries.
> Bits[N-1:0] counter value
> Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
>   config to get to a Host Physical Address)
> 
> Specific RFC questions.
> - What should be in the header added to the aux buffer.
>   Currently just the minimum is provided. Number of records
>   and the counter width needed to decode them.
> - Should we reset the counters when doing sampling "-F X"
>   If the frequency is higher than the epoch we never see any hot units.
>   If so, when should we reset them?
> 
> Note testing has been light and on emulation only + as perf tool is
> a pain to build on a striped back VM,  build testing has all be on
> arm64 so far.  The driver loads though on both arm64 and x86 so
> any problems are likely in the perf tool arch specific code
> which is build tested (on wrong machine)
> 
> The QEMU emulation needs some cleanup, but I should be able to post
> that shortly to let people actually play with this.  There are lots
> of open questions there on how 'right' we want the emulation to be
> and what counting uarch to emulate.
> 
> Jonathan Cameron (4):
>   cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
>   cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
>   perf: Add support for CXL Hotness Monitoring Units (CHMU)
>   hwtrace: Document CXL Hotness Monitoring Unit driver
> 
>  Documentation/trace/cxl-hmu.rst     | 197 +++++++
>  Documentation/trace/index.rst       |   1 +
>  drivers/cxl/Kconfig                 |   6 +
>  drivers/cxl/Makefile                |   3 +
>  drivers/cxl/core/Makefile           |   1 +
>  drivers/cxl/core/core.h             |   1 +
>  drivers/cxl/core/hmu.c              |  64 ++
>  drivers/cxl/core/port.c             |   2 +
>  drivers/cxl/core/regs.c             |  14 +
>  drivers/cxl/cxl.h                   |   5 +
>  drivers/cxl/cxlpci.h                |   1 +
>  drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
>  drivers/cxl/hmu.h                   |  23 +
>  drivers/cxl/pci.c                   |  26 +-
>  tools/perf/arch/arm/util/auxtrace.c |  58 ++
>  tools/perf/arch/x86/util/auxtrace.c |  76 +++
>  tools/perf/util/Build               |   1 +
>  tools/perf/util/auxtrace.c          |   4 +
>  tools/perf/util/auxtrace.h          |   1 +
>  tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
>  tools/perf/util/cxl-hmu.h           |  18 +
>  21 files changed, 1748 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/trace/cxl-hmu.rst
>  create mode 100644 drivers/cxl/core/hmu.c
>  create mode 100644 drivers/cxl/hmu.c
>  create mode 100644 drivers/cxl/hmu.h
>  create mode 100644 tools/perf/util/cxl-hmu.c
>  create mode 100644 tools/perf/util/cxl-hmu.h
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* RE: [EXT] Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
  2024-11-27 16:34 ` Jonathan Cameron
@ 2024-12-04 12:35   ` Ajay Joshi
       [not found]   ` <CGME20250103053521epcas5p30cd4abba59d695664335b03ba806c56d@epcas5p3.samsung.com>
  1 sibling, 0 replies; 27+ messages in thread
From: Ajay Joshi @ 2024-12-04 12:35 UTC (permalink / raw)
  To: Jonathan Cameron, linux-cxl@vger.kernel.org, linux-mm@kvack.org,
	linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org,
	linuxarm@huawei.com
  Cc: tongtiangen@huawei.com, Yicong Yang, Niyas Sait, Vandana Salve,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Ira Weiny,
	Dan Williams, Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Gregory Price, Huang Ying,
	Vanshika Gupta, Vishal Tanna, Aravind Ramesh

Micron Confidential


Micron Confidential
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> Sent: Wednesday, November 27, 2024 10:05 PM
>
>
> On Thu, 21 Nov 2024 10:18:41 +0000
> Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>
> > The CXL specification release 3.2 is now available under a click
> > through at
> >
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcom
> p
> > uteexpresslink.org%2Fcxl-
> specification%2F&data=05%7C02%7Cajayjoshi%40micron.com%7Ce59092c
> 80eed4878d9cc08dd0f016a78%7Cf38a5ecd28134862b11bac1d563c806f%
> 7C0%7C0%7C638683221020661525%7CUnknown%7CTWFpbGZsb3d8eyJF
> bXB0eU1hcGkiOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiT
> WFpbCIsIldUIjoyfQ%3D%3D%7C0%7C%7C%7C&sdata=A6OYPhky94PnkzYn
> 4bfB1usIFDQzR1GlY1QFK3hBVtY%3D&reserved=0 and it brings new shiny
> toys.
>
> If anyone wants to play, basic emulation on my CXL QEMU staging tree
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitla
> b.com%2Fjic23%2Fqemu%2F-
> %2Fcommit%2Fe89b35d264c1bcc04807e7afab1254f35ffc8cb9&data=05%7
> C02%7Cajayjoshi%40micron.com%7Ce59092c80eed4878d9cc08dd0f016a7
> 8%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C638683221020
> 676260%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGkiOnRydWUsIlYi
> OiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIjoyfQ%3D%3D
> %7C0%7C%7C%7C&sdata=Un0fB5v%2BBKTnQPldKKoRwOpw9GrGdDwBrXm
> JamKEIvA%3D&reserved=0

This is interesting. We are definitely trying this and let you know how it goes.

>
> Branch with a few other things on top is:
> https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgitla
> b.com%2Fjic23%2Fqemu%2F-%2Fcommits%2Fcxl-2024-11-
> 27&data=05%7C02%7Cajayjoshi%40micron.com%7Ce59092c80eed4878d9
> cc08dd0f016a78%7Cf38a5ecd28134862b11bac1d563c806f%7C0%7C0%7C
> 638683221020684284%7CUnknown%7CTWFpbGZsb3d8eyJFbXB0eU1hcGk
> iOnRydWUsIlYiOiIwLjAuMDAwMCIsIlAiOiJXaW4zMiIsIkFOIjoiTWFpbCIsIldUIj
> oyfQ%3D%3D%7C0%7C%7C%7C&sdata=V451%2BM9UKiC0RfBUviNTY3fZH
> UGHdjJEgGuR0DowJZM%3D&reserved=0
>
> Note that this currently doesn't produce real data.  I have a plan / initial PoC /
> hack to hook that up via an addition to the QEMU cache plugin and an
> external tool to emulate the hotness tracker counting hardware. Will be a little
> while before I get that finished, so in a meantime the above exercises the
> driver.
>
> Jonathan
>
> >
> > RFC reason
> > - Whilst trace capture with a particular configuration is potentially useful
> >   the intent is that CXL HMU units will be used to drive various forms of
> >   hotpage migration for memory tiering setups. This driver doesn't do this
> >   (yet), but rather provides data capture etc for experimentation and
> >   for working out how to mostly put the allocations in the right place to
> >   start with by tuning applications.
> >
> > CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The
> > intent of this is to provide a way to establish which units of memory
> > (typically pages or larger) in CXL attached memory are hot. The
> > implementation details and algorithm are all implementation defined.
> > The specification simply describes the 'interface' which takes the
> > form of ring buffer of hotness records in a PCI BAR and defined
> > capability, configuration and status registers.
> >
> > The hardware may have constraints on what it can track, granularity
> > etc and on how accurately it tracks (e.g. counter exhaustion,
> > inaccurate trackers). Some of these constraints are discoverable from
> > the hardware registers, others such as loss of accuracy have no
> > universally accepted measures as they are typically access pattern
> > dependent. Sadly it is very unlikely any hardware will implement a
> > truly precise tracker given the large resource requirements for tracking at a
> useful granularity.
> >
> > There are two fundamental operation modes:
> >
> > * Epoch based. Counters are checked after a period of time (Epoch) and
> >   if over a threshold added to the hotlist.
> > * Always on. Counters run until a threshold is reached, after that the
> >   hot unit is added to the hotlist and the counter released.
> >
> > Counting can be filtered on:
> >
> > * Region of CXL DPA space (256MiB per bit in a bitmap).
> > * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> >
> > Sampling can be modified by:
> >
> > * Downsampling including potentially randomized downsampling.
> >
> > The driver presented here is intended to be useful in its own right
> > but also to act as the first step of a possible path towards hotness
> > monitoring based hot page migration. Those steps might look like.
> >
> > 1. Gather data - drivers provide telemetry like solutions to get that
> >    data. May be enhanced, for example in this driver by providing the
> >    HPA address rather than DPA Unit Address. Userspace can access enough
> >    information to do this so maybe not.
> > 2. Userspace algorithm development, possibly combined with userspace
> >    triggered migration by PA. Working out how to use different levels
> >    of constrained hardware resources will be challenging.
> > 3. Move those algorithms in kernel. Will require generalization across
> >    different hotpage trackers etc.
> >
> > So far this driver just gives access to the raw data. I will probably
> > kick of a longer discussion on how to do adaptive sampling needed to
> > actually use these units for tiering etc, sometime soon (if no one one
> > else beats me too it).  There is a follow up topic of how to
> > virtualize this stuff for memory stranding cases (VM gets a fixed
> > mixture of fast and slow memory and should do it's own tiering).
> >
> > More details in the Documentation patch but typical commands are:
> >
> > $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
> >
> >
> hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
> >  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
> >  hotness_granual=12
> >
> > $perf report --dump-raw-traces
> >
> > Example output.  With a counter_width of 16 (0x10) the least
> > significant
> > 4 bytes are the counter value and the unit index is bits 16-63.
> > Here all units are over the threshold and the indexes are 0,1,2 etc.
> >
> > . ... CXL_HMU data: size 33512 bytes
> > Header 0: units: 29c counter_width 10
> > Header 1 : deadbeef
> > 0000000000000283
> > 0000000000010364
> > 0000000000020366
> > 000000000003033c
> > 0000000000040343
> > 00000000000502ff
> > 000000000006030d
> > 000000000007031a
> >
> > Which will produce a list of hotness entries.
> > Bits[N-1:0] counter value
> > Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
> >   config to get to a Host Physical Address)
> >
> > Specific RFC questions.
> > - What should be in the header added to the aux buffer.
> >   Currently just the minimum is provided. Number of records
> >   and the counter width needed to decode them.
> > - Should we reset the counters when doing sampling "-F X"
> >   If the frequency is higher than the epoch we never see any hot units.
> >   If so, when should we reset them?
> >
> > Note testing has been light and on emulation only + as perf tool is a
> > pain to build on a striped back VM,  build testing has all be on
> > arm64 so far.  The driver loads though on both arm64 and x86 so any
> > problems are likely in the perf tool arch specific code which is build
> > tested (on wrong machine)
> >
> > The QEMU emulation needs some cleanup, but I should be able to post
> > that shortly to let people actually play with this.  There are lots of
> > open questions there on how 'right' we want the emulation to be and
> > what counting uarch to emulate.
> >
> > Jonathan Cameron (4):
> >   cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
> >   cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
> >   perf: Add support for CXL Hotness Monitoring Units (CHMU)
> >   hwtrace: Document CXL Hotness Monitoring Unit driver
> >
> >  Documentation/trace/cxl-hmu.rst     | 197 +++++++
> >  Documentation/trace/index.rst       |   1 +
> >  drivers/cxl/Kconfig                 |   6 +
> >  drivers/cxl/Makefile                |   3 +
> >  drivers/cxl/core/Makefile           |   1 +
> >  drivers/cxl/core/core.h             |   1 +
> >  drivers/cxl/core/hmu.c              |  64 ++
> >  drivers/cxl/core/port.c             |   2 +
> >  drivers/cxl/core/regs.c             |  14 +
> >  drivers/cxl/cxl.h                   |   5 +
> >  drivers/cxl/cxlpci.h                |   1 +
> >  drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
> >  drivers/cxl/hmu.h                   |  23 +
> >  drivers/cxl/pci.c                   |  26 +-
> >  tools/perf/arch/arm/util/auxtrace.c |  58 ++
> > tools/perf/arch/x86/util/auxtrace.c |  76 +++
> >  tools/perf/util/Build               |   1 +
> >  tools/perf/util/auxtrace.c          |   4 +
> >  tools/perf/util/auxtrace.h          |   1 +
> >  tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
> >  tools/perf/util/cxl-hmu.h           |  18 +
> >  21 files changed, 1748 insertions(+), 1 deletion(-)  create mode
> > 100644 Documentation/trace/cxl-hmu.rst  create mode 100644
> > drivers/cxl/core/hmu.c  create mode 100644 drivers/cxl/hmu.c  create
> > mode 100644 drivers/cxl/hmu.h  create mode 100644
> > tools/perf/util/cxl-hmu.c  create mode 100644
> > tools/perf/util/cxl-hmu.h
> >


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/4] cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
       [not found]   ` <CGME20250103052421epcas5p4a1a917ba5d367dfccec91d4522666ca0@epcas5p4.samsung.com>
@ 2025-01-03  5:16     ` Neeraj Kumar
  2025-01-03 12:07       ` Jonathan Cameron
  0 siblings, 1 reply; 27+ messages in thread
From: Neeraj Kumar @ 2025-01-03  5:16 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-mm, linux-perf-users, linux-kernel, linuxarm,
	tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi, Vandana Salve,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Ira Weiny,
	Dan Williams, Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Gregory Price, Huang Ying,
	Vishak G, Krishna Kanth Reddy, Alok Rathore, gost.dev

[-- Attachment #1: Type: text/plain, Size: 6786 bytes --]

On 21/11/24 10:18AM, Jonathan Cameron wrote:
>Basic registration using similar approach to how the CPMUs
>are registered.
>
>Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>---
> drivers/cxl/core/Makefile |  1 +
> drivers/cxl/core/hmu.c    | 64 +++++++++++++++++++++++++++++++++++++++
> drivers/cxl/core/regs.c   | 14 +++++++++
> drivers/cxl/cxl.h         |  4 +++
> drivers/cxl/cxlpci.h      |  1 +
> drivers/cxl/hmu.h         | 23 ++++++++++++++
> drivers/cxl/pci.c         | 26 +++++++++++++++-
> 7 files changed, 132 insertions(+), 1 deletion(-)
>
>diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
>index 9259bcc6773c..d060abb773ae 100644
>--- a/drivers/cxl/core/Makefile
>+++ b/drivers/cxl/core/Makefile
>@@ -12,6 +12,7 @@ cxl_core-y += memdev.o
> cxl_core-y += mbox.o
> cxl_core-y += pci.o
> cxl_core-y += hdm.o
>+cxl_core-y += hmu.o
> cxl_core-y += pmu.o
> cxl_core-y += cdat.o
> cxl_core-$(CONFIG_TRACING) += trace.o
>diff --git a/drivers/cxl/core/hmu.c b/drivers/cxl/core/hmu.c
>new file mode 100644
>index 000000000000..3ee938bb6c05
>--- /dev/null
>+++ b/drivers/cxl/core/hmu.c
>@@ -0,0 +1,64 @@
>+// SPDX-License-Identifier: GPL-2.0-only
>+/* Copyright(c) 2024 Huawei. All rights reserved. */
>+
>+#include <linux/device.h>
>+#include <linux/slab.h>
>+#include <linux/idr.h>
>+#include <cxlmem.h>
>+#include <hmu.h>
>+#include <cxl.h>
>+#include "core.h"
>+
>+static void cxl_hmu_release(struct device *dev)
>+{
>+	struct cxl_hmu *hmu = to_cxl_hmu(dev);
>+
>+	kfree(hmu);
>+}
>+
>+const struct device_type cxl_hmu_type = {
>+	.name = "cxl_hmu",
>+	.release = cxl_hmu_release,
>+};
>+
>+static void remove_dev(void *dev)
>+{
>+	device_unregister(dev);
>+}
>+
>+int devm_cxl_hmu_add(struct device *parent, struct cxl_hmu_regs *regs,
>+		     int assoc_id, int index)
>+{
>+	struct cxl_hmu *hmu;
>+	struct device *dev;
>+	int rc;
>+
>+	hmu = kzalloc(sizeof(*hmu), GFP_KERNEL);
>+	if (!hmu)
>+		return -ENOMEM;
>+
>+	hmu->assoc_id = assoc_id;
>+	hmu->index = index;
>+	hmu->base = regs->hmu;
>+	dev = &hmu->dev;
>+	device_initialize(dev);
>+	device_set_pm_not_required(dev);
>+	dev->parent = parent;
>+	dev->bus = &cxl_bus_type;
>+	dev->type = &cxl_hmu_type;
>+	rc = dev_set_name(dev, "hmu_mem%d.%d", assoc_id, index);
>+	if (rc)
>+		goto err;
>+
>+	rc = device_add(dev);
>+	if (rc)
>+		goto err;
>+
>+	return devm_add_action_or_reset(parent, remove_dev, dev);
>+
>+err:
>+	put_device(&hmu->dev);
>+	return rc;
>+}
>+EXPORT_SYMBOL_NS_GPL(devm_cxl_hmu_add, CXL);
>+
>diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
>index e1082e749c69..c12afaa6ef98 100644
>--- a/drivers/cxl/core/regs.c
>+++ b/drivers/cxl/core/regs.c
>@@ -401,6 +401,20 @@ int cxl_map_pmu_regs(struct cxl_register_map *map, struct cxl_pmu_regs *regs)
> }
> EXPORT_SYMBOL_NS_GPL(cxl_map_pmu_regs, CXL);
>
>+int cxl_map_hmu_regs(struct cxl_register_map *map, struct cxl_hmu_regs *regs)
>+{
>+	struct device *dev = map->host;
>+	resource_size_t phys_addr;
>+
>+	phys_addr = map->resource;
>+	regs->hmu = devm_cxl_iomap_block(dev, phys_addr, map->max_size);
>+	if (!regs->hmu)
>+		return -ENOMEM;
>+
>+	return 0;
>+}
>+EXPORT_SYMBOL_NS_GPL(cxl_map_hmu_regs, CXL);
>+
> static int cxl_map_regblock(struct cxl_register_map *map)
> {
> 	struct device *host = map->host;
>diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
>index 5406e3ab3d4a..8172bc1f7a8d 100644
>--- a/drivers/cxl/cxl.h
>+++ b/drivers/cxl/cxl.h
>@@ -227,6 +227,9 @@ struct cxl_regs {
> 	struct_group_tagged(cxl_pmu_regs, pmu_regs,
> 		void __iomem *pmu;
> 	);
>+	struct_group_tagged(cxl_hmu_regs, hmu_regs,
>+		void __iomem *hmu;
>+	);
>
> 	/*
> 	 * RCH downstream port specific RAS register
>@@ -292,6 +295,7 @@ int cxl_map_component_regs(const struct cxl_register_map *map,
> 			   unsigned long map_mask);
> int cxl_map_device_regs(const struct cxl_register_map *map,
> 			struct cxl_device_regs *regs);
>+int cxl_map_hmu_regs(struct cxl_register_map *map, struct cxl_hmu_regs *regs);
> int cxl_map_pmu_regs(struct cxl_register_map *map, struct cxl_pmu_regs *regs);
>
> enum cxl_regloc_type;
>diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
>index 4da07727ab9c..71f5e9620137 100644
>--- a/drivers/cxl/cxlpci.h
>+++ b/drivers/cxl/cxlpci.h
>@@ -67,6 +67,7 @@ enum cxl_regloc_type {
> 	CXL_REGLOC_RBI_VIRT,
> 	CXL_REGLOC_RBI_MEMDEV,
> 	CXL_REGLOC_RBI_PMU,
>+	CXL_REGLOC_RBI_HMU,
> 	CXL_REGLOC_RBI_TYPES
> };
>
>diff --git a/drivers/cxl/hmu.h b/drivers/cxl/hmu.h
>new file mode 100644
>index 000000000000..c4798ed9764b
>--- /dev/null
>+++ b/drivers/cxl/hmu.h
>@@ -0,0 +1,23 @@
>+/* SPDX-License-Identifier: GPL-2.0-only */
>+/*
>+ * Copyright(c) 2024 Huawei
>+ * CXL Specification rev 3.2 Setion 8.2.8 (CHMU Register Interface)
>+ */
>+#ifndef CXL_HMU_H
>+#define CXL_HMU_H
>+#include <linux/device.h>

No compilation errors even by removing this header.
I think this inclusion is not required.
Also found similar include at drivers/cxl/pmu.h

>+
>+#define CXL_HMU_REGMAP_SIZE 0xe00 /* Table 8-32 CXL 3.0 specification */

Above Macro CXL_HMU_REGMAP_SIZE is not used, So we should remove it.
Its comment is also not appropriate

>+struct cxl_hmu {
>+	struct device dev;
>+	void __iomem *base;
>+	int assoc_id;
>+	int index;
>+};
>+
>+#define to_cxl_hmu(dev) container_of(dev, struct cxl_hmu, dev)
>+struct cxl_hmu_regs;
>+int devm_cxl_hmu_add(struct device *parent, struct cxl_hmu_regs *regs,
>+		     int assoc_id, int idx);
>+
>+#endif
>diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
>index 188412d45e0d..e89ea9d3f007 100644
>--- a/drivers/cxl/pci.c
>+++ b/drivers/cxl/pci.c
>@@ -15,6 +15,7 @@
> #include "cxlmem.h"
> #include "cxlpci.h"
> #include "cxl.h"
>+#include "hmu.h"
> #include "pmu.h"
>
> /**
>@@ -814,7 +815,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> 	struct cxl_dev_state *cxlds;
> 	struct cxl_register_map map;
> 	struct cxl_memdev *cxlmd;
>-	int i, rc, pmu_count;
>+	int i, rc, hmu_count, pmu_count;
> 	bool irq_avail;
>
> 	/*
>@@ -938,6 +939,29 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> 		}
> 	}
>
>+	hmu_count = cxl_count_regblock(pdev, CXL_REGLOC_RBI_HMU);
>+	for (i = 0; i < hmu_count; i++) {
>+		struct cxl_hmu_regs hmu_regs;
>+
>+		rc = cxl_find_regblock_instance(pdev, CXL_REGLOC_RBI_HMU, &map, i);
>+		if (rc) {
>+			dev_dbg(&pdev->dev, "Could not find HMU regblock\n");
>+			break;
>+		}
>+
>+		rc = cxl_map_hmu_regs(&map, &hmu_regs);
>+		if (rc) {
>+			dev_dbg(&pdev->dev, "Could not map HMU regs\n");
>+			break;
>+		}
>+
>+		rc = devm_cxl_hmu_add(cxlds->dev, &hmu_regs, cxlmd->id, i);
>+		if (rc) {
>+			dev_dbg(&pdev->dev, "Could not add HMU instance\n");
>+			break;
>+		}
>+	}
>+
> 	rc = cxl_event_config(host_bridge, mds, irq_avail);
> 	if (rc)
> 		return rc;
>-- 
>2.43.0
>

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 4/4] hwtrace: Document CXL Hotness Monitoring Unit driver
       [not found]   ` <CGME20250103052702epcas5p3f7eea83ac70ba7147e0de7fb30f90a62@epcas5p3.samsung.com>
@ 2025-01-03  5:19     ` Neeraj Kumar
  0 siblings, 0 replies; 27+ messages in thread
From: Neeraj Kumar @ 2025-01-03  5:19 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-mm, linux-perf-users, linux-kernel, linuxarm,
	tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi, Vandana Salve,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Ira Weiny,
	Dan Williams, Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Gregory Price, Huang Ying,
	Vishak G, Krishna Kanth Reddy, Alok Rathore, gost.dev

[-- Attachment #1: Type: text/plain, Size: 8710 bytes --]

On 21/11/24 10:18AM, Jonathan Cameron wrote:
>Add basic documentation to describe the CXL HMU and the
>perf AUX buffer based interfaces.
>
>Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
>---
> Documentation/trace/cxl-hmu.rst | 197 ++++++++++++++++++++++++++++++++
> Documentation/trace/index.rst   |   1 +
> 2 files changed, 198 insertions(+)
>
>diff --git a/Documentation/trace/cxl-hmu.rst b/Documentation/trace/cxl-hmu.rst
>new file mode 100644
>index 000000000000..f07a50ba608c
>--- /dev/null
>+++ b/Documentation/trace/cxl-hmu.rst
>@@ -0,0 +1,197 @@
>+.. SPDX-License-Identifier: GPL-2.0
>+
>+==================================
>+CXL Hotness Monitoring Unit Driver
>+==================================
>+
>+CXL r3.2 introduced the CXL Hotness Monitoring Unit (CHMU). A CHMU allows
>+software running on a CXL Host to identify hot memory ranges, that is those with
>+higher access frequency relative to other memory ranges.
>+
>+A given Logical Device (presentation of a CXL memory device seen by a particular
>+host) can provide 1 or more CHMU each of which supports 1 or more separately
>+programmable CHMU Instances (CHMUI). These CHMUI are mostly independent with
>+the exception that there can be restrictions on them tracking the same memory
>+regions. The CHMUs are always completely independent.
>+The naming of the units is cxl_hmu_memX.Y.Z where memX matches the naming
>+of the memory device in /sys/bus/cxl/devices/, Y is the CHMU index and
>+Z is the CHMUI index with the CHMU.
>+
>+Each CHMUI provides a ring buffer structure known as the Hot List from which the
>+host an read back entries that describe the hotness of particular region of
>+memory (Hot List Units). The Hot List Unit combines a Unit Address and an access
>+count for the particular address. Unit address to DPA requires multiplication
>+by the unit size. Thus, for large unit sizes the device may support higher
>+counts. It is these Hot List Units that the driver provides via a perf AUX
>+buffer by copying them from PCI BAR space.
>+
>+The unit size at which hotness is measured is configurable for each CHMUI and
>+all measurement is done in Device Physical Address space. To relate this to
>+Host Physical Address space the HDM (Host-Managed Device Memory) decoder
>+configuration must be taken into account to reflect the placement in a
>+CXL Fixed Memory Window and any interleaving.
>+
>+The CHMUI can support interrupts on fills above a watermark, or on overflow
>+of the hotlist.
>+
>+A CHMUI can support two different basic modes of operation. Epoch and
>+Always On. These affect what is placed on the hotlist. Note that the actual
>+implementation of tracking is implementation defined and likely to be
>+inherently imprecise in that the hottest pages may not be discovered due to
>+resource exhaustion and the hotness counts may not represent accurately how
>+hot they are. The specification allows for a very high degree of flexibility
>+in implementation, important as it is likely that a number of different
>+hardware implementations will be chosen to suit particular silicon and accuracy
>+budgets.
>+
>+Operation and configuration
>+===========================
>+
>+An example command line is::
>+
>+  $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
>+  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
>+  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
>+  hotness_granual=12
>+
>+  $perf report --dump-raw-traces

Typo: --dump-raw-trace

>+
>+which will produce a list of hotlist entries, one per line with a short header
>+to provide sufficient information to interpret the entries::
>+
>+  . ... CXL_HMU data: size 33512 bytes
>+  Header 0: units: 29c counter_width 10
>+  Header 1 : deadbeef
>+  0000000000000283
>+  0000000000010364
>+  0000000000020366
>+  000000000003033c
>+  0000000000040343
>+  00000000000502ff
>+  000000000006030d
>+  000000000007031a
>+  ...
>+
>+The least significant counter_width bits (here 16, hex 10) are the counter
>+value, all higher bits are the unit index.  Multiply by the unit size
>+to get a Device Physical Address.
>+
>+The parameters are as follows:
>+
>+epoch_type
>+----------
>+
>+Two values may be supported::
>+
>+  0 - Epoch based operation
>+  1 - Always on operation
>+
>+
>+0. Epoch Based Operation
>+~~~~~~~~~~~~~~~~~~~~~~~~
>+
>+An Epoch is a period of time after which a counter is assessed for hotness.
>+
>+The device may have a global sense of an Epoch but it may also operate them on
>+a per counter, or per region of device basis. This is a function of the
>+implementation and is not controllable, but is discoverable. In a global Epoch
>+scheme at start of each Epoch all counters are zeroed / deallocated. Counters
>+are then allocated in a hardware specific manner and accesses counted. At the
>+completion of the Epoch the counters are compared with a threshold and entries
>+with a count above a configurable threshold are added to the hotlist. A new
>+Epoch is then begun with all counters cleared.
>+
>+In non-global Epoch scheme, when the Epoch of a given counter begins is not
>+specified. An example might be an Epoch for counter only starting on first
>+touch to the relevant memory region.  When a local Epoch ends the counter is
>+compared to the threshold and if appropriate added to the hotlist.
>+
>+Note, in Epoch Based Operation, the counter in the hotlist entry provides
>+information on how hot the memory is as the counter for the full Epoch is
>+provided.
>+
>+1. Always on Operation
>+~~~~~~~~~~~~~~~~~~~~~~
>+
>+In this mode, counters may all be reset before enabling the CHMUI. Then
>+counters are allocated to particular memory units via an hardware specific
>+method, perhaps on first touch.  When a counter passes the configurable
>+hotness threshold an entry is added to the hotlist and that counter is freed
>+for reuse.
>+
>+In this scheme the count provided in the hotlist entry is not useful as it will
>+depend only on the configured threshold.
>+
>+access_type
>+-----------
>+
>+The parameter controls which access are counted::
>+
>+  1 - Non-TEE read only
>+  2 - Non-TEE write only
>+  3 - Non-TEE read and write
>+  4 - TEE and Non-TEE read only
>+  5 - TEE and Non-TEE write only
>+  6 - TEE and Non-tee read and write
>+
>+
>+TEE here refers to a trusted execution environment, specifically one that
>+results in the T bit being set in the CXL transactions.
>+
>+
>+hotness_granual
>+---------------
>+
>+Unit size at which tracking is performed.  Must be at least 256 bytes but
>+hardware may only support some sizes. Expressed as a power of 2. e.g. 12 = 4kiB.
>+
>+hotness_threshold
>+-----------------
>+
>+This is the minimum counter value that must be reached for the unit to count as
>+hot and be added to the hotlist.
>+
>+The possible range may be dependent on the unit size as a larger unit size
>+requires more bits on the hotlist entry leaving fewer available for the hotness
>+counter.
>+
>+epoch_multiplier and epoch_scale
>+--------------------------------
>+
>+The length of an epoch (in epoch mode) is controlled by these two parameters
>+with the decoded epoch_scale multiplied by the epoch_multiplier to give the
>+overall epoch length.
>+
>+epoch_scale::
>+
>+  1 - 100 usecs
>+  2 - 1 msec
>+  3 - 10 msecs
>+  4 - 100 msecs
>+  5 - 1 second
>+
>+range_base and range_scale
>+--------------------------
>+
>+Expressed in terms of the unit size set via hotness_granual. Each CHMUI has a
>+bitmap that controls what Device Physical Address spaces is tracked. Each bit
>+represents 256MiB of DPA space.
>+
>+This interface provides a simple base and size in units of 256MiB to configure
>+this bitmap. All bits in the specified range will be set.
>+
>+downsampling_factor
>+-------------------
>+
>+Hardware may be incapable of counting accesses at full speed or it may be
>+desirable to count over a longer period during which the counters would
>+overflow.  This control allows selection of a down sampling factor expressed
>+as a power of 2 between 1 and 32768.  Default is minimum supported downsampling
>+factor.
>+
>+randomized_downsampling
>+-----------------------
>+
>+To avoid problems with downsampling when accesses are periodic this option
>+allows for an implementation defined randomization of the sampling interval,
>+whilst remaining close to the specified downsampling_factor.
>diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
>index 0b300901fd75..b35ed8e9dfa9 100644
>--- a/Documentation/trace/index.rst
>+++ b/Documentation/trace/index.rst
>@@ -36,3 +36,4 @@ Linux Tracing Technologies
>    user_events
>    rv/index
>    hisi-ptt
>+   cxl-hmu
>-- 
>2.43.0
>

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
       [not found]   ` <CGME20250103053521epcas5p30cd4abba59d695664335b03ba806c56d@epcas5p3.samsung.com>
@ 2025-01-03  5:27     ` Neeraj Kumar
  2025-01-15 13:42       ` Jonathan Cameron
  0 siblings, 1 reply; 27+ messages in thread
From: Neeraj Kumar @ 2025-01-03  5:27 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-mm, linux-perf-users, linux-kernel, linuxarm,
	tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi, Vandana Salve,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Ira Weiny,
	Dan Williams, Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Gregory Price, Huang Ying,
	Vishak G, Krishna Kanth Reddy, Alok Rathore, gost.dev

[-- Attachment #1: Type: text/plain, Size: 10572 bytes --]

On 27/11/24 04:34PM, Jonathan Cameron wrote:
>On Thu, 21 Nov 2024 10:18:41 +0000
>Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
>
>> The CXL specification release 3.2 is now available under a click through at
>> https://computeexpresslink.org/cxl-specification/ and it brings new
>> shiny toys.
>
>If anyone wants to play, basic emulation on my CXL QEMU staging tree
>https://gitlab.com/jic23/qemu/-/commit/e89b35d264c1bcc04807e7afab1254f35ffc8cb9
>
>Branch with a few other things on top is:
>https://gitlab.com/jic23/qemu/-/commits/cxl-2024-11-27
>
>Note that this currently doesn't produce real data.  I have a plan
>/ initial PoC / hack to hook that up via an addition to the QEMU cache
>plugin and an external tool to emulate the hotness tracker counting
>hardware. Will be a little while before I get that finished, so in
>a meantime the above exercises the driver.
>
>Jonathan
>
>>
>> RFC reason
>> - Whilst trace capture with a particular configuration is potentially useful
>>   the intent is that CXL HMU units will be used to drive various forms of
>>   hotpage migration for memory tiering setups. This driver doesn't do this
>>   (yet), but rather provides data capture etc for experimentation and
>>   for working out how to mostly put the allocations in the right place to
>>   start with by tuning applications.
>>
>> CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
>> of this is to provide a way to establish which units of memory (typically
>> pages or larger) in CXL attached memory are hot. The implementation details
>> and algorithm are all implementation defined. The specification simply
>> describes the 'interface' which takes the form of ring buffer of hotness
>> records in a PCI BAR and defined capability, configuration and status
>> registers.
>>
>> The hardware may have constraints on what it can track, granularity etc
>> and on how accurately it tracks (e.g. counter exhaustion, inaccurate
>> trackers). Some of these constraints are discoverable from the hardware
>> registers, others such as loss of accuracy have no universally accepted
>> measures as they are typically access pattern dependent. Sadly it is
>> very unlikely any hardware will implement a truly precise tracker given
>> the large resource requirements for tracking at a useful granularity.
>>
>> There are two fundamental operation modes:
>>
>> * Epoch based. Counters are checked after a period of time (Epoch) and
>>   if over a threshold added to the hotlist.
>> * Always on. Counters run until a threshold is reached, after that the
>>   hot unit is added to the hotlist and the counter released.
>>
>> Counting can be filtered on:
>>
>> * Region of CXL DPA space (256MiB per bit in a bitmap).
>> * Type of access - Trusted and non trusted or non trusted only, R/W/RW
>>
>> Sampling can be modified by:
>>
>> * Downsampling including potentially randomized downsampling.
>>
>> The driver presented here is intended to be useful in its own right but
>> also to act as the first step of a possible path towards hotness monitoring
>> based hot page migration. Those steps might look like.
>>
>> 1. Gather data - drivers provide telemetry like solutions to get that
>>    data. May be enhanced, for example in this driver by providing the
>>    HPA address rather than DPA Unit Address. Userspace can access enough
>>    information to do this so maybe not.
>> 2. Userspace algorithm development, possibly combined with userspace
>>    triggered migration by PA. Working out how to use different levels
>>    of constrained hardware resources will be challenging.
>> 3. Move those algorithms in kernel. Will require generalization across
>>    different hotpage trackers etc.
>>
>> So far this driver just gives access to the raw data. I will probably kick
>> of a longer discussion on how to do adaptive sampling needed to actually
>> use these units for tiering etc, sometime soon (if no one one else beats
>> me too it).  There is a follow up topic of how to virtualize this stuff
>> for memory stranding cases (VM gets a fixed mixture of fast and slow
>> memory and should do it's own tiering).
>>
>> More details in the Documentation patch but typical commands are:
>>
>> $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
>>  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
>>  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
>>  hotness_granual=12

Facing issue while executing perf record on x86 emulation environment using following steps

1. Tried applying CHMU Patch on branch cxl-for-6.13 using b4 utility. As
base commit is not specified, with minor change able to apply patch.
Compiled kernel with CONFIG_CXL_HMU

2. Compiled jic23/cxl-2024-11-27 for x86_64-softmmu

3. Launched Qemu with following CXL topology along with compiled kernel
VM="-object memory-backend-ram,id=vmem1,share=on,size=512M \
     -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
     -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
     -device cxl-type3,bus=root_port13,volatile-memdev=vmem1,id=cxl-vmem1 \
     -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k"

4. Created region and onlined this memory. Also run top utility on the newly created 
numa node using numactl -m<node> top

5. Compiled and installed perf utility in qemu environment, and able to
see cxl_hmu_mem* entries in perf list

root@QEMUCXL2030mm:~# perf list
<snip>
  cxl_hmu_mem0.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
  cxl_hmu_mem0.0.1/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
  cxl_hmu_mem0.0.2/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
  cxl_hmu_mem1.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
  cxl_hmu_mem1.0.1/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
  cxl_hmu_mem1.0.2/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
  cxl_pmu_mem0.0/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
  cxl_pmu_mem0.1/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
  cxl_pmu_mem1.0/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
  cxl_pmu_mem1.1/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
<snip>

6. Tried running perf command mentioned in Documentation/trace/cxl-hmu.rst

root@QEMUCXL2030mm:/home/cxl/cxl-linux-mainline/tools/perf# perf -v
perf version 6.12.rc5.gc198a4f4a356
root@QEMUCXL2030mm:/home/cxl/cxl-linux-mainline/tools/perf# perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12
event syntax error: '..ess_granual=12'
                                  \___ Unrecognized input



Are there any steps i am missing?

Regards,
Neeraj	 

>>
>> $perf report --dump-raw-traces
>>
>> Example output.  With a counter_width of 16 (0x10) the least significant
>> 4 bytes are the counter value and the unit index is bits 16-63.
>> Here all units are over the threshold and the indexes are 0,1,2 etc.
>>
>> . ... CXL_HMU data: size 33512 bytes
>> Header 0: units: 29c counter_width 10
>> Header 1 : deadbeef
>> 0000000000000283
>> 0000000000010364
>> 0000000000020366
>> 000000000003033c
>> 0000000000040343
>> 00000000000502ff
>> 000000000006030d
>> 000000000007031a
>>
>> Which will produce a list of hotness entries.
>> Bits[N-1:0] counter value
>> Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
>>   config to get to a Host Physical Address)
>>
>> Specific RFC questions.
>> - What should be in the header added to the aux buffer.
>>   Currently just the minimum is provided. Number of records
>>   and the counter width needed to decode them.
>> - Should we reset the counters when doing sampling "-F X"
>>   If the frequency is higher than the epoch we never see any hot units.
>>   If so, when should we reset them?
>>
>> Note testing has been light and on emulation only + as perf tool is
>> a pain to build on a striped back VM,  build testing has all be on
>> arm64 so far.  The driver loads though on both arm64 and x86 so
>> any problems are likely in the perf tool arch specific code
>> which is build tested (on wrong machine)
>>
>> The QEMU emulation needs some cleanup, but I should be able to post
>> that shortly to let people actually play with this.  There are lots
>> of open questions there on how 'right' we want the emulation to be
>> and what counting uarch to emulate.
>>
>> Jonathan Cameron (4):
>>   cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
>>   cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
>>   perf: Add support for CXL Hotness Monitoring Units (CHMU)
>>   hwtrace: Document CXL Hotness Monitoring Unit driver
>>
>>  Documentation/trace/cxl-hmu.rst     | 197 +++++++
>>  Documentation/trace/index.rst       |   1 +
>>  drivers/cxl/Kconfig                 |   6 +
>>  drivers/cxl/Makefile                |   3 +
>>  drivers/cxl/core/Makefile           |   1 +
>>  drivers/cxl/core/core.h             |   1 +
>>  drivers/cxl/core/hmu.c              |  64 ++
>>  drivers/cxl/core/port.c             |   2 +
>>  drivers/cxl/core/regs.c             |  14 +
>>  drivers/cxl/cxl.h                   |   5 +
>>  drivers/cxl/cxlpci.h                |   1 +
>>  drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
>>  drivers/cxl/hmu.h                   |  23 +
>>  drivers/cxl/pci.c                   |  26 +-
>>  tools/perf/arch/arm/util/auxtrace.c |  58 ++
>>  tools/perf/arch/x86/util/auxtrace.c |  76 +++
>>  tools/perf/util/Build               |   1 +
>>  tools/perf/util/auxtrace.c          |   4 +
>>  tools/perf/util/auxtrace.h          |   1 +
>>  tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
>>  tools/perf/util/cxl-hmu.h           |  18 +
>>  21 files changed, 1748 insertions(+), 1 deletion(-)
>>  create mode 100644 Documentation/trace/cxl-hmu.rst
>>  create mode 100644 drivers/cxl/core/hmu.c
>>  create mode 100644 drivers/cxl/hmu.c
>>  create mode 100644 drivers/cxl/hmu.h
>>  create mode 100644 tools/perf/util/cxl-hmu.c
>>  create mode 100644 tools/perf/util/cxl-hmu.h
>>
>

[-- Attachment #2: Type: text/plain, Size: 0 bytes --]



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/4] cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
  2025-01-03  5:16     ` Neeraj Kumar
@ 2025-01-03 12:07       ` Jonathan Cameron
  0 siblings, 0 replies; 27+ messages in thread
From: Jonathan Cameron @ 2025-01-03 12:07 UTC (permalink / raw)
  To: Neeraj Kumar
  Cc: linux-cxl, linux-mm, linux-perf-users, linux-kernel, linuxarm,
	tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi, Vandana Salve,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Ira Weiny,
	Dan Williams, Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Gregory Price, Huang Ying,
	Vishak G, Krishna Kanth Reddy, Alok Rathore, gost.dev

> >diff --git a/drivers/cxl/hmu.h b/drivers/cxl/hmu.h
> >new file mode 100644
> >index 000000000000..c4798ed9764b
> >--- /dev/null
> >+++ b/drivers/cxl/hmu.h
> >@@ -0,0 +1,23 @@
> >+/* SPDX-License-Identifier: GPL-2.0-only */
> >+/*
> >+ * Copyright(c) 2024 Huawei
> >+ * CXL Specification rev 3.2 Setion 8.2.8 (CHMU Register Interface)
> >+ */
> >+#ifndef CXL_HMU_H
> >+#define CXL_HMU_H
> >+#include <linux/device.h>  
> 
> No compilation errors even by removing this header.
> I think this inclusion is not required.
> Also found similar include at drivers/cxl/pmu.h

Kernel generally follows include what you use principles to avoid
future issues due to reorganization of headers etc.

Here struct device definition is needed below so this header should be
included.  If there are other cases that do not do this, they should
be fixed (there are ongoing efforts to clean this up btw by adding the
missing includes).

> 
> >+
> >+#define CXL_HMU_REGMAP_SIZE 0xe00 /* Table 8-32 CXL 3.0 specification */  
> 
> Above Macro CXL_HMU_REGMAP_SIZE is not used, So we should remove it.
> Its comment is also not appropriate

Not sure on the comment being in appropriate but sure this define can 
go away (and the comment with it).

Thanks for taking a look.

Jonathan



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
  2025-01-03  5:27     ` Neeraj Kumar
@ 2025-01-15 13:42       ` Jonathan Cameron
  2025-06-19  3:59         ` Yuquan Wang
  0 siblings, 1 reply; 27+ messages in thread
From: Jonathan Cameron @ 2025-01-15 13:42 UTC (permalink / raw)
  To: Neeraj Kumar
  Cc: linux-cxl, linux-mm, linux-perf-users, linux-kernel, linuxarm,
	tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi, Vandana Salve,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Ira Weiny,
	Dan Williams, Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Gregory Price, Huang Ying,
	Vishak G, Krishna Kanth Reddy, Alok Rathore, gost.dev

On Fri, 3 Jan 2025 10:57:22 +0530
Neeraj Kumar <s.neeraj@samsung.com> wrote:

> On 27/11/24 04:34PM, Jonathan Cameron wrote:
> >On Thu, 21 Nov 2024 10:18:41 +0000
> >Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> >  
> >> The CXL specification release 3.2 is now available under a click through at
> >> https://computeexpresslink.org/cxl-specification/ and it brings new
> >> shiny toys.  
> >
> >If anyone wants to play, basic emulation on my CXL QEMU staging tree
> >https://gitlab.com/jic23/qemu/-/commit/e89b35d264c1bcc04807e7afab1254f35ffc8cb9
> >
> >Branch with a few other things on top is:
> >https://gitlab.com/jic23/qemu/-/commits/cxl-2024-11-27
> >
> >Note that this currently doesn't produce real data.  I have a plan
> >/ initial PoC / hack to hook that up via an addition to the QEMU cache
> >plugin and an external tool to emulate the hotness tracker counting
> >hardware. Will be a little while before I get that finished, so in
> >a meantime the above exercises the driver.
> >
> >Jonathan
> >  
> >>
> >> RFC reason
> >> - Whilst trace capture with a particular configuration is potentially useful
> >>   the intent is that CXL HMU units will be used to drive various forms of
> >>   hotpage migration for memory tiering setups. This driver doesn't do this
> >>   (yet), but rather provides data capture etc for experimentation and
> >>   for working out how to mostly put the allocations in the right place to
> >>   start with by tuning applications.
> >>
> >> CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> >> of this is to provide a way to establish which units of memory (typically
> >> pages or larger) in CXL attached memory are hot. The implementation details
> >> and algorithm are all implementation defined. The specification simply
> >> describes the 'interface' which takes the form of ring buffer of hotness
> >> records in a PCI BAR and defined capability, configuration and status
> >> registers.
> >>
> >> The hardware may have constraints on what it can track, granularity etc
> >> and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> >> trackers). Some of these constraints are discoverable from the hardware
> >> registers, others such as loss of accuracy have no universally accepted
> >> measures as they are typically access pattern dependent. Sadly it is
> >> very unlikely any hardware will implement a truly precise tracker given
> >> the large resource requirements for tracking at a useful granularity.
> >>
> >> There are two fundamental operation modes:
> >>
> >> * Epoch based. Counters are checked after a period of time (Epoch) and
> >>   if over a threshold added to the hotlist.
> >> * Always on. Counters run until a threshold is reached, after that the
> >>   hot unit is added to the hotlist and the counter released.
> >>
> >> Counting can be filtered on:
> >>
> >> * Region of CXL DPA space (256MiB per bit in a bitmap).
> >> * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> >>
> >> Sampling can be modified by:
> >>
> >> * Downsampling including potentially randomized downsampling.
> >>
> >> The driver presented here is intended to be useful in its own right but
> >> also to act as the first step of a possible path towards hotness monitoring
> >> based hot page migration. Those steps might look like.
> >>
> >> 1. Gather data - drivers provide telemetry like solutions to get that
> >>    data. May be enhanced, for example in this driver by providing the
> >>    HPA address rather than DPA Unit Address. Userspace can access enough
> >>    information to do this so maybe not.
> >> 2. Userspace algorithm development, possibly combined with userspace
> >>    triggered migration by PA. Working out how to use different levels
> >>    of constrained hardware resources will be challenging.
> >> 3. Move those algorithms in kernel. Will require generalization across
> >>    different hotpage trackers etc.
> >>
> >> So far this driver just gives access to the raw data. I will probably kick
> >> of a longer discussion on how to do adaptive sampling needed to actually
> >> use these units for tiering etc, sometime soon (if no one one else beats
> >> me too it).  There is a follow up topic of how to virtualize this stuff
> >> for memory stranding cases (VM gets a fixed mixture of fast and slow
> >> memory and should do it's own tiering).
> >>
> >> More details in the Documentation patch but typical commands are:
> >>
> >> $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
> >>  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
> >>  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
> >>  hotness_granual=12  
> 
> Facing issue while executing perf record on x86 emulation environment using following steps
> 
> 1. Tried applying CHMU Patch on branch cxl-for-6.13 using b4 utility. As
> base commit is not specified, with minor change able to apply patch.
> Compiled kernel with CONFIG_CXL_HMU
> 
> 2. Compiled jic23/cxl-2024-11-27 for x86_64-softmmu
> 
> 3. Launched Qemu with following CXL topology along with compiled kernel
> VM="-object memory-backend-ram,id=vmem1,share=on,size=512M \
>      -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
>      -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
>      -device cxl-type3,bus=root_port13,volatile-memdev=vmem1,id=cxl-vmem1 \
>      -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k"
> 
> 4. Created region and onlined this memory. Also run top utility on the newly created 
> numa node using numactl -m<node> top
> 
> 5. Compiled and installed perf utility in qemu environment, and able to
> see cxl_hmu_mem* entries in perf list
> 
> root@QEMUCXL2030mm:~# perf list
> <snip>
>   cxl_hmu_mem0.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
>   cxl_hmu_mem0.0.1/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
>   cxl_hmu_mem0.0.2/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
>   cxl_hmu_mem1.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
>   cxl_hmu_mem1.0.1/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
>   cxl_hmu_mem1.0.2/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
>   cxl_pmu_mem0.0/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
>   cxl_pmu_mem0.1/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
>   cxl_pmu_mem1.0/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
>   cxl_pmu_mem1.1/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> <snip>
> 
> 6. Tried running perf command mentioned in Documentation/trace/cxl-hmu.rst
> 
> root@QEMUCXL2030mm:/home/cxl/cxl-linux-mainline/tools/perf# perf -v
> perf version 6.12.rc5.gc198a4f4a356
> root@QEMUCXL2030mm:/home/cxl/cxl-linux-mainline/tools/perf# perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12
> event syntax error: '..ess_granual=12'
>                                   \___ Unrecognized input

This is probably my mistake when cutting and pasting the example from a terminal.
Add a trailing / and something to run.

 perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12/ -- sleep 10

Jonathan

> 
> 
> 
> Are there any steps i am missing?
> 
> Regards,
> Neeraj	 
> 
> >>
> >> $perf report --dump-raw-traces
> >>
> >> Example output.  With a counter_width of 16 (0x10) the least significant
> >> 4 bytes are the counter value and the unit index is bits 16-63.
> >> Here all units are over the threshold and the indexes are 0,1,2 etc.
> >>
> >> . ... CXL_HMU data: size 33512 bytes
> >> Header 0: units: 29c counter_width 10
> >> Header 1 : deadbeef
> >> 0000000000000283
> >> 0000000000010364
> >> 0000000000020366
> >> 000000000003033c
> >> 0000000000040343
> >> 00000000000502ff
> >> 000000000006030d
> >> 000000000007031a
> >>
> >> Which will produce a list of hotness entries.
> >> Bits[N-1:0] counter value
> >> Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
> >>   config to get to a Host Physical Address)
> >>
> >> Specific RFC questions.
> >> - What should be in the header added to the aux buffer.
> >>   Currently just the minimum is provided. Number of records
> >>   and the counter width needed to decode them.
> >> - Should we reset the counters when doing sampling "-F X"
> >>   If the frequency is higher than the epoch we never see any hot units.
> >>   If so, when should we reset them?
> >>
> >> Note testing has been light and on emulation only + as perf tool is
> >> a pain to build on a striped back VM,  build testing has all be on
> >> arm64 so far.  The driver loads though on both arm64 and x86 so
> >> any problems are likely in the perf tool arch specific code
> >> which is build tested (on wrong machine)
> >>
> >> The QEMU emulation needs some cleanup, but I should be able to post
> >> that shortly to let people actually play with this.  There are lots
> >> of open questions there on how 'right' we want the emulation to be
> >> and what counting uarch to emulate.
> >>
> >> Jonathan Cameron (4):
> >>   cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
> >>   cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
> >>   perf: Add support for CXL Hotness Monitoring Units (CHMU)
> >>   hwtrace: Document CXL Hotness Monitoring Unit driver
> >>
> >>  Documentation/trace/cxl-hmu.rst     | 197 +++++++
> >>  Documentation/trace/index.rst       |   1 +
> >>  drivers/cxl/Kconfig                 |   6 +
> >>  drivers/cxl/Makefile                |   3 +
> >>  drivers/cxl/core/Makefile           |   1 +
> >>  drivers/cxl/core/core.h             |   1 +
> >>  drivers/cxl/core/hmu.c              |  64 ++
> >>  drivers/cxl/core/port.c             |   2 +
> >>  drivers/cxl/core/regs.c             |  14 +
> >>  drivers/cxl/cxl.h                   |   5 +
> >>  drivers/cxl/cxlpci.h                |   1 +
> >>  drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
> >>  drivers/cxl/hmu.h                   |  23 +
> >>  drivers/cxl/pci.c                   |  26 +-
> >>  tools/perf/arch/arm/util/auxtrace.c |  58 ++
> >>  tools/perf/arch/x86/util/auxtrace.c |  76 +++
> >>  tools/perf/util/Build               |   1 +
> >>  tools/perf/util/auxtrace.c          |   4 +
> >>  tools/perf/util/auxtrace.h          |   1 +
> >>  tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
> >>  tools/perf/util/cxl-hmu.h           |  18 +
> >>  21 files changed, 1748 insertions(+), 1 deletion(-)
> >>  create mode 100644 Documentation/trace/cxl-hmu.rst
> >>  create mode 100644 drivers/cxl/core/hmu.c
> >>  create mode 100644 drivers/cxl/hmu.c
> >>  create mode 100644 drivers/cxl/hmu.h
> >>  create mode 100644 tools/perf/util/cxl-hmu.c
> >>  create mode 100644 tools/perf/util/cxl-hmu.h
> >>  
> >  
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
  2024-11-21 10:18 [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver Jonathan Cameron
                   ` (6 preceding siblings ...)
  2024-11-27 16:34 ` Jonathan Cameron
@ 2025-01-24 17:40 ` Jonathan Cameron
  7 siblings, 0 replies; 27+ messages in thread
From: Jonathan Cameron @ 2025-01-24 17:40 UTC (permalink / raw)
  To: linux-cxl, linux-mm, linux-perf-users, linux-kernel, linxuarm
  Cc: tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi, Vandana Salve,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Ira Weiny,
	Dan Williams, Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Gregory Price, Huang Ying

On Thu, 21 Nov 2024 10:18:41 +0000
Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:

> The CXL specification release 3.2 is now available under a click through at
> https://computeexpresslink.org/cxl-specification/ and it brings new
> shiny toys.

PoC of qemu plugin based approach to getting real data:
https://lore.kernel.org/qemu-devel/20250124172905.84099-1-Jonathan.Cameron@huawei.com/

Also available on gitlab.com/jic23/qemu cxl-2025-01-24

I have a minor revision to this driver to post after a spec clarification on units
of one parameter. Will do that next week but for now that PoC ignores the parameter
anyway :)

Hopefully a cleaned up version of the above will provide us with a useful
platform for algorithm and framework development. I've posted it early as
there is a fundamental question for the QEMU maintainers on how they
would prefer the plugin interaction to be done - once that's resolved it's
just a question of wiring up more parameters and providing additional
options for the actual counting implementation. For now it's an oracle
with 32 bit counters for every page.  Hardware folk tell me that might be
a 'little too expensive'!

It's not even that slow :)

Jonathan
> 
> RFC reason
> - Whilst trace capture with a particular configuration is potentially useful
>   the intent is that CXL HMU units will be used to drive various forms of
>   hotpage migration for memory tiering setups. This driver doesn't do this
>   (yet), but rather provides data capture etc for experimentation and
>   for working out how to mostly put the allocations in the right place to
>   start with by tuning applications.
> 
> CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> of this is to provide a way to establish which units of memory (typically
> pages or larger) in CXL attached memory are hot. The implementation details
> and algorithm are all implementation defined. The specification simply
> describes the 'interface' which takes the form of ring buffer of hotness
> records in a PCI BAR and defined capability, configuration and status
> registers.
> 
> The hardware may have constraints on what it can track, granularity etc
> and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> trackers). Some of these constraints are discoverable from the hardware
> registers, others such as loss of accuracy have no universally accepted
> measures as they are typically access pattern dependent. Sadly it is
> very unlikely any hardware will implement a truly precise tracker given
> the large resource requirements for tracking at a useful granularity.
> 
> There are two fundamental operation modes:
> 
> * Epoch based. Counters are checked after a period of time (Epoch) and
>   if over a threshold added to the hotlist.
> * Always on. Counters run until a threshold is reached, after that the
>   hot unit is added to the hotlist and the counter released.
> 
> Counting can be filtered on:
> 
> * Region of CXL DPA space (256MiB per bit in a bitmap).
> * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> 
> Sampling can be modified by:
> 
> * Downsampling including potentially randomized downsampling.
> 
> The driver presented here is intended to be useful in its own right but
> also to act as the first step of a possible path towards hotness monitoring
> based hot page migration. Those steps might look like.
> 
> 1. Gather data - drivers provide telemetry like solutions to get that
>    data. May be enhanced, for example in this driver by providing the
>    HPA address rather than DPA Unit Address. Userspace can access enough
>    information to do this so maybe not.
> 2. Userspace algorithm development, possibly combined with userspace
>    triggered migration by PA. Working out how to use different levels
>    of constrained hardware resources will be challenging.
> 3. Move those algorithms in kernel. Will require generalization across
>    different hotpage trackers etc.
> 
> So far this driver just gives access to the raw data. I will probably kick
> of a longer discussion on how to do adaptive sampling needed to actually
> use these units for tiering etc, sometime soon (if no one one else beats
> me too it).  There is a follow up topic of how to virtualize this stuff
> for memory stranding cases (VM gets a fixed mixture of fast and slow
> memory and should do it's own tiering).
> 
> More details in the Documentation patch but typical commands are:
> 
> $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
>  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
>  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
>  hotness_granual=12
> 
> $perf report --dump-raw-traces
> 
> Example output.  With a counter_width of 16 (0x10) the least significant
> 4 bytes are the counter value and the unit index is bits 16-63.
> Here all units are over the threshold and the indexes are 0,1,2 etc.
> 
> . ... CXL_HMU data: size 33512 bytes
> Header 0: units: 29c counter_width 10
> Header 1 : deadbeef
> 0000000000000283
> 0000000000010364
> 0000000000020366
> 000000000003033c
> 0000000000040343
> 00000000000502ff
> 000000000006030d
> 000000000007031a
> 
> Which will produce a list of hotness entries.
> Bits[N-1:0] counter value
> Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
>   config to get to a Host Physical Address)
> 
> Specific RFC questions.
> - What should be in the header added to the aux buffer.
>   Currently just the minimum is provided. Number of records
>   and the counter width needed to decode them.
> - Should we reset the counters when doing sampling "-F X"
>   If the frequency is higher than the epoch we never see any hot units.
>   If so, when should we reset them?
> 
> Note testing has been light and on emulation only + as perf tool is
> a pain to build on a striped back VM,  build testing has all be on
> arm64 so far.  The driver loads though on both arm64 and x86 so
> any problems are likely in the perf tool arch specific code
> which is build tested (on wrong machine)
> 
> The QEMU emulation needs some cleanup, but I should be able to post
> that shortly to let people actually play with this.  There are lots
> of open questions there on how 'right' we want the emulation to be
> and what counting uarch to emulate.
> 
> Jonathan Cameron (4):
>   cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
>   cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
>   perf: Add support for CXL Hotness Monitoring Units (CHMU)
>   hwtrace: Document CXL Hotness Monitoring Unit driver
> 
>  Documentation/trace/cxl-hmu.rst     | 197 +++++++
>  Documentation/trace/index.rst       |   1 +
>  drivers/cxl/Kconfig                 |   6 +
>  drivers/cxl/Makefile                |   3 +
>  drivers/cxl/core/Makefile           |   1 +
>  drivers/cxl/core/core.h             |   1 +
>  drivers/cxl/core/hmu.c              |  64 ++
>  drivers/cxl/core/port.c             |   2 +
>  drivers/cxl/core/regs.c             |  14 +
>  drivers/cxl/cxl.h                   |   5 +
>  drivers/cxl/cxlpci.h                |   1 +
>  drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
>  drivers/cxl/hmu.h                   |  23 +
>  drivers/cxl/pci.c                   |  26 +-
>  tools/perf/arch/arm/util/auxtrace.c |  58 ++
>  tools/perf/arch/x86/util/auxtrace.c |  76 +++
>  tools/perf/util/Build               |   1 +
>  tools/perf/util/auxtrace.c          |   4 +
>  tools/perf/util/auxtrace.h          |   1 +
>  tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
>  tools/perf/util/cxl-hmu.h           |  18 +
>  21 files changed, 1748 insertions(+), 1 deletion(-)
>  create mode 100644 Documentation/trace/cxl-hmu.rst
>  create mode 100644 drivers/cxl/core/hmu.c
>  create mode 100644 drivers/cxl/hmu.c
>  create mode 100644 drivers/cxl/hmu.h
>  create mode 100644 tools/perf/util/cxl-hmu.c
>  create mode 100644 tools/perf/util/cxl-hmu.h
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/4] cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
  2024-11-21 10:18 ` [RFC PATCH 1/4] cxl: Register devices for CXL Hotness Monitoring Units (CHMU) Jonathan Cameron
       [not found]   ` <CGME20250103052421epcas5p4a1a917ba5d367dfccec91d4522666ca0@epcas5p4.samsung.com>
@ 2025-06-19  1:47   ` Yuquan Wang
  2025-06-19 10:11     ` Jonathan Cameron
  2025-08-08  8:29   ` wangyuquan
  2025-08-08  8:45   ` Yuquan Wang
  3 siblings, 1 reply; 27+ messages in thread
From: Yuquan Wang @ 2025-06-19  1:47 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: linux-cxl, linux-mm, linux-perf-users, linux-kernel, linuxarm,
	tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi, Vandana Salve,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Ira Weiny,
	Dan Williams, Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Gregory Price, Huang Ying

On Thu, Nov 21, 2024 at 10:18:42AM +0000, Jonathan Cameron wrote:
> Basic registration using similar approach to how the CPMUs
> are registered.
> 
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  drivers/cxl/core/Makefile |  1 +
>  drivers/cxl/core/hmu.c    | 64 +++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/regs.c   | 14 +++++++++
>  drivers/cxl/cxl.h         |  4 +++
>  drivers/cxl/cxlpci.h      |  1 +
>  drivers/cxl/hmu.h         | 23 ++++++++++++++
>  drivers/cxl/pci.c         | 26 +++++++++++++++-
>  7 files changed, 132 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 9259bcc6773c..d060abb773ae 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -12,6 +12,7 @@ cxl_core-y += memdev.o
>  cxl_core-y += mbox.o
>  cxl_core-y += pci.o
>  cxl_core-y += hdm.o
> +cxl_core-y += hmu.o
>  cxl_core-y += pmu.o
>  cxl_core-y += cdat.o
>  cxl_core-$(CONFIG_TRACING) += trace.o
> diff --git a/drivers/cxl/core/hmu.c b/drivers/cxl/core/hmu.c
> new file mode 100644
> index 000000000000..3ee938bb6c05
> --- /dev/null
> +++ b/drivers/cxl/core/hmu.c
> @@ -0,0 +1,64 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2024 Huawei. All rights reserved. */
> +
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/idr.h>
> +#include <cxlmem.h>
> +#include <hmu.h>
> +#include <cxl.h>
> +#include "core.h"
> +
> +static void cxl_hmu_release(struct device *dev)
> +{
> +	struct cxl_hmu *hmu = to_cxl_hmu(dev);
> +
> +	kfree(hmu);
> +}
> +
> +const struct device_type cxl_hmu_type = {
> +	.name = "cxl_hmu",
> +	.release = cxl_hmu_release,
> +};
> +
> +static void remove_dev(void *dev)
> +{
> +	device_unregister(dev);
> +}
> +
> +int devm_cxl_hmu_add(struct device *parent, struct cxl_hmu_regs *regs,
> +		     int assoc_id, int index)
> +{
> +	struct cxl_hmu *hmu;
> +	struct device *dev;
> +	int rc;
> +
> +	hmu = kzalloc(sizeof(*hmu), GFP_KERNEL);
> +	if (!hmu)
> +		return -ENOMEM;
> +
> +	hmu->assoc_id = assoc_id;
> +	hmu->index = index;
> +	hmu->base = regs->hmu;
> +	dev = &hmu->dev;
> +	device_initialize(dev);
> +	device_set_pm_not_required(dev);
> +	dev->parent = parent;
> +	dev->bus = &cxl_bus_type;
> +	dev->type = &cxl_hmu_type;
> +	rc = dev_set_name(dev, "hmu_mem%d.%d", assoc_id, index);
> +	if (rc)
> +		goto err;
> +
> +	rc = device_add(dev);
> +	if (rc)
> +		goto err;
> +
> +	return devm_add_action_or_reset(parent, remove_dev, dev);
> +
> +err:
> +	put_device(&hmu->dev);
> +	return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(devm_cxl_hmu_add, CXL);
> +
> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
> index e1082e749c69..c12afaa6ef98 100644
> --- a/drivers/cxl/core/regs.c
> +++ b/drivers/cxl/core/regs.c
> @@ -401,6 +401,20 @@ int cxl_map_pmu_regs(struct cxl_register_map *map, struct cxl_pmu_regs *regs)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_map_pmu_regs, CXL);
>  
> +int cxl_map_hmu_regs(struct cxl_register_map *map, struct cxl_hmu_regs *regs)
> +{
> +	struct device *dev = map->host;
> +	resource_size_t phys_addr;
> +
> +	phys_addr = map->resource;
> +	regs->hmu = devm_cxl_iomap_block(dev, phys_addr, map->max_size);
I applied CHMU patch on 6.15.0 kernel and I tried to boot the virt with
one cxl root port and one device (jic23/cxl-2025-06-10), then the dmesg shows
"Failed to request region 0x10210000-0x1023ffff". I guess it is caused by the
'map->max_size'(0x30000) is large and the resource has been allocated by CPMU regs.
I tried to change it to 0x10000, the hmu_mem0.0 could be created as normal.
> +	if (!regs->hmu)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_map_hmu_regs, CXL);
> +
>  static int cxl_map_regblock(struct cxl_register_map *map)
>  {
>  	struct device *host = map->host;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 5406e3ab3d4a..8172bc1f7a8d 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -227,6 +227,9 @@ struct cxl_regs {
>  	struct_group_tagged(cxl_pmu_regs, pmu_regs,
>  		void __iomem *pmu;
>  	);
> +	struct_group_tagged(cxl_hmu_regs, hmu_regs,
> +		void __iomem *hmu;
> +	);
>  
>  	/*
>  	 * RCH downstream port specific RAS register
> @@ -292,6 +295,7 @@ int cxl_map_component_regs(const struct cxl_register_map *map,
>  			   unsigned long map_mask);
>  int cxl_map_device_regs(const struct cxl_register_map *map,
>  			struct cxl_device_regs *regs);
> +int cxl_map_hmu_regs(struct cxl_register_map *map, struct cxl_hmu_regs *regs);
>  int cxl_map_pmu_regs(struct cxl_register_map *map, struct cxl_pmu_regs *regs);
>  
>  enum cxl_regloc_type;
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 4da07727ab9c..71f5e9620137 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -67,6 +67,7 @@ enum cxl_regloc_type {
>  	CXL_REGLOC_RBI_VIRT,
>  	CXL_REGLOC_RBI_MEMDEV,
>  	CXL_REGLOC_RBI_PMU,
> +	CXL_REGLOC_RBI_HMU,
>  	CXL_REGLOC_RBI_TYPES
>  };
>  
> diff --git a/drivers/cxl/hmu.h b/drivers/cxl/hmu.h
> new file mode 100644
> index 000000000000..c4798ed9764b
> --- /dev/null
> +++ b/drivers/cxl/hmu.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright(c) 2024 Huawei
> + * CXL Specification rev 3.2 Setion 8.2.8 (CHMU Register Interface)
> + */
> +#ifndef CXL_HMU_H
> +#define CXL_HMU_H
> +#include <linux/device.h>
> +
> +#define CXL_HMU_REGMAP_SIZE 0xe00 /* Table 8-32 CXL 3.0 specification */
> +struct cxl_hmu {
> +	struct device dev;
> +	void __iomem *base;
> +	int assoc_id;
> +	int index;
> +};
> +
> +#define to_cxl_hmu(dev) container_of(dev, struct cxl_hmu, dev)
> +struct cxl_hmu_regs;
> +int devm_cxl_hmu_add(struct device *parent, struct cxl_hmu_regs *regs,
> +		     int assoc_id, int idx);
> +
> +#endif
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 188412d45e0d..e89ea9d3f007 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -15,6 +15,7 @@
>  #include "cxlmem.h"
>  #include "cxlpci.h"
>  #include "cxl.h"
> +#include "hmu.h"
>  #include "pmu.h"
>  
>  /**
> @@ -814,7 +815,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	struct cxl_dev_state *cxlds;
>  	struct cxl_register_map map;
>  	struct cxl_memdev *cxlmd;
> -	int i, rc, pmu_count;
> +	int i, rc, hmu_count, pmu_count;
>  	bool irq_avail;
>  
>  	/*
> @@ -938,6 +939,29 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  		}
>  	}
>  
> +	hmu_count = cxl_count_regblock(pdev, CXL_REGLOC_RBI_HMU);
> +	for (i = 0; i < hmu_count; i++) {
> +		struct cxl_hmu_regs hmu_regs;
> +
> +		rc = cxl_find_regblock_instance(pdev, CXL_REGLOC_RBI_HMU, &map, i);
> +		if (rc) {
> +			dev_dbg(&pdev->dev, "Could not find HMU regblock\n");
> +			break;
> +		}
> +
> +		rc = cxl_map_hmu_regs(&map, &hmu_regs);
> +		if (rc) {
> +			dev_dbg(&pdev->dev, "Could not map HMU regs\n");
> +			break;
> +		}
> +
> +		rc = devm_cxl_hmu_add(cxlds->dev, &hmu_regs, cxlmd->id, i);
> +		if (rc) {
> +			dev_dbg(&pdev->dev, "Could not add HMU instance\n");
> +			break;
> +		}
> +	}
> +
>  	rc = cxl_event_config(host_bridge, mds, irq_avail);
>  	if (rc)
>  		return rc;
> -- 
> 2.43.0
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
  2025-01-15 13:42       ` Jonathan Cameron
@ 2025-06-19  3:59         ` Yuquan Wang
  2025-06-19 10:49           ` Jonathan Cameron
  0 siblings, 1 reply; 27+ messages in thread
From: Yuquan Wang @ 2025-06-19  3:59 UTC (permalink / raw)
  To: Jonathan Cameron
  Cc: Neeraj Kumar, linux-cxl, linux-mm, linux-perf-users, linux-kernel,
	linuxarm, tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi,
	Vandana Salve, Davidlohr Bueso, Dave Jiang, Alison Schofield,
	Ira Weiny, Dan Williams, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Gregory Price, Huang Ying, Vishak G, Krishna Kanth Reddy,
	Alok Rathore, gost.dev

On Wed, Jan 15, 2025 at 01:42:07PM +0000, Jonathan Cameron wrote:
> On Fri, 3 Jan 2025 10:57:22 +0530
> Neeraj Kumar <s.neeraj@samsung.com> wrote:
> 
> > On 27/11/24 04:34PM, Jonathan Cameron wrote:
> > >On Thu, 21 Nov 2024 10:18:41 +0000
> > >Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > >  
> > >> The CXL specification release 3.2 is now available under a click through at
> > >> https://computeexpresslink.org/cxl-specification/ and it brings new
> > >> shiny toys.  
> > >
> > >If anyone wants to play, basic emulation on my CXL QEMU staging tree
> > >https://gitlab.com/jic23/qemu/-/commit/e89b35d264c1bcc04807e7afab1254f35ffc8cb9
> > >
> > >Branch with a few other things on top is:
> > >https://gitlab.com/jic23/qemu/-/commits/cxl-2024-11-27
> > >
> > >Note that this currently doesn't produce real data.  I have a plan
> > >/ initial PoC / hack to hook that up via an addition to the QEMU cache
> > >plugin and an external tool to emulate the hotness tracker counting
> > >hardware. Will be a little while before I get that finished, so in
> > >a meantime the above exercises the driver.
> > >
> > >Jonathan
> > >  
> > >>
> > >> RFC reason
> > >> - Whilst trace capture with a particular configuration is potentially useful
> > >>   the intent is that CXL HMU units will be used to drive various forms of
> > >>   hotpage migration for memory tiering setups. This driver doesn't do this
> > >>   (yet), but rather provides data capture etc for experimentation and
> > >>   for working out how to mostly put the allocations in the right place to
> > >>   start with by tuning applications.
> > >>
> > >> CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> > >> of this is to provide a way to establish which units of memory (typically
> > >> pages or larger) in CXL attached memory are hot. The implementation details
> > >> and algorithm are all implementation defined. The specification simply
> > >> describes the 'interface' which takes the form of ring buffer of hotness
> > >> records in a PCI BAR and defined capability, configuration and status
> > >> registers.
> > >>
> > >> The hardware may have constraints on what it can track, granularity etc
> > >> and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> > >> trackers). Some of these constraints are discoverable from the hardware
> > >> registers, others such as loss of accuracy have no universally accepted
> > >> measures as they are typically access pattern dependent. Sadly it is
> > >> very unlikely any hardware will implement a truly precise tracker given
> > >> the large resource requirements for tracking at a useful granularity.
> > >>
> > >> There are two fundamental operation modes:
> > >>
> > >> * Epoch based. Counters are checked after a period of time (Epoch) and
> > >>   if over a threshold added to the hotlist.
> > >> * Always on. Counters run until a threshold is reached, after that the
> > >>   hot unit is added to the hotlist and the counter released.
> > >>
> > >> Counting can be filtered on:
> > >>
> > >> * Region of CXL DPA space (256MiB per bit in a bitmap).
> > >> * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> > >>
> > >> Sampling can be modified by:
> > >>
> > >> * Downsampling including potentially randomized downsampling.
> > >>
> > >> The driver presented here is intended to be useful in its own right but
> > >> also to act as the first step of a possible path towards hotness monitoring
> > >> based hot page migration. Those steps might look like.
> > >>
> > >> 1. Gather data - drivers provide telemetry like solutions to get that
> > >>    data. May be enhanced, for example in this driver by providing the
> > >>    HPA address rather than DPA Unit Address. Userspace can access enough
> > >>    information to do this so maybe not.
> > >> 2. Userspace algorithm development, possibly combined with userspace
> > >>    triggered migration by PA. Working out how to use different levels
> > >>    of constrained hardware resources will be challenging.
> > >> 3. Move those algorithms in kernel. Will require generalization across
> > >>    different hotpage trackers etc.
> > >>
> > >> So far this driver just gives access to the raw data. I will probably kick
> > >> of a longer discussion on how to do adaptive sampling needed to actually
> > >> use these units for tiering etc, sometime soon (if no one one else beats
> > >> me too it).  There is a follow up topic of how to virtualize this stuff
> > >> for memory stranding cases (VM gets a fixed mixture of fast and slow
> > >> memory and should do it's own tiering).
> > >>
> > >> More details in the Documentation patch but typical commands are:
> > >>
> > >> $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
> > >>  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
> > >>  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
> > >>  hotness_granual=12  
> > 
> > Facing issue while executing perf record on x86 emulation environment using following steps
> > 
> > 1. Tried applying CHMU Patch on branch cxl-for-6.13 using b4 utility. As
> > base commit is not specified, with minor change able to apply patch.
> > Compiled kernel with CONFIG_CXL_HMU
> > 
> > 2. Compiled jic23/cxl-2024-11-27 for x86_64-softmmu
> > 
> > 3. Launched Qemu with following CXL topology along with compiled kernel
> > VM="-object memory-backend-ram,id=vmem1,share=on,size=512M \
> >      -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> >      -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> >      -device cxl-type3,bus=root_port13,volatile-memdev=vmem1,id=cxl-vmem1 \
> >      -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k"
> > 
> > 4. Created region and onlined this memory. Also run top utility on the newly created 
> > numa node using numactl -m<node> top
> > 
> > 5. Compiled and installed perf utility in qemu environment, and able to
> > see cxl_hmu_mem* entries in perf list
> > 
> > root@QEMUCXL2030mm:~# perf list
> > <snip>
> >   cxl_hmu_mem0.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> >   cxl_hmu_mem0.0.1/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> >   cxl_hmu_mem0.0.2/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> >   cxl_hmu_mem1.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> >   cxl_hmu_mem1.0.1/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> >   cxl_hmu_mem1.0.2/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> >   cxl_pmu_mem0.0/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> >   cxl_pmu_mem0.1/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> >   cxl_pmu_mem1.0/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> >   cxl_pmu_mem1.1/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> > <snip>
> > 
> > 6. Tried running perf command mentioned in Documentation/trace/cxl-hmu.rst
> > 
> > root@QEMUCXL2030mm:/home/cxl/cxl-linux-mainline/tools/perf# perf -v
> > perf version 6.12.rc5.gc198a4f4a356
> > root@QEMUCXL2030mm:/home/cxl/cxl-linux-mainline/tools/perf# perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12
> > event syntax error: '..ess_granual=12'
> >                                   \___ Unrecognized input
> 
> This is probably my mistake when cutting and pasting the example from a terminal.
> Add a trailing / and something to run.
> 
>  perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12/ -- sleep 10

Hi Jonathan,

I tried to use this new command but perf shows error. 

Based on the change of hmu iomap_block size[1], my steps are like below:

step1: Create cxl region and online the numa node

root@ubuntu-jammy-arm64:~/tools# numactl -H
available: 3 nodes (0-2)
node 0 cpus: 0 1
node 0 size: 1972 MB
node 0 free: 1694 MB
node 1 cpus: 2 3
node 1 size: 1942 MB
node 1 free: 1690 MB
node 2 cpus:
node 2 size: 256 MB
node 2 free: 256 MB
node distances:
node   0   1   2
  0:  10  20  20
  1:  20  10  20
  2:  20  20  10

step2: Bind this numa node to run 'ls'

root@ubuntu-jammy-arm64:~/tools# numactl -m 2 ls
build   perf.data  ndctl  perf     

root@ubuntu-jammy-arm64:~/tools# numastat 
                           node0           node1           node2
numa_hit                  109323          141170              77
numa_miss                      0               0               0
numa_foreign                   0               0               0
interleave_hit               519             591               0
local_node                108810          139591               0
other_node                   513            1579              77

step3: Use perf tool

root@ubuntu-jammy-arm64:~/tools# perf -v
perf version 6.15.rc5.g2c3e6f60f5cf

root@ubuntu-jammy-arm64:~/tools# perf list | grep -i hmu
  cxl_hmu_mem0.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw event descriptor]

root@ubuntu-jammy-arm64:~/tools# perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoperf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12/ -- sleep 10

Error:
cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12/H: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'

[1]:https://lore.kernel.org/linux-cxl/aFNsFI5OKrD0CWR3@phytium.com.cn/T/#u

Is something wrong on the CHMU interrupts?

> 
> Jonathan
> 
> > 
> > 
> > 
> > Are there any steps i am missing?
> > 
> > Regards,
> > Neeraj	 
> > 
> > >>
> > >> $perf report --dump-raw-traces
> > >>
> > >> Example output.  With a counter_width of 16 (0x10) the least significant
> > >> 4 bytes are the counter value and the unit index is bits 16-63.
> > >> Here all units are over the threshold and the indexes are 0,1,2 etc.
> > >>
> > >> . ... CXL_HMU data: size 33512 bytes
> > >> Header 0: units: 29c counter_width 10
> > >> Header 1 : deadbeef
> > >> 0000000000000283
> > >> 0000000000010364
> > >> 0000000000020366
> > >> 000000000003033c
> > >> 0000000000040343
> > >> 00000000000502ff
> > >> 000000000006030d
> > >> 000000000007031a
> > >>
> > >> Which will produce a list of hotness entries.
> > >> Bits[N-1:0] counter value
> > >> Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
> > >>   config to get to a Host Physical Address)
> > >>
> > >> Specific RFC questions.
> > >> - What should be in the header added to the aux buffer.
> > >>   Currently just the minimum is provided. Number of records
> > >>   and the counter width needed to decode them.
> > >> - Should we reset the counters when doing sampling "-F X"
> > >>   If the frequency is higher than the epoch we never see any hot units.
> > >>   If so, when should we reset them?
> > >>
> > >> Note testing has been light and on emulation only + as perf tool is
> > >> a pain to build on a striped back VM,  build testing has all be on
> > >> arm64 so far.  The driver loads though on both arm64 and x86 so
> > >> any problems are likely in the perf tool arch specific code
> > >> which is build tested (on wrong machine)
> > >>
> > >> The QEMU emulation needs some cleanup, but I should be able to post
> > >> that shortly to let people actually play with this.  There are lots
> > >> of open questions there on how 'right' we want the emulation to be
> > >> and what counting uarch to emulate.
> > >>
> > >> Jonathan Cameron (4):
> > >>   cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
> > >>   cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
> > >>   perf: Add support for CXL Hotness Monitoring Units (CHMU)
> > >>   hwtrace: Document CXL Hotness Monitoring Unit driver
> > >>
> > >>  Documentation/trace/cxl-hmu.rst     | 197 +++++++
> > >>  Documentation/trace/index.rst       |   1 +
> > >>  drivers/cxl/Kconfig                 |   6 +
> > >>  drivers/cxl/Makefile                |   3 +
> > >>  drivers/cxl/core/Makefile           |   1 +
> > >>  drivers/cxl/core/core.h             |   1 +
> > >>  drivers/cxl/core/hmu.c              |  64 ++
> > >>  drivers/cxl/core/port.c             |   2 +
> > >>  drivers/cxl/core/regs.c             |  14 +
> > >>  drivers/cxl/cxl.h                   |   5 +
> > >>  drivers/cxl/cxlpci.h                |   1 +
> > >>  drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
> > >>  drivers/cxl/hmu.h                   |  23 +
> > >>  drivers/cxl/pci.c                   |  26 +-
> > >>  tools/perf/arch/arm/util/auxtrace.c |  58 ++
> > >>  tools/perf/arch/x86/util/auxtrace.c |  76 +++
> > >>  tools/perf/util/Build               |   1 +
> > >>  tools/perf/util/auxtrace.c          |   4 +
> > >>  tools/perf/util/auxtrace.h          |   1 +
> > >>  tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
> > >>  tools/perf/util/cxl-hmu.h           |  18 +
> > >>  21 files changed, 1748 insertions(+), 1 deletion(-)
> > >>  create mode 100644 Documentation/trace/cxl-hmu.rst
> > >>  create mode 100644 drivers/cxl/core/hmu.c
> > >>  create mode 100644 drivers/cxl/hmu.c
> > >>  create mode 100644 drivers/cxl/hmu.h
> > >>  create mode 100644 tools/perf/util/cxl-hmu.c
> > >>  create mode 100644 tools/perf/util/cxl-hmu.h
> > >>  
> > >  
> > 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/4] cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
  2025-06-19  1:47   ` Yuquan Wang
@ 2025-06-19 10:11     ` Jonathan Cameron
  0 siblings, 0 replies; 27+ messages in thread
From: Jonathan Cameron @ 2025-06-19 10:11 UTC (permalink / raw)
  To: Yuquan Wang
  Cc: linux-cxl, linux-mm, linux-perf-users, linux-kernel, linuxarm,
	tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi, Vandana Salve,
	Davidlohr Bueso, Dave Jiang, Alison Schofield, Ira Weiny,
	Dan Williams, Alexander Shishkin, Peter Zijlstra, Ingo Molnar,
	Arnaldo Carvalho de Melo, Mark Rutland, Gregory Price, Huang Ying

On Thu, 19 Jun 2025 09:47:00 +0800
Yuquan Wang <wangyuquan1236@phytium.com.cn> wrote:

> On Thu, Nov 21, 2024 at 10:18:42AM +0000, Jonathan Cameron wrote:
> > Basic registration using similar approach to how the CPMUs
> > are registered.
> > 
> > Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> > ---
> >  drivers/cxl/core/Makefile |  1 +
> >  drivers/cxl/core/hmu.c    | 64 +++++++++++++++++++++++++++++++++++++++
> >  drivers/cxl/core/regs.c   | 14 +++++++++
> >  drivers/cxl/cxl.h         |  4 +++
> >  drivers/cxl/cxlpci.h      |  1 +
> >  drivers/cxl/hmu.h         | 23 ++++++++++++++
> >  drivers/cxl/pci.c         | 26 +++++++++++++++-
> >  7 files changed, 132 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> > index 9259bcc6773c..d060abb773ae 100644
> > --- a/drivers/cxl/core/Makefile
> > +++ b/drivers/cxl/core/Makefile
> > @@ -12,6 +12,7 @@ cxl_core-y += memdev.o
> >  cxl_core-y += mbox.o
> >  cxl_core-y += pci.o
> >  cxl_core-y += hdm.o
> > +cxl_core-y += hmu.o
> >  cxl_core-y += pmu.o
> >  cxl_core-y += cdat.o
> >  cxl_core-$(CONFIG_TRACING) += trace.o
> > diff --git a/drivers/cxl/core/hmu.c b/drivers/cxl/core/hmu.c
> > new file mode 100644
> > index 000000000000..3ee938bb6c05
> > --- /dev/null
> > +++ b/drivers/cxl/core/hmu.c
> > @@ -0,0 +1,64 @@
> > +// SPDX-License-Identifier: GPL-2.0-only
> > +/* Copyright(c) 2024 Huawei. All rights reserved. */
> > +
> > +#include <linux/device.h>
> > +#include <linux/slab.h>
> > +#include <linux/idr.h>
> > +#include <cxlmem.h>
> > +#include <hmu.h>
> > +#include <cxl.h>
> > +#include "core.h"
> > +
> > +static void cxl_hmu_release(struct device *dev)
> > +{
> > +	struct cxl_hmu *hmu = to_cxl_hmu(dev);
> > +
> > +	kfree(hmu);
> > +}
> > +
> > +const struct device_type cxl_hmu_type = {
> > +	.name = "cxl_hmu",
> > +	.release = cxl_hmu_release,
> > +};
> > +
> > +static void remove_dev(void *dev)
> > +{
> > +	device_unregister(dev);
> > +}
> > +
> > +int devm_cxl_hmu_add(struct device *parent, struct cxl_hmu_regs *regs,
> > +		     int assoc_id, int index)
> > +{
> > +	struct cxl_hmu *hmu;
> > +	struct device *dev;
> > +	int rc;
> > +
> > +	hmu = kzalloc(sizeof(*hmu), GFP_KERNEL);
> > +	if (!hmu)
> > +		return -ENOMEM;
> > +
> > +	hmu->assoc_id = assoc_id;
> > +	hmu->index = index;
> > +	hmu->base = regs->hmu;
> > +	dev = &hmu->dev;
> > +	device_initialize(dev);
> > +	device_set_pm_not_required(dev);
> > +	dev->parent = parent;
> > +	dev->bus = &cxl_bus_type;
> > +	dev->type = &cxl_hmu_type;
> > +	rc = dev_set_name(dev, "hmu_mem%d.%d", assoc_id, index);
> > +	if (rc)
> > +		goto err;
> > +
> > +	rc = device_add(dev);
> > +	if (rc)
> > +		goto err;
> > +
> > +	return devm_add_action_or_reset(parent, remove_dev, dev);
> > +
> > +err:
> > +	put_device(&hmu->dev);
> > +	return rc;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(devm_cxl_hmu_add, CXL);
> > +
> > diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
> > index e1082e749c69..c12afaa6ef98 100644
> > --- a/drivers/cxl/core/regs.c
> > +++ b/drivers/cxl/core/regs.c
> > @@ -401,6 +401,20 @@ int cxl_map_pmu_regs(struct cxl_register_map *map, struct cxl_pmu_regs *regs)
> >  }
> >  EXPORT_SYMBOL_NS_GPL(cxl_map_pmu_regs, CXL);
> >  
> > +int cxl_map_hmu_regs(struct cxl_register_map *map, struct cxl_hmu_regs *regs)
> > +{
> > +	struct device *dev = map->host;
> > +	resource_size_t phys_addr;
> > +
> > +	phys_addr = map->resource;
> > +	regs->hmu = devm_cxl_iomap_block(dev, phys_addr, map->max_size);  
> I applied CHMU patch on 6.15.0 kernel and I tried to boot the virt with
> one cxl root port and one device (jic23/cxl-2025-06-10), then the dmesg shows
> "Failed to request region 0x10210000-0x1023ffff". I guess it is caused by the
> 'map->max_size'(0x30000) is large and the resource has been allocated by CPMU regs.
> I tried to change it to 0x10000, the hmu_mem0.0 could be created as normal.

Ah. I was meaning to post an updated version of this series just to fix the
bug you've hit here but forgot to do so! Sorry about that.

I need to figure out if there is a more elegant way to do this but in meantime
here is what I'm carrying on top of these posted series. It's a fairly horrible
bit of layering as we need the generic code to know to poke around inside a temporary
mapping just to get the size that it should iomap.

From fadffb32ed302dfb6dec4497214e1d3b39450f8b Mon Sep 17 00:00:00 2001
From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Date: Thu, 17 Apr 2025 11:02:07 +0100
Subject: [PATCH] Fix up sizing of CHMU issue

---
 drivers/cxl/core/regs.c | 16 ++++++++++++++--
 drivers/cxl/cxlpci.h    |  7 +++++++
 drivers/cxl/hmu.c       |  6 ------
 3 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index 925870c4e494..5ae31696bf8b 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -416,10 +416,22 @@ EXPORT_SYMBOL_NS_GPL(cxl_map_pmu_regs, "CXL");
 int cxl_map_hmu_regs(struct cxl_register_map *map, struct cxl_hmu_regs *regs)
 {
 	struct device *dev = map->host;
-	resource_size_t phys_addr;
+	u64 __iomem *poke;
+	u64 common_cap[2];
+	resource_size_t phys_addr, phys_size;
 
 	phys_addr = map->resource;
-	regs->hmu = devm_cxl_iomap_block(dev, phys_addr, map->max_size);
+	/* Finding out the size of a CHMU means poking around inside */
+	poke = ioremap(phys_addr, sizeof(common_cap));
+	if (!poke) {
+		return -ENOMEM;
+	}
+	common_cap[0] = le64_to_cpu(readq(poke));
+	common_cap[1] = le64_to_cpu(readq(poke + 1));
+	iounmap(poke);
+	phys_size = FIELD_GET(CHMU_COMMON_CAP0_NUMINST_MSK, common_cap[0]) *
+		FIELD_GET(CHMU_COMMON_CAP1_INSTLEN_MSK, common_cap[1]) + 0x10;
+	regs->hmu = devm_cxl_iomap_block(dev, phys_addr, phys_size);
 	if (!regs->hmu)
 		return -ENOMEM;
 
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index f7b902eab288..a91407292aea 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -77,6 +77,13 @@ enum cxl_regloc_type {
 	CXL_REGLOC_RBI_TYPES
 };
 
+/* A few CHMU registers are needed to establish size */
+#define CHMU_COMMON_CAP0_REG				0x00
+#define   CHMU_COMMON_CAP0_VER_MSK			GENMASK(3, 0)
+#define   CHMU_COMMON_CAP0_NUMINST_MSK			GENMASK(15, 8)
+#define CHMU_COMMON_CAP1_REG				0x08
+#define   CHMU_COMMON_CAP1_INSTLEN_MSK			GENMASK(15, 0)
+
 /*
  * Table Access DOE, CDAT Read Entry Response
  *
diff --git a/drivers/cxl/hmu.c b/drivers/cxl/hmu.c
index 1a7a0f60a6ad..a1953e8750c8 100644
--- a/drivers/cxl/hmu.c
+++ b/drivers/cxl/hmu.c
@@ -27,12 +27,6 @@
 #include "cxl.h"
 #include "hmu.h"
 
-#define CHMU_COMMON_CAP0_REG				0x00
-#define   CHMU_COMMON_CAP0_VER_MSK			GENMASK(3, 0)
-#define   CHMU_COMMON_CAP0_NUMINST_MSK			GENMASK(15, 8)
-#define CHMU_COMMON_CAP1_REG				0x08
-#define   CHMU_COMMON_CAP1_INSTLEN_MSK			GENMASK(15, 0)
-
 /* Register offsets within instance */
 #define CHMU_INST0_CAP0_REG				0x00
 #define   CHMU_INST0_CAP0_MSI_N_MSK			GENMASK(3, 0)
-- 
2.48.1



> > +	if (!regs->hmu)
> > +		return -ENOMEM;
> > +
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_NS_GPL(cxl_map_hmu_regs, CXL);
> > +
> >  static int cxl_map_regblock(struct cxl_register_map *map)
> >  {
> >  	struct device *host = map->host;
> > diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> > index 5406e3ab3d4a..8172bc1f7a8d 100644
> > --- a/drivers/cxl/cxl.h
> > +++ b/drivers/cxl/cxl.h
> > @@ -227,6 +227,9 @@ struct cxl_regs {
> >  	struct_group_tagged(cxl_pmu_regs, pmu_regs,
> >  		void __iomem *pmu;
> >  	);
> > +	struct_group_tagged(cxl_hmu_regs, hmu_regs,
> > +		void __iomem *hmu;
> > +	);
> >  
> >  	/*
> >  	 * RCH downstream port specific RAS register
> > @@ -292,6 +295,7 @@ int cxl_map_component_regs(const struct cxl_register_map *map,
> >  			   unsigned long map_mask);
> >  int cxl_map_device_regs(const struct cxl_register_map *map,
> >  			struct cxl_device_regs *regs);
> > +int cxl_map_hmu_regs(struct cxl_register_map *map, struct cxl_hmu_regs *regs);
> >  int cxl_map_pmu_regs(struct cxl_register_map *map, struct cxl_pmu_regs *regs);
> >  
> >  enum cxl_regloc_type;
> > diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> > index 4da07727ab9c..71f5e9620137 100644
> > --- a/drivers/cxl/cxlpci.h
> > +++ b/drivers/cxl/cxlpci.h
> > @@ -67,6 +67,7 @@ enum cxl_regloc_type {
> >  	CXL_REGLOC_RBI_VIRT,
> >  	CXL_REGLOC_RBI_MEMDEV,
> >  	CXL_REGLOC_RBI_PMU,
> > +	CXL_REGLOC_RBI_HMU,
> >  	CXL_REGLOC_RBI_TYPES
> >  };
> >  
> > diff --git a/drivers/cxl/hmu.h b/drivers/cxl/hmu.h
> > new file mode 100644
> > index 000000000000..c4798ed9764b
> > --- /dev/null
> > +++ b/drivers/cxl/hmu.h
> > @@ -0,0 +1,23 @@
> > +/* SPDX-License-Identifier: GPL-2.0-only */
> > +/*
> > + * Copyright(c) 2024 Huawei
> > + * CXL Specification rev 3.2 Setion 8.2.8 (CHMU Register Interface)
> > + */
> > +#ifndef CXL_HMU_H
> > +#define CXL_HMU_H
> > +#include <linux/device.h>
> > +
> > +#define CXL_HMU_REGMAP_SIZE 0xe00 /* Table 8-32 CXL 3.0 specification */
> > +struct cxl_hmu {
> > +	struct device dev;
> > +	void __iomem *base;
> > +	int assoc_id;
> > +	int index;
> > +};
> > +
> > +#define to_cxl_hmu(dev) container_of(dev, struct cxl_hmu, dev)
> > +struct cxl_hmu_regs;
> > +int devm_cxl_hmu_add(struct device *parent, struct cxl_hmu_regs *regs,
> > +		     int assoc_id, int idx);
> > +
> > +#endif
> > diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> > index 188412d45e0d..e89ea9d3f007 100644
> > --- a/drivers/cxl/pci.c
> > +++ b/drivers/cxl/pci.c
> > @@ -15,6 +15,7 @@
> >  #include "cxlmem.h"
> >  #include "cxlpci.h"
> >  #include "cxl.h"
> > +#include "hmu.h"
> >  #include "pmu.h"
> >  
> >  /**
> > @@ -814,7 +815,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >  	struct cxl_dev_state *cxlds;
> >  	struct cxl_register_map map;
> >  	struct cxl_memdev *cxlmd;
> > -	int i, rc, pmu_count;
> > +	int i, rc, hmu_count, pmu_count;
> >  	bool irq_avail;
> >  
> >  	/*
> > @@ -938,6 +939,29 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
> >  		}
> >  	}
> >  
> > +	hmu_count = cxl_count_regblock(pdev, CXL_REGLOC_RBI_HMU);
> > +	for (i = 0; i < hmu_count; i++) {
> > +		struct cxl_hmu_regs hmu_regs;
> > +
> > +		rc = cxl_find_regblock_instance(pdev, CXL_REGLOC_RBI_HMU, &map, i);
> > +		if (rc) {
> > +			dev_dbg(&pdev->dev, "Could not find HMU regblock\n");
> > +			break;
> > +		}
> > +
> > +		rc = cxl_map_hmu_regs(&map, &hmu_regs);
> > +		if (rc) {
> > +			dev_dbg(&pdev->dev, "Could not map HMU regs\n");
> > +			break;
> > +		}
> > +
> > +		rc = devm_cxl_hmu_add(cxlds->dev, &hmu_regs, cxlmd->id, i);
> > +		if (rc) {
> > +			dev_dbg(&pdev->dev, "Could not add HMU instance\n");
> > +			break;
> > +		}
> > +	}
> > +
> >  	rc = cxl_event_config(host_bridge, mds, irq_avail);
> >  	if (rc)
> >  		return rc;
> > -- 
> > 2.43.0
> >   
> 
> 


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver
  2025-06-19  3:59         ` Yuquan Wang
@ 2025-06-19 10:49           ` Jonathan Cameron
  0 siblings, 0 replies; 27+ messages in thread
From: Jonathan Cameron @ 2025-06-19 10:49 UTC (permalink / raw)
  To: Yuquan Wang
  Cc: Neeraj Kumar, linux-cxl, linux-mm, linux-perf-users, linux-kernel,
	linuxarm, tongtiangen, Yicong Yang, Niyas Sait, ajayjoshi,
	Vandana Salve, Davidlohr Bueso, Dave Jiang, Alison Schofield,
	Ira Weiny, Dan Williams, Alexander Shishkin, Peter Zijlstra,
	Ingo Molnar, Arnaldo Carvalho de Melo, Mark Rutland,
	Gregory Price, Huang Ying, Vishak G, Krishna Kanth Reddy,
	Alok Rathore, gost.dev

On Thu, 19 Jun 2025 11:59:28 +0800
Yuquan Wang <wangyuquan1236@phytium.com.cn> wrote:

> On Wed, Jan 15, 2025 at 01:42:07PM +0000, Jonathan Cameron wrote:
> > On Fri, 3 Jan 2025 10:57:22 +0530
> > Neeraj Kumar <s.neeraj@samsung.com> wrote:
> >   
> > > On 27/11/24 04:34PM, Jonathan Cameron wrote:  
> > > >On Thu, 21 Nov 2024 10:18:41 +0000
> > > >Jonathan Cameron <Jonathan.Cameron@huawei.com> wrote:
> > > >    
> > > >> The CXL specification release 3.2 is now available under a click through at
> > > >> https://computeexpresslink.org/cxl-specification/ and it brings new
> > > >> shiny toys.    
> > > >
> > > >If anyone wants to play, basic emulation on my CXL QEMU staging tree
> > > >https://gitlab.com/jic23/qemu/-/commit/e89b35d264c1bcc04807e7afab1254f35ffc8cb9
> > > >
> > > >Branch with a few other things on top is:
> > > >https://gitlab.com/jic23/qemu/-/commits/cxl-2024-11-27
> > > >
> > > >Note that this currently doesn't produce real data.  I have a plan
> > > >/ initial PoC / hack to hook that up via an addition to the QEMU cache
> > > >plugin and an external tool to emulate the hotness tracker counting
> > > >hardware. Will be a little while before I get that finished, so in
> > > >a meantime the above exercises the driver.
> > > >
> > > >Jonathan
> > > >    
> > > >>
> > > >> RFC reason
> > > >> - Whilst trace capture with a particular configuration is potentially useful
> > > >>   the intent is that CXL HMU units will be used to drive various forms of
> > > >>   hotpage migration for memory tiering setups. This driver doesn't do this
> > > >>   (yet), but rather provides data capture etc for experimentation and
> > > >>   for working out how to mostly put the allocations in the right place to
> > > >>   start with by tuning applications.
> > > >>
> > > >> CXL r3.2 introduces a CXL Hotness Monitoring Unit definition. The intent
> > > >> of this is to provide a way to establish which units of memory (typically
> > > >> pages or larger) in CXL attached memory are hot. The implementation details
> > > >> and algorithm are all implementation defined. The specification simply
> > > >> describes the 'interface' which takes the form of ring buffer of hotness
> > > >> records in a PCI BAR and defined capability, configuration and status
> > > >> registers.
> > > >>
> > > >> The hardware may have constraints on what it can track, granularity etc
> > > >> and on how accurately it tracks (e.g. counter exhaustion, inaccurate
> > > >> trackers). Some of these constraints are discoverable from the hardware
> > > >> registers, others such as loss of accuracy have no universally accepted
> > > >> measures as they are typically access pattern dependent. Sadly it is
> > > >> very unlikely any hardware will implement a truly precise tracker given
> > > >> the large resource requirements for tracking at a useful granularity.
> > > >>
> > > >> There are two fundamental operation modes:
> > > >>
> > > >> * Epoch based. Counters are checked after a period of time (Epoch) and
> > > >>   if over a threshold added to the hotlist.
> > > >> * Always on. Counters run until a threshold is reached, after that the
> > > >>   hot unit is added to the hotlist and the counter released.
> > > >>
> > > >> Counting can be filtered on:
> > > >>
> > > >> * Region of CXL DPA space (256MiB per bit in a bitmap).
> > > >> * Type of access - Trusted and non trusted or non trusted only, R/W/RW
> > > >>
> > > >> Sampling can be modified by:
> > > >>
> > > >> * Downsampling including potentially randomized downsampling.
> > > >>
> > > >> The driver presented here is intended to be useful in its own right but
> > > >> also to act as the first step of a possible path towards hotness monitoring
> > > >> based hot page migration. Those steps might look like.
> > > >>
> > > >> 1. Gather data - drivers provide telemetry like solutions to get that
> > > >>    data. May be enhanced, for example in this driver by providing the
> > > >>    HPA address rather than DPA Unit Address. Userspace can access enough
> > > >>    information to do this so maybe not.
> > > >> 2. Userspace algorithm development, possibly combined with userspace
> > > >>    triggered migration by PA. Working out how to use different levels
> > > >>    of constrained hardware resources will be challenging.
> > > >> 3. Move those algorithms in kernel. Will require generalization across
> > > >>    different hotpage trackers etc.
> > > >>
> > > >> So far this driver just gives access to the raw data. I will probably kick
> > > >> of a longer discussion on how to do adaptive sampling needed to actually
> > > >> use these units for tiering etc, sometime soon (if no one one else beats
> > > >> me too it).  There is a follow up topic of how to virtualize this stuff
> > > >> for memory stranding cases (VM gets a fixed mixture of fast and slow
> > > >> memory and should do it's own tiering).
> > > >>
> > > >> More details in the Documentation patch but typical commands are:
> > > >>
> > > >> $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
> > > >>  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
> > > >>  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
> > > >>  hotness_granual=12    
> > > 
> > > Facing issue while executing perf record on x86 emulation environment using following steps
> > > 
> > > 1. Tried applying CHMU Patch on branch cxl-for-6.13 using b4 utility. As
> > > base commit is not specified, with minor change able to apply patch.
> > > Compiled kernel with CONFIG_CXL_HMU
> > > 
> > > 2. Compiled jic23/cxl-2024-11-27 for x86_64-softmmu
> > > 
> > > 3. Launched Qemu with following CXL topology along with compiled kernel
> > > VM="-object memory-backend-ram,id=vmem1,share=on,size=512M \
> > >      -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
> > >      -device cxl-rp,port=0,bus=cxl.1,id=root_port13,chassis=0,slot=2 \
> > >      -device cxl-type3,bus=root_port13,volatile-memdev=vmem1,id=cxl-vmem1 \
> > >      -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=8k"
> > > 
> > > 4. Created region and onlined this memory. Also run top utility on the newly created 
> > > numa node using numactl -m<node> top
> > > 
> > > 5. Compiled and installed perf utility in qemu environment, and able to
> > > see cxl_hmu_mem* entries in perf list
> > > 
> > > root@QEMUCXL2030mm:~# perf list
> > > <snip>
> > >   cxl_hmu_mem0.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> > >   cxl_hmu_mem0.0.1/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> > >   cxl_hmu_mem0.0.2/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> > >   cxl_hmu_mem1.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> > >   cxl_hmu_mem1.0.1/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> > >   cxl_hmu_mem1.0.2/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw ev>
> > >   cxl_pmu_mem0.0/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> > >   cxl_pmu_mem0.1/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> > >   cxl_pmu_mem1.0/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> > >   cxl_pmu_mem1.1/vid=0..0xffff,edge,mask=0..0xffffffff,.../modifier[Raw event descriptor]
> > > <snip>
> > > 
> > > 6. Tried running perf command mentioned in Documentation/trace/cxl-hmu.rst
> > > 
> > > root@QEMUCXL2030mm:/home/cxl/cxl-linux-mainline/tools/perf# perf -v
> > > perf version 6.12.rc5.gc198a4f4a356
> > > root@QEMUCXL2030mm:/home/cxl/cxl-linux-mainline/tools/perf# perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12
> > > event syntax error: '..ess_granual=12'
> > >                                   \___ Unrecognized input  
> > 
> > This is probably my mistake when cutting and pasting the example from a terminal.
> > Add a trailing / and something to run.
> > 
> >  perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12/ -- sleep 10  
> 
> Hi Jonathan,
> 
> I tried to use this new command but perf shows error. 
> 
> Based on the change of hmu iomap_block size[1], my steps are like below:
> 
> step1: Create cxl region and online the numa node
> 
> root@ubuntu-jammy-arm64:~/tools# numactl -H
> available: 3 nodes (0-2)
> node 0 cpus: 0 1
> node 0 size: 1972 MB
> node 0 free: 1694 MB
> node 1 cpus: 2 3
> node 1 size: 1942 MB
> node 1 free: 1690 MB
> node 2 cpus:
> node 2 size: 256 MB
> node 2 free: 256 MB
> node distances:
> node   0   1   2
>   0:  10  20  20
>   1:  20  10  20
>   2:  20  20  10
> 
> step2: Bind this numa node to run 'ls'
> 
> root@ubuntu-jammy-arm64:~/tools# numactl -m 2 ls
> build   perf.data  ndctl  perf     
> 
> root@ubuntu-jammy-arm64:~/tools# numastat 
>                            node0           node1           node2
> numa_hit                  109323          141170              77
> numa_miss                      0               0               0
> numa_foreign                   0               0               0
> interleave_hit               519             591               0
> local_node                108810          139591               0
> other_node                   513            1579              77
> 
> step3: Use perf tool
> 
> root@ubuntu-jammy-arm64:~/tools# perf -v
> perf version 6.15.rc5.g2c3e6f60f5cf
> 
> root@ubuntu-jammy-arm64:~/tools# perf list | grep -i hmu
>   cxl_hmu_mem0.0.0/hotness_granual=0..0xffffffff,hotness_threshold=0..0xffffffff,downsampling_factor=0..255,.../modifier[Raw event descriptor]
> 
> root@ubuntu-jammy-arm64:~/tools# perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoperf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12/ -- sleep 10
> 
This cut and paste seems to have half of two commands.  

I'll just grab the second one as what I'm guessing you were running. 

perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12/ -- sleep 10

> Error:
> cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,range_size=1024,randomized_downsampling=0,downsampling_factor=32,hotness_granual=12/H: PMU Hardware doesn't support sampling/overflow-interrupts. Try 'perf stat'
> 
> [1]:https://lore.kernel.org/linux-cxl/aFNsFI5OKrD0CWR3@phytium.com.cn/T/#u
> 
> Is something wrong on the CHMU interrupts?

That error message is annoyingly meaningless and misleading. I think
the perf core is assuming that any not supported error means that is the problem
rather than something else in the command line might not be supported.

It's failing because we don't support that downsampling_factor.  The format does (hence
I think the stuff above) but the emulation doesn't. 

So for now just don't specify downsampling_factor in the command line. There is
a bug in the emulation around setting it to 1 it seems.  

val = FIELD_DP64(val, CXL_CHMU0_CAP1, DOWN_SAMPLING_FACTORS, BIT(1))
which should have been bit 0 to indicate no downsampling (comment above that is wrong too).


Looking at the spec is rather confusing on how this is supposed to work.
We have a bitmap that has 16 bits, each of which represents a power of 2, but
then we have a CHMU configuration register Down-sampling factor that says
bits 99:96 are the 'one of the 16 possible values'  I assume that means
bit offset.

I'll reply to the qemu thread to point out this bug in the capability.

Jonathan



> 
> > 
> > Jonathan
> >   
> > > 
> > > 
> > > 
> > > Are there any steps i am missing?
> > > 
> > > Regards,
> > > Neeraj	 
> > >   
> > > >>
> > > >> $perf report --dump-raw-traces
> > > >>
> > > >> Example output.  With a counter_width of 16 (0x10) the least significant
> > > >> 4 bytes are the counter value and the unit index is bits 16-63.
> > > >> Here all units are over the threshold and the indexes are 0,1,2 etc.
> > > >>
> > > >> . ... CXL_HMU data: size 33512 bytes
> > > >> Header 0: units: 29c counter_width 10
> > > >> Header 1 : deadbeef
> > > >> 0000000000000283
> > > >> 0000000000010364
> > > >> 0000000000020366
> > > >> 000000000003033c
> > > >> 0000000000040343
> > > >> 00000000000502ff
> > > >> 000000000006030d
> > > >> 000000000007031a
> > > >>
> > > >> Which will produce a list of hotness entries.
> > > >> Bits[N-1:0] counter value
> > > >> Bits[63:N] Unit ID (combine with unit size and DPA base + HDM decoder
> > > >>   config to get to a Host Physical Address)
> > > >>
> > > >> Specific RFC questions.
> > > >> - What should be in the header added to the aux buffer.
> > > >>   Currently just the minimum is provided. Number of records
> > > >>   and the counter width needed to decode them.
> > > >> - Should we reset the counters when doing sampling "-F X"
> > > >>   If the frequency is higher than the epoch we never see any hot units.
> > > >>   If so, when should we reset them?
> > > >>
> > > >> Note testing has been light and on emulation only + as perf tool is
> > > >> a pain to build on a striped back VM,  build testing has all be on
> > > >> arm64 so far.  The driver loads though on both arm64 and x86 so
> > > >> any problems are likely in the perf tool arch specific code
> > > >> which is build tested (on wrong machine)
> > > >>
> > > >> The QEMU emulation needs some cleanup, but I should be able to post
> > > >> that shortly to let people actually play with this.  There are lots
> > > >> of open questions there on how 'right' we want the emulation to be
> > > >> and what counting uarch to emulate.
> > > >>
> > > >> Jonathan Cameron (4):
> > > >>   cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
> > > >>   cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
> > > >>   perf: Add support for CXL Hotness Monitoring Units (CHMU)
> > > >>   hwtrace: Document CXL Hotness Monitoring Unit driver
> > > >>
> > > >>  Documentation/trace/cxl-hmu.rst     | 197 +++++++
> > > >>  Documentation/trace/index.rst       |   1 +
> > > >>  drivers/cxl/Kconfig                 |   6 +
> > > >>  drivers/cxl/Makefile                |   3 +
> > > >>  drivers/cxl/core/Makefile           |   1 +
> > > >>  drivers/cxl/core/core.h             |   1 +
> > > >>  drivers/cxl/core/hmu.c              |  64 ++
> > > >>  drivers/cxl/core/port.c             |   2 +
> > > >>  drivers/cxl/core/regs.c             |  14 +
> > > >>  drivers/cxl/cxl.h                   |   5 +
> > > >>  drivers/cxl/cxlpci.h                |   1 +
> > > >>  drivers/cxl/hmu.c                   | 880 ++++++++++++++++++++++++++++
> > > >>  drivers/cxl/hmu.h                   |  23 +
> > > >>  drivers/cxl/pci.c                   |  26 +-
> > > >>  tools/perf/arch/arm/util/auxtrace.c |  58 ++
> > > >>  tools/perf/arch/x86/util/auxtrace.c |  76 +++
> > > >>  tools/perf/util/Build               |   1 +
> > > >>  tools/perf/util/auxtrace.c          |   4 +
> > > >>  tools/perf/util/auxtrace.h          |   1 +
> > > >>  tools/perf/util/cxl-hmu.c           | 367 ++++++++++++
> > > >>  tools/perf/util/cxl-hmu.h           |  18 +
> > > >>  21 files changed, 1748 insertions(+), 1 deletion(-)
> > > >>  create mode 100644 Documentation/trace/cxl-hmu.rst
> > > >>  create mode 100644 drivers/cxl/core/hmu.c
> > > >>  create mode 100644 drivers/cxl/hmu.c
> > > >>  create mode 100644 drivers/cxl/hmu.h
> > > >>  create mode 100644 tools/perf/util/cxl-hmu.c
> > > >>  create mode 100644 tools/perf/util/cxl-hmu.h
> > > >>    
> > > >    
> > >   
> >   
> 
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [RFC PATCH 1/4] cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
  2024-11-21 10:18 ` [RFC PATCH 1/4] cxl: Register devices for CXL Hotness Monitoring Units (CHMU) Jonathan Cameron
       [not found]   ` <CGME20250103052421epcas5p4a1a917ba5d367dfccec91d4522666ca0@epcas5p4.samsung.com>
  2025-06-19  1:47   ` Yuquan Wang
@ 2025-08-08  8:29   ` wangyuquan
  2025-08-08  8:45   ` Yuquan Wang
  3 siblings, 0 replies; 27+ messages in thread
From: wangyuquan @ 2025-08-08  8:29 UTC (permalink / raw)
  To: tangtao1634, linux-cxl, linux-mm, linux-perf-users, linux-kernel
  Cc: Jonathan Cameron, linuxarm, tongtiangen, Yicong Yang, Niyas Sait,
	ajayjoshi, Vandana Salve, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Ira Weiny, Dan Williams, Alexander Shishkin,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Gregory Price, Huang Ying

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Basic registration using similar approach to how the CPMUs
are registered.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/cxl/core/Makefile |  1 +
 drivers/cxl/core/hmu.c    | 64 +++++++++++++++++++++++++++++++++++++++
 drivers/cxl/core/regs.c   | 14 +++++++++
 drivers/cxl/cxl.h         |  4 +++
 drivers/cxl/cxlpci.h      |  1 +
 drivers/cxl/hmu.h         | 23 ++++++++++++++
 drivers/cxl/pci.c         | 26 +++++++++++++++-
 7 files changed, 132 insertions(+), 1 deletion(-)

diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
index 9259bcc6773c..d060abb773ae 100644
--- a/drivers/cxl/core/Makefile
+++ b/drivers/cxl/core/Makefile
@@ -12,6 +12,7 @@ cxl_core-y += memdev.o
 cxl_core-y += mbox.o
 cxl_core-y += pci.o
 cxl_core-y += hdm.o
+cxl_core-y += hmu.o
 cxl_core-y += pmu.o
 cxl_core-y += cdat.o
 cxl_core-$(CONFIG_TRACING) += trace.o
diff --git a/drivers/cxl/core/hmu.c b/drivers/cxl/core/hmu.c
new file mode 100644
index 000000000000..3ee938bb6c05
--- /dev/null
+++ b/drivers/cxl/core/hmu.c
@@ -0,0 +1,64 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/* Copyright(c) 2024 Huawei. All rights reserved. */
+
+#include <linux/device.h>
+#include <linux/slab.h>
+#include <linux/idr.h>
+#include <cxlmem.h>
+#include <hmu.h>
+#include <cxl.h>
+#include "core.h"
+
+static void cxl_hmu_release(struct device *dev)
+{
+	struct cxl_hmu *hmu = to_cxl_hmu(dev);
+
+	kfree(hmu);
+}
+
+const struct device_type cxl_hmu_type = {
+	.name = "cxl_hmu",
+	.release = cxl_hmu_release,
+};
+
+static void remove_dev(void *dev)
+{
+	device_unregister(dev);
+}
+
+int devm_cxl_hmu_add(struct device *parent, struct cxl_hmu_regs *regs,
+		     int assoc_id, int index)
+{
+	struct cxl_hmu *hmu;
+	struct device *dev;
+	int rc;
+
+	hmu = kzalloc(sizeof(*hmu), GFP_KERNEL);
+	if (!hmu)
+		return -ENOMEM;
+
+	hmu->assoc_id = assoc_id;
+	hmu->index = index;
+	hmu->base = regs->hmu;
+	dev = &hmu->dev;
+	device_initialize(dev);
+	device_set_pm_not_required(dev);
+	dev->parent = parent;
+	dev->bus = &cxl_bus_type;
+	dev->type = &cxl_hmu_type;
+	rc = dev_set_name(dev, "hmu_mem%d.%d", assoc_id, index);
+	if (rc)
+		goto err;
+
+	rc = device_add(dev);
+	if (rc)
+		goto err;
+
+	return devm_add_action_or_reset(parent, remove_dev, dev);
+
+err:
+	put_device(&hmu->dev);
+	return rc;
+}
+EXPORT_SYMBOL_NS_GPL(devm_cxl_hmu_add, CXL);
+
diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
index e1082e749c69..c12afaa6ef98 100644
--- a/drivers/cxl/core/regs.c
+++ b/drivers/cxl/core/regs.c
@@ -401,6 +401,20 @@ int cxl_map_pmu_regs(struct cxl_register_map *map, struct cxl_pmu_regs *regs)
 }
 EXPORT_SYMBOL_NS_GPL(cxl_map_pmu_regs, CXL);
 
+int cxl_map_hmu_regs(struct cxl_register_map *map, struct cxl_hmu_regs *regs)
+{
+	struct device *dev = map->host;
+	resource_size_t phys_addr;
+
+	phys_addr = map->resource;
+	regs->hmu = devm_cxl_iomap_block(dev, phys_addr, map->max_size);
+	if (!regs->hmu)
+		return -ENOMEM;
+
+	return 0;
+}
+EXPORT_SYMBOL_NS_GPL(cxl_map_hmu_regs, CXL);
+
 static int cxl_map_regblock(struct cxl_register_map *map)
 {
 	struct device *host = map->host;
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 5406e3ab3d4a..8172bc1f7a8d 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -227,6 +227,9 @@ struct cxl_regs {
 	struct_group_tagged(cxl_pmu_regs, pmu_regs,
 		void __iomem *pmu;
 	);
+	struct_group_tagged(cxl_hmu_regs, hmu_regs,
+		void __iomem *hmu;
+	);
 
 	/*
 	 * RCH downstream port specific RAS register
@@ -292,6 +295,7 @@ int cxl_map_component_regs(const struct cxl_register_map *map,
 			   unsigned long map_mask);
 int cxl_map_device_regs(const struct cxl_register_map *map,
 			struct cxl_device_regs *regs);
+int cxl_map_hmu_regs(struct cxl_register_map *map, struct cxl_hmu_regs *regs);
 int cxl_map_pmu_regs(struct cxl_register_map *map, struct cxl_pmu_regs *regs);
 
 enum cxl_regloc_type;
diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
index 4da07727ab9c..71f5e9620137 100644
--- a/drivers/cxl/cxlpci.h
+++ b/drivers/cxl/cxlpci.h
@@ -67,6 +67,7 @@ enum cxl_regloc_type {
 	CXL_REGLOC_RBI_VIRT,
 	CXL_REGLOC_RBI_MEMDEV,
 	CXL_REGLOC_RBI_PMU,
+	CXL_REGLOC_RBI_HMU,
 	CXL_REGLOC_RBI_TYPES
 };
 
diff --git a/drivers/cxl/hmu.h b/drivers/cxl/hmu.h
new file mode 100644
index 000000000000..c4798ed9764b
--- /dev/null
+++ b/drivers/cxl/hmu.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright(c) 2024 Huawei
+ * CXL Specification rev 3.2 Setion 8.2.8 (CHMU Register Interface)
+ */
+#ifndef CXL_HMU_H
+#define CXL_HMU_H
+#include <linux/device.h>
+
+#define CXL_HMU_REGMAP_SIZE 0xe00 /* Table 8-32 CXL 3.0 specification */
+struct cxl_hmu {
+	struct device dev;
+	void __iomem *base;
+	int assoc_id;
+	int index;
+};
+
+#define to_cxl_hmu(dev) container_of(dev, struct cxl_hmu, dev)
+struct cxl_hmu_regs;
+int devm_cxl_hmu_add(struct device *parent, struct cxl_hmu_regs *regs,
+		     int assoc_id, int idx);
+
+#endif
diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
index 188412d45e0d..e89ea9d3f007 100644
--- a/drivers/cxl/pci.c
+++ b/drivers/cxl/pci.c
@@ -15,6 +15,7 @@
 #include "cxlmem.h"
 #include "cxlpci.h"
 #include "cxl.h"
+#include "hmu.h"
 #include "pmu.h"
 
 /**
@@ -814,7 +815,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 	struct cxl_dev_state *cxlds;
 	struct cxl_register_map map;
 	struct cxl_memdev *cxlmd;
-	int i, rc, pmu_count;
+	int i, rc, hmu_count, pmu_count;
 	bool irq_avail;
 
 	/*
@@ -938,6 +939,29 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
 		}
 	}
 
+	hmu_count = cxl_count_regblock(pdev, CXL_REGLOC_RBI_HMU);
+	for (i = 0; i < hmu_count; i++) {
+		struct cxl_hmu_regs hmu_regs;
+
+		rc = cxl_find_regblock_instance(pdev, CXL_REGLOC_RBI_HMU, &map, i);
+		if (rc) {
+			dev_dbg(&pdev->dev, "Could not find HMU regblock\n");
+			break;
+		}
+
+		rc = cxl_map_hmu_regs(&map, &hmu_regs);
+		if (rc) {
+			dev_dbg(&pdev->dev, "Could not map HMU regs\n");
+			break;
+		}
+
+		rc = devm_cxl_hmu_add(cxlds->dev, &hmu_regs, cxlmd->id, i);
+		if (rc) {
+			dev_dbg(&pdev->dev, "Could not add HMU instance\n");
+			break;
+		}
+	}
+
 	rc = cxl_event_config(host_bridge, mds, irq_avail);
 	if (rc)
 		return rc;


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 2/4] cxl: Hotness Monitoring Unit via a Perf AUX Buffer.
  2024-11-21 10:18 ` [RFC PATCH 2/4] cxl: Hotness Monitoring Unit via a Perf AUX Buffer Jonathan Cameron
@ 2025-08-08  8:29   ` wangyuquan
  0 siblings, 0 replies; 27+ messages in thread
From: wangyuquan @ 2025-08-08  8:29 UTC (permalink / raw)
  To: tangtao1634, linux-cxl, linux-mm, linux-perf-users, linux-kernel
  Cc: Jonathan Cameron, linuxarm, tongtiangen, Yicong Yang, Niyas Sait,
	ajayjoshi, Vandana Salve, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Ira Weiny, Dan Williams, Alexander Shishkin,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Gregory Price, Huang Ying

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

There are many ways that support for the new CXL hotness monitoring unit
could be enabled.

The existing infrastructure of perf + auxiliary buffers is used for the
similar activity of trace capture.

This driver is based on the existing hisi_ptt (PCI trace and tune) driver
and the CXL PMU driver.  Testing was done against QEMU emulation of
the feature but it's early days and lots more testing is needed as
this is a flexible specification with many corner cases.

The raw hotlist elements cannot be interpreted without knowing
the counter width. This is unfortunately dependent in an implementation
defined way on the unit size used for monitoring.
As such, store this and a count of new hotlist entrees in a header
that is inserted at the start of each set of records added to the auxiliary
buffer.

TODO: Add capabilities to expose what can be set for at least some
of these parameters.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 drivers/cxl/Kconfig     |   6 +
 drivers/cxl/Makefile    |   3 +
 drivers/cxl/core/core.h |   1 +
 drivers/cxl/core/port.c |   2 +
 drivers/cxl/cxl.h       |   1 +
 drivers/cxl/hmu.c       | 880 ++++++++++++++++++++++++++++++++++++++++
 6 files changed, 893 insertions(+)

diff --git a/drivers/cxl/Kconfig b/drivers/cxl/Kconfig
index 876469e23f7a..c420f828fe20 100644
--- a/drivers/cxl/Kconfig
+++ b/drivers/cxl/Kconfig
@@ -146,4 +146,10 @@ config CXL_REGION_INVALIDATION_TEST
 	  If unsure, or if this kernel is meant for production environments,
 	  say N.
 
+config CXL_HMU
+       tristate "CXL: Hotness Monitoring Unit Driver"
+       depends on PERF_EVENTS
+       help
+	 Read data out from the CXL hotness units and provide it to userspace
+	 via the perf auxbuffer framework.
 endif
diff --git a/drivers/cxl/Makefile b/drivers/cxl/Makefile
index 2caa90fa4bf2..b678aa927298 100644
--- a/drivers/cxl/Makefile
+++ b/drivers/cxl/Makefile
@@ -7,15 +7,18 @@
 # - 'mem' and 'pmem' before endpoint drivers so that memdevs are
 #   immediately enabled
 # - 'pci' last, also mirrors the hardware enumeration hierarchy
+# - 'hmu' doesn't matter for now.
 obj-y += core/
 obj-$(CONFIG_CXL_PORT) += cxl_port.o
 obj-$(CONFIG_CXL_ACPI) += cxl_acpi.o
 obj-$(CONFIG_CXL_PMEM) += cxl_pmem.o
 obj-$(CONFIG_CXL_MEM) += cxl_mem.o
 obj-$(CONFIG_CXL_PCI) += cxl_pci.o
+obj-$(CONFIG_CXL_HMU) += cxl_hmu.o
 
 cxl_port-y := port.o
 cxl_acpi-y := acpi.o
 cxl_pmem-y := pmem.o security.o
 cxl_mem-y := mem.o
 cxl_pci-y := pci.o
+cxl_hmu-y := hmu.o
diff --git a/drivers/cxl/core/core.h b/drivers/cxl/core/core.h
index 0c62b4069ba0..88c673a4d950 100644
--- a/drivers/cxl/core/core.h
+++ b/drivers/cxl/core/core.h
@@ -6,6 +6,7 @@
 
 extern const struct device_type cxl_nvdimm_bridge_type;
 extern const struct device_type cxl_nvdimm_type;
+extern const struct device_type cxl_hmu_type;
 extern const struct device_type cxl_pmu_type;
 
 extern struct attribute_group cxl_base_attribute_group;
diff --git a/drivers/cxl/core/port.c b/drivers/cxl/core/port.c
index af92c67bc954..a91712757830 100644
--- a/drivers/cxl/core/port.c
+++ b/drivers/cxl/core/port.c
@@ -74,6 +74,8 @@ static int cxl_device_id(const struct device *dev)
 		return CXL_DEVICE_REGION;
 	if (dev->type == &cxl_pmu_type)
 		return CXL_DEVICE_PMU;
+	if (dev->type == &cxl_hmu_type)
+		return CXL_DEVICE_HMU;
 	return 0;
 }
 
diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
index 8172bc1f7a8d..bd190e2baa1d 100644
--- a/drivers/cxl/cxl.h
+++ b/drivers/cxl/cxl.h
@@ -850,6 +850,7 @@ void cxl_driver_unregister(struct cxl_driver *cxl_drv);
 #define CXL_DEVICE_PMEM_REGION		7
 #define CXL_DEVICE_DAX_REGION		8
 #define CXL_DEVICE_PMU			9
+#define CXL_DEVICE_HMU			10
 
 #define MODULE_ALIAS_CXL(type) MODULE_ALIAS("cxl:t" __stringify(type) "*")
 #define CXL_MODALIAS_FMT "cxl:t%d"
diff --git a/drivers/cxl/hmu.c b/drivers/cxl/hmu.c
new file mode 100644
index 000000000000..9f5947afbb4b
--- /dev/null
+++ b/drivers/cxl/hmu.c
@@ -0,0 +1,880 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Driver for CXL Hotness Monitoring unit
+ *
+ * Based on hisi_ptt.c (Author Yicong Yang <yangyicong@hisilicon.com>)
+ * Copyright (c) 2022-2024 HiSilicon Technologies Co., Ltd.
+ *
+ * TODO:
+ * - Add capabilities attributes to help userspace know what can be set.
+ * - Find out if timeouts are appropriate for real hardware. Currently
+ *   assuming 0.1 seconds is enough for anything.
+ */
+#include <linux/dev_printk.h>
+#include <linux/perf_event.h>
+#include <linux/bitfield.h>
+#include <linux/spinlock.h>
+#include <linux/cleanup.h>
+#include <linux/vmalloc.h>
+#include <linux/device.h>
+#include <linux/iopoll.h>
+#include <linux/delay.h>
+#include <linux/list.h>
+#include <linux/math.h>
+#include <linux/pci.h>
+
+#include "cxlpci.h"
+#include "cxl.h"
+#include "hmu.h"
+
+#define CHMU_COMMON_CAP0_REG				0x00
+#define   CHMU_COMMON_CAP0_VER_MSK			GENMASK(3, 0)
+#define   CHMU_COMMON_CAP0_NUMINST_MSK			GENMASK(15, 8)
+#define CHMU_COMMON_CAP1_REG				0x08
+#define   CHMU_COMMON_CAP1_INSTLEN_MSK			GENMASK(15, 0)
+
+/* Register offsets within instance */
+#define CHMU_INST0_CAP0_REG				0x00
+#define   CHMU_INST0_CAP0_MSI_N_MSK			GENMASK(3, 0)
+#define   CHMU_INST0_CAP0_OVRFLW_CAP			BIT(4)
+#define   CHMU_INST0_CAP0_FILLTHRESH_CAP		BIT(5)
+#define   CHMU_INST0_CAP0_EPOCH_TYPE_MSK		GENMASK(7, 6)
+#define     CHMU_INST0_CAP0_EPOCH_TYPE_GLOBAL		0
+#define     CHMU_INST0_CAP0_EPOCH_TYPE_PERCNT		1
+#define   CHMU_INST0_CAP0_TRACK_NONTEE_R		BIT(8)
+#define   CHMU_INST0_CAP0_TRACK_NONTEE_W		BIT(9)
+#define   CHMU_INST0_CAP0_TRACK_NONTEE_RW		BIT(10)
+#define   CHMU_INST0_CAP0_TRACK_R			BIT(11)
+#define   CHMU_INST0_CAP0_TRACK_W			BIT(12)
+#define   CHMU_INST0_CAP0_TRACK_RW			BIT(13)
+/* Epoch defined as scale * multiplier */
+#define   CHMU_INST0_CAP0_EPOCH_MAX_SCALE_MSK		GENMASK(19, 16)
+#define     CHMU_EPOCH_SCALE_100US			1
+#define     CHMU_EPOCH_SCALE_1MS			2
+#define     CHMU_INST0_SCALE_10MS			3
+#define     CHMU_INST0_SCALE_100MS			4
+#define     CHMU_INST0_SCALE_1US			5
+#define   CHMU_INST0_CAP0_EPOCH_MAX_MULT_MSK		GENMASK(31, 20)
+#define   CHMU_INST0_CAP0_EPOCH_MIN_SCALE_MSK		GENMASK_ULL(35, 32)
+#define   CHMU_INST0_CAP0_EPOCH_MIN_MULT_MSK		GENMASK_ULL(47, 36)
+#define   CHMU_INST0_CAP0_HOTLIST_SIZE_MSK		GENMASK_ULL(63, 48)
+#define CHMU_INST0_CAP1_REG				0x08
+/* Power of 2 * 256 bits */
+#define   CHMU_INST0_CAP1_UNIT_SIZE_MSK			GENMASK(31, 0)
+/* Power of 2 */
+#define   CHMU_INST0_CAP1_DOWNSAMP_MSK			GENMASK_ULL(47, 32)
+#define   CHMU_INST0_CAP1_EPOCH_SUP			BIT_ULL(48)
+#define   CHMU_INST0_CAP1_ALWAYS_ON_SUP			BIT_ULL(49)
+#define   CHMU_INST0_CAP1_RAND_DOWNSAMP_SUP		BIT_ULL(50)
+#define   CHMU_INST0_CAP1_ADDR_OVERLAP_SUP		BIT_ULL(51)
+#define   CHMU_INST0_CAP1_POSTPONED_ON_OVRFLOW_SUP	BIT_ULL(52)
+
+/*
+ * In CXL r3.2 all defined as part of single giant CAP register.
+ * Where a whole 64 bits is in one field just name after the field.
+ */
+#define CHMU_INST0_RANGE_BITMAP_OFFSET_REG		0x10
+#define CHMU_INST0_HOTLIST_OFFSET_REG			0x18
+
+#define CHMU_INST0_CFG0_REG				0x40
+#define   CHMU_INST0_CFG0_WHAT_MSK			GENMASK(7, 0)
+#define      CHMU_INST0_CFG0_WHAT_NONTEE_R		1
+#define      CHMU_INST0_CFG0_WHAT_NONTEE_W		2
+#define      CHMU_INST0_CFG0_WHAT_NONTEE_RW		3
+#define      CHMU_INST0_CFG0_WHAT_R			4
+#define      CHMU_INST0_CFG0_WHAT_W			5
+#define      CHMU_INST0_CFG0_WHAT_RW			6
+#define   CHMU_INST0_CFG0_RAND_DOWNSAMP_EN		BIT(8)
+#define   CHMU_INST0_CFG0_OVRFLW_INT_EN			BIT(9)
+#define   CHMU_INST0_CFG0_FILLTHRESH_INT_EN		BIT(10)
+#define   CHMU_INST0_CFG0_ENABLE			BIT(16)
+#define   CHMU_INST0_CFG0_RESET_COUNTERS		BIT(17)
+#define   CHMU_INST0_CFG0_HOTNESS_THRESH_MSK		GENMASK_ULL(63, 32)
+#define CHMU_INST0_CFG1_REG				0x48
+#define   CHMU_INST0_CFG1_UNIT_SIZE_MSK			GENMASK(31, 0)
+#define   CHMU_INST0_CFG1_DS_FACTOR_MSK			GENMASK(35, 32)
+#define   CHMU_INST0_CFG1_MODE_MSK			GENMASK_ULL(47, 40)
+#define   CHMU_INST0_CFG1_EPOCH_SCALE_MSK		GENMASK_ULL(51, 48)
+#define   CHMU_INST0_CFG1_EPOCH_MULT_MSK		GENMASK_ULL(63, 52)
+#define CHMU_INST0_CFG2_REG				0x50
+#define   CHMU_INST0_CFG2_FILLTHRESH_THRESHOLD_MSK	GENMASK(15, 0)
+
+#define CHMU_INST0_STATUS_REG				0x60
+#define   CHMU_INST0_STATUS_ENABLED			BIT(0)
+#define   CHMU_INST0_STATUS_OP_INPROG_MSK		GENMASK(31, 16)
+#define     CHMU_INST0_STATUS_OP_INPROG_NONE		0
+#define     CHMU_INST0_STATUS_OP_INPROG_ENABLE		1
+#define     CHMU_INST0_STATUS_OP_INPROG_DISABLE		2
+#define     CHMU_INST0_STATUS_OP_INPROG_RESET		3
+#define   CHMU_INST0_STATUS_COUNTER_WIDTH_MSK		GENMASK_ULL(39, 32)
+#define   CHMU_INST0_STATUS_OVRFLW			BIT_ULL(40)
+#define   CHMU_INST0_STATUS_FILLTHRESH			BIT_ULL(41)
+
+/* 2 byte registers */
+#define CHMU_INST0_HEAD_REG				0x68
+#define CHMU_INST0_TAIL_REG				0x6A
+
+/* CFG attribute bit mappings */
+#define CXL_HMU_ATTR_CONFIG_EPOCH_TYPE_MASK GENMASK(1, 0)
+#define CXL_HMU_ATTR_CONFIG_ACCESS_TYPE_MASK GENMASK(9, 2)
+#define CXL_HMU_ATTR_CONFIG_EPOCH_SCALE_MASK GENMASK(13, 10)
+#define CXL_HMU_ATTR_CONFIG_EPOCH_MULT_MASK GENMASK(25, 14)
+#define CXL_HMU_ATTR_CONFIG_RANDOM_DS_MASK BIT(26)
+#define CXL_HMU_ATTR_CONFIG_DS_FACTOR_MASK GENMASK_ULL(34, 27)
+
+#define CXL_HMU_ATTR_CONFIG1_HOTNESS_THRESH_MASK GENMASK(31, 0)
+#define CXL_HMU_ATTR_CONFIG1_HOTNESS_GRANUAL_MASK GENMASK_ULL(63, 32)
+
+/* In multiples of 256MiB */
+#define CXL_HMU_ATTR_CONFIG2_DPA_BASE_MASK GENMASK(31, 0)
+#define CXL_HMU_ATTR_CONFIG2_DPA_SIZE_MASK GENMASK_ULL(63, 32)
+
+/* Range bitmap registers at offset 0x10 + Range Config Bitmap offset */
+/* Hotlist registers at offset 0x10 + Hotlist Register offset */
+static int cxl_hmu_cpuhp_state_num;
+
+enum cxl_hmu_reporting_mode {
+	CHMU_MODE_EPOCH = 0,
+	CHMU_MODE_ALWAYS_ON = 1,
+};
+
+struct cxl_hmu_info {
+	struct pmu pmu;
+	struct perf_output_handle handle;
+	void __iomem *base;
+	struct hlist_node node;
+	int irq;
+	int on_cpu;
+	u32 hot_thresh;
+	u32 hot_gran; /* power of 2, 256 to 2 GiB */
+	/* For now use a range rather than a bitmap, chunks of 256MiB */
+	u32 range_base;
+	u32 range_num;
+	enum cxl_hmu_reporting_mode reporting_mode;
+	u8 m2s_requests_to_track;
+	u8 ds_factor_pow2;
+	u8 epoch_scale;
+	u16 epoch_mult;
+	bool randomized_ds;
+
+	/* Protect both the device state for RMW and the pmu state */
+	spinlock_t lock;
+};
+
+#define pmu_to_cxl_hmu(p) container_of(p, struct cxl_hmu_info, pmu)
+
+/* destriptor for the aux buffer */
+struct cxl_hmu_buf {
+	size_t length;
+	int nr_pages;
+	void *base;
+	long pos;
+};
+
+static ssize_t cpumask_show(struct device *dev, struct device_attribute *attr,
+			    char *buf)
+{
+	struct cxl_hmu_info *hmu = dev_get_drvdata(dev);
+
+	return cpumap_print_to_pagebuf(true, buf, cpumask_of(hmu->on_cpu));
+}
+static DEVICE_ATTR_RO(cpumask);
+
+static struct attribute *cxl_hmu_cpumask_attrs[] = {
+	&dev_attr_cpumask.attr,
+	NULL
+};
+
+static const struct attribute_group cxl_hmu_cpumask_attr_group = {
+	.attrs = cxl_hmu_cpumask_attrs,
+};
+
+/* Sized fields to future proof based on space in spec */
+PMU_FORMAT_ATTR(epoch_type, "config:0-1"); /* 2 bits to future proof */
+PMU_FORMAT_ATTR(access_type, "config:2-9");
+PMU_FORMAT_ATTR(epoch_scale, "config:10-13");
+PMU_FORMAT_ATTR(epoch_multiplier, "config:14-25");
+PMU_FORMAT_ATTR(randomized_downsampling, "config:26-26");
+PMU_FORMAT_ATTR(downsampling_factor, "config:27-34");
+
+PMU_FORMAT_ATTR(hotness_threshold, "config1:0-31");
+PMU_FORMAT_ATTR(hotness_granual, "config1:32-63");
+
+/* RFC this is a bitmap can we control it better? */
+PMU_FORMAT_ATTR(range_base, "config2:0-31");
+PMU_FORMAT_ATTR(range_size, "config2:32-63");
+static struct attribute *cxl_hmu_format_attrs[] = {
+	&format_attr_epoch_type.attr,
+	&format_attr_access_type.attr,
+	&format_attr_epoch_scale.attr,
+	&format_attr_epoch_multiplier.attr,
+	&format_attr_randomized_downsampling.attr,
+	&format_attr_downsampling_factor.attr,
+	&format_attr_hotness_threshold.attr,
+	&format_attr_hotness_granual.attr,
+	&format_attr_range_base.attr,
+	&format_attr_range_size.attr,
+	NULL
+};
+
+static struct attribute_group cxl_hmu_format_attr_group = {
+	.name = "format",
+	.attrs = cxl_hmu_format_attrs,
+};
+
+static const struct attribute_group *cxl_hmu_groups[] = {
+	&cxl_hmu_cpumask_attr_group,
+	&cxl_hmu_format_attr_group,
+	NULL
+};
+
+static int cxl_hmu_event_init(struct perf_event *event)
+{
+	struct cxl_hmu_info *hmu = pmu_to_cxl_hmu(event->pmu);
+	struct device *dev = event->pmu->dev;
+	u32 gran_sup;
+	u16 ds_sup;
+	u64 cap0, cap1;
+	u64 epoch_min, epoch_max, epoch;
+	u64 hotlist_offset = readq(hmu->base + CHMU_INST0_HOTLIST_OFFSET_REG);
+	u64 bitmap_offset = readq(hmu->base + CHMU_INST0_RANGE_BITMAP_OFFSET_REG);
+
+	if (event->attr.type != hmu->pmu.type)
+		return -ENOENT;
+
+	if (event->cpu < 0) {
+		dev_info(dev, "Per-task mode not supported\n");
+		return -EOPNOTSUPP;
+	}
+
+	if (event->attach_state & PERF_ATTACH_TASK)
+		return -EOPNOTSUPP;
+
+	cap0 = readq(hmu->base + CHMU_INST0_CAP0_REG);
+	cap1 = readq(hmu->base + CHMU_INST0_CAP1_REG);
+
+	switch (FIELD_GET(CXL_HMU_ATTR_CONFIG_EPOCH_TYPE_MASK,
+		event->attr.config)) {
+	case 0:
+		if (!FIELD_GET(CHMU_INST0_CAP1_EPOCH_SUP, cap1))
+			return -EOPNOTSUPP;
+		hmu->reporting_mode = CHMU_MODE_EPOCH;
+		break;
+	case 1:
+		if (!FIELD_GET(CHMU_INST0_CAP1_ALWAYS_ON_SUP, cap1))
+			return -EOPNOTSUPP;
+		hmu->reporting_mode = CHMU_MODE_ALWAYS_ON;
+		break;
+	default:
+		dev_dbg(dev, "Tried for a non existent type\n");
+		return -EINVAL;
+	}
+	hmu->randomized_ds = FIELD_GET(CXL_HMU_ATTR_CONFIG_RANDOM_DS_MASK,
+				      event->attr.config);
+	if (hmu->randomized_ds && !FIELD_GET(CHMU_INST0_CAP1_RAND_DOWNSAMP_SUP, cap1)) {
+		dev_info(dev, "Randomized downsampling not supported\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* RFC: sanity check against currently defined or not? */
+	hmu->m2s_requests_to_track = FIELD_GET(CXL_HMU_ATTR_CONFIG_ACCESS_TYPE_MASK,
+					       event->attr.config);
+	if (hmu->m2s_requests_to_track < CHMU_INST0_CFG0_WHAT_NONTEE_R ||
+	    hmu->m2s_requests_to_track > CHMU_INST0_CFG0_WHAT_RW) {
+		dev_dbg(dev, "Requested a reserved type to track\n");
+		return -EINVAL;
+	}
+
+	hmu->hot_thresh = FIELD_GET(CXL_HMU_ATTR_CONFIG1_HOTNESS_THRESH_MASK,
+					   event->attr.config1);
+
+	hmu->hot_gran = FIELD_GET(CXL_HMU_ATTR_CONFIG1_HOTNESS_GRANUAL_MASK,
+					 event->attr.config1);
+
+	gran_sup = FIELD_GET(CHMU_INST0_CAP1_UNIT_SIZE_MSK, cap1);
+	/* Default to smallest granual if not specified */
+	if (hmu->hot_gran == 0 && gran_sup)
+		hmu->hot_gran = 8 + ffs(gran_sup);
+
+	if (hmu->hot_gran < 8) {
+		dev_dbg(dev, "Granual less than 256 bytes, not valid in CXL 3.2\n");
+		return -EINVAL;
+	}
+
+	if (!((1 << (hmu->hot_gran - 8)) & gran_sup)) {
+		dev_dbg(dev, "Granual %d not supported, supported mask %x\n",
+			hmu->hot_gran - 8, gran_sup);
+		return -EOPNOTSUPP;
+	}
+
+	ds_sup = FIELD_GET(CHMU_INST0_CAP1_DOWNSAMP_MSK, cap1);
+	hmu->ds_factor_pow2 = FIELD_GET(CXL_HMU_ATTR_CONFIG_DS_FACTOR_MASK,
+					event->attr.config);
+	if (!((1 << hmu->ds_factor_pow2) & ds_sup)) {
+		/* Special case default of 0 if not supported as smallest DS possibe */
+		if (hmu->ds_factor_pow2 == 0 && ds_sup) {
+			hmu->ds_factor_pow2 = ffs(ds_sup);
+			dev_dbg(dev, "Downsampling set to default min of %d\n",
+				hmu->ds_factor_pow2);
+		} else {
+			dev_dbg(dev, "Downsampling %d no supported, supported mask %x\n",
+				hmu->ds_factor_pow2, ds_sup);
+			return -EOPNOTSUPP;
+		}
+	}
+
+	hmu->epoch_scale = FIELD_GET(CXL_HMU_ATTR_CONFIG_EPOCH_SCALE_MASK,
+				     event->attr.config);
+
+	hmu->epoch_mult = FIELD_GET(CXL_HMU_ATTR_CONFIG_EPOCH_MULT_MASK,
+				    event->attr.config);
+
+	/* Default to what? */
+	if (hmu->epoch_mult == 0 && hmu->epoch_scale == 0) {
+		hmu->epoch_scale = FIELD_GET(CHMU_INST0_CAP0_EPOCH_MIN_SCALE_MSK, cap0);
+		hmu->epoch_mult = FIELD_GET(CHMU_INST0_CAP0_EPOCH_MIN_MULT_MSK, cap0);
+	}
+	if (hmu->epoch_mult == 0)
+		return -EINVAL;
+
+	/* Units of 100ms */
+	epoch_min = int_pow(10, FIELD_GET(CHMU_INST0_CAP0_EPOCH_MIN_SCALE_MSK, cap0)) *
+		(u64)FIELD_GET(CHMU_INST0_CAP0_EPOCH_MIN_MULT_MSK, cap0);
+	epoch_max = int_pow(10, FIELD_GET(CHMU_INST0_CAP0_EPOCH_MAX_SCALE_MSK, cap0)) *
+		(u64)FIELD_GET(CHMU_INST0_CAP0_EPOCH_MAX_MULT_MSK, cap0);
+	epoch = int_pow(10, hmu->epoch_scale) * (u64)hmu->epoch_mult;
+
+	if (epoch > epoch_max || epoch < epoch_min) {
+		dev_dbg(dev, "out of range %llu %llu %llu\n",
+			epoch, epoch_max, epoch_min);
+		return -EINVAL;
+	}
+
+	hmu->range_base = FIELD_GET(CXL_HMU_ATTR_CONFIG2_DPA_BASE_MASK,
+				    event->attr.config2);
+	hmu->range_num = FIELD_GET(CXL_HMU_ATTR_CONFIG2_DPA_SIZE_MASK,
+				   event->attr.config2);
+
+	if (hmu->range_num == 0) {
+		/* Set a default of 'everything' */
+		hmu->range_num = (hotlist_offset - bitmap_offset) * 8;
+	}
+	/* TODO - pass in better DPA range info from parent driver */
+	if ((u64)hmu->range_base + hmu->range_num >
+	    (hotlist_offset - bitmap_offset) * 8) {
+		dev_dbg(dev, "Requested range that this HMU can't track Can track 0x%llx, asked for 0x%x to 0x%x\n",
+			(hotlist_offset - bitmap_offset) * 8,
+			hmu->range_base, hmu->range_base + hmu->range_num);
+		return -EINVAL;
+	}
+
+	return 0;
+}
+
+static int cxl_hmu_update_aux(struct cxl_hmu_info *hmu, bool stop)
+{
+	struct perf_output_handle *handle = &hmu->handle;
+	struct cxl_hmu_buf *buf = perf_get_aux(handle);
+	struct perf_event *event = handle->event;
+	size_t size = 0;
+	size_t tocopy, tocopy2;
+
+	u64 offset = readq(hmu->base + CHMU_INST0_HOTLIST_OFFSET_REG);
+	u16 head = readw(hmu->base + CHMU_INST0_HEAD_REG);
+	u16 tail = readw(hmu->base + CHMU_INST0_TAIL_REG);
+	u8 count_width = FIELD_GET(CHMU_INST0_STATUS_COUNTER_WIDTH_MSK,
+				   readq(hmu->base + CHMU_INST0_STATUS_REG));
+	u16 top = FIELD_GET(CHMU_INST0_CAP0_HOTLIST_SIZE_MSK,
+			    readq(hmu->base + CHMU_INST0_CAP0_REG));
+	/* 16 bytes of header - arbitrary choice! */
+#define CHMU_HEADER0_SIZE_MASK GENMASK(15, 0)
+#define CHMU_HEADER0_COUNT_WIDTH GENMASK(23, 16)
+	u64 header[2];
+
+	if (tail > head) {
+		tocopy = min_t(size_t, (tail - head) * 8,
+			       buf->length - buf->pos - sizeof(header));
+		header[0] = FIELD_PREP(CHMU_HEADER0_SIZE_MASK, tocopy / 8) |
+			    FIELD_PREP(CHMU_HEADER0_COUNT_WIDTH, count_width);
+		header[1] = 0xDEADBEEF;
+		if (tocopy) {
+			memcpy(buf->base + buf->pos, header, sizeof(header));
+			size += sizeof(header);
+			buf->pos += sizeof(header);
+			memcpy_fromio(buf->base + buf->pos,
+				      hmu->base + offset + head * 8, tocopy);
+			size += tocopy;
+			buf->pos += tocopy;
+		}
+
+	} else if (tail < head) { /* wrap around */
+		tocopy = min_t(size_t, (top - head) * 8,
+			       buf->length - buf->pos - sizeof(header));
+		tocopy2 = min_t(size_t, tail * 8,
+				buf->length - buf->pos - tocopy - sizeof(header));
+		header[0] = FIELD_PREP(CHMU_HEADER0_SIZE_MASK, (tocopy + tocopy2) / 8) |
+			    FIELD_PREP(CHMU_HEADER0_COUNT_WIDTH, count_width);
+		header[1] = 0xDEADBEEF;
+		if (tocopy) {
+			memcpy(buf->base + buf->pos, header, sizeof(header));
+			size += sizeof(header);
+			buf->pos += sizeof(header);
+			memcpy_fromio(buf->base + buf->pos,
+				      hmu->base + offset + head * 8, tocopy);
+			size += tocopy;
+			buf->pos += tocopy;
+
+		}
+
+		if (tocopy2) {
+			memcpy_fromio(buf->base + buf->pos,
+				      hmu->base + offset, tocopy2);
+			size += tocopy2;
+			buf->pos += tocopy2;
+		}
+	} /* may be no data */
+
+	perf_aux_output_end(handle, size);
+	if (buf->pos == buf->length)
+		return -EINVAL; /* FULL */
+
+	/* Do this after the space check, so the buffer on device will not overwrite */
+	writew(tail, hmu->base + CHMU_INST0_HEAD_REG);
+
+	if (!stop) {
+		buf = perf_aux_output_begin(handle, event);
+		if (!buf)
+			return -EINVAL;
+		buf->pos = handle->head % buf->length;
+	}
+	return 0;
+}
+
+static int __cxl_hmu_start(struct perf_event *event, int flags)
+{
+	struct cxl_hmu_info *hmu = pmu_to_cxl_hmu(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	struct device *dev = event->pmu->dev;
+	struct cxl_hmu_buf *buf;
+	int cpu = event->cpu;
+	u64 val, status, bitmap_base;
+	int ret, i;
+	u16 list_len = FIELD_GET(CHMU_INST0_CAP0_HOTLIST_SIZE_MSK,
+				 readq(hmu->base + CHMU_INST0_CAP0_REG));
+
+	hwc->state = 0;
+	status = readq(hmu->base + CHMU_INST0_STATUS_REG);
+	if (FIELD_GET(CHMU_INST0_STATUS_ENABLED, status)) {
+		dev_dbg(dev, "trace already started\n");
+		return -EBUSY;
+	}
+	/* TODO: Figure out what to do as very likely this is shared
+	 *  - Hopefully only with other HMU instances
+	 */
+	ret = irq_set_affinity(hmu->irq, cpumask_of(cpu));
+	if (ret)
+		dev_warn(dev, "failed to affinity of HMU interrupt\n");
+
+	hmu->on_cpu = cpu;
+
+	buf = perf_aux_output_begin(&hmu->handle, event);
+	if (!buf) {
+		dev_dbg(event->pmu->dev, "aux output begin failed\n");
+		return -EINVAL;
+	}
+
+	buf->pos = hmu->handle.head % buf->length;
+	/* Reset here disrupts samping with -F, should we avoid doing so? */
+	writeq(FIELD_PREP(CHMU_INST0_CFG0_RESET_COUNTERS, 1),
+		hmu->base + CHMU_INST0_CFG0_REG);
+
+	ret = readq_poll_timeout_atomic(hmu->base + CHMU_INST0_STATUS_REG, status,
+		(FIELD_GET(CHMU_INST0_STATUS_OP_INPROG_MSK, status) == 0),
+		10, 100000);
+	if (ret) {
+		dev_dbg(event->pmu->dev, "Reset timed out\n");
+		return ret;
+	}
+	/* Setup what is being capured */
+	/* Type of capture, granularity etc */
+
+	val = FIELD_PREP(CHMU_INST0_CFG1_UNIT_SIZE_MSK, hmu->hot_gran) |
+		FIELD_PREP(CHMU_INST0_CFG1_DS_FACTOR_MSK, hmu->ds_factor_pow2) |
+		FIELD_PREP(CHMU_INST0_CFG1_MODE_MSK, hmu->reporting_mode) |
+		FIELD_PREP(CHMU_INST0_CFG1_EPOCH_SCALE_MSK, hmu->epoch_scale) |
+		FIELD_PREP(CHMU_INST0_CFG1_EPOCH_MULT_MSK, hmu->epoch_mult);
+	writeq(val, hmu->base + CHMU_INST0_CFG1_REG);
+
+	val = 0;
+	bitmap_base = readq(hmu->base + CHMU_INST0_RANGE_BITMAP_OFFSET_REG);
+	for (i = hmu->range_base; i < hmu->range_base + hmu->range_num; i++) {
+		val |= BIT(i % 64);
+		if (i % 64 == 63) {
+			writeq(val, hmu->base + bitmap_base + (i / 64) * 8);
+			val = 0;
+		}
+	}
+	/* Potential duplicate write that doesn't matter */
+	writeq(val, hmu->base + bitmap_base + (i / 64) * 8);
+
+	/* Set notificaiton threshold to half of buffer */
+	val = FIELD_PREP(CHMU_INST0_CFG2_FILLTHRESH_THRESHOLD_MSK,
+			 list_len / 2);
+	writeq(val, hmu->base + CHMU_INST0_CFG2_REG);
+
+	/*
+	 * RFC: Only after granual is set can the width be known - so can only check here,
+	 * or program granual size earlier just to see if it will work here.
+	 */
+	status = readq(hmu->base + CHMU_INST0_STATUS_REG);
+	if (hmu->hot_thresh >= (1 << (64 - FIELD_GET(CHMU_INST0_STATUS_COUNTER_WIDTH_MSK, status))))
+		return -EINVAL;
+	/* Start the unit up */
+	val = FIELD_PREP(CHMU_INST0_CFG0_WHAT_MSK, hmu->m2s_requests_to_track) |
+		FIELD_PREP(CHMU_INST0_CFG0_RAND_DOWNSAMP_EN,
+			   hmu->randomized_ds ? 1 : 0) |
+		FIELD_PREP(CHMU_INST0_CFG0_OVRFLW_INT_EN, 1) |
+		FIELD_PREP(CHMU_INST0_CFG0_FILLTHRESH_INT_EN, 1) |
+		FIELD_PREP(CHMU_INST0_CFG0_ENABLE, 1) |
+		FIELD_PREP(CHMU_INST0_CFG0_HOTNESS_THRESH_MSK, hmu->hot_thresh);
+	writeq(val, hmu->base + CHMU_INST0_CFG0_REG);
+
+	/* Poll status register for enablement to complete */
+	ret = readq_poll_timeout_atomic(hmu->base + CHMU_INST0_STATUS_REG, status,
+		(FIELD_GET(CHMU_INST0_STATUS_OP_INPROG_MSK, status) == 0),
+		10, 100000);
+	if (ret) {
+		dev_info(event->pmu->dev, "Enable timed out\n");
+		return ret;
+	}
+
+	return 0;
+}
+
+static void cxl_hmu_start(struct perf_event *event, int flags)
+{
+	struct cxl_hmu_info *hmu = pmu_to_cxl_hmu(event->pmu);
+	int ret;
+
+	guard(spinlock)(&hmu->lock);
+
+	ret = __cxl_hmu_start(event, flags);
+	if (ret)
+		event->hw.state |= PERF_HES_STOPPED;
+}
+
+static void cxl_hmu_stop(struct perf_event *event, int flags)
+{
+	struct cxl_hmu_info *hmu = pmu_to_cxl_hmu(event->pmu);
+	struct hw_perf_event *hwc = &event->hw;
+	u64 status, val;
+	int ret;
+
+	if (hwc->state & PERF_HES_STOPPED)
+		return;
+
+	guard(spinlock)(&hmu->lock);
+	status = readq(hmu->base + CHMU_INST0_STATUS_REG);
+	if (FIELD_GET(CHMU_INST0_STATUS_ENABLED, status)) {
+		/* Stop the HMU instance */
+		val = readq(hmu->base + CHMU_INST0_CFG0_REG);
+		val &= ~CHMU_INST0_CFG0_ENABLE;
+		writeq(val, hmu->base + CHMU_INST0_CFG0_REG);
+
+		ret = readq_poll_timeout_atomic(hmu->base + CHMU_INST0_STATUS_REG, status,
+			(FIELD_GET(CHMU_INST0_STATUS_OP_INPROG_MSK, status) == 0),
+			10, 100000);
+		if (ret) {
+			dev_info(event->pmu->dev, "Disable timed out\n");
+			return;
+		}
+
+		cxl_hmu_update_aux(hmu, true);
+	}
+
+}
+static void cxl_hmu_read(struct perf_event *event)
+{
+	/* Nothing to do */
+}
+
+static int cxl_hmu_add(struct perf_event *event, int flags)
+{
+	struct hw_perf_event *hwc = &event->hw;
+
+	hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
+	if (flags & PERF_EF_START) {
+		cxl_hmu_start(event, PERF_EF_RELOAD);
+		if (hwc->state & PERF_HES_STOPPED)
+			return -EINVAL;
+	}
+	return 0;
+}
+
+/*
+ * There is a lot to do in here, but using a thread is not
+ * currently possible for a perf PMU driver.
+ */
+static irqreturn_t cxl_hmu_irq(int irq, void *data)
+{
+	struct cxl_hmu_info *info = data;
+	u64 status;
+	int ret;
+
+	status = readq(info->base + CHMU_INST0_STATUS_REG);
+	if (!FIELD_GET(CHMU_INST0_STATUS_OVRFLW, status) &&
+	    !FIELD_GET(CHMU_INST0_STATUS_FILLTHRESH, status))
+		return IRQ_NONE;
+
+	ret = cxl_hmu_update_aux(info, false);
+	if (ret)
+		dev_err(info->pmu.dev, "interrupt update failed\n");
+
+	/*
+	 * They are level interrupts so should trigger on next fill
+	 * hence should be no problem with races.
+	 */
+	writeq(status, info->base + CHMU_INST0_STATUS_REG);
+
+	return IRQ_HANDLED;
+}
+
+static void cxl_hmu_del(struct perf_event *event, int flags)
+{
+	cxl_hmu_stop(event, PERF_EF_UPDATE);
+}
+
+static void *cxl_hmu_setup_aux(struct perf_event *event, void **pages,
+			int nr_pages, bool overwrite)
+{
+	int i;
+
+	if (overwrite) {
+		dev_warn(event->pmu->dev, "Overwrite mode is not supported\n");
+		return NULL;
+	}
+
+	if (nr_pages < 1)
+		return NULL;
+
+	struct cxl_hmu_buf *buf __free(kfree) =
+		kzalloc(sizeof(*buf), GFP_KERNEL);
+	if (!buf)
+		return NULL;
+
+	struct page **pagelist __free(kfree) =
+		kcalloc(nr_pages, sizeof(*pagelist), GFP_KERNEL);
+	if (!pagelist)
+		return NULL;
+
+	for (i = 0; i < nr_pages; i++)
+		pagelist[i] = virt_to_page(pages[i]);
+
+	buf->base = vmap(pagelist, nr_pages, VM_MAP, PAGE_KERNEL);
+	if (!buf->base)
+		return NULL;
+
+	buf->nr_pages = nr_pages;
+	buf->length = nr_pages * PAGE_SIZE;
+	buf->pos = 0;
+
+	return_ptr(buf);
+}
+
+static void cxl_hmu_free_aux(void *aux)
+{
+	struct cxl_hmu_buf *buf = aux;
+
+	vunmap(buf->base);
+	kfree(buf);
+}
+
+static void cxl_hmu_perf_unregister(void *_info)
+{
+	struct cxl_hmu_info *info = _info;
+
+	perf_pmu_unregister(&info->pmu);
+}
+
+static void cxl_hmu_cpuhp_remove(void *_info)
+{
+	struct cxl_hmu_info *info = _info;
+
+	cpuhp_state_remove_instance_nocalls(cxl_hmu_cpuhp_state_num,
+					    &info->node);
+}
+
+static int cxl_hmu_probe(struct device *dev)
+{
+	struct pci_dev *pdev = to_pci_dev(dev->parent);
+	struct cxl_hmu *hmu = to_cxl_hmu(dev);
+	int i, rc;
+
+	int num_inst = FIELD_GET(CHMU_COMMON_CAP0_NUMINST_MSK,
+				 readq(hmu->base + CHMU_COMMON_CAP0_REG));
+	int inst_len = FIELD_GET(CHMU_COMMON_CAP1_INSTLEN_MSK,
+				 readq(hmu->base + CHMU_COMMON_CAP1_REG));
+
+	for (i = 0; i < num_inst; i++) {
+		struct cxl_hmu_info *info;
+		char *pmu_name;
+		int msg_num;
+		u64 val;
+
+		info = devm_kzalloc(dev, sizeof(*info), GFP_KERNEL);
+		if (!info)
+			return -ENOMEM;
+
+		dev_set_drvdata(dev, info);
+		info->on_cpu = -1;
+		info->base = hmu->base + 0x10 + inst_len * i;
+
+		val = readq(info->base + CHMU_INST0_CAP0_REG);
+		msg_num = FIELD_GET(CHMU_INST0_CAP0_MSI_N_MSK, val);
+
+		/* TODO add polling support - for now require threshold */
+		if (!FIELD_GET(CHMU_INST0_CAP0_FILLTHRESH_CAP, val)) {
+			devm_kfree(dev, info);
+			continue;
+		}
+
+		spin_lock_init(&info->lock);
+
+		pmu_name = devm_kasprintf(dev, GFP_KERNEL,
+					  "cxl_hmu_mem%d.%d.%d",
+					  hmu->assoc_id, hmu->index, i);
+		if (!pmu_name)
+			return -ENOMEM;
+
+		info->pmu = (struct pmu) {
+			.name = pmu_name,
+			.parent = dev,
+			.module = THIS_MODULE,
+			.capabilities = PERF_PMU_CAP_EXCLUSIVE |
+					PERF_PMU_CAP_NO_EXCLUDE,
+			.task_ctx_nr = perf_sw_context,
+			.attr_groups = cxl_hmu_groups,
+			.event_init = cxl_hmu_event_init,
+			.setup_aux = cxl_hmu_setup_aux,
+			.free_aux = cxl_hmu_free_aux,
+			.start = cxl_hmu_start,
+			.stop = cxl_hmu_stop,
+			.add = cxl_hmu_add,
+			.del = cxl_hmu_del,
+			.read = cxl_hmu_read,
+		};
+		rc = pci_irq_vector(pdev, msg_num);
+		if (rc < 0)
+			return rc;
+		info->irq = rc;
+
+		/*
+		 * Whilst there is a 'strong' recomendation that the interrupt
+		 * should not be shared it is not a requirement.
+		 * Can we support IRQF_SHARED on a PMU?
+		 */
+		rc = devm_request_irq(dev, info->irq, cxl_hmu_irq,
+				      IRQF_NO_THREAD | IRQF_NOBALANCING,
+				      pmu_name, info);
+		if (rc)
+			return rc;
+
+		rc = cpuhp_state_add_instance(cxl_hmu_cpuhp_state_num,
+					      &info->node);
+		if (rc)
+			return rc;
+
+		rc = devm_add_action_or_reset(dev, cxl_hmu_cpuhp_remove, info);
+		if (rc)
+			return rc;
+
+		rc = perf_pmu_register(&info->pmu, info->pmu.name, -1);
+		if (rc)
+			return rc;
+
+		rc = devm_add_action_or_reset(dev, cxl_hmu_perf_unregister,
+					      info);
+		if (rc)
+			return rc;
+	}
+	return 0;
+}
+
+static struct cxl_driver cxl_hmu_driver = {
+	.name = "cxl_hmu",
+	.probe = cxl_hmu_probe,
+	.id = CXL_DEVICE_HMU,
+};
+
+static int cxl_hmu_online_cpu(unsigned int cpu, struct hlist_node *node)
+{
+	struct cxl_hmu_info *info =
+		hlist_entry_safe(node, struct cxl_hmu_info, node);
+
+	if (info->on_cpu != -1)
+		return 0;
+
+	info->on_cpu = cpu;
+
+	WARN_ON(irq_set_affinity(info->irq, cpumask_of(cpu)));
+
+	return 0;
+}
+
+static int cxl_hmu_offline_cpu(unsigned int cpu, struct hlist_node *node)
+{
+	struct cxl_hmu_info *info =
+		hlist_entry_safe(node, struct cxl_hmu_info, node);
+	unsigned int target;
+
+	if (info->on_cpu != cpu)
+		return 0;
+
+	info->on_cpu = -1;
+	target = cpumask_any_but(cpu_online_mask, cpu);
+	if (target >= nr_cpu_ids) {
+		dev_err(info->pmu.dev, "Unable to find a suitable CPU\n");
+		return 0;
+	}
+
+	perf_pmu_migrate_context(&info->pmu, cpu, target);
+	info->on_cpu = target;
+	/*
+	 * CPU HP lock is held so we should be guaranteed that this CPU hasn't
+	 * yet gone away.
+	 */
+	WARN_ON(irq_set_affinity(info->irq, cpumask_of(target)));
+	return 0;
+}
+
+static __init int cxl_hmu_init(void)
+{
+	int rc;
+
+	rc = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
+				     "AP_PERF_CXL_HMU_ONLINE",
+				     cxl_hmu_online_cpu, cxl_hmu_offline_cpu);
+	if (rc < 0)
+		return rc;
+	cxl_hmu_cpuhp_state_num = rc;
+
+	rc = cxl_driver_register(&cxl_hmu_driver);
+	if (rc)
+		cpuhp_remove_multi_state(cxl_hmu_cpuhp_state_num);
+
+	return rc;
+}
+
+static __exit void cxl_hmu_exit(void)
+{
+	cxl_driver_unregister(&cxl_hmu_driver);
+	cpuhp_remove_multi_state(cxl_hmu_cpuhp_state_num);
+}
+
+MODULE_AUTHOR("Jonathan Cameron <Jonathan.Cameron@huawei.com>");
+MODULE_DESCRIPTION("CXL Hotness Monitor Driver");
+MODULE_LICENSE("GPL");
+MODULE_IMPORT_NS(CXL);
+module_init(cxl_hmu_init);
+module_exit(cxl_hmu_exit);
+MODULE_ALIAS_CXL(CXL_DEVICE_HMU);


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 3/4] perf: Add support for CXL Hotness Monitoring Units (CHMU)
  2024-11-21 10:18 ` [RFC PATCH 3/4] perf: Add support for CXL Hotness Monitoring Units (CHMU) Jonathan Cameron
@ 2025-08-08  8:29   ` wangyuquan
  0 siblings, 0 replies; 27+ messages in thread
From: wangyuquan @ 2025-08-08  8:29 UTC (permalink / raw)
  To: tangtao1634, linux-cxl, linux-mm, linux-perf-users, linux-kernel
  Cc: Jonathan Cameron, linuxarm, tongtiangen, Yicong Yang, Niyas Sait,
	ajayjoshi, Vandana Salve, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Ira Weiny, Dan Williams, Alexander Shishkin,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Gregory Price, Huang Ying

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Based closely on existing support for hisi_ptt.
Provides basic record and report --dump-raw-traces support.

Example output.  With a counter_width of 16 (0x10) the least significant
4 bytes are the counter value and the unit index is bits 16-63.
Here all units are over the threshold and the indexes are 0,1,2 etc.

. ... CXL_HMU data: size 33512 bytes
Header 0: units: 29c counter_width 10
Header 1 : deadbeef
0000000000000283
0000000000010364
0000000000020366
000000000003033c
0000000000040343
00000000000502ff
000000000006030d
000000000007031a

Note this is definitely RFC quality code. Before merging should reduce
the code duplication that already exists and this code makes sorse.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 tools/perf/arch/arm/util/auxtrace.c |  58 +++++
 tools/perf/arch/x86/util/auxtrace.c |  76 ++++++
 tools/perf/util/Build               |   1 +
 tools/perf/util/auxtrace.c          |   4 +
 tools/perf/util/auxtrace.h          |   1 +
 tools/perf/util/cxl-hmu.c           | 367 ++++++++++++++++++++++++++++
 tools/perf/util/cxl-hmu.h           |  18 ++
 7 files changed, 525 insertions(+)

diff --git a/tools/perf/arch/arm/util/auxtrace.c b/tools/perf/arch/arm/util/auxtrace.c
index 3b8eca0ffb17..07ff41800808 100644
--- a/tools/perf/arch/arm/util/auxtrace.c
+++ b/tools/perf/arch/arm/util/auxtrace.c
@@ -18,6 +18,7 @@
 #include "cs-etm.h"
 #include "arm-spe.h"
 #include "hisi-ptt.h"
+#include "cxl-hmu.h"
 
 static struct perf_pmu **find_all_arm_spe_pmus(int *nr_spes, int *err)
 {
@@ -99,6 +100,49 @@ static struct perf_pmu **find_all_hisi_ptt_pmus(int *nr_ptts, int *err)
 	return hisi_ptt_pmus;
 }
 
+static struct perf_pmu **find_all_cxl_hmu_pmus(int *nr_hmus, int *err)
+{
+	struct perf_pmu **cxl_hmu_pmus = NULL;
+	struct dirent *dent;
+	char path[PATH_MAX];
+	DIR *dir = NULL;
+	int idx = 0;
+
+	perf_pmu__event_source_devices_scnprintf(path, sizeof(path));
+	dir = opendir(path);
+	if (!dir) {
+		*err = -EINVAL;
+		return NULL;
+	}
+
+	while ((dent = readdir(dir))) {
+		if (strstr(dent->d_name, "cxl_hmu"))
+			(*nr_hmus)++;
+	}
+
+	if (!(*nr_hmus))
+		goto out;
+
+	cxl_hmu_pmus = zalloc(sizeof(struct perf_pmu *) * (*nr_hmus));
+	if (!cxl_hmu_pmus) {
+		*err = -ENOMEM;
+		goto out;
+	}
+
+	rewinddir(dir);
+	while ((dent = readdir(dir))) {
+		if (strstr(dent->d_name, "cxl_hmu") && idx < *nr_hmus) {
+			cxl_hmu_pmus[idx] = perf_pmus__find(dent->d_name);
+			if (cxl_hmu_pmus[idx])
+				idx++;
+		}
+	}
+
+out:
+	closedir(dir);
+	return cxl_hmu_pmus;
+}
+
 static struct perf_pmu *find_pmu_for_event(struct perf_pmu **pmus,
 					   int pmu_nr, struct evsel *evsel)
 {
@@ -121,13 +165,16 @@ struct auxtrace_record
 	struct perf_pmu	*cs_etm_pmu = NULL;
 	struct perf_pmu **arm_spe_pmus = NULL;
 	struct perf_pmu **hisi_ptt_pmus = NULL;
+	struct perf_pmu **chmu_pmus = NULL;
 	struct evsel *evsel;
 	struct perf_pmu *found_etm = NULL;
 	struct perf_pmu *found_spe = NULL;
 	struct perf_pmu *found_ptt = NULL;
+	struct perf_pmu *found_chmu = NULL;
 	int auxtrace_event_cnt = 0;
 	int nr_spes = 0;
 	int nr_ptts = 0;
+	int nr_chmus = 0;
 
 	if (!evlist)
 		return NULL;
@@ -135,6 +182,7 @@ struct auxtrace_record
 	cs_etm_pmu = perf_pmus__find(CORESIGHT_ETM_PMU_NAME);
 	arm_spe_pmus = find_all_arm_spe_pmus(&nr_spes, err);
 	hisi_ptt_pmus = find_all_hisi_ptt_pmus(&nr_ptts, err);
+	chmu_pmus = find_all_cxl_hmu_pmus(&nr_chmus, err);
 
 	evlist__for_each_entry(evlist, evsel) {
 		if (cs_etm_pmu && !found_etm)
@@ -145,10 +193,14 @@ struct auxtrace_record
 
 		if (hisi_ptt_pmus && !found_ptt)
 			found_ptt = find_pmu_for_event(hisi_ptt_pmus, nr_ptts, evsel);
+
+		if (chmu_pmus && !found_chmu)
+			found_chmu = find_pmu_for_event(chmu_pmus, nr_chmus, evsel);
 	}
 
 	free(arm_spe_pmus);
 	free(hisi_ptt_pmus);
+	free(chmu_pmus);
 
 	if (found_etm)
 		auxtrace_event_cnt++;
@@ -159,6 +211,9 @@ struct auxtrace_record
 	if (found_ptt)
 		auxtrace_event_cnt++;
 
+	if (found_chmu)
+		auxtrace_event_cnt++;
+
 	if (auxtrace_event_cnt > 1) {
 		pr_err("Concurrent AUX trace operation not currently supported\n");
 		*err = -EOPNOTSUPP;
@@ -174,6 +229,9 @@ struct auxtrace_record
 
 	if (found_ptt)
 		return hisi_ptt_recording_init(err, found_ptt);
+
+	if (found_chmu)
+		return chmu_recording_init(err, found_chmu);
 #endif
 
 	/*
diff --git a/tools/perf/arch/x86/util/auxtrace.c b/tools/perf/arch/x86/util/auxtrace.c
index 354780ff1605..30d84ce41394 100644
--- a/tools/perf/arch/x86/util/auxtrace.c
+++ b/tools/perf/arch/x86/util/auxtrace.c
@@ -4,6 +4,7 @@
  * Copyright (c) 2013-2014, Intel Corporation.
  */
 
+#include <dirent.h>
 #include <errno.h>
 #include <stdbool.h>
 
@@ -14,6 +15,7 @@
 #include "../../../util/auxtrace.h"
 #include "../../../util/intel-pt.h"
 #include "../../../util/intel-bts.h"
+#include "../../../util/cxl-hmu.h"
 #include "../../../util/evlist.h"
 
 static
@@ -51,14 +53,88 @@ struct auxtrace_record *auxtrace_record__init_intel(struct evlist *evlist,
 	return NULL;
 }
 
+static struct perf_pmu **find_all_cxl_hmu_pmus(int *nr_hmus, int *err)
+{
+	struct perf_pmu **cxl_hmu_pmus = NULL;
+	struct dirent *dent;
+	char path[PATH_MAX];
+	DIR *dir = NULL;
+	int idx = 0;
+
+	perf_pmu__event_source_devices_scnprintf(path, sizeof(path));
+	dir = opendir(path);
+	if (!dir) {
+		*err = -EINVAL;
+		return NULL;
+	}
+
+	while ((dent = readdir(dir))) {
+		if (strstr(dent->d_name, "cxl_hmu"))
+			(*nr_hmus)++;
+	}
+
+	if (!(*nr_hmus))
+		goto out;
+
+	cxl_hmu_pmus = zalloc(sizeof(struct perf_pmu *) * (*nr_hmus));
+	if (!cxl_hmu_pmus) {
+		*err = -ENOMEM;
+		goto out;
+	}
+
+	rewinddir(dir);
+	while ((dent = readdir(dir))) {
+		if (strstr(dent->d_name, "cxl_hmu") && idx < *nr_hmus) {
+			cxl_hmu_pmus[idx] = perf_pmus__find(dent->d_name);
+			if (cxl_hmu_pmus[idx])
+				idx++;
+		}
+	}
+
+out:
+	closedir(dir);
+	return cxl_hmu_pmus;
+}
+
+static struct perf_pmu *find_pmu_for_event(struct perf_pmu **pmus,
+					   int pmu_nr, struct evsel *evsel)
+{
+	int i;
+
+	if (!pmus)
+		return NULL;
+
+	for (i = 0; i < pmu_nr; i++) {
+		if (evsel->core.attr.type == pmus[i]->type)
+			return pmus[i];
+	}
+
+	return NULL;
+}
+
 struct auxtrace_record *auxtrace_record__init(struct evlist *evlist,
 					      int *err)
 {
 	char buffer[64];
 	int ret;
+	struct perf_pmu **chmu_pmus = NULL;
+	struct perf_pmu *found_chmu = NULL;
+	struct evsel *evsel;
+	int nr_chmus = 0;
 
 	*err = 0;
 
+	chmu_pmus = find_all_cxl_hmu_pmus(&nr_chmus, err);
+
+	evlist__for_each_entry(evlist, evsel) {
+		if (chmu_pmus && !found_chmu)
+			found_chmu = find_pmu_for_event(chmu_pmus, nr_chmus, evsel);
+	}
+	free(chmu_pmus);
+
+	if (found_chmu)
+		return chmu_recording_init(err, found_chmu);
+
 	ret = get_cpuid(buffer, sizeof(buffer));
 	if (ret) {
 		*err = ret;
diff --git a/tools/perf/util/Build b/tools/perf/util/Build
index dc616292b2dd..40c645fd0cb3 100644
--- a/tools/perf/util/Build
+++ b/tools/perf/util/Build
@@ -127,6 +127,7 @@ perf-util-$(CONFIG_AUXTRACE) += arm-spe.o
 perf-util-$(CONFIG_AUXTRACE) += arm-spe-decoder/
 perf-util-$(CONFIG_AUXTRACE) += hisi-ptt.o
 perf-util-$(CONFIG_AUXTRACE) += hisi-ptt-decoder/
+perf-util-$(CONFIG_AUXTRACE) += cxl-hmu.o
 perf-util-$(CONFIG_AUXTRACE) += s390-cpumsf.o
 
 ifdef CONFIG_LIBOPENCSD
diff --git a/tools/perf/util/auxtrace.c b/tools/perf/util/auxtrace.c
index ca8682966fae..0efc15732a03 100644
--- a/tools/perf/util/auxtrace.c
+++ b/tools/perf/util/auxtrace.c
@@ -53,6 +53,7 @@
 #include "intel-bts.h"
 #include "arm-spe.h"
 #include "hisi-ptt.h"
+#include "cxl-hmu.h"
 #include "s390-cpumsf.h"
 #include "util/mmap.h"
 
@@ -1333,6 +1334,9 @@ int perf_event__process_auxtrace_info(struct perf_session *session,
 	case PERF_AUXTRACE_HISI_PTT:
 		err = hisi_ptt_process_auxtrace_info(event, session);
 		break;
+	case PERF_AUXTRACE_CXL_HMU:
+		err = cxl_hmu_process_auxtrace_info(event, session);
+		break;
 	case PERF_AUXTRACE_UNKNOWN:
 	default:
 		return -EINVAL;
diff --git a/tools/perf/util/auxtrace.h b/tools/perf/util/auxtrace.h
index a1895a4f530b..8a7a5b7dc2d6 100644
--- a/tools/perf/util/auxtrace.h
+++ b/tools/perf/util/auxtrace.h
@@ -49,6 +49,7 @@ enum auxtrace_type {
 	PERF_AUXTRACE_ARM_SPE,
 	PERF_AUXTRACE_S390_CPUMSF,
 	PERF_AUXTRACE_HISI_PTT,
+	PERF_AUXTRACE_CXL_HMU,
 };
 
 enum itrace_period_type {
diff --git a/tools/perf/util/cxl-hmu.c b/tools/perf/util/cxl-hmu.c
new file mode 100644
index 000000000000..31844f16e4f9
--- /dev/null
+++ b/tools/perf/util/cxl-hmu.c
@@ -0,0 +1,367 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * CXL HMU support
+ * Copyright (c) 2024 Huawei
+ *
+ * Based on:
+ * HiSilicon PCIe Trace and Tuning (PTT) support
+ * Copyright (c) 2022 HiSilicon Technologies Co., Ltd.
+ */
+
+#include <byteswap.h>
+#include <endian.h>
+#include <errno.h>
+#include <inttypes.h>
+#include <linux/bitops.h>
+#include <linux/kernel.h>
+#include <linux/log2.h>
+#include <linux/types.h>
+#include <linux/zalloc.h>
+#include <stdlib.h>
+#include <unistd.h>
+
+#include "auxtrace.h"
+#include "color.h"
+#include "debug.h"
+#include "evlist.h"
+#include "evsel.h"
+#include "cxl-hmu.h"
+#include "machine.h"
+#include "record.h"
+#include "session.h"
+#include "tool.h"
+#include "tsc.h"
+#include <internal/lib.h>
+
+#define KiB(x) ((x) * 1024)
+#define MiB(x) ((x) * 1024 * 1024)
+
+struct chmu_recording {
+	struct auxtrace_record	itr;
+	struct perf_pmu *chmu_pmu;
+	struct evlist *evlist;
+};
+
+static size_t
+chmu_info_priv_size(struct auxtrace_record *itr __maybe_unused,
+			struct evlist *evlist __maybe_unused)
+{
+	return CXL_HMU_AUXTRACE_PRIV_SIZE;
+}
+
+static int chmu_info_fill(struct auxtrace_record *itr,
+			      struct perf_session *session,
+			      struct perf_record_auxtrace_info *auxtrace_info,
+			      size_t priv_size)
+{
+	struct chmu_recording *pttr =
+			container_of(itr, struct chmu_recording, itr);
+	struct perf_pmu *chmu_pmu = pttr->chmu_pmu;
+
+	if (priv_size != CXL_HMU_AUXTRACE_PRIV_SIZE)
+		return -EINVAL;
+
+	if (!session->evlist->core.nr_mmaps)
+		return -EINVAL;
+
+	auxtrace_info->type = PERF_AUXTRACE_CXL_HMU;
+	auxtrace_info->priv[0] = chmu_pmu->type;
+
+	return 0;
+}
+
+static int chmu_set_auxtrace_mmap_page(struct record_opts *opts)
+{
+	bool privileged = perf_event_paranoid_check(-1);
+
+	if (!opts->full_auxtrace)
+		return 0;
+
+	if (opts->full_auxtrace && !opts->auxtrace_mmap_pages) {
+		if (privileged) {
+			opts->auxtrace_mmap_pages = MiB(16) / page_size;
+		} else {
+			opts->auxtrace_mmap_pages = KiB(128) / page_size;
+			if (opts->mmap_pages == UINT_MAX)
+				opts->mmap_pages = KiB(256) / page_size;
+		}
+	}
+
+	/* Validate auxtrace_mmap_pages */
+	if (opts->auxtrace_mmap_pages) {
+		size_t sz = opts->auxtrace_mmap_pages * (size_t)page_size;
+		size_t min_sz = KiB(8);
+
+		if (sz < min_sz || !is_power_of_2(sz)) {
+			pr_err("Invalid mmap size for CXL_HMU: must be at least %zuKiB and a power of 2\n",
+			       min_sz / 1024);
+			return -EINVAL;
+		}
+	}
+
+	return 0;
+}
+
+static int chmu_recording_options(struct auxtrace_record *itr,
+				      struct evlist *evlist,
+				      struct record_opts *opts)
+{
+	struct chmu_recording *pttr =
+			container_of(itr, struct chmu_recording, itr);
+	struct perf_pmu *chmu_pmu = pttr->chmu_pmu;
+	struct evsel *evsel, *chmu_evsel = NULL;
+	struct evsel *tracking_evsel;
+	int err;
+
+	pttr->evlist = evlist;
+	evlist__for_each_entry(evlist, evsel) {
+		if (evsel->core.attr.type == chmu_pmu->type) {
+			if (chmu_evsel) {
+				pr_err("There may be only one cxl_hmu x event\n");
+				return -EINVAL;
+			}
+			evsel->core.attr.freq = 0;
+			evsel->core.attr.sample_period = 1;
+			evsel->needs_auxtrace_mmap = true;
+			chmu_evsel = evsel;
+			opts->full_auxtrace = true;
+		}
+	}
+
+	err = chmu_set_auxtrace_mmap_page(opts);
+	if (err)
+		return err;
+	/*
+	 * To obtain the auxtrace buffer file descriptor, the auxtrace event
+	 * must come first.
+	 */
+	evlist__to_front(evlist, chmu_evsel);
+	evsel__set_sample_bit(chmu_evsel, TIME);
+
+	/* Add dummy event to keep tracking */
+	err = parse_event(evlist, "dummy:u");
+	if (err)
+		return err;
+
+	tracking_evsel = evlist__last(evlist);
+	evlist__set_tracking_event(evlist, tracking_evsel);
+
+	tracking_evsel->core.attr.freq = 0;
+	tracking_evsel->core.attr.sample_period = 1;
+	evsel__set_sample_bit(tracking_evsel, TIME);
+
+	return 0;
+}
+
+static u64 chmu_reference(struct auxtrace_record *itr __maybe_unused)
+{
+	return rdtsc();
+}
+
+static void chmu_recording_free(struct auxtrace_record *itr)
+{
+	struct chmu_recording *pttr =
+	  container_of(itr, struct chmu_recording, itr);
+
+	free(pttr);
+}
+
+struct auxtrace_record *chmu_recording_init(int *err,
+						struct perf_pmu *chmu_pmu)
+{
+	struct chmu_recording *pttr;
+
+	if (!chmu_pmu) {
+		*err = -ENODEV;
+		return NULL;
+	}
+
+	pttr = zalloc(sizeof(*pttr));
+	if (!pttr) {
+		*err = -ENOMEM;
+		return NULL;
+	}
+
+	pttr->chmu_pmu = chmu_pmu;
+	pttr->itr.recording_options = chmu_recording_options;
+	pttr->itr.info_priv_size = chmu_info_priv_size;
+	pttr->itr.info_fill = chmu_info_fill;
+	pttr->itr.free = chmu_recording_free;
+	pttr->itr.reference = chmu_reference;
+	pttr->itr.read_finish = auxtrace_record__read_finish;
+	pttr->itr.alignment = 0;
+
+	*err = 0;
+	return &pttr->itr;
+}
+
+struct cxl_hmu {
+	struct auxtrace auxtrace;
+	u32 auxtrace_type;
+	struct perf_session *session;
+	struct machine *machine;
+	u32 pmu_type;
+};
+
+struct cxl_hmu_queue {
+	struct cxl_hmu *hmu;
+	struct auxtrace_buffer *buffer;
+};
+
+static void cxl_hmu_dump(struct cxl_hmu *hmu __maybe_unused,
+			  unsigned char *buf, size_t len)
+{
+	const char *color = PERF_COLOR_BLUE;
+	size_t pos = 0;
+	size_t packet_offset = 0, hotlist_entries_in_packet;
+
+	len = round_down(len, 8);
+	color_fprintf(stdout, color, ". ... CXL_HMU data: size %zu bytes\n",
+		      len);
+
+	while (len > 0) {
+		if (!packet_offset) {
+			hotlist_entries_in_packet = ((uint64_t *)(buf + pos))[0] & 0xFFFF;
+			color_fprintf(stdout, PERF_COLOR_BLUE,
+				      "Header 0: units: %x counter_width %x\n",
+				      hotlist_entries_in_packet,
+				      (((uint64_t *)(buf + pos))[0] >> 16) & 0xFF);
+		} else if (packet_offset == 1) {
+			color_fprintf(stdout, PERF_COLOR_BLUE,
+				      "Header 1 : %lx\n", ((uint64_t *)(buf + pos))[0]);
+		} else {
+			color_fprintf(stdout, PERF_COLOR_BLUE,
+				      "%016lx\n", ((uint64_t *)(buf + pos))[0]);
+		}
+		pos += 8;
+		len -= 8;
+		packet_offset++;
+		if (packet_offset == hotlist_entries_in_packet + 2)
+			packet_offset = 0;
+	}
+}
+
+static void cxl_hmu_dump_event(struct cxl_hmu *hmu, unsigned char *buf,
+			       size_t len)
+{
+	printf(".\n");
+
+	cxl_hmu_dump(hmu, buf, len);
+}
+
+static int cxl_hmu_process_event(struct perf_session *session __maybe_unused,
+				  union perf_event *event __maybe_unused,
+				  struct perf_sample *sample __maybe_unused,
+				  const struct perf_tool *tool __maybe_unused)
+{
+	return 0;
+}
+
+static int cxl_hmu_process_auxtrace_event(struct perf_session *session,
+					  union perf_event *event,
+					  const struct perf_tool *tool __maybe_unused)
+{
+	struct cxl_hmu *hmu = container_of(session->auxtrace, struct cxl_hmu,
+					    auxtrace);
+	int fd = perf_data__fd(session->data);
+	int size = event->auxtrace.size;
+	void *data = malloc(size);
+	off_t data_offset;
+	int err;
+
+	if (!data) {
+		printf("no data\n");
+		return -errno;
+	}
+
+	if (perf_data__is_pipe(session->data)) {
+		data_offset = 0;
+	} else {
+		data_offset = lseek(fd, 0, SEEK_CUR);
+		if (data_offset == -1) {
+			free(data);
+			printf("failed to seek\n");
+			return -errno;
+		}
+	}
+
+	err = readn(fd, data, size);
+	if (err != (ssize_t)size) {
+		free(data);
+		printf("failed to rread\n");
+		return -errno;
+	}
+
+	if (dump_trace)
+		cxl_hmu_dump_event(hmu, data, size);
+
+	free(data);
+	return 0;
+}
+
+static int cxl_hmu_flush(struct perf_session *session __maybe_unused,
+			 const struct perf_tool *tool __maybe_unused)
+{
+	return 0;
+}
+
+static void cxl_hmu_free_events(struct perf_session *session __maybe_unused)
+{
+}
+
+static void cxl_hmu_free(struct perf_session *session)
+{
+	struct cxl_hmu *hmu = container_of(session->auxtrace, struct cxl_hmu,
+					    auxtrace);
+
+	session->auxtrace = NULL;
+	free(hmu);
+}
+
+static bool cxl_hmu_evsel_is_auxtrace(struct perf_session *session,
+				       struct evsel *evsel)
+{
+	struct cxl_hmu *hmu = container_of(session->auxtrace, struct cxl_hmu, auxtrace);
+
+	return evsel->core.attr.type == hmu->pmu_type;
+}
+
+static void cxl_hmu_print_info(__u64 type)
+{
+	if (!dump_trace)
+		return;
+
+	fprintf(stdout, "  PMU Type           %" PRId64 "\n", (s64) type);
+}
+
+int cxl_hmu_process_auxtrace_info(union perf_event *event,
+				   struct perf_session *session)
+{
+	struct perf_record_auxtrace_info *auxtrace_info = &event->auxtrace_info;
+	struct cxl_hmu *hmu;
+
+	if (auxtrace_info->header.size < CXL_HMU_AUXTRACE_PRIV_SIZE +
+				sizeof(struct perf_record_auxtrace_info))
+		return -EINVAL;
+
+	hmu = zalloc(sizeof(*hmu));
+	if (!hmu)
+		return -ENOMEM;
+
+	hmu->session = session;
+	hmu->machine = &session->machines.host; /* No kvm support */
+	hmu->auxtrace_type = auxtrace_info->type;
+	hmu->pmu_type = auxtrace_info->priv[0];
+
+	hmu->auxtrace.process_event = cxl_hmu_process_event;
+	hmu->auxtrace.process_auxtrace_event = cxl_hmu_process_auxtrace_event;
+	hmu->auxtrace.flush_events = cxl_hmu_flush;
+	hmu->auxtrace.free_events = cxl_hmu_free_events;
+	hmu->auxtrace.free = cxl_hmu_free;
+	hmu->auxtrace.evsel_is_auxtrace = cxl_hmu_evsel_is_auxtrace;
+	session->auxtrace = &hmu->auxtrace;
+
+	cxl_hmu_print_info(auxtrace_info->priv[0]);
+
+	return 0;
+}
diff --git a/tools/perf/util/cxl-hmu.h b/tools/perf/util/cxl-hmu.h
new file mode 100644
index 000000000000..9b4d83219711
--- /dev/null
+++ b/tools/perf/util/cxl-hmu.h
@@ -0,0 +1,18 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * CXL Hotness Monitoring Unit Support
+ */
+
+#ifndef INCLUDE__PERF_CXL_HMU_H__
+#define INCLUDE__PERF_CXL_HMU_H__
+
+#define CXL_HMU_PMU_NAME		"cxl_hmu"
+#define CXL_HMU_AUXTRACE_PRIV_SIZE	sizeof(u64)
+
+struct auxtrace_record *chmu_recording_init(int *err,
+					       struct perf_pmu *cxl_hmu_pmu);
+
+int cxl_hmu_process_auxtrace_info(union perf_event *event,
+				   struct perf_session *session);
+
+#endif


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [RFC PATCH 4/4] hwtrace: Document CXL Hotness Monitoring Unit driver
  2024-11-21 10:18 ` [RFC PATCH 4/4] hwtrace: Document CXL Hotness Monitoring Unit driver Jonathan Cameron
       [not found]   ` <CGME20250103052702epcas5p3f7eea83ac70ba7147e0de7fb30f90a62@epcas5p3.samsung.com>
@ 2025-08-08  8:29   ` wangyuquan
  1 sibling, 0 replies; 27+ messages in thread
From: wangyuquan @ 2025-08-08  8:29 UTC (permalink / raw)
  To: tangtao1634, linux-cxl, linux-mm, linux-perf-users, linux-kernel
  Cc: Jonathan Cameron, linuxarm, tongtiangen, Yicong Yang, Niyas Sait,
	ajayjoshi, Vandana Salve, Davidlohr Bueso, Dave Jiang,
	Alison Schofield, Ira Weiny, Dan Williams, Alexander Shishkin,
	Peter Zijlstra, Ingo Molnar, Arnaldo Carvalho de Melo,
	Mark Rutland, Gregory Price, Huang Ying

From: Jonathan Cameron <Jonathan.Cameron@huawei.com>

Add basic documentation to describe the CXL HMU and the
perf AUX buffer based interfaces.

Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
---
 Documentation/trace/cxl-hmu.rst | 197 ++++++++++++++++++++++++++++++++
 Documentation/trace/index.rst   |   1 +
 2 files changed, 198 insertions(+)

diff --git a/Documentation/trace/cxl-hmu.rst b/Documentation/trace/cxl-hmu.rst
new file mode 100644
index 000000000000..f07a50ba608c
--- /dev/null
+++ b/Documentation/trace/cxl-hmu.rst
@@ -0,0 +1,197 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==================================
+CXL Hotness Monitoring Unit Driver
+==================================
+
+CXL r3.2 introduced the CXL Hotness Monitoring Unit (CHMU). A CHMU allows
+software running on a CXL Host to identify hot memory ranges, that is those with
+higher access frequency relative to other memory ranges.
+
+A given Logical Device (presentation of a CXL memory device seen by a particular
+host) can provide 1 or more CHMU each of which supports 1 or more separately
+programmable CHMU Instances (CHMUI). These CHMUI are mostly independent with
+the exception that there can be restrictions on them tracking the same memory
+regions. The CHMUs are always completely independent.
+The naming of the units is cxl_hmu_memX.Y.Z where memX matches the naming
+of the memory device in /sys/bus/cxl/devices/, Y is the CHMU index and
+Z is the CHMUI index with the CHMU.
+
+Each CHMUI provides a ring buffer structure known as the Hot List from which the
+host an read back entries that describe the hotness of particular region of
+memory (Hot List Units). The Hot List Unit combines a Unit Address and an access
+count for the particular address. Unit address to DPA requires multiplication
+by the unit size. Thus, for large unit sizes the device may support higher
+counts. It is these Hot List Units that the driver provides via a perf AUX
+buffer by copying them from PCI BAR space.
+
+The unit size at which hotness is measured is configurable for each CHMUI and
+all measurement is done in Device Physical Address space. To relate this to
+Host Physical Address space the HDM (Host-Managed Device Memory) decoder
+configuration must be taken into account to reflect the placement in a
+CXL Fixed Memory Window and any interleaving.
+
+The CHMUI can support interrupts on fills above a watermark, or on overflow
+of the hotlist.
+
+A CHMUI can support two different basic modes of operation. Epoch and
+Always On. These affect what is placed on the hotlist. Note that the actual
+implementation of tracking is implementation defined and likely to be
+inherently imprecise in that the hottest pages may not be discovered due to
+resource exhaustion and the hotness counts may not represent accurately how
+hot they are. The specification allows for a very high degree of flexibility
+in implementation, important as it is likely that a number of different
+hardware implementations will be chosen to suit particular silicon and accuracy
+budgets.
+
+Operation and configuration
+===========================
+
+An example command line is::
+
+  $perf record -a  -e cxl_hmu_mem0.0.0/epoch_type=0,access_type=6,\
+  hotness_threshold=1024,epoch_multiplier=4,epoch_scale=4,range_base=0,\
+  range_size=1024,randomized_downsampling=0,downsampling_factor=32,\
+  hotness_granual=12
+
+  $perf report --dump-raw-traces
+
+which will produce a list of hotlist entries, one per line with a short header
+to provide sufficient information to interpret the entries::
+
+  . ... CXL_HMU data: size 33512 bytes
+  Header 0: units: 29c counter_width 10
+  Header 1 : deadbeef
+  0000000000000283
+  0000000000010364
+  0000000000020366
+  000000000003033c
+  0000000000040343
+  00000000000502ff
+  000000000006030d
+  000000000007031a
+  ...
+
+The least significant counter_width bits (here 16, hex 10) are the counter
+value, all higher bits are the unit index.  Multiply by the unit size
+to get a Device Physical Address.
+
+The parameters are as follows:
+
+epoch_type
+----------
+
+Two values may be supported::
+
+  0 - Epoch based operation
+  1 - Always on operation
+
+
+0. Epoch Based Operation
+~~~~~~~~~~~~~~~~~~~~~~~~
+
+An Epoch is a period of time after which a counter is assessed for hotness.
+
+The device may have a global sense of an Epoch but it may also operate them on
+a per counter, or per region of device basis. This is a function of the
+implementation and is not controllable, but is discoverable. In a global Epoch
+scheme at start of each Epoch all counters are zeroed / deallocated. Counters
+are then allocated in a hardware specific manner and accesses counted. At the
+completion of the Epoch the counters are compared with a threshold and entries
+with a count above a configurable threshold are added to the hotlist. A new
+Epoch is then begun with all counters cleared.
+
+In non-global Epoch scheme, when the Epoch of a given counter begins is not
+specified. An example might be an Epoch for counter only starting on first
+touch to the relevant memory region.  When a local Epoch ends the counter is
+compared to the threshold and if appropriate added to the hotlist.
+
+Note, in Epoch Based Operation, the counter in the hotlist entry provides
+information on how hot the memory is as the counter for the full Epoch is
+provided.
+
+1. Always on Operation
+~~~~~~~~~~~~~~~~~~~~~~
+
+In this mode, counters may all be reset before enabling the CHMUI. Then
+counters are allocated to particular memory units via an hardware specific
+method, perhaps on first touch.  When a counter passes the configurable
+hotness threshold an entry is added to the hotlist and that counter is freed
+for reuse.
+
+In this scheme the count provided in the hotlist entry is not useful as it will
+depend only on the configured threshold.
+
+access_type
+-----------
+
+The parameter controls which access are counted::
+
+  1 - Non-TEE read only
+  2 - Non-TEE write only
+  3 - Non-TEE read and write
+  4 - TEE and Non-TEE read only
+  5 - TEE and Non-TEE write only
+  6 - TEE and Non-tee read and write
+
+
+TEE here refers to a trusted execution environment, specifically one that
+results in the T bit being set in the CXL transactions.
+
+
+hotness_granual
+---------------
+
+Unit size at which tracking is performed.  Must be at least 256 bytes but
+hardware may only support some sizes. Expressed as a power of 2. e.g. 12 = 4kiB.
+
+hotness_threshold
+-----------------
+
+This is the minimum counter value that must be reached for the unit to count as
+hot and be added to the hotlist.
+
+The possible range may be dependent on the unit size as a larger unit size
+requires more bits on the hotlist entry leaving fewer available for the hotness
+counter.
+
+epoch_multiplier and epoch_scale
+--------------------------------
+
+The length of an epoch (in epoch mode) is controlled by these two parameters
+with the decoded epoch_scale multiplied by the epoch_multiplier to give the
+overall epoch length.
+
+epoch_scale::
+
+  1 - 100 usecs
+  2 - 1 msec
+  3 - 10 msecs
+  4 - 100 msecs
+  5 - 1 second
+
+range_base and range_scale
+--------------------------
+
+Expressed in terms of the unit size set via hotness_granual. Each CHMUI has a
+bitmap that controls what Device Physical Address spaces is tracked. Each bit
+represents 256MiB of DPA space.
+
+This interface provides a simple base and size in units of 256MiB to configure
+this bitmap. All bits in the specified range will be set.
+
+downsampling_factor
+-------------------
+
+Hardware may be incapable of counting accesses at full speed or it may be
+desirable to count over a longer period during which the counters would
+overflow.  This control allows selection of a down sampling factor expressed
+as a power of 2 between 1 and 32768.  Default is minimum supported downsampling
+factor.
+
+randomized_downsampling
+-----------------------
+
+To avoid problems with downsampling when accesses are periodic this option
+allows for an implementation defined randomization of the sampling interval,
+whilst remaining close to the specified downsampling_factor.
diff --git a/Documentation/trace/index.rst b/Documentation/trace/index.rst
index 0b300901fd75..b35ed8e9dfa9 100644
--- a/Documentation/trace/index.rst
+++ b/Documentation/trace/index.rst
@@ -36,3 +36,4 @@ Linux Tracing Technologies
    user_events
    rv/index
    hisi-ptt
+   cxl-hmu


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [RFC PATCH 1/4] cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
  2024-11-21 10:18 ` [RFC PATCH 1/4] cxl: Register devices for CXL Hotness Monitoring Units (CHMU) Jonathan Cameron
                     ` (2 preceding siblings ...)
  2025-08-08  8:29   ` wangyuquan
@ 2025-08-08  8:45   ` Yuquan Wang
  3 siblings, 0 replies; 27+ messages in thread
From: Yuquan Wang @ 2025-08-08  8:45 UTC (permalink / raw)
  To: tangtao1634, linux-cxl, linux-mm, linux-perf-users, linux-kernel
  Cc: Jonathan Cameron

Sorry for disturbing! I tried to test send more than 10 emails on my smtp server！


> -----原始邮件-----
> 发件人: wangyuquan <wangyuquan1236@phytium.com.cn>
> 发送时间:2025-08-08 16:29:53 (星期五)
> 收件人: tangtao1634@phytium.com.cn, linux-cxl@vger.kernel.org, linux-mm@kvack.org, linux-perf-users@vger.kernel.org, linux-kernel@vger.kernel.org
> 抄送: "Jonathan Cameron" <Jonathan.Cameron@huawei.com>, linuxarm@huawei.com, tongtiangen@huawei.com, "Yicong Yang" <yangyicong@huawei.com>, "Niyas Sait" <niyas.sait@huawei.com>, ajayjoshi@micron.com, "Vandana Salve" <vsalve@micron.com>, "Davidlohr Bueso" <dave@stgolabs.net>, "Dave Jiang" <dave.jiang@intel.com>, "Alison Schofield" <alison.schofield@intel.com>, "Ira Weiny" <ira.weiny@intel.com>, "Dan Williams" <dan.j.williams@intel.com>, "Alexander Shishkin" <alexander.shishkin@linux.intel.com>, "Peter Zijlstra" <peterz@infradead.org>, "Ingo Molnar" <mingo@redhat.com>, "Arnaldo Carvalho de Melo" <acme@kernel.org>, "Mark Rutland" <mark.rutland@arm.com>, "Gregory Price" <gourry@gourry.net>, "Huang Ying" <ying.huang@intel.com>
> 主题: [RFC PATCH 1/4] cxl: Register devices for CXL Hotness Monitoring Units (CHMU)
> 
> From: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> 
> Basic registration using similar approach to how the CPMUs
> are registered.
> 
> Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
> ---
>  drivers/cxl/core/Makefile |  1 +
>  drivers/cxl/core/hmu.c    | 64 +++++++++++++++++++++++++++++++++++++++
>  drivers/cxl/core/regs.c   | 14 +++++++++
>  drivers/cxl/cxl.h         |  4 +++
>  drivers/cxl/cxlpci.h      |  1 +
>  drivers/cxl/hmu.h         | 23 ++++++++++++++
>  drivers/cxl/pci.c         | 26 +++++++++++++++-
>  7 files changed, 132 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cxl/core/Makefile b/drivers/cxl/core/Makefile
> index 9259bcc6773c..d060abb773ae 100644
> --- a/drivers/cxl/core/Makefile
> +++ b/drivers/cxl/core/Makefile
> @@ -12,6 +12,7 @@ cxl_core-y += memdev.o
>  cxl_core-y += mbox.o
>  cxl_core-y += pci.o
>  cxl_core-y += hdm.o
> +cxl_core-y += hmu.o
>  cxl_core-y += pmu.o
>  cxl_core-y += cdat.o
>  cxl_core-$(CONFIG_TRACING) += trace.o
> diff --git a/drivers/cxl/core/hmu.c b/drivers/cxl/core/hmu.c
> new file mode 100644
> index 000000000000..3ee938bb6c05
> --- /dev/null
> +++ b/drivers/cxl/core/hmu.c
> @@ -0,0 +1,64 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/* Copyright(c) 2024 Huawei. All rights reserved. */
> +
> +#include <linux/device.h>
> +#include <linux/slab.h>
> +#include <linux/idr.h>
> +#include <cxlmem.h>
> +#include <hmu.h>
> +#include <cxl.h>
> +#include "core.h"
> +
> +static void cxl_hmu_release(struct device *dev)
> +{
> +	struct cxl_hmu *hmu = to_cxl_hmu(dev);
> +
> +	kfree(hmu);
> +}
> +
> +const struct device_type cxl_hmu_type = {
> +	.name = "cxl_hmu",
> +	.release = cxl_hmu_release,
> +};
> +
> +static void remove_dev(void *dev)
> +{
> +	device_unregister(dev);
> +}
> +
> +int devm_cxl_hmu_add(struct device *parent, struct cxl_hmu_regs *regs,
> +		     int assoc_id, int index)
> +{
> +	struct cxl_hmu *hmu;
> +	struct device *dev;
> +	int rc;
> +
> +	hmu = kzalloc(sizeof(*hmu), GFP_KERNEL);
> +	if (!hmu)
> +		return -ENOMEM;
> +
> +	hmu->assoc_id = assoc_id;
> +	hmu->index = index;
> +	hmu->base = regs->hmu;
> +	dev = &hmu->dev;
> +	device_initialize(dev);
> +	device_set_pm_not_required(dev);
> +	dev->parent = parent;
> +	dev->bus = &cxl_bus_type;
> +	dev->type = &cxl_hmu_type;
> +	rc = dev_set_name(dev, "hmu_mem%d.%d", assoc_id, index);
> +	if (rc)
> +		goto err;
> +
> +	rc = device_add(dev);
> +	if (rc)
> +		goto err;
> +
> +	return devm_add_action_or_reset(parent, remove_dev, dev);
> +
> +err:
> +	put_device(&hmu->dev);
> +	return rc;
> +}
> +EXPORT_SYMBOL_NS_GPL(devm_cxl_hmu_add, CXL);
> +
> diff --git a/drivers/cxl/core/regs.c b/drivers/cxl/core/regs.c
> index e1082e749c69..c12afaa6ef98 100644
> --- a/drivers/cxl/core/regs.c
> +++ b/drivers/cxl/core/regs.c
> @@ -401,6 +401,20 @@ int cxl_map_pmu_regs(struct cxl_register_map *map, struct cxl_pmu_regs *regs)
>  }
>  EXPORT_SYMBOL_NS_GPL(cxl_map_pmu_regs, CXL);
>  
> +int cxl_map_hmu_regs(struct cxl_register_map *map, struct cxl_hmu_regs *regs)
> +{
> +	struct device *dev = map->host;
> +	resource_size_t phys_addr;
> +
> +	phys_addr = map->resource;
> +	regs->hmu = devm_cxl_iomap_block(dev, phys_addr, map->max_size);
> +	if (!regs->hmu)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_NS_GPL(cxl_map_hmu_regs, CXL);
> +
>  static int cxl_map_regblock(struct cxl_register_map *map)
>  {
>  	struct device *host = map->host;
> diff --git a/drivers/cxl/cxl.h b/drivers/cxl/cxl.h
> index 5406e3ab3d4a..8172bc1f7a8d 100644
> --- a/drivers/cxl/cxl.h
> +++ b/drivers/cxl/cxl.h
> @@ -227,6 +227,9 @@ struct cxl_regs {
>  	struct_group_tagged(cxl_pmu_regs, pmu_regs,
>  		void __iomem *pmu;
>  	);
> +	struct_group_tagged(cxl_hmu_regs, hmu_regs,
> +		void __iomem *hmu;
> +	);
>  
>  	/*
>  	 * RCH downstream port specific RAS register
> @@ -292,6 +295,7 @@ int cxl_map_component_regs(const struct cxl_register_map *map,
>  			   unsigned long map_mask);
>  int cxl_map_device_regs(const struct cxl_register_map *map,
>  			struct cxl_device_regs *regs);
> +int cxl_map_hmu_regs(struct cxl_register_map *map, struct cxl_hmu_regs *regs);
>  int cxl_map_pmu_regs(struct cxl_register_map *map, struct cxl_pmu_regs *regs);
>  
>  enum cxl_regloc_type;
> diff --git a/drivers/cxl/cxlpci.h b/drivers/cxl/cxlpci.h
> index 4da07727ab9c..71f5e9620137 100644
> --- a/drivers/cxl/cxlpci.h
> +++ b/drivers/cxl/cxlpci.h
> @@ -67,6 +67,7 @@ enum cxl_regloc_type {
>  	CXL_REGLOC_RBI_VIRT,
>  	CXL_REGLOC_RBI_MEMDEV,
>  	CXL_REGLOC_RBI_PMU,
> +	CXL_REGLOC_RBI_HMU,
>  	CXL_REGLOC_RBI_TYPES
>  };
>  
> diff --git a/drivers/cxl/hmu.h b/drivers/cxl/hmu.h
> new file mode 100644
> index 000000000000..c4798ed9764b
> --- /dev/null
> +++ b/drivers/cxl/hmu.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +/*
> + * Copyright(c) 2024 Huawei
> + * CXL Specification rev 3.2 Setion 8.2.8 (CHMU Register Interface)
> + */
> +#ifndef CXL_HMU_H
> +#define CXL_HMU_H
> +#include <linux/device.h>
> +
> +#define CXL_HMU_REGMAP_SIZE 0xe00 /* Table 8-32 CXL 3.0 specification */
> +struct cxl_hmu {
> +	struct device dev;
> +	void __iomem *base;
> +	int assoc_id;
> +	int index;
> +};
> +
> +#define to_cxl_hmu(dev) container_of(dev, struct cxl_hmu, dev)
> +struct cxl_hmu_regs;
> +int devm_cxl_hmu_add(struct device *parent, struct cxl_hmu_regs *regs,
> +		     int assoc_id, int idx);
> +
> +#endif
> diff --git a/drivers/cxl/pci.c b/drivers/cxl/pci.c
> index 188412d45e0d..e89ea9d3f007 100644
> --- a/drivers/cxl/pci.c
> +++ b/drivers/cxl/pci.c
> @@ -15,6 +15,7 @@
>  #include "cxlmem.h"
>  #include "cxlpci.h"
>  #include "cxl.h"
> +#include "hmu.h"
>  #include "pmu.h"
>  
>  /**
> @@ -814,7 +815,7 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  	struct cxl_dev_state *cxlds;
>  	struct cxl_register_map map;
>  	struct cxl_memdev *cxlmd;
> -	int i, rc, pmu_count;
> +	int i, rc, hmu_count, pmu_count;
>  	bool irq_avail;
>  
>  	/*
> @@ -938,6 +939,29 @@ static int cxl_pci_probe(struct pci_dev *pdev, const struct pci_device_id *id)
>  		}
>  	}
>  
> +	hmu_count = cxl_count_regblock(pdev, CXL_REGLOC_RBI_HMU);
> +	for (i = 0; i < hmu_count; i++) {
> +		struct cxl_hmu_regs hmu_regs;
> +
> +		rc = cxl_find_regblock_instance(pdev, CXL_REGLOC_RBI_HMU, &map, i);
> +		if (rc) {
> +			dev_dbg(&pdev->dev, "Could not find HMU regblock\n");
> +			break;
> +		}
> +
> +		rc = cxl_map_hmu_regs(&map, &hmu_regs);
> +		if (rc) {
> +			dev_dbg(&pdev->dev, "Could not map HMU regs\n");
> +			break;
> +		}
> +
> +		rc = devm_cxl_hmu_add(cxlds->dev, &hmu_regs, cxlmd->id, i);
> +		if (rc) {
> +			dev_dbg(&pdev->dev, "Could not add HMU instance\n");
> +			break;
> +		}
> +	}
> +
>  	rc = cxl_event_config(host_bridge, mds, irq_avail);
>  	if (rc)
>  		return rc;


信息安全声明：本邮件包含信息归发件人所在组织所有,发件人所在组织对该邮件拥有所有权利。请接收者注意保密,未经发件人书面许可,不得向任何第三方组织和个人透露本邮件所含信息。
Information Security Notice: The information contained in this mail is solely property of the sender's organization.This mail communication is confidential.Recipients named above are obligated to maintain secrecy and are not permitted to disclose the contents of this communication to others.

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2025-08-08  8:45 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-21 10:18 [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver Jonathan Cameron
2024-11-21 10:18 ` [RFC PATCH 1/4] cxl: Register devices for CXL Hotness Monitoring Units (CHMU) Jonathan Cameron
     [not found]   ` <CGME20250103052421epcas5p4a1a917ba5d367dfccec91d4522666ca0@epcas5p4.samsung.com>
2025-01-03  5:16     ` Neeraj Kumar
2025-01-03 12:07       ` Jonathan Cameron
2025-06-19  1:47   ` Yuquan Wang
2025-06-19 10:11     ` Jonathan Cameron
2025-08-08  8:29   ` wangyuquan
2025-08-08  8:45   ` Yuquan Wang
2024-11-21 10:18 ` [RFC PATCH 2/4] cxl: Hotness Monitoring Unit via a Perf AUX Buffer Jonathan Cameron
2025-08-08  8:29   ` wangyuquan
2024-11-21 10:18 ` [RFC PATCH 3/4] perf: Add support for CXL Hotness Monitoring Units (CHMU) Jonathan Cameron
2025-08-08  8:29   ` wangyuquan
2024-11-21 10:18 ` [RFC PATCH 4/4] hwtrace: Document CXL Hotness Monitoring Unit driver Jonathan Cameron
     [not found]   ` <CGME20250103052702epcas5p3f7eea83ac70ba7147e0de7fb30f90a62@epcas5p3.samsung.com>
2025-01-03  5:19     ` Neeraj Kumar
2025-08-08  8:29   ` wangyuquan
2024-11-21 13:47 ` [RFC PATCH 0/4] CXL Hotness Monitoring Unit perf driver Jonathan Cameron
2024-11-21 14:24 ` Gregory Price
2024-11-21 14:58   ` Jonathan Cameron
2024-11-21 15:49     ` Gregory Price
2024-11-22 20:08     ` SeongJae Park
2024-11-27 16:34 ` Jonathan Cameron
2024-12-04 12:35   ` [EXT] " Ajay Joshi
     [not found]   ` <CGME20250103053521epcas5p30cd4abba59d695664335b03ba806c56d@epcas5p3.samsung.com>
2025-01-03  5:27     ` Neeraj Kumar
2025-01-15 13:42       ` Jonathan Cameron
2025-06-19  3:59         ` Yuquan Wang
2025-06-19 10:49           ` Jonathan Cameron
2025-01-24 17:40 ` Jonathan Cameron

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).