* [PATCH v2 1/8] perf/arm_cspmu: nvidia: Rename doc to Tegra241
2026-02-18 14:58 [PATCH v2 0/8] perf: add NVIDIA Tegra410 Uncore PMU support Besar Wicaksono
@ 2026-02-18 14:58 ` Besar Wicaksono
2026-02-18 14:58 ` [PATCH v2 2/8] perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU Besar Wicaksono
` (6 subsequent siblings)
7 siblings, 0 replies; 18+ messages in thread
From: Besar Wicaksono @ 2026-02-18 14:58 UTC (permalink / raw)
To: will, suzuki.poulose, robin.murphy, ilkka
Cc: linux-arm-kernel, linux-kernel, linux-tegra, mark.rutland,
treding, jonathanh, vsethi, rwiley, sdonthineni, skelley, ywan,
mochs, nirmoyd, Besar Wicaksono
The documentation in nvidia-pmu.rst contains PMUs specific
to NVIDIA Tegra241 SoC. Rename the file for this specific
SoC to have better distinction with other NVIDIA SoC.
Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
---
Documentation/admin-guide/perf/index.rst | 2 +-
.../perf/{nvidia-pmu.rst => nvidia-tegra241-pmu.rst} | 8 ++++----
2 files changed, 5 insertions(+), 5 deletions(-)
rename Documentation/admin-guide/perf/{nvidia-pmu.rst => nvidia-tegra241-pmu.rst} (98%)
diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
index 47d9a3df6329..c407bb44b08e 100644
--- a/Documentation/admin-guide/perf/index.rst
+++ b/Documentation/admin-guide/perf/index.rst
@@ -24,7 +24,7 @@ Performance monitor support
thunderx2-pmu
alibaba_pmu
dwc_pcie_pmu
- nvidia-pmu
+ nvidia-tegra241-pmu
meson-ddr-pmu
cxl
ampere_cspmu
diff --git a/Documentation/admin-guide/perf/nvidia-pmu.rst b/Documentation/admin-guide/perf/nvidia-tegra241-pmu.rst
similarity index 98%
rename from Documentation/admin-guide/perf/nvidia-pmu.rst
rename to Documentation/admin-guide/perf/nvidia-tegra241-pmu.rst
index f538ef67e0e8..fad5bc4cee6c 100644
--- a/Documentation/admin-guide/perf/nvidia-pmu.rst
+++ b/Documentation/admin-guide/perf/nvidia-tegra241-pmu.rst
@@ -1,8 +1,8 @@
-=========================================================
-NVIDIA Tegra SoC Uncore Performance Monitoring Unit (PMU)
-=========================================================
+============================================================
+NVIDIA Tegra241 SoC Uncore Performance Monitoring Unit (PMU)
+============================================================
-The NVIDIA Tegra SoC includes various system PMUs to measure key performance
+The NVIDIA Tegra241 SoC includes various system PMUs to measure key performance
metrics like memory bandwidth, latency, and utilization:
* Scalable Coherency Fabric (SCF)
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* [PATCH v2 2/8] perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU
2026-02-18 14:58 [PATCH v2 0/8] perf: add NVIDIA Tegra410 Uncore PMU support Besar Wicaksono
2026-02-18 14:58 ` [PATCH v2 1/8] perf/arm_cspmu: nvidia: Rename doc to Tegra241 Besar Wicaksono
@ 2026-02-18 14:58 ` Besar Wicaksono
2026-02-19 9:43 ` Jonathan Cameron
2026-02-18 14:58 ` [PATCH v2 3/8] perf/arm_cspmu: Add arm_cspmu_acpi_dev_get Besar Wicaksono
` (5 subsequent siblings)
7 siblings, 1 reply; 18+ messages in thread
From: Besar Wicaksono @ 2026-02-18 14:58 UTC (permalink / raw)
To: will, suzuki.poulose, robin.murphy, ilkka
Cc: linux-arm-kernel, linux-kernel, linux-tegra, mark.rutland,
treding, jonathanh, vsethi, rwiley, sdonthineni, skelley, ywan,
mochs, nirmoyd, Besar Wicaksono
The Unified Coherence Fabric (UCF) contains last level cache
and cache coherent interconnect in Tegra410 SOC. The PMU in
this device can be used to capture events related to access
to the last level cache and memory from different sources.
Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
---
Documentation/admin-guide/perf/index.rst | 1 +
.../admin-guide/perf/nvidia-tegra410-pmu.rst | 106 ++++++++++++++++++
drivers/perf/arm_cspmu/nvidia_cspmu.c | 90 ++++++++++++++-
3 files changed, 196 insertions(+), 1 deletion(-)
create mode 100644 Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
diff --git a/Documentation/admin-guide/perf/index.rst b/Documentation/admin-guide/perf/index.rst
index c407bb44b08e..aa12708ddb96 100644
--- a/Documentation/admin-guide/perf/index.rst
+++ b/Documentation/admin-guide/perf/index.rst
@@ -25,6 +25,7 @@ Performance monitor support
alibaba_pmu
dwc_pcie_pmu
nvidia-tegra241-pmu
+ nvidia-tegra410-pmu
meson-ddr-pmu
cxl
ampere_cspmu
diff --git a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
new file mode 100644
index 000000000000..7b7ba5700ca1
--- /dev/null
+++ b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
@@ -0,0 +1,106 @@
+=====================================================================
+NVIDIA Tegra410 SoC Uncore Performance Monitoring Unit (PMU)
+=====================================================================
+
+The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance
+metrics like memory bandwidth, latency, and utilization:
+
+* Unified Coherence Fabric (UCF)
+
+PMU Driver
+----------
+
+The PMU driver describes the available events and configuration of each PMU in
+sysfs. Please see the sections below to get the sysfs path of each PMU. Like
+other uncore PMU drivers, the driver provides "cpumask" sysfs attribute to show
+the CPU id used to handle the PMU event. There is also "associated_cpus"
+sysfs attribute, which contains a list of CPUs associated with the PMU instance.
+
+UCF PMU
+-------
+
+The Unified Coherence Fabric (UCF) in the NVIDIA Tegra410 SoC serves as a
+distributed cache, last level for CPU Memory and CXL Memory, and cache coherent
+interconnect that supports hardware coherence across multiple coherently caching
+agents, including:
+
+ * CPU clusters
+ * GPU
+ * PCIe Ordering Controller Unit (OCU)
+ * Other IO-coherent requesters
+
+The events and configuration options of this PMU device are described in sysfs,
+see /sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>.
+
+Some of the events available in this PMU can be used to measure bandwidth and
+utilization:
+
+ * slc_access_rd: count the number of read requests to SLC.
+ * slc_access_wr: count the number of write requests to SLC.
+ * slc_bytes_rd: count the number of bytes transferred by slc_access_rd.
+ * slc_bytes_wr: count the number of bytes transferred by slc_access_wr.
+ * mem_access_rd: count the number of read requests to local or remote memory.
+ * mem_access_wr: count the number of write requests to local or remote memory.
+ * mem_bytes_rd: count the number of bytes transferred by mem_access_rd.
+ * mem_bytes_wr: count the number of bytes transferred by mem_access_wr.
+ * cycles: counts the UCF cycles.
+
+The average bandwidth is calculated as::
+
+ AVG_SLC_READ_BANDWIDTH_IN_GBPS = SLC_BYTES_RD / ELAPSED_TIME_IN_NS
+ AVG_SLC_WRITE_BANDWIDTH_IN_GBPS = SLC_BYTES_WR / ELAPSED_TIME_IN_NS
+ AVG_MEM_READ_BANDWIDTH_IN_GBPS = MEM_BYTES_RD / ELAPSED_TIME_IN_NS
+ AVG_MEM_WRITE_BANDWIDTH_IN_GBPS = MEM_BYTES_WR / ELAPSED_TIME_IN_NS
+
+The average request rate is calculated as::
+
+ AVG_SLC_READ_REQUEST_RATE = SLC_ACCESS_RD / CYCLES
+ AVG_SLC_WRITE_REQUEST_RATE = SLC_ACCESS_WR / CYCLES
+ AVG_MEM_READ_REQUEST_RATE = MEM_ACCESS_RD / CYCLES
+ AVG_MEM_WRITE_REQUEST_RATE = MEM_ACCESS_WR / CYCLES
+
+More details about what other events are available can be found in Tegra410 SoC
+technical reference manual.
+
+The events can be filtered based on source or destination. The source filter
+indicates the traffic initiator to the SLC, e.g local CPU, non-CPU device, or
+remote socket. The destination filter specifies the destination memory type,
+e.g. local system memory (CMEM), local GPU memory (GMEM), or remote memory. The
+local/remote classification of the destination filter is based on the home
+socket of the address, not where the data actually resides. The available
+filters are described in
+/sys/bus/event_source/devices/nvidia_ucf_pmu_<socket-id>/format/.
+
+The list of UCF PMU event filters:
+
+* Source filter:
+
+ * src_loc_cpu: if set, count events from local CPU
+ * src_loc_noncpu: if set, count events from local non-CPU device
+ * src_rem: if set, count events from CPU, GPU, PCIE devices of remote socket
+
+* Destination filter:
+
+ * dst_loc_cmem: if set, count events to local system memory (CMEM) address
+ * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
+ * dst_loc_other: if set, count events to local CXL memory address
+ * dst_rem: if set, count events to CPU, GPU, and CXL memory address of remote socket
+
+If the source is not specified, the PMU will count events from all sources. If
+the destination is not specified, the PMU will count events to all destinations.
+
+Example usage:
+
+* Count event id 0x0 in socket 0 from all sources and to all destinations::
+
+ perf stat -a -e nvidia_ucf_pmu_0/event=0x0/
+
+* Count event id 0x0 in socket 0 with source filter = local CPU and destination
+ filter = local system memory (CMEM)::
+
+ perf stat -a -e nvidia_ucf_pmu_0/event=0x0,src_loc_cpu=0x1,dst_loc_cmem=0x1/
+
+* Count event id 0x0 in socket 1 with source filter = local non-CPU device and
+ destination filter = remote memory::
+
+ perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/
diff --git a/drivers/perf/arm_cspmu/nvidia_cspmu.c b/drivers/perf/arm_cspmu/nvidia_cspmu.c
index e06a06d3407b..c67667097a3c 100644
--- a/drivers/perf/arm_cspmu/nvidia_cspmu.c
+++ b/drivers/perf/arm_cspmu/nvidia_cspmu.c
@@ -1,6 +1,6 @@
// SPDX-License-Identifier: GPL-2.0
/*
- * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
*
*/
@@ -21,6 +21,13 @@
#define NV_CNVL_PORT_COUNT 4ULL
#define NV_CNVL_FILTER_ID_MASK GENMASK_ULL(NV_CNVL_PORT_COUNT - 1, 0)
+#define NV_UCF_SRC_COUNT 3ULL
+#define NV_UCF_DST_COUNT 4ULL
+#define NV_UCF_FILTER_ID_MASK GENMASK_ULL(11, 0)
+#define NV_UCF_FILTER_SRC GENMASK_ULL(2, 0)
+#define NV_UCF_FILTER_DST GENMASK_ULL(11, 8)
+#define NV_UCF_FILTER_DEFAULT (NV_UCF_FILTER_SRC | NV_UCF_FILTER_DST)
+
#define NV_GENERIC_FILTER_ID_MASK GENMASK_ULL(31, 0)
#define NV_PRODID_MASK (PMIIDR_PRODUCTID | PMIIDR_VARIANT | PMIIDR_REVISION)
@@ -124,6 +131,37 @@ static struct attribute *mcf_pmu_event_attrs[] = {
NULL,
};
+static struct attribute *ucf_pmu_event_attrs[] = {
+ ARM_CSPMU_EVENT_ATTR(bus_cycles, 0x1D),
+
+ ARM_CSPMU_EVENT_ATTR(slc_allocate, 0xF0),
+ ARM_CSPMU_EVENT_ATTR(slc_wb, 0xF3),
+ ARM_CSPMU_EVENT_ATTR(slc_refill_rd, 0x109),
+ ARM_CSPMU_EVENT_ATTR(slc_refill_wr, 0x10A),
+ ARM_CSPMU_EVENT_ATTR(slc_hit_rd, 0x119),
+
+ ARM_CSPMU_EVENT_ATTR(slc_access_dataless, 0x183),
+ ARM_CSPMU_EVENT_ATTR(slc_access_atomic, 0x184),
+
+ ARM_CSPMU_EVENT_ATTR(slc_access, 0xF2),
+ ARM_CSPMU_EVENT_ATTR(slc_access_rd, 0x111),
+ ARM_CSPMU_EVENT_ATTR(slc_access_wr, 0x112),
+ ARM_CSPMU_EVENT_ATTR(slc_bytes_rd, 0x113),
+ ARM_CSPMU_EVENT_ATTR(slc_bytes_wr, 0x114),
+
+ ARM_CSPMU_EVENT_ATTR(mem_access_rd, 0x121),
+ ARM_CSPMU_EVENT_ATTR(mem_access_wr, 0x122),
+ ARM_CSPMU_EVENT_ATTR(mem_bytes_rd, 0x123),
+ ARM_CSPMU_EVENT_ATTR(mem_bytes_wr, 0x124),
+
+ ARM_CSPMU_EVENT_ATTR(local_snoop, 0x180),
+ ARM_CSPMU_EVENT_ATTR(ext_snp_access, 0x181),
+ ARM_CSPMU_EVENT_ATTR(ext_snp_evict, 0x182),
+
+ ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
+ NULL,
+};
+
static struct attribute *generic_pmu_event_attrs[] = {
ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
NULL,
@@ -152,6 +190,18 @@ static struct attribute *cnvlink_pmu_format_attrs[] = {
NULL,
};
+static struct attribute *ucf_pmu_format_attrs[] = {
+ ARM_CSPMU_FORMAT_EVENT_ATTR,
+ ARM_CSPMU_FORMAT_ATTR(src_loc_noncpu, "config1:0"),
+ ARM_CSPMU_FORMAT_ATTR(src_loc_cpu, "config1:1"),
+ ARM_CSPMU_FORMAT_ATTR(src_rem, "config1:2"),
+ ARM_CSPMU_FORMAT_ATTR(dst_loc_cmem, "config1:8"),
+ ARM_CSPMU_FORMAT_ATTR(dst_loc_gmem, "config1:9"),
+ ARM_CSPMU_FORMAT_ATTR(dst_loc_other, "config1:10"),
+ ARM_CSPMU_FORMAT_ATTR(dst_rem, "config1:11"),
+ NULL,
+};
+
static struct attribute *generic_pmu_format_attrs[] = {
ARM_CSPMU_FORMAT_EVENT_ATTR,
ARM_CSPMU_FORMAT_FILTER_ATTR,
@@ -236,6 +286,27 @@ static void nv_cspmu_set_cc_filter(struct arm_cspmu *cspmu,
writel(filter, cspmu->base0 + PMCCFILTR);
}
+static u32 ucf_pmu_event_filter(const struct perf_event *event)
+{
+ u32 ret, filter, src, dst;
+
+ filter = nv_cspmu_event_filter(event);
+
+ /* Monitor all sources if none is selected. */
+ src = FIELD_GET(NV_UCF_FILTER_SRC, filter);
+ if (src == 0)
+ src = GENMASK_ULL(NV_UCF_SRC_COUNT - 1, 0);
+
+ /* Monitor all destinations if none is selected. */
+ dst = FIELD_GET(NV_UCF_FILTER_DST, filter);
+ if (dst == 0)
+ dst = GENMASK_ULL(NV_UCF_DST_COUNT - 1, 0);
+
+ ret = FIELD_PREP(NV_UCF_FILTER_SRC, src);
+ ret |= FIELD_PREP(NV_UCF_FILTER_DST, dst);
+
+ return ret;
+}
enum nv_cspmu_name_fmt {
NAME_FMT_GENERIC,
@@ -342,6 +413,23 @@ static const struct nv_cspmu_match nv_cspmu_match[] = {
.init_data = NULL
},
},
+ {
+ .prodid = 0x2CF20000,
+ .prodid_mask = NV_PRODID_MASK,
+ .name_pattern = "nvidia_ucf_pmu_%u",
+ .name_fmt = NAME_FMT_SOCKET,
+ .template_ctx = {
+ .event_attr = ucf_pmu_event_attrs,
+ .format_attr = ucf_pmu_format_attrs,
+ .filter_mask = NV_UCF_FILTER_ID_MASK,
+ .filter_default_val = NV_UCF_FILTER_DEFAULT,
+ .filter2_mask = 0x0,
+ .filter2_default_val = 0x0,
+ .get_filter = ucf_pmu_event_filter,
+ .get_filter2 = NULL,
+ .init_data = NULL
+ },
+ },
{
.prodid = 0,
.prodid_mask = 0,
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v2 2/8] perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU
2026-02-18 14:58 ` [PATCH v2 2/8] perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU Besar Wicaksono
@ 2026-02-19 9:43 ` Jonathan Cameron
2026-03-05 22:33 ` Besar Wicaksono
0 siblings, 1 reply; 18+ messages in thread
From: Jonathan Cameron @ 2026-02-19 9:43 UTC (permalink / raw)
To: Besar Wicaksono
Cc: will, suzuki.poulose, robin.murphy, ilkka, linux-arm-kernel,
linux-kernel, linux-tegra, mark.rutland, treding, jonathanh,
vsethi, rwiley, sdonthineni, skelley, ywan, mochs, nirmoyd
On Wed, 18 Feb 2026 14:58:03 +0000
Besar Wicaksono <bwicaksono@nvidia.com> wrote:
> The Unified Coherence Fabric (UCF) contains last level cache
> and cache coherent interconnect in Tegra410 SOC. The PMU in
> this device can be used to capture events related to access
> to the last level cache and memory from different sources.
>
> Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
Trivial stuff inline...
> diff --git a/drivers/perf/arm_cspmu/nvidia_cspmu.c b/drivers/perf/arm_cspmu/nvidia_cspmu.c
> index e06a06d3407b..c67667097a3c 100644
> --- a/drivers/perf/arm_cspmu/nvidia_cspmu.c
> +++ b/drivers/perf/arm_cspmu/nvidia_cspmu.c
> @@ -1,6 +1,6 @@
> // SPDX-License-Identifier: GPL-2.0
> /*
> - * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
> + * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
> *
> */
>
> @@ -21,6 +21,13 @@
> #define NV_CNVL_PORT_COUNT 4ULL
> #define NV_CNVL_FILTER_ID_MASK GENMASK_ULL(NV_CNVL_PORT_COUNT - 1, 0)
>
> +#define NV_UCF_SRC_COUNT 3ULL
> +#define NV_UCF_DST_COUNT 4ULL
> +#define NV_UCF_FILTER_ID_MASK GENMASK_ULL(11, 0)
> +#define NV_UCF_FILTER_SRC GENMASK_ULL(2, 0)
> +#define NV_UCF_FILTER_DST GENMASK_ULL(11, 8)
> +#define NV_UCF_FILTER_DEFAULT (NV_UCF_FILTER_SRC | NV_UCF_FILTER_DST)
> +
> #define NV_GENERIC_FILTER_ID_MASK GENMASK_ULL(31, 0)
>
> #define NV_PRODID_MASK (PMIIDR_PRODUCTID | PMIIDR_VARIANT | PMIIDR_REVISION)
> @@ -124,6 +131,37 @@ static struct attribute *mcf_pmu_event_attrs[] = {
> NULL,
> };
>
> +static struct attribute *ucf_pmu_event_attrs[] = {
> + ARM_CSPMU_EVENT_ATTR(bus_cycles, 0x1D),
> +
> + ARM_CSPMU_EVENT_ATTR(slc_allocate, 0xF0),
> + ARM_CSPMU_EVENT_ATTR(slc_wb, 0xF3),
> + ARM_CSPMU_EVENT_ATTR(slc_refill_rd, 0x109),
> + ARM_CSPMU_EVENT_ATTR(slc_refill_wr, 0x10A),
> + ARM_CSPMU_EVENT_ATTR(slc_hit_rd, 0x119),
> +
> + ARM_CSPMU_EVENT_ATTR(slc_access_dataless, 0x183),
> + ARM_CSPMU_EVENT_ATTR(slc_access_atomic, 0x184),
> +
> + ARM_CSPMU_EVENT_ATTR(slc_access, 0xF2),
> + ARM_CSPMU_EVENT_ATTR(slc_access_rd, 0x111),
> + ARM_CSPMU_EVENT_ATTR(slc_access_wr, 0x112),
> + ARM_CSPMU_EVENT_ATTR(slc_bytes_rd, 0x113),
> + ARM_CSPMU_EVENT_ATTR(slc_bytes_wr, 0x114),
> +
> + ARM_CSPMU_EVENT_ATTR(mem_access_rd, 0x121),
> + ARM_CSPMU_EVENT_ATTR(mem_access_wr, 0x122),
> + ARM_CSPMU_EVENT_ATTR(mem_bytes_rd, 0x123),
> + ARM_CSPMU_EVENT_ATTR(mem_bytes_wr, 0x124),
> +
> + ARM_CSPMU_EVENT_ATTR(local_snoop, 0x180),
> + ARM_CSPMU_EVENT_ATTR(ext_snp_access, 0x181),
> + ARM_CSPMU_EVENT_ATTR(ext_snp_evict, 0x182),
> +
> + ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
> + NULL,
Whilst it's locally consistent. In general commas after NULL terminators
are something that makes little sense. The whole point of that terminator
is nothing will ever come after it...
I wouldn't have commented but...
> +};
> enum nv_cspmu_name_fmt {
> NAME_FMT_GENERIC,
> @@ -342,6 +413,23 @@ static const struct nv_cspmu_match nv_cspmu_match[] = {
> .init_data = NULL
> },
> },
> + {
> + .prodid = 0x2CF20000,
> + .prodid_mask = NV_PRODID_MASK,
> + .name_pattern = "nvidia_ucf_pmu_%u",
> + .name_fmt = NAME_FMT_SOCKET,
> + .template_ctx = {
> + .event_attr = ucf_pmu_event_attrs,
> + .format_attr = ucf_pmu_format_attrs,
> + .filter_mask = NV_UCF_FILTER_ID_MASK,
> + .filter_default_val = NV_UCF_FILTER_DEFAULT,
> + .filter2_mask = 0x0,
> + .filter2_default_val = 0x0,
> + .get_filter = ucf_pmu_event_filter,
> + .get_filter2 = NULL,
> + .init_data = NULL
Also locally consistent but generally considered a bad thing to do.
It is certainly possible that in future the template_ctx will gain another field
so the lack of a trailing comma here will then create unnecessary noise.
For this reason trailing commas are normally used in structure initialization.
Jonathan
> + },
> + },
> {
> .prodid = 0,
> .prodid_mask = 0,
^ permalink raw reply [flat|nested] 18+ messages in thread* RE: [PATCH v2 2/8] perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU
2026-02-19 9:43 ` Jonathan Cameron
@ 2026-03-05 22:33 ` Besar Wicaksono
0 siblings, 0 replies; 18+ messages in thread
From: Besar Wicaksono @ 2026-03-05 22:33 UTC (permalink / raw)
To: Jonathan Cameron
Cc: will@kernel.org, suzuki.poulose@arm.com, robin.murphy@arm.com,
ilkka@os.amperecomputing.com,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-tegra@vger.kernel.org,
mark.rutland@arm.com, Thierry Reding, Jon Hunter, Vikram Sethi,
Rich Wiley, Shanker Donthineni, Sean Kelley, Yifei Wan, Matt Ochs,
Nirmoy Das
> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: Thursday, February 19, 2026 3:43 AM
> To: Besar Wicaksono <bwicaksono@nvidia.com>
> Cc: will@kernel.org; suzuki.poulose@arm.com; robin.murphy@arm.com;
> ilkka@os.amperecomputing.com; linux-arm-kernel@lists.infradead.org; linux-
> kernel@vger.kernel.org; linux-tegra@vger.kernel.org; mark.rutland@arm.com;
> Thierry Reding <treding@nvidia.com>; Jon Hunter <jonathanh@nvidia.com>;
> Vikram Sethi <vsethi@nvidia.com>; Rich Wiley <rwiley@nvidia.com>; Shanker
> Donthineni <sdonthineni@nvidia.com>; Sean Kelley <skelley@nvidia.com>;
> Yifei Wan <ywan@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Nirmoy Das
> <nirmoyd@nvidia.com>
> Subject: Re: [PATCH v2 2/8] perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU
>
> External email: Use caution opening links or attachments
>
>
> On Wed, 18 Feb 2026 14:58:03 +0000
> Besar Wicaksono <bwicaksono@nvidia.com> wrote:
>
> > The Unified Coherence Fabric (UCF) contains last level cache
> > and cache coherent interconnect in Tegra410 SOC. The PMU in
> > this device can be used to capture events related to access
> > to the last level cache and memory from different sources.
> >
> > Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
> Trivial stuff inline...
>
>
Thanks for the feedback Jonathan. I will fix them on V3.
Regards,
Besar
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 3/8] perf/arm_cspmu: Add arm_cspmu_acpi_dev_get
2026-02-18 14:58 [PATCH v2 0/8] perf: add NVIDIA Tegra410 Uncore PMU support Besar Wicaksono
2026-02-18 14:58 ` [PATCH v2 1/8] perf/arm_cspmu: nvidia: Rename doc to Tegra241 Besar Wicaksono
2026-02-18 14:58 ` [PATCH v2 2/8] perf/arm_cspmu: nvidia: Add Tegra410 UCF PMU Besar Wicaksono
@ 2026-02-18 14:58 ` Besar Wicaksono
2026-02-19 9:40 ` Jonathan Cameron
2026-02-18 14:58 ` [PATCH v2 4/8] perf/arm_cspmu: nvidia: Add Tegra410 PCIE PMU Besar Wicaksono
` (4 subsequent siblings)
7 siblings, 1 reply; 18+ messages in thread
From: Besar Wicaksono @ 2026-02-18 14:58 UTC (permalink / raw)
To: will, suzuki.poulose, robin.murphy, ilkka
Cc: linux-arm-kernel, linux-kernel, linux-tegra, mark.rutland,
treding, jonathanh, vsethi, rwiley, sdonthineni, skelley, ywan,
mochs, nirmoyd, Besar Wicaksono
Add interface to get ACPI device associated with the
PMU. This ACPI device may contain additional properties
not covered by the standard properties.
Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
---
drivers/perf/arm_cspmu/arm_cspmu.c | 22 +++++++++++++++++++++-
drivers/perf/arm_cspmu/arm_cspmu.h | 17 ++++++++++++++++-
2 files changed, 37 insertions(+), 2 deletions(-)
diff --git a/drivers/perf/arm_cspmu/arm_cspmu.c b/drivers/perf/arm_cspmu/arm_cspmu.c
index 34430b68f602..ab2479c048bb 100644
--- a/drivers/perf/arm_cspmu/arm_cspmu.c
+++ b/drivers/perf/arm_cspmu/arm_cspmu.c
@@ -16,7 +16,7 @@
* The user should refer to the vendor technical documentation to get details
* about the supported events.
*
- * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
*
*/
@@ -1132,6 +1132,26 @@ static int arm_cspmu_acpi_get_cpus(struct arm_cspmu *cspmu)
return 0;
}
+
+struct acpi_device *arm_cspmu_acpi_dev_get(const struct arm_cspmu *cspmu)
+{
+ char hid[16];
+ char uid[16];
+ const struct acpi_apmt_node *apmt_node;
+
+ apmt_node = arm_cspmu_apmt_node(cspmu->dev);
+ if (!apmt_node || apmt_node->type != ACPI_APMT_NODE_TYPE_ACPI)
+ return NULL;
+
+ memset(hid, 0, sizeof(hid));
+ memset(uid, 0, sizeof(uid));
+
+ memcpy(hid, &apmt_node->inst_primary, sizeof(apmt_node->inst_primary));
+ snprintf(uid, sizeof(uid), "%u", apmt_node->inst_secondary);
+
+ return acpi_dev_get_first_match_dev(hid, uid, -1);
+}
+EXPORT_SYMBOL_GPL(arm_cspmu_acpi_dev_get);
#else
static int arm_cspmu_acpi_get_cpus(struct arm_cspmu *cspmu)
{
diff --git a/drivers/perf/arm_cspmu/arm_cspmu.h b/drivers/perf/arm_cspmu/arm_cspmu.h
index cd65a58dbd88..320096673200 100644
--- a/drivers/perf/arm_cspmu/arm_cspmu.h
+++ b/drivers/perf/arm_cspmu/arm_cspmu.h
@@ -1,13 +1,14 @@
/* SPDX-License-Identifier: GPL-2.0
*
* ARM CoreSight Architecture PMU driver.
- * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
*
*/
#ifndef __ARM_CSPMU_H__
#define __ARM_CSPMU_H__
+#include <linux/acpi.h>
#include <linux/bitfield.h>
#include <linux/cpumask.h>
#include <linux/device.h>
@@ -255,4 +256,18 @@ int arm_cspmu_impl_register(const struct arm_cspmu_impl_match *impl_match);
/* Unregister vendor backend. */
void arm_cspmu_impl_unregister(const struct arm_cspmu_impl_match *impl_match);
+#if defined(CONFIG_ACPI)
+/**
+ * Get ACPI device associated with the PMU.
+ * The caller is responsible for calling acpi_dev_put() on the returned device.
+ */
+struct acpi_device *arm_cspmu_acpi_dev_get(const struct arm_cspmu *cspmu);
+#else
+static inline struct acpi_device *
+arm_cspmu_acpi_dev_get(const struct arm_cspmu *cspmu)
+{
+ return NULL;
+}
+#endif
+
#endif /* __ARM_CSPMU_H__ */
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v2 3/8] perf/arm_cspmu: Add arm_cspmu_acpi_dev_get
2026-02-18 14:58 ` [PATCH v2 3/8] perf/arm_cspmu: Add arm_cspmu_acpi_dev_get Besar Wicaksono
@ 2026-02-19 9:40 ` Jonathan Cameron
2026-03-05 22:39 ` Besar Wicaksono
0 siblings, 1 reply; 18+ messages in thread
From: Jonathan Cameron @ 2026-02-19 9:40 UTC (permalink / raw)
To: Besar Wicaksono
Cc: will, suzuki.poulose, robin.murphy, ilkka, linux-arm-kernel,
linux-kernel, linux-tegra, mark.rutland, treding, jonathanh,
vsethi, rwiley, sdonthineni, skelley, ywan, mochs, nirmoyd
On Wed, 18 Feb 2026 14:58:04 +0000
Besar Wicaksono <bwicaksono@nvidia.com> wrote:
> Add interface to get ACPI device associated with the
> PMU. This ACPI device may contain additional properties
> not covered by the standard properties.
>
> Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
Hi Besar,
A drive by review as I was curious.
A few comments inline.
> ---
> drivers/perf/arm_cspmu/arm_cspmu.c | 22 +++++++++++++++++++++-
> drivers/perf/arm_cspmu/arm_cspmu.h | 17 ++++++++++++++++-
> 2 files changed, 37 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/perf/arm_cspmu/arm_cspmu.c b/drivers/perf/arm_cspmu/arm_cspmu.c
> index 34430b68f602..ab2479c048bb 100644
> --- a/drivers/perf/arm_cspmu/arm_cspmu.c
> +++ b/drivers/perf/arm_cspmu/arm_cspmu.c
> @@ -16,7 +16,7 @@
> * The user should refer to the vendor technical documentation to get details
> * about the supported events.
> *
> - * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
> + * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
> *
> */
>
> @@ -1132,6 +1132,26 @@ static int arm_cspmu_acpi_get_cpus(struct arm_cspmu *cspmu)
>
> return 0;
> }
> +
> +struct acpi_device *arm_cspmu_acpi_dev_get(const struct arm_cspmu *cspmu)
> +{
> + char hid[16];
> + char uid[16];
Might as well do
char hid[16] = { };
char uid[16] = { };
and drop the memsets below.
> + const struct acpi_apmt_node *apmt_node;
> +
> + apmt_node = arm_cspmu_apmt_node(cspmu->dev);
> + if (!apmt_node || apmt_node->type != ACPI_APMT_NODE_TYPE_ACPI)
> + return NULL;
> +
> + memset(hid, 0, sizeof(hid));
> + memset(uid, 0, sizeof(uid));
> +
> + memcpy(hid, &apmt_node->inst_primary, sizeof(apmt_node->inst_primary));
> + snprintf(uid, sizeof(uid), "%u", apmt_node->inst_secondary);
> +
> + return acpi_dev_get_first_match_dev(hid, uid, -1);
> +}
> +EXPORT_SYMBOL_GPL(arm_cspmu_acpi_dev_get);
> #else
> static int arm_cspmu_acpi_get_cpus(struct arm_cspmu *cspmu)
> {
> diff --git a/drivers/perf/arm_cspmu/arm_cspmu.h b/drivers/perf/arm_cspmu/arm_cspmu.h
> index cd65a58dbd88..320096673200 100644
> --- a/drivers/perf/arm_cspmu/arm_cspmu.h
> +++ b/drivers/perf/arm_cspmu/arm_cspmu.h
> @@ -1,13 +1,14 @@
> /* SPDX-License-Identifier: GPL-2.0
> *
> * ARM CoreSight Architecture PMU driver.
> - * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
> + * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
> *
> */
>
> #ifndef __ARM_CSPMU_H__
> #define __ARM_CSPMU_H__
>
> +#include <linux/acpi.h>
> #include <linux/bitfield.h>
> #include <linux/cpumask.h>
> #include <linux/device.h>
> @@ -255,4 +256,18 @@ int arm_cspmu_impl_register(const struct arm_cspmu_impl_match *impl_match);
> /* Unregister vendor backend. */
> void arm_cspmu_impl_unregister(const struct arm_cspmu_impl_match *impl_match);
>
> +#if defined(CONFIG_ACPI)
This isn't the same gate as used for whether the function is built. I think that's
#if defined(CONFIG_ACPI) && defined(CONFIG_ARM64)
Whilst it might work to have them different today I think this is a little more
fragile than would be ideal.
The ARM64 bit seems to be there to allow COMPILE_TEST for
ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU and to me that smells like a stub or Kconfig
dependency missing.
> +/**
> + * Get ACPI device associated with the PMU.
> + * The caller is responsible for calling acpi_dev_put() on the returned device.
> + */
> +struct acpi_device *arm_cspmu_acpi_dev_get(const struct arm_cspmu *cspmu);
> +#else
> +static inline struct acpi_device *
> +arm_cspmu_acpi_dev_get(const struct arm_cspmu *cspmu)
> +{
> + return NULL;
> +}
> +#endif
> +
> #endif /* __ARM_CSPMU_H__ */
^ permalink raw reply [flat|nested] 18+ messages in thread* RE: [PATCH v2 3/8] perf/arm_cspmu: Add arm_cspmu_acpi_dev_get
2026-02-19 9:40 ` Jonathan Cameron
@ 2026-03-05 22:39 ` Besar Wicaksono
0 siblings, 0 replies; 18+ messages in thread
From: Besar Wicaksono @ 2026-03-05 22:39 UTC (permalink / raw)
To: Jonathan Cameron
Cc: will@kernel.org, suzuki.poulose@arm.com, robin.murphy@arm.com,
ilkka@os.amperecomputing.com,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-tegra@vger.kernel.org,
mark.rutland@arm.com, Thierry Reding, Jon Hunter, Vikram Sethi,
Rich Wiley, Shanker Donthineni, Sean Kelley, Yifei Wan, Matt Ochs,
Nirmoy Das
> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: Thursday, February 19, 2026 3:40 AM
> To: Besar Wicaksono <bwicaksono@nvidia.com>
> Cc: will@kernel.org; suzuki.poulose@arm.com; robin.murphy@arm.com;
> ilkka@os.amperecomputing.com; linux-arm-kernel@lists.infradead.org; linux-
> kernel@vger.kernel.org; linux-tegra@vger.kernel.org; mark.rutland@arm.com;
> Thierry Reding <treding@nvidia.com>; Jon Hunter <jonathanh@nvidia.com>;
> Vikram Sethi <vsethi@nvidia.com>; Rich Wiley <rwiley@nvidia.com>; Shanker
> Donthineni <sdonthineni@nvidia.com>; Sean Kelley <skelley@nvidia.com>;
> Yifei Wan <ywan@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Nirmoy Das
> <nirmoyd@nvidia.com>
> Subject: Re: [PATCH v2 3/8] perf/arm_cspmu: Add arm_cspmu_acpi_dev_get
>
> External email: Use caution opening links or attachments
>
>
> On Wed, 18 Feb 2026 14:58:04 +0000
> Besar Wicaksono <bwicaksono@nvidia.com> wrote:
>
> > Add interface to get ACPI device associated with the
> > PMU. This ACPI device may contain additional properties
> > not covered by the standard properties.
> >
> > Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
> Hi Besar,
>
> A drive by review as I was curious.
>
> A few comments inline.
> > ---
> > drivers/perf/arm_cspmu/arm_cspmu.c | 22 +++++++++++++++++++++-
> > drivers/perf/arm_cspmu/arm_cspmu.h | 17 ++++++++++++++++-
> > 2 files changed, 37 insertions(+), 2 deletions(-)
> >
> > diff --git a/drivers/perf/arm_cspmu/arm_cspmu.c
> b/drivers/perf/arm_cspmu/arm_cspmu.c
> > index 34430b68f602..ab2479c048bb 100644
> > --- a/drivers/perf/arm_cspmu/arm_cspmu.c
> > +++ b/drivers/perf/arm_cspmu/arm_cspmu.c
> > @@ -16,7 +16,7 @@
> > * The user should refer to the vendor technical documentation to get details
> > * about the supported events.
> > *
> > - * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights
> reserved.
> > + * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights
> reserved.
> > *
> > */
> >
> > @@ -1132,6 +1132,26 @@ static int arm_cspmu_acpi_get_cpus(struct
> arm_cspmu *cspmu)
> >
> > return 0;
> > }
> > +
> > +struct acpi_device *arm_cspmu_acpi_dev_get(const struct arm_cspmu
> *cspmu)
> > +{
> > + char hid[16];
> > + char uid[16];
>
> Might as well do
> char hid[16] = { };
> char uid[16] = { };
>
Sure, will do on V3.
> and drop the memsets below.
>
> > + const struct acpi_apmt_node *apmt_node;
> > +
> > + apmt_node = arm_cspmu_apmt_node(cspmu->dev);
> > + if (!apmt_node || apmt_node->type != ACPI_APMT_NODE_TYPE_ACPI)
> > + return NULL;
> > +
> > + memset(hid, 0, sizeof(hid));
> > + memset(uid, 0, sizeof(uid));
> > +
> > + memcpy(hid, &apmt_node->inst_primary, sizeof(apmt_node-
> >inst_primary));
> > + snprintf(uid, sizeof(uid), "%u", apmt_node->inst_secondary);
> > +
> > + return acpi_dev_get_first_match_dev(hid, uid, -1);
> > +}
> > +EXPORT_SYMBOL_GPL(arm_cspmu_acpi_dev_get);
> > #else
> > static int arm_cspmu_acpi_get_cpus(struct arm_cspmu *cspmu)
> > {
> > diff --git a/drivers/perf/arm_cspmu/arm_cspmu.h
> b/drivers/perf/arm_cspmu/arm_cspmu.h
> > index cd65a58dbd88..320096673200 100644
> > --- a/drivers/perf/arm_cspmu/arm_cspmu.h
> > +++ b/drivers/perf/arm_cspmu/arm_cspmu.h
> > @@ -1,13 +1,14 @@
> > /* SPDX-License-Identifier: GPL-2.0
> > *
> > * ARM CoreSight Architecture PMU driver.
> > - * Copyright (c) 2022-2023, NVIDIA CORPORATION & AFFILIATES. All rights
> reserved.
> > + * Copyright (c) 2022-2026, NVIDIA CORPORATION & AFFILIATES. All rights
> reserved.
> > *
> > */
> >
> > #ifndef __ARM_CSPMU_H__
> > #define __ARM_CSPMU_H__
> >
> > +#include <linux/acpi.h>
> > #include <linux/bitfield.h>
> > #include <linux/cpumask.h>
> > #include <linux/device.h>
> > @@ -255,4 +256,18 @@ int arm_cspmu_impl_register(const struct
> arm_cspmu_impl_match *impl_match);
> > /* Unregister vendor backend. */
> > void arm_cspmu_impl_unregister(const struct arm_cspmu_impl_match
> *impl_match);
> >
> > +#if defined(CONFIG_ACPI)
Thanks for spotting. Will fix it on V3.
Regards,
Besar
> This isn't the same gate as used for whether the function is built. I think that's
> #if defined(CONFIG_ACPI) && defined(CONFIG_ARM64)
>
> Whilst it might work to have them different today I think this is a little more
> fragile than would be ideal.
>
> The ARM64 bit seems to be there to allow COMPILE_TEST for
> ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU and to me that smells like a
> stub or Kconfig
> dependency missing.
>
> > +/**
> > + * Get ACPI device associated with the PMU.
> > + * The caller is responsible for calling acpi_dev_put() on the returned device.
> > + */
> > +struct acpi_device *arm_cspmu_acpi_dev_get(const struct arm_cspmu
> *cspmu);
> > +#else
> > +static inline struct acpi_device *
> > +arm_cspmu_acpi_dev_get(const struct arm_cspmu *cspmu)
> > +{
> > + return NULL;
> > +}
> > +#endif
> > +
> > #endif /* __ARM_CSPMU_H__ */
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 4/8] perf/arm_cspmu: nvidia: Add Tegra410 PCIE PMU
2026-02-18 14:58 [PATCH v2 0/8] perf: add NVIDIA Tegra410 Uncore PMU support Besar Wicaksono
` (2 preceding siblings ...)
2026-02-18 14:58 ` [PATCH v2 3/8] perf/arm_cspmu: Add arm_cspmu_acpi_dev_get Besar Wicaksono
@ 2026-02-18 14:58 ` Besar Wicaksono
2026-02-19 10:06 ` Jonathan Cameron
2026-02-18 14:58 ` [PATCH v2 5/8] perf/arm_cspmu: nvidia: Add Tegra410 PCIE-TGT PMU Besar Wicaksono
` (3 subsequent siblings)
7 siblings, 1 reply; 18+ messages in thread
From: Besar Wicaksono @ 2026-02-18 14:58 UTC (permalink / raw)
To: will, suzuki.poulose, robin.murphy, ilkka
Cc: linux-arm-kernel, linux-kernel, linux-tegra, mark.rutland,
treding, jonathanh, vsethi, rwiley, sdonthineni, skelley, ywan,
mochs, nirmoyd, Besar Wicaksono
Adds PCIE PMU support in Tegra410 SOC. This PMU is instanced
in each root complex in the SOC and can capture traffic from
PCIE device to various memory types. This PMU can filter traffic
based on the originating root port or BDF and the target memory
types (CPU DRAM, GPU Memory, CXL Memory, or remote Memory).
Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
---
.../admin-guide/perf/nvidia-tegra410-pmu.rst | 162 ++++++++++++++
drivers/perf/arm_cspmu/nvidia_cspmu.c | 211 +++++++++++++++++-
2 files changed, 368 insertions(+), 5 deletions(-)
diff --git a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
index 7b7ba5700ca1..8528685ddb61 100644
--- a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
+++ b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
@@ -6,6 +6,7 @@ The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance
metrics like memory bandwidth, latency, and utilization:
* Unified Coherence Fabric (UCF)
+* PCIE
PMU Driver
----------
@@ -104,3 +105,164 @@ Example usage:
destination filter = remote memory::
perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/
+
+PCIE PMU
+--------
+
+This PMU monitors all read/write traffic from the root port(s) or a particular
+BDF in a PCIE root complex (RC) to local or remote memory. There is one PMU per
+PCIE RC in the SoC. Each RC can have up to 16 lanes that can be bifurcated into
+up to 8 root ports. The traffic from each root port can be filtered using RP or
+BDF filter. For example, specifying "src_rp_mask=0xFF" means the PMU counter will
+capture traffic from all RPs. Please see below for more details.
+
+The events and configuration options of this PMU device are described in sysfs,
+see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>.
+
+The events in this PMU can be used to measure bandwidth, utilization, and
+latency:
+
+ * rd_req: count the number of read requests by PCIE device.
+ * wr_req: count the number of write requests by PCIE device.
+ * rd_bytes: count the number of bytes transferred by rd_req.
+ * wr_bytes: count the number of bytes transferred by wr_req.
+ * rd_cum_outs: count outstanding rd_req each cycle.
+ * cycles: counts the PCIE cycles.
+
+The average bandwidth is calculated as::
+
+ AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
+ AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
+
+The average request rate is calculated as::
+
+ AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
+ AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
+
+
+The average latency is calculated as::
+
+ FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
+ AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
+ AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
+
+The PMU events can be filtered based on the traffic source and destination.
+The source filter indicates the PCIE devices that will be monitored. The
+destination filter specifies the destination memory type, e.g. local system
+memory (CMEM), local GPU memory (GMEM), or remote memory. The local/remote
+classification of the destination filter is based on the home socket of the
+address, not where the data actually resides. These filters can be found in
+/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
+
+The list of event filters:
+
+* Source filter:
+
+ * src_rp_mask: bitmask of root ports that will be monitored. Each bit in this
+ bitmask represents the RP index in the RC. If the bit is set, all devices under
+ the associated RP will be monitored. E.g "src_rp_mask=0xF" will monitor
+ devices in root port 0 to 3.
+ * src_bdf: the BDF that will be monitored. This is a 16-bit value that
+ follows formula: (bus << 8) + (device << 3) + (function). For example, the
+ value of BDF 27:01.1 is 0x2781.
+ * src_bdf_en: enable the BDF filter. If this is set, the BDF filter value in
+ "src_bdf" is used to filter the traffic.
+
+ Note that Root-Port and BDF filters are mutually exclusive and the PMU in
+ each RC can only have one BDF filter for the whole counters. If BDF filter
+ is enabled, the BDF filter value will be applied to all events.
+
+* Destination filter:
+
+ * dst_loc_cmem: if set, count events to local system memory (CMEM) address
+ * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
+ * dst_loc_pcie_p2p: if set, count events to local PCIE peer address
+ * dst_loc_pcie_cxl: if set, count events to local CXL memory address
+ * dst_rem: if set, count events to remote memory address
+
+If the source filter is not specified, the PMU will count events from all root
+ports. If the destination filter is not specified, the PMU will count events
+to all destinations.
+
+Example usage:
+
+* Count event id 0x0 from root port 0 of PCIE RC-0 on socket 0 targeting all
+ destinations::
+
+ perf stat -a -e nvidia_pcie_pmu_0_rc_0/event=0x0,src_rp_mask=0x1/
+
+* Count event id 0x1 from root port 0 and 1 of PCIE RC-1 on socket 0 and
+ targeting just local CMEM of socket 0::
+
+ perf stat -a -e nvidia_pcie_pmu_0_rc_1/event=0x1,src_rp_mask=0x3,dst_loc_cmem=0x1/
+
+* Count event id 0x2 from root port 0 of PCIE RC-2 on socket 1 targeting all
+ destinations::
+
+ perf stat -a -e nvidia_pcie_pmu_1_rc_2/event=0x2,src_rp_mask=0x1/
+
+* Count event id 0x3 from root port 0 and 1 of PCIE RC-3 on socket 1 and
+ targeting just local CMEM of socket 1::
+
+ perf stat -a -e nvidia_pcie_pmu_1_rc_3/event=0x3,src_rp_mask=0x3,dst_loc_cmem=0x1/
+
+* Count event id 0x4 from BDF 01:01.0 of PCIE RC-4 on socket 0 targeting all
+ destinations::
+
+ perf stat -a -e nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/
+
+Mapping the RC# to lspci segment number can be non-trivial; hence a new NVIDIA
+Designated Vendor Specific Capability (DVSEC) register is added into the PCIE config space
+for each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The DVSEC register
+contains the following information to map PCIE devices under the RP back to its RC# :
+
+ - Bus# (byte 0xc) : bus number as reported by the lspci output
+ - Segment# (byte 0xd) : segment number as reported by the lspci output
+ - RP# (byte 0xe) : port number as reported by LnkCap attribute from lspci for a device with Root Port capability
+ - RC# (byte 0xf): root complex number associated with the RP
+ - Socket# (byte 0x10): socket number associated with the RP
+
+Example script for mapping lspci BDF to RC# and socket#::
+
+ #!/bin/bash
+ while read bdf rest; do
+ dvsec4_reg=$(lspci -vv -s $bdf | awk '
+ /Designated Vendor-Specific: Vendor=10de ID=0004/ {
+ match($0, /\[([0-9a-fA-F]+)/, arr);
+ print "0x" arr[1];
+ exit
+ }
+ ')
+ if [ -n "$dvsec4_reg" ]; then
+ bus=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xc))).b)
+ segment=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xd))).b)
+ rp=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xe))).b)
+ rc=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xf))).b)
+ socket=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0x10))).b)
+ echo "$bdf: Bus=$bus, Segment=$segment, RP=$rp, RC=$rc, Socket=$socket"
+ fi
+ done < <(lspci -d 10de:)
+
+Example output::
+
+ 0001:00:00.0: Bus=00, Segment=01, RP=00, RC=00, Socket=00
+ 0002:80:00.0: Bus=80, Segment=02, RP=01, RC=01, Socket=00
+ 0002:a0:00.0: Bus=a0, Segment=02, RP=02, RC=01, Socket=00
+ 0002:c0:00.0: Bus=c0, Segment=02, RP=03, RC=01, Socket=00
+ 0002:e0:00.0: Bus=e0, Segment=02, RP=04, RC=01, Socket=00
+ 0003:00:00.0: Bus=00, Segment=03, RP=00, RC=02, Socket=00
+ 0004:00:00.0: Bus=00, Segment=04, RP=00, RC=03, Socket=00
+ 0005:00:00.0: Bus=00, Segment=05, RP=00, RC=04, Socket=00
+ 0005:40:00.0: Bus=40, Segment=05, RP=01, RC=04, Socket=00
+ 0005:c0:00.0: Bus=c0, Segment=05, RP=02, RC=04, Socket=00
+ 0006:00:00.0: Bus=00, Segment=06, RP=00, RC=05, Socket=00
+ 0009:00:00.0: Bus=00, Segment=09, RP=00, RC=00, Socket=01
+ 000a:80:00.0: Bus=80, Segment=0a, RP=01, RC=01, Socket=01
+ 000a:a0:00.0: Bus=a0, Segment=0a, RP=02, RC=01, Socket=01
+ 000a:e0:00.0: Bus=e0, Segment=0a, RP=03, RC=01, Socket=01
+ 000b:00:00.0: Bus=00, Segment=0b, RP=00, RC=02, Socket=01
+ 000c:00:00.0: Bus=00, Segment=0c, RP=00, RC=03, Socket=01
+ 000d:00:00.0: Bus=00, Segment=0d, RP=00, RC=04, Socket=01
+ 000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01
+ 000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01
+ 000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01
diff --git a/drivers/perf/arm_cspmu/nvidia_cspmu.c b/drivers/perf/arm_cspmu/nvidia_cspmu.c
index c67667097a3c..42f11f37bddf 100644
--- a/drivers/perf/arm_cspmu/nvidia_cspmu.c
+++ b/drivers/perf/arm_cspmu/nvidia_cspmu.c
@@ -8,6 +8,7 @@
#include <linux/io.h>
#include <linux/module.h>
+#include <linux/property.h>
#include <linux/topology.h>
#include "arm_cspmu.h"
@@ -28,6 +29,19 @@
#define NV_UCF_FILTER_DST GENMASK_ULL(11, 8)
#define NV_UCF_FILTER_DEFAULT (NV_UCF_FILTER_SRC | NV_UCF_FILTER_DST)
+#define NV_PCIE_V2_PORT_COUNT 8ULL
+#define NV_PCIE_V2_FILTER_ID_MASK GENMASK_ULL(24, 0)
+#define NV_PCIE_V2_FILTER_PORT GENMASK_ULL(NV_PCIE_V2_PORT_COUNT - 1, 0)
+#define NV_PCIE_V2_FILTER_BDF_VAL GENMASK_ULL(23, NV_PCIE_V2_PORT_COUNT)
+#define NV_PCIE_V2_FILTER_BDF_EN BIT(24)
+#define NV_PCIE_V2_FILTER_BDF_VAL_EN GENMASK_ULL(24, NV_PCIE_V2_PORT_COUNT)
+#define NV_PCIE_V2_FILTER_DEFAULT NV_PCIE_V2_FILTER_PORT
+
+#define NV_PCIE_V2_DST_COUNT 5ULL
+#define NV_PCIE_V2_FILTER2_ID_MASK GENMASK_ULL(4, 0)
+#define NV_PCIE_V2_FILTER2_DST GENMASK_ULL(NV_PCIE_V2_DST_COUNT - 1, 0)
+#define NV_PCIE_V2_FILTER2_DEFAULT NV_PCIE_V2_FILTER2_DST
+
#define NV_GENERIC_FILTER_ID_MASK GENMASK_ULL(31, 0)
#define NV_PRODID_MASK (PMIIDR_PRODUCTID | PMIIDR_VARIANT | PMIIDR_REVISION)
@@ -162,6 +176,16 @@ static struct attribute *ucf_pmu_event_attrs[] = {
NULL,
};
+static struct attribute *pcie_v2_pmu_event_attrs[] = {
+ ARM_CSPMU_EVENT_ATTR(rd_bytes, 0x0),
+ ARM_CSPMU_EVENT_ATTR(wr_bytes, 0x1),
+ ARM_CSPMU_EVENT_ATTR(rd_req, 0x2),
+ ARM_CSPMU_EVENT_ATTR(wr_req, 0x3),
+ ARM_CSPMU_EVENT_ATTR(rd_cum_outs, 0x4),
+ ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
+ NULL,
+};
+
static struct attribute *generic_pmu_event_attrs[] = {
ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
NULL,
@@ -202,6 +226,19 @@ static struct attribute *ucf_pmu_format_attrs[] = {
NULL,
};
+static struct attribute *pcie_v2_pmu_format_attrs[] = {
+ ARM_CSPMU_FORMAT_EVENT_ATTR,
+ ARM_CSPMU_FORMAT_ATTR(src_rp_mask, "config1:0-7"),
+ ARM_CSPMU_FORMAT_ATTR(src_bdf, "config1:8-23"),
+ ARM_CSPMU_FORMAT_ATTR(src_bdf_en, "config1:24"),
+ ARM_CSPMU_FORMAT_ATTR(dst_loc_cmem, "config2:0"),
+ ARM_CSPMU_FORMAT_ATTR(dst_loc_gmem, "config2:1"),
+ ARM_CSPMU_FORMAT_ATTR(dst_loc_pcie_p2p, "config2:2"),
+ ARM_CSPMU_FORMAT_ATTR(dst_loc_pcie_cxl, "config2:3"),
+ ARM_CSPMU_FORMAT_ATTR(dst_rem, "config2:4"),
+ NULL,
+};
+
static struct attribute *generic_pmu_format_attrs[] = {
ARM_CSPMU_FORMAT_EVENT_ATTR,
ARM_CSPMU_FORMAT_FILTER_ATTR,
@@ -233,6 +270,32 @@ nv_cspmu_get_name(const struct arm_cspmu *cspmu)
return ctx->name;
}
+#if defined(CONFIG_ACPI)
+static int nv_cspmu_get_inst_id(const struct arm_cspmu *cspmu, u32 *id)
+{
+ struct fwnode_handle *fwnode;
+ struct acpi_device *adev;
+ int ret;
+
+ adev = arm_cspmu_acpi_dev_get(cspmu);
+ if (!adev)
+ return -ENODEV;
+
+ fwnode = acpi_fwnode_handle(adev);
+ ret = fwnode_property_read_u32(fwnode, "instance_id", id);
+ if (ret)
+ dev_err(cspmu->dev, "Failed to get instance ID\n");
+
+ acpi_dev_put(adev);
+ return ret;
+}
+#else
+static int nv_cspmu_get_inst_id(const struct arm_cspmu *cspmu, u32 *id)
+{
+ return -EINVAL;
+}
+#endif
+
static u32 nv_cspmu_event_filter(const struct perf_event *event)
{
const struct nv_cspmu_ctx *ctx =
@@ -278,6 +341,20 @@ static void nv_cspmu_set_ev_filter(struct arm_cspmu *cspmu,
}
}
+static void nv_cspmu_reset_ev_filter(struct arm_cspmu *cspmu,
+ const struct perf_event *event)
+{
+ const struct nv_cspmu_ctx *ctx =
+ to_nv_cspmu_ctx(to_arm_cspmu(event->pmu));
+ const u32 offset = 4 * event->hw.idx;
+
+ if (ctx->get_filter)
+ writel(0, cspmu->base0 + PMEVFILTR + offset);
+
+ if (ctx->get_filter2)
+ writel(0, cspmu->base0 + PMEVFILT2R + offset);
+}
+
static void nv_cspmu_set_cc_filter(struct arm_cspmu *cspmu,
const struct perf_event *event)
{
@@ -308,9 +385,103 @@ static u32 ucf_pmu_event_filter(const struct perf_event *event)
return ret;
}
+static u32 pcie_v2_pmu_bdf_val_en(u32 filter)
+{
+ const u32 bdf_en = FIELD_GET(NV_PCIE_V2_FILTER_BDF_EN, filter);
+
+ /* Returns both BDF value and enable bit if BDF filtering is enabled. */
+ if (bdf_en)
+ return FIELD_GET(NV_PCIE_V2_FILTER_BDF_VAL_EN, filter);
+
+ /* Ignore the BDF value if BDF filter is not enabled. */
+ return 0;
+}
+
+static u32 pcie_v2_pmu_event_filter(const struct perf_event *event)
+{
+ u32 filter, lead_filter, lead_bdf;
+ struct perf_event *leader;
+ const struct nv_cspmu_ctx *ctx =
+ to_nv_cspmu_ctx(to_arm_cspmu(event->pmu));
+
+ filter = event->attr.config1 & ctx->filter_mask;
+ if (filter != 0)
+ return filter;
+
+ leader = event->group_leader;
+
+ /* Use leader's filter value if its BDF filtering is enabled. */
+ if (event != leader) {
+ lead_filter = pcie_v2_pmu_event_filter(leader);
+ lead_bdf = pcie_v2_pmu_bdf_val_en(lead_filter);
+ if (lead_bdf != 0)
+ return lead_filter;
+ }
+
+ /* Otherwise, return default filter value. */
+ return ctx->filter_default_val;
+}
+
+static int pcie_v2_pmu_validate_event(struct arm_cspmu *cspmu,
+ struct perf_event *new_ev)
+{
+ /*
+ * Make sure the events are using same BDF filter since the PCIE-SRC PMU
+ * only supports one common BDF filter setting for all of the counters.
+ */
+
+ int idx;
+ u32 new_filter, new_rp, new_bdf, new_lead_filter, new_lead_bdf;
+ struct perf_event *leader, *new_leader;
+
+ if (cspmu->impl.ops.is_cycle_counter_event(new_ev))
+ return 0;
+
+ new_leader = new_ev->group_leader;
+
+ new_filter = pcie_v2_pmu_event_filter(new_ev);
+ new_lead_filter = pcie_v2_pmu_event_filter(new_leader);
+
+ new_bdf = pcie_v2_pmu_bdf_val_en(new_filter);
+ new_lead_bdf = pcie_v2_pmu_bdf_val_en(new_lead_filter);
+
+ new_rp = FIELD_GET(NV_PCIE_V2_FILTER_PORT, new_filter);
+
+ if (new_rp != 0 && new_bdf != 0) {
+ dev_err(cspmu->dev,
+ "RP and BDF filtering are mutually exclusive\n");
+ return -EINVAL;
+ }
+
+ if (new_bdf != new_lead_bdf) {
+ dev_err(cspmu->dev,
+ "sibling and leader BDF value should be equal\n");
+ return -EINVAL;
+ }
+
+ /* Compare BDF filter on existing events. */
+ idx = find_first_bit(cspmu->hw_events.used_ctrs,
+ cspmu->cycle_counter_logical_idx);
+
+ if (idx != cspmu->cycle_counter_logical_idx) {
+ leader = cspmu->hw_events.events[idx]->group_leader;
+
+ const u32 lead_filter = pcie_v2_pmu_event_filter(leader);
+ const u32 lead_bdf = pcie_v2_pmu_bdf_val_en(lead_filter);
+
+ if (new_lead_bdf != lead_bdf) {
+ dev_err(cspmu->dev, "only one BDF value is supported\n");
+ return -EINVAL;
+ }
+ }
+
+ return 0;
+}
+
enum nv_cspmu_name_fmt {
NAME_FMT_GENERIC,
- NAME_FMT_SOCKET
+ NAME_FMT_SOCKET,
+ NAME_FMT_SOCKET_INST
};
struct nv_cspmu_match {
@@ -430,6 +601,27 @@ static const struct nv_cspmu_match nv_cspmu_match[] = {
.init_data = NULL
},
},
+ {
+ .prodid = 0x10301000,
+ .prodid_mask = NV_PRODID_MASK,
+ .name_pattern = "nvidia_pcie_pmu_%u_rc_%u",
+ .name_fmt = NAME_FMT_SOCKET_INST,
+ .template_ctx = {
+ .event_attr = pcie_v2_pmu_event_attrs,
+ .format_attr = pcie_v2_pmu_format_attrs,
+ .filter_mask = NV_PCIE_V2_FILTER_ID_MASK,
+ .filter_default_val = NV_PCIE_V2_FILTER_DEFAULT,
+ .filter2_mask = NV_PCIE_V2_FILTER2_ID_MASK,
+ .filter2_default_val = NV_PCIE_V2_FILTER2_DEFAULT,
+ .get_filter = pcie_v2_pmu_event_filter,
+ .get_filter2 = nv_cspmu_event_filter2,
+ .init_data = NULL
+ },
+ .ops = {
+ .validate_event = pcie_v2_pmu_validate_event,
+ .reset_ev_filter = nv_cspmu_reset_ev_filter,
+ }
+ },
{
.prodid = 0,
.prodid_mask = 0,
@@ -453,7 +645,7 @@ static const struct nv_cspmu_match nv_cspmu_match[] = {
static char *nv_cspmu_format_name(const struct arm_cspmu *cspmu,
const struct nv_cspmu_match *match)
{
- char *name;
+ char *name = NULL;
struct device *dev = cspmu->dev;
static atomic_t pmu_generic_idx = {0};
@@ -467,13 +659,20 @@ static char *nv_cspmu_format_name(const struct arm_cspmu *cspmu,
socket);
break;
}
+ case NAME_FMT_SOCKET_INST: {
+ const int cpu = cpumask_first(&cspmu->associated_cpus);
+ const int socket = cpu_to_node(cpu);
+ u32 inst_id;
+
+ if (!nv_cspmu_get_inst_id(cspmu, &inst_id))
+ name = devm_kasprintf(dev, GFP_KERNEL,
+ match->name_pattern, socket, inst_id);
+ break;
+ }
case NAME_FMT_GENERIC:
name = devm_kasprintf(dev, GFP_KERNEL, match->name_pattern,
atomic_fetch_inc(&pmu_generic_idx));
break;
- default:
- name = NULL;
- break;
}
return name;
@@ -514,8 +713,10 @@ static int nv_cspmu_init_ops(struct arm_cspmu *cspmu)
cspmu->impl.ctx = ctx;
/* NVIDIA specific callbacks. */
+ SET_OP(validate_event, impl_ops, match, NULL);
SET_OP(set_cc_filter, impl_ops, match, nv_cspmu_set_cc_filter);
SET_OP(set_ev_filter, impl_ops, match, nv_cspmu_set_ev_filter);
+ SET_OP(reset_ev_filter, impl_ops, match, NULL);
SET_OP(get_event_attrs, impl_ops, match, nv_cspmu_get_event_attrs);
SET_OP(get_format_attrs, impl_ops, match, nv_cspmu_get_format_attrs);
SET_OP(get_name, impl_ops, match, nv_cspmu_get_name);
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v2 4/8] perf/arm_cspmu: nvidia: Add Tegra410 PCIE PMU
2026-02-18 14:58 ` [PATCH v2 4/8] perf/arm_cspmu: nvidia: Add Tegra410 PCIE PMU Besar Wicaksono
@ 2026-02-19 10:06 ` Jonathan Cameron
2026-03-05 23:59 ` Besar Wicaksono
0 siblings, 1 reply; 18+ messages in thread
From: Jonathan Cameron @ 2026-02-19 10:06 UTC (permalink / raw)
To: Besar Wicaksono
Cc: will, suzuki.poulose, robin.murphy, ilkka, linux-arm-kernel,
linux-kernel, linux-tegra, mark.rutland, treding, jonathanh,
vsethi, rwiley, sdonthineni, skelley, ywan, mochs, nirmoyd,
Bjorn Helgaas, linux-pci, Yushan Wang, shiju.jose
On Wed, 18 Feb 2026 14:58:05 +0000
Besar Wicaksono <bwicaksono@nvidia.com> wrote:
> Adds PCIE PMU support in Tegra410 SOC. This PMU is instanced
> in each root complex in the SOC and can capture traffic from
> PCIE device to various memory types. This PMU can filter traffic
> based on the originating root port or BDF and the target memory
> types (CPU DRAM, GPU Memory, CXL Memory, or remote Memory).
>
> Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
Given I've added a bunch of +CC I've left all your patch in place rather
than cropping to just what I've commented on.
Great to see another PCIe related PMU, but this is certainly showing
the diversity in what such things are!
I've expressed a few times that it would be really nice if a standard
PCI centric defintion would come from the PCI-SIG (similar to the one
that CXL has) but what you have here is, I think, monitoring certainly
types of accesses closer to the CPU interconnect side of the RC than
such a spec would cover. As mentioned below I've +CC various people who
will be interested in this. Please keep them cc'd on v3.
> ---
> .../admin-guide/perf/nvidia-tegra410-pmu.rst | 162 ++++++++++++++
> drivers/perf/arm_cspmu/nvidia_cspmu.c | 211 +++++++++++++++++-
> 2 files changed, 368 insertions(+), 5 deletions(-)
>
> diff --git a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> index 7b7ba5700ca1..8528685ddb61 100644
> --- a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> +++ b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> @@ -6,6 +6,7 @@ The NVIDIA Tegra410 SoC includes various system PMUs to measure key performance
> metrics like memory bandwidth, latency, and utilization:
>
> * Unified Coherence Fabric (UCF)
> +* PCIE
It's interesting to see what people put in their PCIe related PMUs.
Seems we are getting a bit of a split into those focused on the SoC side of the host bridge
and those focused on the PCI protocol stuff (so counting TLPs, FLITs, Retries etc).
I don't suppose it matters that much, but maybe we need to think about some suitable
terminology..
I've +CC linux-pci and Bjorn as those are the folk who are most likely to comment
on generalization aspects of PCIe PMUs.
>
> PMU Driver
> ----------
> @@ -104,3 +105,164 @@ Example usage:
> destination filter = remote memory::
>
> perf stat -a -e nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/
> +
> +PCIE PMU
> +--------
> +
> +This PMU monitors all read/write traffic from the root port(s) or a particular
> +BDF in a PCIE root complex (RC) to local or remote memory. There is one PMU per
> +PCIE RC in the SoC. Each RC can have up to 16 lanes that can be bifurcated into
> +up to 8 root ports. The traffic from each root port can be filtered using RP or
> +BDF filter. For example, specifying "src_rp_mask=0xFF" means the PMU counter will
> +capture traffic from all RPs. Please see below for more details.
> +
> +The events and configuration options of this PMU device are described in sysfs,
> +see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>.
> +
> +The events in this PMU can be used to measure bandwidth, utilization, and
> +latency:
> +
> + * rd_req: count the number of read requests by PCIE device.
> + * wr_req: count the number of write requests by PCIE device.
> + * rd_bytes: count the number of bytes transferred by rd_req.
> + * wr_bytes: count the number of bytes transferred by wr_req.
> + * rd_cum_outs: count outstanding rd_req each cycle.
> + * cycles: counts the PCIE cycles.
This maybe needs a tighter definition. Too many types of cycle
involved in PCIe IPs.
Would also be good to see how this driver fits with the efforts for
a generic perf iostat
https://lore.kernel.org/all/20260126123514.3238425-1-wangyushan12@huawei.com/
(Added wangyushan and shiju to +CC)
> +
> +The average bandwidth is calculated as::
> +
> + AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
> + AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
> +
> +The average request rate is calculated as::
> +
> + AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
> + AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
> +
> +
> +The average latency is calculated as::
> +
> + FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
> + AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
> + AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
> +
> +The PMU events can be filtered based on the traffic source and destination.
> +The source filter indicates the PCIE devices that will be monitored. The
> +destination filter specifies the destination memory type, e.g. local system
> +memory (CMEM), local GPU memory (GMEM), or remote memory. The local/remote
> +classification of the destination filter is based on the home socket of the
> +address, not where the data actually resides. These filters can be found in
> +/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
> +
> +The list of event filters:
> +
> +* Source filter:
> +
> + * src_rp_mask: bitmask of root ports that will be monitored. Each bit in this
> + bitmask represents the RP index in the RC. If the bit is set, all devices under
> + the associated RP will be monitored. E.g "src_rp_mask=0xF" will monitor
> + devices in root port 0 to 3.
> + * src_bdf: the BDF that will be monitored. This is a 16-bit value that
> + follows formula: (bus << 8) + (device << 3) + (function). For example, the
> + value of BDF 27:01.1 is 0x2781.
> + * src_bdf_en: enable the BDF filter. If this is set, the BDF filter value in
> + "src_bdf" is used to filter the traffic.
> +
> + Note that Root-Port and BDF filters are mutually exclusive and the PMU in
> + each RC can only have one BDF filter for the whole counters. If BDF filter
> + is enabled, the BDF filter value will be applied to all events.
> +
> +* Destination filter:
> +
> + * dst_loc_cmem: if set, count events to local system memory (CMEM) address
> + * dst_loc_gmem: if set, count events to local GPU memory (GMEM) address
> + * dst_loc_pcie_p2p: if set, count events to local PCIE peer address
> + * dst_loc_pcie_cxl: if set, count events to local CXL memory address
> + * dst_rem: if set, count events to remote memory address
> +
> +If the source filter is not specified, the PMU will count events from all root
> +ports. If the destination filter is not specified, the PMU will count events
> +to all destinations.
> +
> +Example usage:
> +
> +* Count event id 0x0 from root port 0 of PCIE RC-0 on socket 0 targeting all
> + destinations::
> +
> + perf stat -a -e nvidia_pcie_pmu_0_rc_0/event=0x0,src_rp_mask=0x1/
> +
> +* Count event id 0x1 from root port 0 and 1 of PCIE RC-1 on socket 0 and
> + targeting just local CMEM of socket 0::
> +
> + perf stat -a -e nvidia_pcie_pmu_0_rc_1/event=0x1,src_rp_mask=0x3,dst_loc_cmem=0x1/
> +
> +* Count event id 0x2 from root port 0 of PCIE RC-2 on socket 1 targeting all
> + destinations::
> +
> + perf stat -a -e nvidia_pcie_pmu_1_rc_2/event=0x2,src_rp_mask=0x1/
> +
> +* Count event id 0x3 from root port 0 and 1 of PCIE RC-3 on socket 1 and
> + targeting just local CMEM of socket 1::
> +
> + perf stat -a -e nvidia_pcie_pmu_1_rc_3/event=0x3,src_rp_mask=0x3,dst_loc_cmem=0x1/
> +
> +* Count event id 0x4 from BDF 01:01.0 of PCIE RC-4 on socket 0 targeting all
> + destinations::
> +
> + perf stat -a -e nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/
> +
> +Mapping the RC# to lspci segment number can be non-trivial; hence a new NVIDIA
> +Designated Vendor Specific Capability (DVSEC) register is added into the PCIE config space
> +for each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The DVSEC register
> +contains the following information to map PCIE devices under the RP back to its RC# :
> +
> + - Bus# (byte 0xc) : bus number as reported by the lspci output
> + - Segment# (byte 0xd) : segment number as reported by the lspci output
> + - RP# (byte 0xe) : port number as reported by LnkCap attribute from lspci for a device with Root Port capability
> + - RC# (byte 0xf): root complex number associated with the RP
> + - Socket# (byte 0x10): socket number associated with the RP
> +
> +Example script for mapping lspci BDF to RC# and socket#::
> +
> + #!/bin/bash
> + while read bdf rest; do
> + dvsec4_reg=$(lspci -vv -s $bdf | awk '
> + /Designated Vendor-Specific: Vendor=10de ID=0004/ {
> + match($0, /\[([0-9a-fA-F]+)/, arr);
> + print "0x" arr[1];
> + exit
> + }
> + ')
> + if [ -n "$dvsec4_reg" ]; then
> + bus=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xc))).b)
> + segment=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xd))).b)
> + rp=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xe))).b)
> + rc=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xf))).b)
> + socket=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0x10))).b)
> + echo "$bdf: Bus=$bus, Segment=$segment, RP=$rp, RC=$rc, Socket=$socket"
> + fi
> + done < <(lspci -d 10de:)
> +
> +Example output::
> +
> + 0001:00:00.0: Bus=00, Segment=01, RP=00, RC=00, Socket=00
> + 0002:80:00.0: Bus=80, Segment=02, RP=01, RC=01, Socket=00
> + 0002:a0:00.0: Bus=a0, Segment=02, RP=02, RC=01, Socket=00
> + 0002:c0:00.0: Bus=c0, Segment=02, RP=03, RC=01, Socket=00
> + 0002:e0:00.0: Bus=e0, Segment=02, RP=04, RC=01, Socket=00
> + 0003:00:00.0: Bus=00, Segment=03, RP=00, RC=02, Socket=00
> + 0004:00:00.0: Bus=00, Segment=04, RP=00, RC=03, Socket=00
> + 0005:00:00.0: Bus=00, Segment=05, RP=00, RC=04, Socket=00
> + 0005:40:00.0: Bus=40, Segment=05, RP=01, RC=04, Socket=00
> + 0005:c0:00.0: Bus=c0, Segment=05, RP=02, RC=04, Socket=00
> + 0006:00:00.0: Bus=00, Segment=06, RP=00, RC=05, Socket=00
> + 0009:00:00.0: Bus=00, Segment=09, RP=00, RC=00, Socket=01
> + 000a:80:00.0: Bus=80, Segment=0a, RP=01, RC=01, Socket=01
> + 000a:a0:00.0: Bus=a0, Segment=0a, RP=02, RC=01, Socket=01
> + 000a:e0:00.0: Bus=e0, Segment=0a, RP=03, RC=01, Socket=01
> + 000b:00:00.0: Bus=00, Segment=0b, RP=00, RC=02, Socket=01
> + 000c:00:00.0: Bus=00, Segment=0c, RP=00, RC=03, Socket=01
> + 000d:00:00.0: Bus=00, Segment=0d, RP=00, RC=04, Socket=01
> + 000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01
> + 000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01
> + 000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01
> diff --git a/drivers/perf/arm_cspmu/nvidia_cspmu.c b/drivers/perf/arm_cspmu/nvidia_cspmu.c
> index c67667097a3c..42f11f37bddf 100644
> --- a/drivers/perf/arm_cspmu/nvidia_cspmu.c
> +++ b/drivers/perf/arm_cspmu/nvidia_cspmu.c
> static struct attribute *generic_pmu_format_attrs[] = {
> ARM_CSPMU_FORMAT_EVENT_ATTR,
> ARM_CSPMU_FORMAT_FILTER_ATTR,
> @@ -233,6 +270,32 @@ nv_cspmu_get_name(const struct arm_cspmu *cspmu)
> return ctx->name;
> }
>
> +#if defined(CONFIG_ACPI)
> +static int nv_cspmu_get_inst_id(const struct arm_cspmu *cspmu, u32 *id)
> +{
> + struct fwnode_handle *fwnode;
> + struct acpi_device *adev;
> + int ret;
> +
> + adev = arm_cspmu_acpi_dev_get(cspmu);
Not necessarily related to your patch but it would be really nice to get
clean stubs etc in place so that we can expose this code to the compiler but
then use
if (IS_CONFIGURED()) etc to provide the fallbacks.
Makes for both easier to read code and better compiler coverage.
> + if (!adev)
> + return -ENODEV;
> +
> + fwnode = acpi_fwnode_handle(adev);
> + ret = fwnode_property_read_u32(fwnode, "instance_id", id);
> + if (ret)
> + dev_err(cspmu->dev, "Failed to get instance ID\n");
> +
> + acpi_dev_put(adev);
Not necessarily a thing for this series, but would be nice to have a
DEFINE_FREE(acpi_dev_put, struct acpi_device *, if (!IS_ERR_OR_NULL(_T)) acpi_dev_put);
> + return ret;
> +}
> +#else
> +static int nv_cspmu_get_inst_id(const struct arm_cspmu *cspmu, u32 *id)
> +{
> + return -EINVAL;
> +}
> +#endif
> +
> +static int pcie_v2_pmu_validate_event(struct arm_cspmu *cspmu,
> + struct perf_event *new_ev)
> +{
> + /*
> + * Make sure the events are using same BDF filter since the PCIE-SRC PMU
> + * only supports one common BDF filter setting for all of the counters.
> + */
> +
> + int idx;
> + u32 new_filter, new_rp, new_bdf, new_lead_filter, new_lead_bdf;
> + struct perf_event *leader, *new_leader;
> +
> + if (cspmu->impl.ops.is_cycle_counter_event(new_ev))
> + return 0;
> +
> + new_leader = new_ev->group_leader;
> +
> + new_filter = pcie_v2_pmu_event_filter(new_ev);
> + new_lead_filter = pcie_v2_pmu_event_filter(new_leader);
> +
> + new_bdf = pcie_v2_pmu_bdf_val_en(new_filter);
> + new_lead_bdf = pcie_v2_pmu_bdf_val_en(new_lead_filter);
> +
> + new_rp = FIELD_GET(NV_PCIE_V2_FILTER_PORT, new_filter);
> +
> + if (new_rp != 0 && new_bdf != 0) {
> + dev_err(cspmu->dev,
> + "RP and BDF filtering are mutually exclusive\n");
> + return -EINVAL;
> + }
> +
> + if (new_bdf != new_lead_bdf) {
> + dev_err(cspmu->dev,
> + "sibling and leader BDF value should be equal\n");
> + return -EINVAL;
> + }
> +
> + /* Compare BDF filter on existing events. */
> + idx = find_first_bit(cspmu->hw_events.used_ctrs,
> + cspmu->cycle_counter_logical_idx);
> +
> + if (idx != cspmu->cycle_counter_logical_idx) {
> + leader = cspmu->hw_events.events[idx]->group_leader;
> +
> + const u32 lead_filter = pcie_v2_pmu_event_filter(leader);
> + const u32 lead_bdf = pcie_v2_pmu_bdf_val_en(lead_filter);
The kernel coding standards (not necessarily written down) only commonly allow
for declarations that aren't at the top of scope when using the cleanup.h magic
(so guards, __free() and stuff like that). So here I'd pull the declaration
of leader into this scope as well.
> +
> + if (new_lead_bdf != lead_bdf) {
> + dev_err(cspmu->dev, "only one BDF value is supported\n");
> + return -EINVAL;
> + }
> + }
> +
> + return 0;
> +}
> +
> enum nv_cspmu_name_fmt {
> NAME_FMT_GENERIC,
> - NAME_FMT_SOCKET
> + NAME_FMT_SOCKET,
> + NAME_FMT_SOCKET_INST
Add the trailing comma just to avoid the extra line change like the one you just
made. The only exception to this is if the enum has a terminating entry for
counting purposes.
> };
>
> struct nv_cspmu_match {
> @@ -430,6 +601,27 @@ static const struct nv_cspmu_match nv_cspmu_match[] = {
> .init_data = NULL
> },
> },
> + {
> + .prodid = 0x10301000,
> + .prodid_mask = NV_PRODID_MASK,
> + .name_pattern = "nvidia_pcie_pmu_%u_rc_%u",
> + .name_fmt = NAME_FMT_SOCKET_INST,
> + .template_ctx = {
> + .event_attr = pcie_v2_pmu_event_attrs,
> + .format_attr = pcie_v2_pmu_format_attrs,
> + .filter_mask = NV_PCIE_V2_FILTER_ID_MASK,
> + .filter_default_val = NV_PCIE_V2_FILTER_DEFAULT,
> + .filter2_mask = NV_PCIE_V2_FILTER2_ID_MASK,
> + .filter2_default_val = NV_PCIE_V2_FILTER2_DEFAULT,
> + .get_filter = pcie_v2_pmu_event_filter,
> + .get_filter2 = nv_cspmu_event_filter2,
> + .init_data = NULL
A side note that I didn't put in the previous similar case.
If a NULL is an 'obvious' default, it is also acceptable to not set
it at all and rely on the c spec to ensure it is set to NULL.
> + },
> + .ops = {
> + .validate_event = pcie_v2_pmu_validate_event,
> + .reset_ev_filter = nv_cspmu_reset_ev_filter,
> + }
> + },
> {
> .prodid = 0,
> .prodid_mask = 0,
> @@ -453,7 +645,7 @@ static const struct nv_cspmu_match nv_cspmu_match[] = {
> static char *nv_cspmu_format_name(const struct arm_cspmu *cspmu,
> const struct nv_cspmu_match *match)
> {
> - char *name;
> + char *name = NULL;
> struct device *dev = cspmu->dev;
>
> static atomic_t pmu_generic_idx = {0};
> @@ -467,13 +659,20 @@ static char *nv_cspmu_format_name(const struct arm_cspmu *cspmu,
> socket);
> break;
> }
> + case NAME_FMT_SOCKET_INST: {
> + const int cpu = cpumask_first(&cspmu->associated_cpus);
> + const int socket = cpu_to_node(cpu);
> + u32 inst_id;
> +
> + if (!nv_cspmu_get_inst_id(cspmu, &inst_id))
> + name = devm_kasprintf(dev, GFP_KERNEL,
> + match->name_pattern, socket, inst_id);
> + break;
> + }
> case NAME_FMT_GENERIC:
> name = devm_kasprintf(dev, GFP_KERNEL, match->name_pattern,
> atomic_fetch_inc(&pmu_generic_idx));
> break;
> - default:
Why this change? to me it doesn't add any particular clarity and is
unrelated to the rest of the patch.
> - name = NULL;
> - break;
> }
>
> return name;
> @@ -514,8 +713,10 @@ static int nv_cspmu_init_ops(struct arm_cspmu *cspmu)
> cspmu->impl.ctx = ctx;
>
> /* NVIDIA specific callbacks. */
> + SET_OP(validate_event, impl_ops, match, NULL);
> SET_OP(set_cc_filter, impl_ops, match, nv_cspmu_set_cc_filter);
> SET_OP(set_ev_filter, impl_ops, match, nv_cspmu_set_ev_filter);
> + SET_OP(reset_ev_filter, impl_ops, match, NULL);
> SET_OP(get_event_attrs, impl_ops, match, nv_cspmu_get_event_attrs);
> SET_OP(get_format_attrs, impl_ops, match, nv_cspmu_get_format_attrs);
> SET_OP(get_name, impl_ops, match, nv_cspmu_get_name);
^ permalink raw reply [flat|nested] 18+ messages in thread* RE: [PATCH v2 4/8] perf/arm_cspmu: nvidia: Add Tegra410 PCIE PMU
2026-02-19 10:06 ` Jonathan Cameron
@ 2026-03-05 23:59 ` Besar Wicaksono
0 siblings, 0 replies; 18+ messages in thread
From: Besar Wicaksono @ 2026-03-05 23:59 UTC (permalink / raw)
To: Jonathan Cameron
Cc: will@kernel.org, suzuki.poulose@arm.com, robin.murphy@arm.com,
ilkka@os.amperecomputing.com,
linux-arm-kernel@lists.infradead.org,
linux-kernel@vger.kernel.org, linux-tegra@vger.kernel.org,
mark.rutland@arm.com, Thierry Reding, Jon Hunter, Vikram Sethi,
Rich Wiley, Shanker Donthineni, Sean Kelley, Yifei Wan, Matt Ochs,
Nirmoy Das, Bjorn Helgaas, linux-pci@vger.kernel.org, Yushan Wang,
shiju.jose@huawei.com
Hi Jonathan,
Thanks for your suggestions, please see my comments inline.
> -----Original Message-----
> From: Jonathan Cameron <jonathan.cameron@huawei.com>
> Sent: Thursday, February 19, 2026 4:07 AM
> To: Besar Wicaksono <bwicaksono@nvidia.com>
> Cc: will@kernel.org; suzuki.poulose@arm.com; robin.murphy@arm.com;
> ilkka@os.amperecomputing.com; linux-arm-kernel@lists.infradead.org; linux-
> kernel@vger.kernel.org; linux-tegra@vger.kernel.org; mark.rutland@arm.com;
> Thierry Reding <treding@nvidia.com>; Jon Hunter <jonathanh@nvidia.com>;
> Vikram Sethi <vsethi@nvidia.com>; Rich Wiley <rwiley@nvidia.com>; Shanker
> Donthineni <sdonthineni@nvidia.com>; Sean Kelley <skelley@nvidia.com>;
> Yifei Wan <ywan@nvidia.com>; Matt Ochs <mochs@nvidia.com>; Nirmoy Das
> <nirmoyd@nvidia.com>; Bjorn Helgaas <bhelgaas@google.com>; linux-
> pci@vger.kernel.org; Yushan Wang <wangyushan12@huawei.com>;
> shiju.jose@huawei.com
> Subject: Re: [PATCH v2 4/8] perf/arm_cspmu: nvidia: Add Tegra410 PCIE PMU
>
> External email: Use caution opening links or attachments
>
>
> On Wed, 18 Feb 2026 14:58:05 +0000
> Besar Wicaksono <bwicaksono@nvidia.com> wrote:
>
> > Adds PCIE PMU support in Tegra410 SOC. This PMU is instanced
> > in each root complex in the SOC and can capture traffic from
> > PCIE device to various memory types. This PMU can filter traffic
> > based on the originating root port or BDF and the target memory
> > types (CPU DRAM, GPU Memory, CXL Memory, or remote Memory).
> >
> > Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> > Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
>
> Given I've added a bunch of +CC I've left all your patch in place rather
> than cropping to just what I've commented on.
>
> Great to see another PCIe related PMU, but this is certainly showing
> the diversity in what such things are!
>
> I've expressed a few times that it would be really nice if a standard
> PCI centric defintion would come from the PCI-SIG (similar to the one
> that CXL has) but what you have here is, I think, monitoring certainly
> types of accesses closer to the CPU interconnect side of the RC than
> such a spec would cover. As mentioned below I've +CC various people who
> will be interested in this. Please keep them cc'd on v3.
>
That is correct, this PMU is more on the SOC fabric side connecting the
PCIE RC and the memory subsystem.
> > ---
> > .../admin-guide/perf/nvidia-tegra410-pmu.rst | 162 ++++++++++++++
> > drivers/perf/arm_cspmu/nvidia_cspmu.c | 211 +++++++++++++++++-
> > 2 files changed, 368 insertions(+), 5 deletions(-)
> >
> > diff --git a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> > index 7b7ba5700ca1..8528685ddb61 100644
> > --- a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> > +++ b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> > @@ -6,6 +6,7 @@ The NVIDIA Tegra410 SoC includes various system PMUs
> to measure key performance
> > metrics like memory bandwidth, latency, and utilization:
> >
> > * Unified Coherence Fabric (UCF)
> > +* PCIE
>
> It's interesting to see what people put in their PCIe related PMUs.
> Seems we are getting a bit of a split into those focused on the SoC side of the
> host bridge
> and those focused on the PCI protocol stuff (so counting TLPs, FLITs, Retries
> etc).
>
> I don't suppose it matters that much, but maybe we need to think about some
> suitable
> terminology..
>
> I've +CC linux-pci and Bjorn as those are the folk who are most likely to
> comment
> on generalization aspects of PCIe PMUs.
> >
> > PMU Driver
> > ----------
> > @@ -104,3 +105,164 @@ Example usage:
> > destination filter = remote memory::
> >
> > perf stat -a -e
> nvidia_ucf_pmu_1/event=0x0,src_loc_noncpu=0x1,dst_rem=0x1/
> > +
> > +PCIE PMU
> > +--------
> > +
> > +This PMU monitors all read/write traffic from the root port(s) or a particular
> > +BDF in a PCIE root complex (RC) to local or remote memory. There is one
> PMU per
> > +PCIE RC in the SoC. Each RC can have up to 16 lanes that can be bifurcated
> into
> > +up to 8 root ports. The traffic from each root port can be filtered using RP
> or
> > +BDF filter. For example, specifying "src_rp_mask=0xFF" means the PMU
> counter will
> > +capture traffic from all RPs. Please see below for more details.
> > +
> > +The events and configuration options of this PMU device are described in
> sysfs,
> > +see /sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-
> id>_rc_<pcie-rc-id>.
> > +
> > +The events in this PMU can be used to measure bandwidth, utilization, and
> > +latency:
> > +
> > + * rd_req: count the number of read requests by PCIE device.
> > + * wr_req: count the number of write requests by PCIE device.
> > + * rd_bytes: count the number of bytes transferred by rd_req.
> > + * wr_bytes: count the number of bytes transferred by wr_req.
> > + * rd_cum_outs: count outstanding rd_req each cycle.
> > + * cycles: counts the PCIE cycles.
>
> This maybe needs a tighter definition. Too many types of cycle
> involved in PCIe IPs.
>
Yeah, this is supposed to be the clock cycles of the SOC fabric.
I will fix it on V3.
> Would also be good to see how this driver fits with the efforts for
> a generic perf iostat
> https://lore.kernel.org/all/20260126123514.3238425-1-
> wangyushan12@huawei.com/
>
> (Added wangyushan and shiju to +CC)
>
> > +
> > +The average bandwidth is calculated as::
> > +
> > + AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
> > + AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
> > +
> > +The average request rate is calculated as::
> > +
> > + AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
> > + AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
> > +
> > +
> > +The average latency is calculated as::
> > +
> > + FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
> > + AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
> > + AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
> > +
> > +The PMU events can be filtered based on the traffic source and destination.
> > +The source filter indicates the PCIE devices that will be monitored. The
> > +destination filter specifies the destination memory type, e.g. local system
> > +memory (CMEM), local GPU memory (GMEM), or remote memory. The
> local/remote
> > +classification of the destination filter is based on the home socket of the
> > +address, not where the data actually resides. These filters can be found in
> > +/sys/bus/event_source/devices/nvidia_pcie_pmu_<socket-id>_rc_<pcie-rc-
> id>/format/.
> > +
> > +The list of event filters:
> > +
> > +* Source filter:
> > +
> > + * src_rp_mask: bitmask of root ports that will be monitored. Each bit in
> this
> > + bitmask represents the RP index in the RC. If the bit is set, all devices
> under
> > + the associated RP will be monitored. E.g "src_rp_mask=0xF" will monitor
> > + devices in root port 0 to 3.
> > + * src_bdf: the BDF that will be monitored. This is a 16-bit value that
> > + follows formula: (bus << 8) + (device << 3) + (function). For example, the
> > + value of BDF 27:01.1 is 0x2781.
> > + * src_bdf_en: enable the BDF filter. If this is set, the BDF filter value in
> > + "src_bdf" is used to filter the traffic.
> > +
> > + Note that Root-Port and BDF filters are mutually exclusive and the PMU in
> > + each RC can only have one BDF filter for the whole counters. If BDF filter
> > + is enabled, the BDF filter value will be applied to all events.
> > +
> > +* Destination filter:
> > +
> > + * dst_loc_cmem: if set, count events to local system memory (CMEM)
> address
> > + * dst_loc_gmem: if set, count events to local GPU memory (GMEM)
> address
> > + * dst_loc_pcie_p2p: if set, count events to local PCIE peer address
> > + * dst_loc_pcie_cxl: if set, count events to local CXL memory address
> > + * dst_rem: if set, count events to remote memory address
> > +
> > +If the source filter is not specified, the PMU will count events from all root
> > +ports. If the destination filter is not specified, the PMU will count events
> > +to all destinations.
> > +
> > +Example usage:
> > +
> > +* Count event id 0x0 from root port 0 of PCIE RC-0 on socket 0 targeting all
> > + destinations::
> > +
> > + perf stat -a -e nvidia_pcie_pmu_0_rc_0/event=0x0,src_rp_mask=0x1/
> > +
> > +* Count event id 0x1 from root port 0 and 1 of PCIE RC-1 on socket 0 and
> > + targeting just local CMEM of socket 0::
> > +
> > + perf stat -a -e
> nvidia_pcie_pmu_0_rc_1/event=0x1,src_rp_mask=0x3,dst_loc_cmem=0x1/
> > +
> > +* Count event id 0x2 from root port 0 of PCIE RC-2 on socket 1 targeting all
> > + destinations::
> > +
> > + perf stat -a -e nvidia_pcie_pmu_1_rc_2/event=0x2,src_rp_mask=0x1/
> > +
> > +* Count event id 0x3 from root port 0 and 1 of PCIE RC-3 on socket 1 and
> > + targeting just local CMEM of socket 1::
> > +
> > + perf stat -a -e
> nvidia_pcie_pmu_1_rc_3/event=0x3,src_rp_mask=0x3,dst_loc_cmem=0x1/
> > +
> > +* Count event id 0x4 from BDF 01:01.0 of PCIE RC-4 on socket 0 targeting
> all
> > + destinations::
> > +
> > + perf stat -a -e
> nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/
> > +
> > +Mapping the RC# to lspci segment number can be non-trivial; hence a new
> NVIDIA
> > +Designated Vendor Specific Capability (DVSEC) register is added into the
> PCIE config space
> > +for each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The
> DVSEC register
> > +contains the following information to map PCIE devices under the RP back
> to its RC# :
> > +
> > + - Bus# (byte 0xc) : bus number as reported by the lspci output
> > + - Segment# (byte 0xd) : segment number as reported by the lspci output
> > + - RP# (byte 0xe) : port number as reported by LnkCap attribute from lspci
> for a device with Root Port capability
> > + - RC# (byte 0xf): root complex number associated with the RP
> > + - Socket# (byte 0x10): socket number associated with the RP
> > +
> > +Example script for mapping lspci BDF to RC# and socket#::
> > +
> > + #!/bin/bash
> > + while read bdf rest; do
> > + dvsec4_reg=$(lspci -vv -s $bdf | awk '
> > + /Designated Vendor-Specific: Vendor=10de ID=0004/ {
> > + match($0, /\[([0-9a-fA-F]+)/, arr);
> > + print "0x" arr[1];
> > + exit
> > + }
> > + ')
> > + if [ -n "$dvsec4_reg" ]; then
> > + bus=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xc))).b)
> > + segment=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xd))).b)
> > + rp=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xe))).b)
> > + rc=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0xf))).b)
> > + socket=$(setpci -s $bdf $(printf '0x%x' $((${dvsec4_reg} + 0x10))).b)
> > + echo "$bdf: Bus=$bus, Segment=$segment, RP=$rp, RC=$rc,
> Socket=$socket"
> > + fi
> > + done < <(lspci -d 10de:)
> > +
> > +Example output::
> > +
> > + 0001:00:00.0: Bus=00, Segment=01, RP=00, RC=00, Socket=00
> > + 0002:80:00.0: Bus=80, Segment=02, RP=01, RC=01, Socket=00
> > + 0002:a0:00.0: Bus=a0, Segment=02, RP=02, RC=01, Socket=00
> > + 0002:c0:00.0: Bus=c0, Segment=02, RP=03, RC=01, Socket=00
> > + 0002:e0:00.0: Bus=e0, Segment=02, RP=04, RC=01, Socket=00
> > + 0003:00:00.0: Bus=00, Segment=03, RP=00, RC=02, Socket=00
> > + 0004:00:00.0: Bus=00, Segment=04, RP=00, RC=03, Socket=00
> > + 0005:00:00.0: Bus=00, Segment=05, RP=00, RC=04, Socket=00
> > + 0005:40:00.0: Bus=40, Segment=05, RP=01, RC=04, Socket=00
> > + 0005:c0:00.0: Bus=c0, Segment=05, RP=02, RC=04, Socket=00
> > + 0006:00:00.0: Bus=00, Segment=06, RP=00, RC=05, Socket=00
> > + 0009:00:00.0: Bus=00, Segment=09, RP=00, RC=00, Socket=01
> > + 000a:80:00.0: Bus=80, Segment=0a, RP=01, RC=01, Socket=01
> > + 000a:a0:00.0: Bus=a0, Segment=0a, RP=02, RC=01, Socket=01
> > + 000a:e0:00.0: Bus=e0, Segment=0a, RP=03, RC=01, Socket=01
> > + 000b:00:00.0: Bus=00, Segment=0b, RP=00, RC=02, Socket=01
> > + 000c:00:00.0: Bus=00, Segment=0c, RP=00, RC=03, Socket=01
> > + 000d:00:00.0: Bus=00, Segment=0d, RP=00, RC=04, Socket=01
> > + 000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01
> > + 000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01
> > + 000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01
> > diff --git a/drivers/perf/arm_cspmu/nvidia_cspmu.c
> b/drivers/perf/arm_cspmu/nvidia_cspmu.c
> > index c67667097a3c..42f11f37bddf 100644
> > --- a/drivers/perf/arm_cspmu/nvidia_cspmu.c
> > +++ b/drivers/perf/arm_cspmu/nvidia_cspmu.c
>
> > static struct attribute *generic_pmu_format_attrs[] = {
> > ARM_CSPMU_FORMAT_EVENT_ATTR,
> > ARM_CSPMU_FORMAT_FILTER_ATTR,
> > @@ -233,6 +270,32 @@ nv_cspmu_get_name(const struct arm_cspmu
> *cspmu)
> > return ctx->name;
> > }
> >
> > +#if defined(CONFIG_ACPI)
> > +static int nv_cspmu_get_inst_id(const struct arm_cspmu *cspmu, u32 *id)
> > +{
> > + struct fwnode_handle *fwnode;
> > + struct acpi_device *adev;
> > + int ret;
> > +
> > + adev = arm_cspmu_acpi_dev_get(cspmu);
> Not necessarily related to your patch but it would be really nice to get
> clean stubs etc in place so that we can expose this code to the compiler but
> then use
> if (IS_CONFIGURED()) etc to provide the fallbacks.
>
> Makes for both easier to read code and better compiler coverage.
>
> > + if (!adev)
> > + return -ENODEV;
> > +
> > + fwnode = acpi_fwnode_handle(adev);
> > + ret = fwnode_property_read_u32(fwnode, "instance_id", id);
> > + if (ret)
> > + dev_err(cspmu->dev, "Failed to get instance ID\n");
> > +
> > + acpi_dev_put(adev);
>
> Not necessarily a thing for this series, but would be nice to have a
> DEFINE_FREE(acpi_dev_put, struct acpi_device *, if (!IS_ERR_OR_NULL(_T))
> acpi_dev_put);
>
> > + return ret;
> > +}
> > +#else
> > +static int nv_cspmu_get_inst_id(const struct arm_cspmu *cspmu, u32 *id)
> > +{
> > + return -EINVAL;
> > +}
> > +#endif
>
> > +
> > +static int pcie_v2_pmu_validate_event(struct arm_cspmu *cspmu,
> > + struct perf_event *new_ev)
> > +{
> > + /*
> > + * Make sure the events are using same BDF filter since the PCIE-SRC
> PMU
> > + * only supports one common BDF filter setting for all of the counters.
> > + */
> > +
> > + int idx;
> > + u32 new_filter, new_rp, new_bdf, new_lead_filter, new_lead_bdf;
> > + struct perf_event *leader, *new_leader;
> > +
> > + if (cspmu->impl.ops.is_cycle_counter_event(new_ev))
> > + return 0;
> > +
> > + new_leader = new_ev->group_leader;
> > +
> > + new_filter = pcie_v2_pmu_event_filter(new_ev);
> > + new_lead_filter = pcie_v2_pmu_event_filter(new_leader);
> > +
> > + new_bdf = pcie_v2_pmu_bdf_val_en(new_filter);
> > + new_lead_bdf = pcie_v2_pmu_bdf_val_en(new_lead_filter);
> > +
> > + new_rp = FIELD_GET(NV_PCIE_V2_FILTER_PORT, new_filter);
> > +
> > + if (new_rp != 0 && new_bdf != 0) {
> > + dev_err(cspmu->dev,
> > + "RP and BDF filtering are mutually exclusive\n");
> > + return -EINVAL;
> > + }
> > +
> > + if (new_bdf != new_lead_bdf) {
> > + dev_err(cspmu->dev,
> > + "sibling and leader BDF value should be equal\n");
> > + return -EINVAL;
> > + }
> > +
> > + /* Compare BDF filter on existing events. */
> > + idx = find_first_bit(cspmu->hw_events.used_ctrs,
> > + cspmu->cycle_counter_logical_idx);
> > +
> > + if (idx != cspmu->cycle_counter_logical_idx) {
> > + leader = cspmu->hw_events.events[idx]->group_leader;
> > +
> > + const u32 lead_filter = pcie_v2_pmu_event_filter(leader);
> > + const u32 lead_bdf = pcie_v2_pmu_bdf_val_en(lead_filter);
>
> The kernel coding standards (not necessarily written down) only commonly
> allow
> for declarations that aren't at the top of scope when using the cleanup.h magic
> (so guards, __free() and stuff like that). So here I'd pull the declaration
> of leader into this scope as well.
>
>
Sure, will do on V3.
> > +
> > + if (new_lead_bdf != lead_bdf) {
> > + dev_err(cspmu->dev, "only one BDF value is supported\n");
> > + return -EINVAL;
> > + }
> > + }
> > +
> > + return 0;
> > +}
> > +
> > enum nv_cspmu_name_fmt {
> > NAME_FMT_GENERIC,
> > - NAME_FMT_SOCKET
> > + NAME_FMT_SOCKET,
> > + NAME_FMT_SOCKET_INST
>
> Add the trailing comma just to avoid the extra line change like the one you just
> made. The only exception to this is if the enum has a terminating entry for
> counting purposes.
>
Sure, will do on V3.
> > };
> >
> > struct nv_cspmu_match {
> > @@ -430,6 +601,27 @@ static const struct nv_cspmu_match
> nv_cspmu_match[] = {
> > .init_data = NULL
> > },
> > },
> > + {
> > + .prodid = 0x10301000,
> > + .prodid_mask = NV_PRODID_MASK,
> > + .name_pattern = "nvidia_pcie_pmu_%u_rc_%u",
> > + .name_fmt = NAME_FMT_SOCKET_INST,
> > + .template_ctx = {
> > + .event_attr = pcie_v2_pmu_event_attrs,
> > + .format_attr = pcie_v2_pmu_format_attrs,
> > + .filter_mask = NV_PCIE_V2_FILTER_ID_MASK,
> > + .filter_default_val = NV_PCIE_V2_FILTER_DEFAULT,
> > + .filter2_mask = NV_PCIE_V2_FILTER2_ID_MASK,
> > + .filter2_default_val = NV_PCIE_V2_FILTER2_DEFAULT,
> > + .get_filter = pcie_v2_pmu_event_filter,
> > + .get_filter2 = nv_cspmu_event_filter2,
> > + .init_data = NULL
>
> A side note that I didn't put in the previous similar case.
> If a NULL is an 'obvious' default, it is also acceptable to not set
> it at all and rely on the c spec to ensure it is set to NULL.
>
> > + },
> > + .ops = {
> > + .validate_event = pcie_v2_pmu_validate_event,
> > + .reset_ev_filter = nv_cspmu_reset_ev_filter,
> > + }
> > + },
> > {
> > .prodid = 0,
> > .prodid_mask = 0,
> > @@ -453,7 +645,7 @@ static const struct nv_cspmu_match
> nv_cspmu_match[] = {
> > static char *nv_cspmu_format_name(const struct arm_cspmu *cspmu,
> > const struct nv_cspmu_match *match)
> > {
> > - char *name;
> > + char *name = NULL;
> > struct device *dev = cspmu->dev;
> >
> > static atomic_t pmu_generic_idx = {0};
> > @@ -467,13 +659,20 @@ static char *nv_cspmu_format_name(const
> struct arm_cspmu *cspmu,
> > socket);
> > break;
> > }
> > + case NAME_FMT_SOCKET_INST: {
> > + const int cpu = cpumask_first(&cspmu->associated_cpus);
> > + const int socket = cpu_to_node(cpu);
> > + u32 inst_id;
> > +
> > + if (!nv_cspmu_get_inst_id(cspmu, &inst_id))
> > + name = devm_kasprintf(dev, GFP_KERNEL,
> > + match->name_pattern, socket, inst_id);
> > + break;
> > + }
> > case NAME_FMT_GENERIC:
> > name = devm_kasprintf(dev, GFP_KERNEL, match->name_pattern,
> > atomic_fetch_inc(&pmu_generic_idx));
> > break;
> > - default:
>
> Why this change? to me it doesn't add any particular clarity and is
> unrelated to the rest of the patch.
>
I changed the name initialization to NULL, so the default case handling is no longer
needed.
Regards,
Besar
> > - name = NULL;
> > - break;
> > }
> >
> > return name;
> > @@ -514,8 +713,10 @@ static int nv_cspmu_init_ops(struct arm_cspmu
> *cspmu)
> > cspmu->impl.ctx = ctx;
> >
> > /* NVIDIA specific callbacks. */
> > + SET_OP(validate_event, impl_ops, match, NULL);
> > SET_OP(set_cc_filter, impl_ops, match, nv_cspmu_set_cc_filter);
> > SET_OP(set_ev_filter, impl_ops, match, nv_cspmu_set_ev_filter);
> > + SET_OP(reset_ev_filter, impl_ops, match, NULL);
> > SET_OP(get_event_attrs, impl_ops, match, nv_cspmu_get_event_attrs);
> > SET_OP(get_format_attrs, impl_ops, match,
> nv_cspmu_get_format_attrs);
> > SET_OP(get_name, impl_ops, match, nv_cspmu_get_name);
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 5/8] perf/arm_cspmu: nvidia: Add Tegra410 PCIE-TGT PMU
2026-02-18 14:58 [PATCH v2 0/8] perf: add NVIDIA Tegra410 Uncore PMU support Besar Wicaksono
` (3 preceding siblings ...)
2026-02-18 14:58 ` [PATCH v2 4/8] perf/arm_cspmu: nvidia: Add Tegra410 PCIE PMU Besar Wicaksono
@ 2026-02-18 14:58 ` Besar Wicaksono
2026-02-19 10:10 ` Jonathan Cameron
2026-02-18 14:58 ` [PATCH v2 6/8] perf: add NVIDIA Tegra410 CPU Memory Latency PMU Besar Wicaksono
` (2 subsequent siblings)
7 siblings, 1 reply; 18+ messages in thread
From: Besar Wicaksono @ 2026-02-18 14:58 UTC (permalink / raw)
To: will, suzuki.poulose, robin.murphy, ilkka
Cc: linux-arm-kernel, linux-kernel, linux-tegra, mark.rutland,
treding, jonathanh, vsethi, rwiley, sdonthineni, skelley, ywan,
mochs, nirmoyd, Besar Wicaksono
Adds PCIE-TGT PMU support in Tegra410 SOC. This PMU is
instanced in each root complex in the SOC and it captures
traffic originating from any source towards PCIE BAR and CXL
HDM range. The traffic can be filtered based on the
destination root port or target address range.
Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
---
.../admin-guide/perf/nvidia-tegra410-pmu.rst | 76 +++++
drivers/perf/arm_cspmu/nvidia_cspmu.c | 323 ++++++++++++++++++
2 files changed, 399 insertions(+)
diff --git a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
index 8528685ddb61..07dc447eead7 100644
--- a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
+++ b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
@@ -7,6 +7,7 @@ metrics like memory bandwidth, latency, and utilization:
* Unified Coherence Fabric (UCF)
* PCIE
+* PCIE-TGT
PMU Driver
----------
@@ -211,6 +212,11 @@ Example usage:
perf stat -a -e nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/
+.. _NVIDIA_T410_PCIE_PMU_RC_Mapping_Section:
+
+Mapping the RC# to lspci segment number
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
Mapping the RC# to lspci segment number can be non-trivial; hence a new NVIDIA
Designated Vendor Specific Capability (DVSEC) register is added into the PCIE config space
for each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The DVSEC register
@@ -266,3 +272,73 @@ Example output::
000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01
000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01
000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01
+
+PCIE-TGT PMU
+------------
+
+The PCIE-TGT PMU monitors traffic targeting PCIE BAR and CXL HDM ranges.
+There is one PCIE-TGT PMU per PCIE root complex (RC) in the SoC. Each RC in
+Tegra410 SoC can have up to 16 lanes that can be bifurcated into up to 8 root
+ports (RP). The PMU provides RP filter to count PCIE BAR traffic to each RP and
+address filter to count access to PCIE BAR or CXL HDM ranges. The details
+of the filters are described in the following sections.
+
+Mapping the RC# to lspci segment number is similar to the PCIE PMU.
+Please see :ref:`NVIDIA_T410_PCIE_PMU_RC_Mapping_Section` for more info.
+
+The events and configuration options of this PMU device are available in sysfs,
+see /sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>.
+
+The events in this PMU can be used to measure bandwidth and utilization:
+
+ * rd_req: count the number of read requests to PCIE.
+ * wr_req: count the number of write requests to PCIE.
+ * rd_bytes: count the number of bytes transferred by rd_req.
+ * wr_bytes: count the number of bytes transferred by wr_req.
+ * cycles: counts the PCIE cycles.
+
+The average bandwidth is calculated as::
+
+ AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
+ AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
+
+The average request rate is calculated as::
+
+ AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
+ AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
+
+The PMU events can be filtered based on the destination root port or target
+address range. Filtering based on RP is only available for PCIE BAR traffic.
+Address filter works for both PCIE BAR and CXL HDM ranges. These filters can be
+found in sysfs, see
+/sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
+
+Destination filter settings:
+
+* dst_rp_mask: bitmask to select the root port(s) to monitor. E.g. "dst_rp_mask=0xFF"
+ corresponds to all root ports (from 0 to 7) in the PCIE RC. Note that this filter is
+ only available for PCIE BAR traffic.
+* dst_addr_base: BAR or CXL HDM filter base address.
+* dst_addr_mask: BAR or CXL HDM filter address mask.
+* dst_addr_en: enable BAR or CXL HDM address range filter. If this is set, the
+ address range specified by "dst_addr_base" and "dst_addr_mask" will be used to filter
+ the PCIE BAR and CXL HDM traffic address. The PMU uses the following comparison
+ to determine if the traffic destination address falls within the filter range::
+
+ (txn's addr & dst_addr_mask) == (dst_addr_base & dst_addr_mask)
+
+ If the comparison succeeds, then the event will be counted.
+
+If the destination filter is not specified, the RP filter will be configured by default
+to count PCIE BAR traffic to all root ports.
+
+Example usage:
+
+* Count event id 0x0 to root port 0 and 1 of PCIE RC-0 on socket 0::
+
+ perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_0/event=0x0,dst_rp_mask=0x3/
+
+* Count event id 0x1 for accesses to PCIE BAR or CXL HDM address range
+ 0x10000 to 0x100FF on socket 0's PCIE RC-1::
+
+ perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_1/event=0x1,dst_addr_base=0x10000,dst_addr_mask=0xFFF00,dst_addr_en=0x1/
diff --git a/drivers/perf/arm_cspmu/nvidia_cspmu.c b/drivers/perf/arm_cspmu/nvidia_cspmu.c
index 42f11f37bddf..25c408b56dc8 100644
--- a/drivers/perf/arm_cspmu/nvidia_cspmu.c
+++ b/drivers/perf/arm_cspmu/nvidia_cspmu.c
@@ -42,6 +42,24 @@
#define NV_PCIE_V2_FILTER2_DST GENMASK_ULL(NV_PCIE_V2_DST_COUNT - 1, 0)
#define NV_PCIE_V2_FILTER2_DEFAULT NV_PCIE_V2_FILTER2_DST
+#define NV_PCIE_TGT_PORT_COUNT 8ULL
+#define NV_PCIE_TGT_EV_TYPE_CC 0x4
+#define NV_PCIE_TGT_EV_TYPE_COUNT 3ULL
+#define NV_PCIE_TGT_EV_TYPE_MASK GENMASK_ULL(NV_PCIE_TGT_EV_TYPE_COUNT - 1, 0)
+#define NV_PCIE_TGT_FILTER2_MASK GENMASK_ULL(NV_PCIE_TGT_PORT_COUNT, 0)
+#define NV_PCIE_TGT_FILTER2_PORT GENMASK_ULL(NV_PCIE_TGT_PORT_COUNT - 1, 0)
+#define NV_PCIE_TGT_FILTER2_ADDR_EN BIT(NV_PCIE_TGT_PORT_COUNT)
+#define NV_PCIE_TGT_FILTER2_ADDR GENMASK_ULL(15, NV_PCIE_TGT_PORT_COUNT)
+#define NV_PCIE_TGT_FILTER2_DEFAULT NV_PCIE_TGT_FILTER2_PORT
+
+#define NV_PCIE_TGT_ADDR_COUNT 8ULL
+#define NV_PCIE_TGT_ADDR_STRIDE 20
+#define NV_PCIE_TGT_ADDR_CTRL 0xD38
+#define NV_PCIE_TGT_ADDR_BASE_LO 0xD3C
+#define NV_PCIE_TGT_ADDR_BASE_HI 0xD40
+#define NV_PCIE_TGT_ADDR_MASK_LO 0xD44
+#define NV_PCIE_TGT_ADDR_MASK_HI 0xD48
+
#define NV_GENERIC_FILTER_ID_MASK GENMASK_ULL(31, 0)
#define NV_PRODID_MASK (PMIIDR_PRODUCTID | PMIIDR_VARIANT | PMIIDR_REVISION)
@@ -186,6 +204,15 @@ static struct attribute *pcie_v2_pmu_event_attrs[] = {
NULL,
};
+static struct attribute *pcie_tgt_pmu_event_attrs[] = {
+ ARM_CSPMU_EVENT_ATTR(rd_bytes, 0x0),
+ ARM_CSPMU_EVENT_ATTR(wr_bytes, 0x1),
+ ARM_CSPMU_EVENT_ATTR(rd_req, 0x2),
+ ARM_CSPMU_EVENT_ATTR(wr_req, 0x3),
+ ARM_CSPMU_EVENT_ATTR(cycles, NV_PCIE_TGT_EV_TYPE_CC),
+ NULL,
+};
+
static struct attribute *generic_pmu_event_attrs[] = {
ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
NULL,
@@ -239,6 +266,15 @@ static struct attribute *pcie_v2_pmu_format_attrs[] = {
NULL,
};
+static struct attribute *pcie_tgt_pmu_format_attrs[] = {
+ ARM_CSPMU_FORMAT_ATTR(event, "config:0-2"),
+ ARM_CSPMU_FORMAT_ATTR(dst_rp_mask, "config:3-10"),
+ ARM_CSPMU_FORMAT_ATTR(dst_addr_en, "config:11"),
+ ARM_CSPMU_FORMAT_ATTR(dst_addr_base, "config1:0-63"),
+ ARM_CSPMU_FORMAT_ATTR(dst_addr_mask, "config2:0-63"),
+ NULL,
+};
+
static struct attribute *generic_pmu_format_attrs[] = {
ARM_CSPMU_FORMAT_EVENT_ATTR,
ARM_CSPMU_FORMAT_FILTER_ATTR,
@@ -478,6 +514,267 @@ static int pcie_v2_pmu_validate_event(struct arm_cspmu *cspmu,
return 0;
}
+struct pcie_tgt_addr_filter {
+ u32 refcount;
+ u64 base;
+ u64 mask;
+};
+
+struct pcie_tgt_data {
+ struct pcie_tgt_addr_filter addr_filter[NV_PCIE_TGT_ADDR_COUNT];
+ void __iomem *addr_filter_reg;
+};
+
+#if defined(CONFIG_ACPI)
+static int pcie_tgt_init_data(struct arm_cspmu *cspmu)
+{
+ int ret;
+ struct acpi_device *adev;
+ struct pcie_tgt_data *data;
+ struct list_head resource_list;
+ struct resource_entry *rentry;
+ struct nv_cspmu_ctx *ctx = to_nv_cspmu_ctx(cspmu);
+ struct device *dev = cspmu->dev;
+
+ data = devm_kzalloc(dev, sizeof(struct pcie_tgt_data), GFP_KERNEL);
+ if (!data)
+ return -ENOMEM;
+
+ adev = arm_cspmu_acpi_dev_get(cspmu);
+ if (!adev) {
+ dev_err(dev, "failed to get associated PCIE-TGT device\n");
+ return -ENODEV;
+ }
+
+ INIT_LIST_HEAD(&resource_list);
+ ret = acpi_dev_get_memory_resources(adev, &resource_list);
+ if (ret < 0) {
+ dev_err(dev, "failed to get PCIE-TGT device memory resources\n");
+ acpi_dev_put(adev);
+ return ret;
+ }
+
+ rentry = list_first_entry_or_null(
+ &resource_list, struct resource_entry, node);
+ if (rentry) {
+ data->addr_filter_reg = devm_ioremap_resource(dev, rentry->res);
+ ret = 0;
+ }
+
+ if (IS_ERR(data->addr_filter_reg)) {
+ dev_err(dev, "failed to get address filter resource\n");
+ ret = PTR_ERR(data->addr_filter_reg);
+ }
+
+ acpi_dev_free_resource_list(&resource_list);
+ acpi_dev_put(adev);
+
+ ctx->data = data;
+
+ return ret;
+}
+#else
+static int pcie_tgt_init_data(struct arm_cspmu *cspmu)
+{
+ return -ENODEV;
+}
+#endif
+
+static struct pcie_tgt_data *pcie_tgt_get_data(struct arm_cspmu *cspmu)
+{
+ struct nv_cspmu_ctx *ctx = to_nv_cspmu_ctx(cspmu);
+
+ return ctx->data;
+}
+
+/* Find the first available address filter slot. */
+static int pcie_tgt_find_addr_idx(struct arm_cspmu *cspmu, u64 base, u64 mask,
+ bool is_reset)
+{
+ int i;
+ struct pcie_tgt_data *data = pcie_tgt_get_data(cspmu);
+
+ for (i = 0; i < NV_PCIE_TGT_ADDR_COUNT; i++) {
+ if (!is_reset && data->addr_filter[i].refcount == 0)
+ return i;
+
+ if (data->addr_filter[i].base == base &&
+ data->addr_filter[i].mask == mask)
+ return i;
+ }
+
+ return -ENODEV;
+}
+
+static u32 pcie_tgt_pmu_event_filter(const struct perf_event *event)
+{
+ u32 filter;
+
+ filter = (event->attr.config >> NV_PCIE_TGT_EV_TYPE_COUNT) &
+ NV_PCIE_TGT_FILTER2_MASK;
+
+ return filter;
+}
+
+static bool pcie_tgt_pmu_addr_en(const struct perf_event *event)
+{
+ u32 filter = pcie_tgt_pmu_event_filter(event);
+
+ return FIELD_GET(NV_PCIE_TGT_FILTER2_ADDR_EN, filter) != 0;
+}
+
+static u32 pcie_tgt_pmu_port_filter(const struct perf_event *event)
+{
+ u32 filter = pcie_tgt_pmu_event_filter(event);
+
+ return FIELD_GET(NV_PCIE_TGT_FILTER2_PORT, filter);
+}
+
+static u64 pcie_tgt_pmu_dst_addr_base(const struct perf_event *event)
+{
+ return event->attr.config1;
+}
+
+static u64 pcie_tgt_pmu_dst_addr_mask(const struct perf_event *event)
+{
+ return event->attr.config2;
+}
+
+static int pcie_tgt_pmu_validate_event(struct arm_cspmu *cspmu,
+ struct perf_event *new_ev)
+{
+ u64 base, mask;
+ int idx;
+
+ if (!pcie_tgt_pmu_addr_en(new_ev))
+ return 0;
+
+ /* Make sure there is a slot available for the address filter. */
+ base = pcie_tgt_pmu_dst_addr_base(new_ev);
+ mask = pcie_tgt_pmu_dst_addr_mask(new_ev);
+ idx = pcie_tgt_find_addr_idx(cspmu, base, mask, false);
+ if (idx < 0)
+ return -EINVAL;
+
+ return 0;
+}
+
+static void pcie_tgt_pmu_config_addr_filter(struct arm_cspmu *cspmu,
+ bool en, u64 base, u64 mask, int idx)
+{
+ struct pcie_tgt_data *data;
+ struct pcie_tgt_addr_filter *filter;
+ void __iomem *filter_reg;
+
+ data = pcie_tgt_get_data(cspmu);
+ filter = &data->addr_filter[idx];
+ filter_reg = data->addr_filter_reg + (idx * NV_PCIE_TGT_ADDR_STRIDE);
+
+ if (en) {
+ filter->refcount++;
+ if (filter->refcount == 1) {
+ filter->base = base;
+ filter->mask = mask;
+
+ writel(lower_32_bits(base), filter_reg + NV_PCIE_TGT_ADDR_BASE_LO);
+ writel(upper_32_bits(base), filter_reg + NV_PCIE_TGT_ADDR_BASE_HI);
+ writel(lower_32_bits(mask), filter_reg + NV_PCIE_TGT_ADDR_MASK_LO);
+ writel(upper_32_bits(mask), filter_reg + NV_PCIE_TGT_ADDR_MASK_HI);
+ writel(1, filter_reg + NV_PCIE_TGT_ADDR_CTRL);
+ }
+ } else {
+ filter->refcount--;
+ if (filter->refcount == 0) {
+ writel(0, filter_reg + NV_PCIE_TGT_ADDR_CTRL);
+ writel(0, filter_reg + NV_PCIE_TGT_ADDR_BASE_LO);
+ writel(0, filter_reg + NV_PCIE_TGT_ADDR_BASE_HI);
+ writel(0, filter_reg + NV_PCIE_TGT_ADDR_MASK_LO);
+ writel(0, filter_reg + NV_PCIE_TGT_ADDR_MASK_HI);
+
+ filter->base = 0;
+ filter->mask = 0;
+ }
+ }
+}
+
+static void pcie_tgt_pmu_set_ev_filter(struct arm_cspmu *cspmu,
+ const struct perf_event *event)
+{
+ bool addr_filter_en;
+ int idx;
+ u32 filter2_val, filter2_offset, port_filter;
+ u64 base, mask;
+
+ filter2_val = 0;
+ filter2_offset = PMEVFILT2R + (4 * event->hw.idx);
+
+ addr_filter_en = pcie_tgt_pmu_addr_en(event);
+ if (addr_filter_en) {
+ base = pcie_tgt_pmu_dst_addr_base(event);
+ mask = pcie_tgt_pmu_dst_addr_mask(event);
+ idx = pcie_tgt_find_addr_idx(cspmu, base, mask, false);
+
+ if (idx < 0) {
+ dev_err(cspmu->dev,
+ "Unable to find a slot for address filtering\n");
+ writel(0, cspmu->base0 + filter2_offset);
+ return;
+ }
+
+ /* Configure address range filter registers.*/
+ pcie_tgt_pmu_config_addr_filter(cspmu, true, base, mask, idx);
+
+ /* Config the counter to use the selected address filter slot. */
+ filter2_val |= FIELD_PREP(NV_PCIE_TGT_FILTER2_ADDR, 1U << idx);
+ }
+
+ port_filter = pcie_tgt_pmu_port_filter(event);
+
+ /* Monitor all ports if no filter is selected. */
+ if (!addr_filter_en && port_filter == 0)
+ port_filter = NV_PCIE_TGT_FILTER2_PORT;
+
+ filter2_val |= FIELD_PREP(NV_PCIE_TGT_FILTER2_PORT, port_filter);
+
+ writel(filter2_val, cspmu->base0 + filter2_offset);
+}
+
+static void pcie_tgt_pmu_reset_ev_filter(struct arm_cspmu *cspmu,
+ const struct perf_event *event)
+{
+ bool addr_filter_en;
+ u64 base, mask;
+ int idx;
+
+ addr_filter_en = pcie_tgt_pmu_addr_en(event);
+ if (!addr_filter_en)
+ return;
+
+ base = pcie_tgt_pmu_dst_addr_base(event);
+ mask = pcie_tgt_pmu_dst_addr_mask(event);
+ idx = pcie_tgt_find_addr_idx(cspmu, base, mask, true);
+
+ if (idx < 0) {
+ dev_err(cspmu->dev,
+ "Unable to find the address filter slot to reset\n");
+ return;
+ }
+
+ pcie_tgt_pmu_config_addr_filter(cspmu, false, base, mask, idx);
+}
+
+static u32 pcie_tgt_pmu_event_type(const struct perf_event *event)
+{
+ return event->attr.config & NV_PCIE_TGT_EV_TYPE_MASK;
+}
+
+static bool pcie_tgt_pmu_is_cycle_counter_event(const struct perf_event *event)
+{
+ u32 event_type = pcie_tgt_pmu_event_type(event);
+
+ return event_type == NV_PCIE_TGT_EV_TYPE_CC;
+}
+
enum nv_cspmu_name_fmt {
NAME_FMT_GENERIC,
NAME_FMT_SOCKET,
@@ -622,6 +919,30 @@ static const struct nv_cspmu_match nv_cspmu_match[] = {
.reset_ev_filter = nv_cspmu_reset_ev_filter,
}
},
+ {
+ .prodid = 0x10700000,
+ .prodid_mask = NV_PRODID_MASK,
+ .name_pattern = "nvidia_pcie_tgt_pmu_%u_rc_%u",
+ .name_fmt = NAME_FMT_SOCKET_INST,
+ .template_ctx = {
+ .event_attr = pcie_tgt_pmu_event_attrs,
+ .format_attr = pcie_tgt_pmu_format_attrs,
+ .filter_mask = 0x0,
+ .filter_default_val = 0x0,
+ .filter2_mask = NV_PCIE_TGT_FILTER2_MASK,
+ .filter2_default_val = NV_PCIE_TGT_FILTER2_DEFAULT,
+ .get_filter = NULL,
+ .get_filter2 = NULL,
+ .init_data = pcie_tgt_init_data
+ },
+ .ops = {
+ .is_cycle_counter_event = pcie_tgt_pmu_is_cycle_counter_event,
+ .event_type = pcie_tgt_pmu_event_type,
+ .validate_event = pcie_tgt_pmu_validate_event,
+ .set_ev_filter = pcie_tgt_pmu_set_ev_filter,
+ .reset_ev_filter = pcie_tgt_pmu_reset_ev_filter,
+ }
+ },
{
.prodid = 0,
.prodid_mask = 0,
@@ -714,6 +1035,8 @@ static int nv_cspmu_init_ops(struct arm_cspmu *cspmu)
/* NVIDIA specific callbacks. */
SET_OP(validate_event, impl_ops, match, NULL);
+ SET_OP(event_type, impl_ops, match, NULL);
+ SET_OP(is_cycle_counter_event, impl_ops, match, NULL);
SET_OP(set_cc_filter, impl_ops, match, nv_cspmu_set_cc_filter);
SET_OP(set_ev_filter, impl_ops, match, nv_cspmu_set_ev_filter);
SET_OP(reset_ev_filter, impl_ops, match, NULL);
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v2 5/8] perf/arm_cspmu: nvidia: Add Tegra410 PCIE-TGT PMU
2026-02-18 14:58 ` [PATCH v2 5/8] perf/arm_cspmu: nvidia: Add Tegra410 PCIE-TGT PMU Besar Wicaksono
@ 2026-02-19 10:10 ` Jonathan Cameron
0 siblings, 0 replies; 18+ messages in thread
From: Jonathan Cameron @ 2026-02-19 10:10 UTC (permalink / raw)
To: Besar Wicaksono
Cc: will, suzuki.poulose, robin.murphy, ilkka, linux-arm-kernel,
linux-kernel, linux-tegra, mark.rutland, treding, jonathanh,
vsethi, rwiley, sdonthineni, skelley, ywan, mochs, nirmoyd,
Bjorn Helgaas, linux-pci, Yushan Wang, shiju.jose
On Wed, 18 Feb 2026 14:58:06 +0000
Besar Wicaksono <bwicaksono@nvidia.com> wrote:
> Adds PCIE-TGT PMU support in Tegra410 SOC. This PMU is
> instanced in each root complex in the SOC and it captures
> traffic originating from any source towards PCIE BAR and CXL
> HDM range. The traffic can be filtered based on the
> destination root port or target address range.
>
> Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
+CC same group as on previous.
No additional comments from me, I just left the convent for those
I +CC.
J
> ---
> .../admin-guide/perf/nvidia-tegra410-pmu.rst | 76 +++++
> drivers/perf/arm_cspmu/nvidia_cspmu.c | 323 ++++++++++++++++++
> 2 files changed, 399 insertions(+)
>
> diff --git a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> index 8528685ddb61..07dc447eead7 100644
> --- a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> +++ b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
> @@ -7,6 +7,7 @@ metrics like memory bandwidth, latency, and utilization:
>
> * Unified Coherence Fabric (UCF)
> * PCIE
> +* PCIE-TGT
>
> PMU Driver
> ----------
> @@ -211,6 +212,11 @@ Example usage:
>
> perf stat -a -e nvidia_pcie_pmu_0_rc_4/event=0x4,src_bdf=0x0180,src_bdf_en=0x1/
>
> +.. _NVIDIA_T410_PCIE_PMU_RC_Mapping_Section:
> +
> +Mapping the RC# to lspci segment number
> +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> +
> Mapping the RC# to lspci segment number can be non-trivial; hence a new NVIDIA
> Designated Vendor Specific Capability (DVSEC) register is added into the PCIE config space
> for each RP. This DVSEC has vendor id "10de" and DVSEC id of "0x4". The DVSEC register
> @@ -266,3 +272,73 @@ Example output::
> 000d:40:00.0: Bus=40, Segment=0d, RP=01, RC=04, Socket=01
> 000d:c0:00.0: Bus=c0, Segment=0d, RP=02, RC=04, Socket=01
> 000e:00:00.0: Bus=00, Segment=0e, RP=00, RC=05, Socket=01
> +
> +PCIE-TGT PMU
> +------------
> +
> +The PCIE-TGT PMU monitors traffic targeting PCIE BAR and CXL HDM ranges.
> +There is one PCIE-TGT PMU per PCIE root complex (RC) in the SoC. Each RC in
> +Tegra410 SoC can have up to 16 lanes that can be bifurcated into up to 8 root
> +ports (RP). The PMU provides RP filter to count PCIE BAR traffic to each RP and
> +address filter to count access to PCIE BAR or CXL HDM ranges. The details
> +of the filters are described in the following sections.
> +
> +Mapping the RC# to lspci segment number is similar to the PCIE PMU.
> +Please see :ref:`NVIDIA_T410_PCIE_PMU_RC_Mapping_Section` for more info.
> +
> +The events and configuration options of this PMU device are available in sysfs,
> +see /sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>.
> +
> +The events in this PMU can be used to measure bandwidth and utilization:
> +
> + * rd_req: count the number of read requests to PCIE.
> + * wr_req: count the number of write requests to PCIE.
> + * rd_bytes: count the number of bytes transferred by rd_req.
> + * wr_bytes: count the number of bytes transferred by wr_req.
> + * cycles: counts the PCIE cycles.
> +
> +The average bandwidth is calculated as::
> +
> + AVG_RD_BANDWIDTH_IN_GBPS = RD_BYTES / ELAPSED_TIME_IN_NS
> + AVG_WR_BANDWIDTH_IN_GBPS = WR_BYTES / ELAPSED_TIME_IN_NS
> +
> +The average request rate is calculated as::
> +
> + AVG_RD_REQUEST_RATE = RD_REQ / CYCLES
> + AVG_WR_REQUEST_RATE = WR_REQ / CYCLES
> +
> +The PMU events can be filtered based on the destination root port or target
> +address range. Filtering based on RP is only available for PCIE BAR traffic.
> +Address filter works for both PCIE BAR and CXL HDM ranges. These filters can be
> +found in sysfs, see
> +/sys/bus/event_source/devices/nvidia_pcie_tgt_pmu_<socket-id>_rc_<pcie-rc-id>/format/.
> +
> +Destination filter settings:
> +
> +* dst_rp_mask: bitmask to select the root port(s) to monitor. E.g. "dst_rp_mask=0xFF"
> + corresponds to all root ports (from 0 to 7) in the PCIE RC. Note that this filter is
> + only available for PCIE BAR traffic.
> +* dst_addr_base: BAR or CXL HDM filter base address.
> +* dst_addr_mask: BAR or CXL HDM filter address mask.
> +* dst_addr_en: enable BAR or CXL HDM address range filter. If this is set, the
> + address range specified by "dst_addr_base" and "dst_addr_mask" will be used to filter
> + the PCIE BAR and CXL HDM traffic address. The PMU uses the following comparison
> + to determine if the traffic destination address falls within the filter range::
> +
> + (txn's addr & dst_addr_mask) == (dst_addr_base & dst_addr_mask)
> +
> + If the comparison succeeds, then the event will be counted.
> +
> +If the destination filter is not specified, the RP filter will be configured by default
> +to count PCIE BAR traffic to all root ports.
> +
> +Example usage:
> +
> +* Count event id 0x0 to root port 0 and 1 of PCIE RC-0 on socket 0::
> +
> + perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_0/event=0x0,dst_rp_mask=0x3/
> +
> +* Count event id 0x1 for accesses to PCIE BAR or CXL HDM address range
> + 0x10000 to 0x100FF on socket 0's PCIE RC-1::
> +
> + perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_1/event=0x1,dst_addr_base=0x10000,dst_addr_mask=0xFFF00,dst_addr_en=0x1/
> diff --git a/drivers/perf/arm_cspmu/nvidia_cspmu.c b/drivers/perf/arm_cspmu/nvidia_cspmu.c
> index 42f11f37bddf..25c408b56dc8 100644
> --- a/drivers/perf/arm_cspmu/nvidia_cspmu.c
> +++ b/drivers/perf/arm_cspmu/nvidia_cspmu.c
> @@ -42,6 +42,24 @@
> #define NV_PCIE_V2_FILTER2_DST GENMASK_ULL(NV_PCIE_V2_DST_COUNT - 1, 0)
> #define NV_PCIE_V2_FILTER2_DEFAULT NV_PCIE_V2_FILTER2_DST
>
> +#define NV_PCIE_TGT_PORT_COUNT 8ULL
> +#define NV_PCIE_TGT_EV_TYPE_CC 0x4
> +#define NV_PCIE_TGT_EV_TYPE_COUNT 3ULL
> +#define NV_PCIE_TGT_EV_TYPE_MASK GENMASK_ULL(NV_PCIE_TGT_EV_TYPE_COUNT - 1, 0)
> +#define NV_PCIE_TGT_FILTER2_MASK GENMASK_ULL(NV_PCIE_TGT_PORT_COUNT, 0)
> +#define NV_PCIE_TGT_FILTER2_PORT GENMASK_ULL(NV_PCIE_TGT_PORT_COUNT - 1, 0)
> +#define NV_PCIE_TGT_FILTER2_ADDR_EN BIT(NV_PCIE_TGT_PORT_COUNT)
> +#define NV_PCIE_TGT_FILTER2_ADDR GENMASK_ULL(15, NV_PCIE_TGT_PORT_COUNT)
> +#define NV_PCIE_TGT_FILTER2_DEFAULT NV_PCIE_TGT_FILTER2_PORT
> +
> +#define NV_PCIE_TGT_ADDR_COUNT 8ULL
> +#define NV_PCIE_TGT_ADDR_STRIDE 20
> +#define NV_PCIE_TGT_ADDR_CTRL 0xD38
> +#define NV_PCIE_TGT_ADDR_BASE_LO 0xD3C
> +#define NV_PCIE_TGT_ADDR_BASE_HI 0xD40
> +#define NV_PCIE_TGT_ADDR_MASK_LO 0xD44
> +#define NV_PCIE_TGT_ADDR_MASK_HI 0xD48
> +
> #define NV_GENERIC_FILTER_ID_MASK GENMASK_ULL(31, 0)
>
> #define NV_PRODID_MASK (PMIIDR_PRODUCTID | PMIIDR_VARIANT | PMIIDR_REVISION)
> @@ -186,6 +204,15 @@ static struct attribute *pcie_v2_pmu_event_attrs[] = {
> NULL,
> };
>
> +static struct attribute *pcie_tgt_pmu_event_attrs[] = {
> + ARM_CSPMU_EVENT_ATTR(rd_bytes, 0x0),
> + ARM_CSPMU_EVENT_ATTR(wr_bytes, 0x1),
> + ARM_CSPMU_EVENT_ATTR(rd_req, 0x2),
> + ARM_CSPMU_EVENT_ATTR(wr_req, 0x3),
> + ARM_CSPMU_EVENT_ATTR(cycles, NV_PCIE_TGT_EV_TYPE_CC),
> + NULL,
> +};
> +
> static struct attribute *generic_pmu_event_attrs[] = {
> ARM_CSPMU_EVENT_ATTR(cycles, ARM_CSPMU_EVT_CYCLES_DEFAULT),
> NULL,
> @@ -239,6 +266,15 @@ static struct attribute *pcie_v2_pmu_format_attrs[] = {
> NULL,
> };
>
> +static struct attribute *pcie_tgt_pmu_format_attrs[] = {
> + ARM_CSPMU_FORMAT_ATTR(event, "config:0-2"),
> + ARM_CSPMU_FORMAT_ATTR(dst_rp_mask, "config:3-10"),
> + ARM_CSPMU_FORMAT_ATTR(dst_addr_en, "config:11"),
> + ARM_CSPMU_FORMAT_ATTR(dst_addr_base, "config1:0-63"),
> + ARM_CSPMU_FORMAT_ATTR(dst_addr_mask, "config2:0-63"),
> + NULL,
> +};
> +
> static struct attribute *generic_pmu_format_attrs[] = {
> ARM_CSPMU_FORMAT_EVENT_ATTR,
> ARM_CSPMU_FORMAT_FILTER_ATTR,
> @@ -478,6 +514,267 @@ static int pcie_v2_pmu_validate_event(struct arm_cspmu *cspmu,
> return 0;
> }
>
> +struct pcie_tgt_addr_filter {
> + u32 refcount;
> + u64 base;
> + u64 mask;
> +};
> +
> +struct pcie_tgt_data {
> + struct pcie_tgt_addr_filter addr_filter[NV_PCIE_TGT_ADDR_COUNT];
> + void __iomem *addr_filter_reg;
> +};
> +
> +#if defined(CONFIG_ACPI)
> +static int pcie_tgt_init_data(struct arm_cspmu *cspmu)
> +{
> + int ret;
> + struct acpi_device *adev;
> + struct pcie_tgt_data *data;
> + struct list_head resource_list;
> + struct resource_entry *rentry;
> + struct nv_cspmu_ctx *ctx = to_nv_cspmu_ctx(cspmu);
> + struct device *dev = cspmu->dev;
> +
> + data = devm_kzalloc(dev, sizeof(struct pcie_tgt_data), GFP_KERNEL);
> + if (!data)
> + return -ENOMEM;
> +
> + adev = arm_cspmu_acpi_dev_get(cspmu);
> + if (!adev) {
> + dev_err(dev, "failed to get associated PCIE-TGT device\n");
> + return -ENODEV;
> + }
> +
> + INIT_LIST_HEAD(&resource_list);
> + ret = acpi_dev_get_memory_resources(adev, &resource_list);
> + if (ret < 0) {
> + dev_err(dev, "failed to get PCIE-TGT device memory resources\n");
> + acpi_dev_put(adev);
> + return ret;
> + }
> +
> + rentry = list_first_entry_or_null(
> + &resource_list, struct resource_entry, node);
> + if (rentry) {
> + data->addr_filter_reg = devm_ioremap_resource(dev, rentry->res);
> + ret = 0;
> + }
> +
> + if (IS_ERR(data->addr_filter_reg)) {
> + dev_err(dev, "failed to get address filter resource\n");
> + ret = PTR_ERR(data->addr_filter_reg);
> + }
> +
> + acpi_dev_free_resource_list(&resource_list);
> + acpi_dev_put(adev);
> +
> + ctx->data = data;
> +
> + return ret;
> +}
> +#else
> +static int pcie_tgt_init_data(struct arm_cspmu *cspmu)
> +{
> + return -ENODEV;
> +}
> +#endif
> +
> +static struct pcie_tgt_data *pcie_tgt_get_data(struct arm_cspmu *cspmu)
> +{
> + struct nv_cspmu_ctx *ctx = to_nv_cspmu_ctx(cspmu);
> +
> + return ctx->data;
> +}
> +
> +/* Find the first available address filter slot. */
> +static int pcie_tgt_find_addr_idx(struct arm_cspmu *cspmu, u64 base, u64 mask,
> + bool is_reset)
> +{
> + int i;
> + struct pcie_tgt_data *data = pcie_tgt_get_data(cspmu);
> +
> + for (i = 0; i < NV_PCIE_TGT_ADDR_COUNT; i++) {
> + if (!is_reset && data->addr_filter[i].refcount == 0)
> + return i;
> +
> + if (data->addr_filter[i].base == base &&
> + data->addr_filter[i].mask == mask)
> + return i;
> + }
> +
> + return -ENODEV;
> +}
> +
> +static u32 pcie_tgt_pmu_event_filter(const struct perf_event *event)
> +{
> + u32 filter;
> +
> + filter = (event->attr.config >> NV_PCIE_TGT_EV_TYPE_COUNT) &
> + NV_PCIE_TGT_FILTER2_MASK;
> +
> + return filter;
> +}
> +
> +static bool pcie_tgt_pmu_addr_en(const struct perf_event *event)
> +{
> + u32 filter = pcie_tgt_pmu_event_filter(event);
> +
> + return FIELD_GET(NV_PCIE_TGT_FILTER2_ADDR_EN, filter) != 0;
> +}
> +
> +static u32 pcie_tgt_pmu_port_filter(const struct perf_event *event)
> +{
> + u32 filter = pcie_tgt_pmu_event_filter(event);
> +
> + return FIELD_GET(NV_PCIE_TGT_FILTER2_PORT, filter);
> +}
> +
> +static u64 pcie_tgt_pmu_dst_addr_base(const struct perf_event *event)
> +{
> + return event->attr.config1;
> +}
> +
> +static u64 pcie_tgt_pmu_dst_addr_mask(const struct perf_event *event)
> +{
> + return event->attr.config2;
> +}
> +
> +static int pcie_tgt_pmu_validate_event(struct arm_cspmu *cspmu,
> + struct perf_event *new_ev)
> +{
> + u64 base, mask;
> + int idx;
> +
> + if (!pcie_tgt_pmu_addr_en(new_ev))
> + return 0;
> +
> + /* Make sure there is a slot available for the address filter. */
> + base = pcie_tgt_pmu_dst_addr_base(new_ev);
> + mask = pcie_tgt_pmu_dst_addr_mask(new_ev);
> + idx = pcie_tgt_find_addr_idx(cspmu, base, mask, false);
> + if (idx < 0)
> + return -EINVAL;
> +
> + return 0;
> +}
> +
> +static void pcie_tgt_pmu_config_addr_filter(struct arm_cspmu *cspmu,
> + bool en, u64 base, u64 mask, int idx)
> +{
> + struct pcie_tgt_data *data;
> + struct pcie_tgt_addr_filter *filter;
> + void __iomem *filter_reg;
> +
> + data = pcie_tgt_get_data(cspmu);
> + filter = &data->addr_filter[idx];
> + filter_reg = data->addr_filter_reg + (idx * NV_PCIE_TGT_ADDR_STRIDE);
> +
> + if (en) {
> + filter->refcount++;
> + if (filter->refcount == 1) {
> + filter->base = base;
> + filter->mask = mask;
> +
> + writel(lower_32_bits(base), filter_reg + NV_PCIE_TGT_ADDR_BASE_LO);
> + writel(upper_32_bits(base), filter_reg + NV_PCIE_TGT_ADDR_BASE_HI);
> + writel(lower_32_bits(mask), filter_reg + NV_PCIE_TGT_ADDR_MASK_LO);
> + writel(upper_32_bits(mask), filter_reg + NV_PCIE_TGT_ADDR_MASK_HI);
> + writel(1, filter_reg + NV_PCIE_TGT_ADDR_CTRL);
> + }
> + } else {
> + filter->refcount--;
> + if (filter->refcount == 0) {
> + writel(0, filter_reg + NV_PCIE_TGT_ADDR_CTRL);
> + writel(0, filter_reg + NV_PCIE_TGT_ADDR_BASE_LO);
> + writel(0, filter_reg + NV_PCIE_TGT_ADDR_BASE_HI);
> + writel(0, filter_reg + NV_PCIE_TGT_ADDR_MASK_LO);
> + writel(0, filter_reg + NV_PCIE_TGT_ADDR_MASK_HI);
> +
> + filter->base = 0;
> + filter->mask = 0;
> + }
> + }
> +}
> +
> +static void pcie_tgt_pmu_set_ev_filter(struct arm_cspmu *cspmu,
> + const struct perf_event *event)
> +{
> + bool addr_filter_en;
> + int idx;
> + u32 filter2_val, filter2_offset, port_filter;
> + u64 base, mask;
> +
> + filter2_val = 0;
> + filter2_offset = PMEVFILT2R + (4 * event->hw.idx);
> +
> + addr_filter_en = pcie_tgt_pmu_addr_en(event);
> + if (addr_filter_en) {
> + base = pcie_tgt_pmu_dst_addr_base(event);
> + mask = pcie_tgt_pmu_dst_addr_mask(event);
> + idx = pcie_tgt_find_addr_idx(cspmu, base, mask, false);
> +
> + if (idx < 0) {
> + dev_err(cspmu->dev,
> + "Unable to find a slot for address filtering\n");
> + writel(0, cspmu->base0 + filter2_offset);
> + return;
> + }
> +
> + /* Configure address range filter registers.*/
> + pcie_tgt_pmu_config_addr_filter(cspmu, true, base, mask, idx);
> +
> + /* Config the counter to use the selected address filter slot. */
> + filter2_val |= FIELD_PREP(NV_PCIE_TGT_FILTER2_ADDR, 1U << idx);
> + }
> +
> + port_filter = pcie_tgt_pmu_port_filter(event);
> +
> + /* Monitor all ports if no filter is selected. */
> + if (!addr_filter_en && port_filter == 0)
> + port_filter = NV_PCIE_TGT_FILTER2_PORT;
> +
> + filter2_val |= FIELD_PREP(NV_PCIE_TGT_FILTER2_PORT, port_filter);
> +
> + writel(filter2_val, cspmu->base0 + filter2_offset);
> +}
> +
> +static void pcie_tgt_pmu_reset_ev_filter(struct arm_cspmu *cspmu,
> + const struct perf_event *event)
> +{
> + bool addr_filter_en;
> + u64 base, mask;
> + int idx;
> +
> + addr_filter_en = pcie_tgt_pmu_addr_en(event);
> + if (!addr_filter_en)
> + return;
> +
> + base = pcie_tgt_pmu_dst_addr_base(event);
> + mask = pcie_tgt_pmu_dst_addr_mask(event);
> + idx = pcie_tgt_find_addr_idx(cspmu, base, mask, true);
> +
> + if (idx < 0) {
> + dev_err(cspmu->dev,
> + "Unable to find the address filter slot to reset\n");
> + return;
> + }
> +
> + pcie_tgt_pmu_config_addr_filter(cspmu, false, base, mask, idx);
> +}
> +
> +static u32 pcie_tgt_pmu_event_type(const struct perf_event *event)
> +{
> + return event->attr.config & NV_PCIE_TGT_EV_TYPE_MASK;
> +}
> +
> +static bool pcie_tgt_pmu_is_cycle_counter_event(const struct perf_event *event)
> +{
> + u32 event_type = pcie_tgt_pmu_event_type(event);
> +
> + return event_type == NV_PCIE_TGT_EV_TYPE_CC;
> +}
> +
> enum nv_cspmu_name_fmt {
> NAME_FMT_GENERIC,
> NAME_FMT_SOCKET,
> @@ -622,6 +919,30 @@ static const struct nv_cspmu_match nv_cspmu_match[] = {
> .reset_ev_filter = nv_cspmu_reset_ev_filter,
> }
> },
> + {
> + .prodid = 0x10700000,
> + .prodid_mask = NV_PRODID_MASK,
> + .name_pattern = "nvidia_pcie_tgt_pmu_%u_rc_%u",
> + .name_fmt = NAME_FMT_SOCKET_INST,
> + .template_ctx = {
> + .event_attr = pcie_tgt_pmu_event_attrs,
> + .format_attr = pcie_tgt_pmu_format_attrs,
> + .filter_mask = 0x0,
> + .filter_default_val = 0x0,
> + .filter2_mask = NV_PCIE_TGT_FILTER2_MASK,
> + .filter2_default_val = NV_PCIE_TGT_FILTER2_DEFAULT,
> + .get_filter = NULL,
> + .get_filter2 = NULL,
> + .init_data = pcie_tgt_init_data
> + },
> + .ops = {
> + .is_cycle_counter_event = pcie_tgt_pmu_is_cycle_counter_event,
> + .event_type = pcie_tgt_pmu_event_type,
> + .validate_event = pcie_tgt_pmu_validate_event,
> + .set_ev_filter = pcie_tgt_pmu_set_ev_filter,
> + .reset_ev_filter = pcie_tgt_pmu_reset_ev_filter,
> + }
> + },
> {
> .prodid = 0,
> .prodid_mask = 0,
> @@ -714,6 +1035,8 @@ static int nv_cspmu_init_ops(struct arm_cspmu *cspmu)
>
> /* NVIDIA specific callbacks. */
> SET_OP(validate_event, impl_ops, match, NULL);
> + SET_OP(event_type, impl_ops, match, NULL);
> + SET_OP(is_cycle_counter_event, impl_ops, match, NULL);
> SET_OP(set_cc_filter, impl_ops, match, nv_cspmu_set_cc_filter);
> SET_OP(set_ev_filter, impl_ops, match, nv_cspmu_set_ev_filter);
> SET_OP(reset_ev_filter, impl_ops, match, NULL);
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 6/8] perf: add NVIDIA Tegra410 CPU Memory Latency PMU
2026-02-18 14:58 [PATCH v2 0/8] perf: add NVIDIA Tegra410 Uncore PMU support Besar Wicaksono
` (4 preceding siblings ...)
2026-02-18 14:58 ` [PATCH v2 5/8] perf/arm_cspmu: nvidia: Add Tegra410 PCIE-TGT PMU Besar Wicaksono
@ 2026-02-18 14:58 ` Besar Wicaksono
2026-02-19 10:28 ` Jonathan Cameron
2026-02-18 14:58 ` [PATCH v2 7/8] perf: add NVIDIA Tegra410 C2C PMU Besar Wicaksono
2026-02-18 14:58 ` [PATCH v2 8/8] arm64: defconfig: Enable NVIDIA TEGRA410 PMU Besar Wicaksono
7 siblings, 1 reply; 18+ messages in thread
From: Besar Wicaksono @ 2026-02-18 14:58 UTC (permalink / raw)
To: will, suzuki.poulose, robin.murphy, ilkka
Cc: linux-arm-kernel, linux-kernel, linux-tegra, mark.rutland,
treding, jonathanh, vsethi, rwiley, sdonthineni, skelley, ywan,
mochs, nirmoyd, Besar Wicaksono
Adds CPU Memory (CMEM) Latency PMU support in Tegra410 SOC.
The PMU is used to measure latency between the edge of the
Unified Coherence Fabric to the local system DRAM.
Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
---
.../admin-guide/perf/nvidia-tegra410-pmu.rst | 25 +
drivers/perf/Kconfig | 7 +
drivers/perf/Makefile | 1 +
drivers/perf/nvidia_t410_cmem_latency_pmu.c | 727 ++++++++++++++++++
4 files changed, 760 insertions(+)
create mode 100644 drivers/perf/nvidia_t410_cmem_latency_pmu.c
diff --git a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
index 07dc447eead7..c8fbc289d12c 100644
--- a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
+++ b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
@@ -8,6 +8,7 @@ metrics like memory bandwidth, latency, and utilization:
* Unified Coherence Fabric (UCF)
* PCIE
* PCIE-TGT
+* CPU Memory (CMEM) Latency
PMU Driver
----------
@@ -342,3 +343,27 @@ Example usage:
0x10000 to 0x100FF on socket 0's PCIE RC-1::
perf stat -a -e nvidia_pcie_tgt_pmu_0_rc_1/event=0x1,dst_addr_base=0x10000,dst_addr_mask=0xFFF00,dst_addr_en=0x1/
+
+CPU Memory (CMEM) Latency PMU
+-----------------------------
+
+This PMU monitors latency events of memory read requests from the edge of the
+Unified Coherence Fabric (UCF) to local CPU DRAM:
+
+ * RD_REQ counters: count read requests (32B per request).
+ * RD_CUM_OUTS counters: accumulated outstanding request counter, which track
+ how many cycles the read requests are in flight.
+ * CYCLES counter: counts the number of elapsed cycles.
+
+The average latency is calculated as::
+
+ FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
+ AVG_LATENCY_IN_CYCLES = RD_CUM_OUTS / RD_REQ
+ AVERAGE_LATENCY_IN_NS = AVG_LATENCY_IN_CYCLES / FREQ_IN_GHZ
+
+The events and configuration options of this PMU device are described in sysfs,
+see /sys/bus/event_source/devices/nvidia_cmem_latency_pmu_<socket-id>.
+
+Example usage::
+
+ perf stat -a -e '{nvidia_cmem_latency_pmu_0/rd_req/,nvidia_cmem_latency_pmu_0/rd_cum_outs/,nvidia_cmem_latency_pmu_0/cycles/}'
diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 638321fc9800..26e86067d8f9 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -311,4 +311,11 @@ config MARVELL_PEM_PMU
Enable support for PCIe Interface performance monitoring
on Marvell platform.
+config NVIDIA_TEGRA410_CMEM_LATENCY_PMU
+ tristate "NVIDIA Tegra410 CPU Memory Latency PMU"
+ depends on ARM64 && ACPI
+ help
+ Enable perf support for CPU memory latency counters monitoring on
+ NVIDIA Tegra410 SoC.
+
endmenu
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index ea52711a87e3..4aa6aad393c2 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -35,3 +35,4 @@ obj-$(CONFIG_DWC_PCIE_PMU) += dwc_pcie_pmu.o
obj-$(CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU) += arm_cspmu/
obj-$(CONFIG_MESON_DDR_PMU) += amlogic/
obj-$(CONFIG_CXL_PMU) += cxl_pmu.o
+obj-$(CONFIG_NVIDIA_TEGRA410_CMEM_LATENCY_PMU) += nvidia_t410_cmem_latency_pmu.o
diff --git a/drivers/perf/nvidia_t410_cmem_latency_pmu.c b/drivers/perf/nvidia_t410_cmem_latency_pmu.c
new file mode 100644
index 000000000000..9b466581c8fc
--- /dev/null
+++ b/drivers/perf/nvidia_t410_cmem_latency_pmu.c
@@ -0,0 +1,727 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NVIDIA Tegra410 CPU Memory (CMEM) Latency PMU driver.
+ *
+ * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ */
+
+#include <linux/acpi.h>
+#include <linux/bitops.h>
+#include <linux/cpumask.h>
+#include <linux/device.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/module.h>
+#include <linux/perf_event.h>
+#include <linux/platform_device.h>
+
+#define NUM_INSTANCES 14
+#define BCAST(pmu) pmu->base[NUM_INSTANCES]
+
+/* Register offsets. */
+#define CG_CTRL 0x800
+#define CTRL 0x808
+#define STATUS 0x810
+#define CYCLE_CNTR 0x818
+#define MC0_REQ_CNTR 0x820
+#define MC0_AOR_CNTR 0x830
+#define MC1_REQ_CNTR 0x838
+#define MC1_AOR_CNTR 0x848
+#define MC2_REQ_CNTR 0x850
+#define MC2_AOR_CNTR 0x860
+
+/* CTRL values. */
+#define CTRL_DISABLE 0x0ULL
+#define CTRL_ENABLE 0x1ULL
+#define CTRL_CLR 0x2ULL
+
+/* CG_CTRL values. */
+#define CG_CTRL_DISABLE 0x0ULL
+#define CG_CTRL_ENABLE 0x1ULL
+
+/* STATUS register field. */
+#define STATUS_CYCLE_OVF BIT(0)
+#define STATUS_MC0_AOR_OVF BIT(1)
+#define STATUS_MC0_REQ_OVF BIT(3)
+#define STATUS_MC1_AOR_OVF BIT(4)
+#define STATUS_MC1_REQ_OVF BIT(6)
+#define STATUS_MC2_AOR_OVF BIT(7)
+#define STATUS_MC2_REQ_OVF BIT(9)
+
+/* Events. */
+#define EVENT_CYCLES 0x0
+#define EVENT_REQ 0x1
+#define EVENT_AOR 0x2
+
+#define NUM_EVENTS 0x3
+#define MASK_EVENT 0x3
+#define MAX_ACTIVE_EVENTS 32
+
+#define ACTIVE_CPU_MASK 0x0
+#define ASSOCIATED_CPU_MASK 0x1
+
+static unsigned long cmem_lat_pmu_cpuhp_state;
+
+struct cmem_lat_pmu_hw_events {
+ struct perf_event *events[MAX_ACTIVE_EVENTS];
+ DECLARE_BITMAP(used_ctrs, MAX_ACTIVE_EVENTS);
+};
+
+struct cmem_lat_pmu {
+ struct pmu pmu;
+ struct device *dev;
+ const char *name;
+ const char *identifier;
+ void __iomem *base[NUM_INSTANCES + 1];
+ cpumask_t associated_cpus;
+ cpumask_t active_cpu;
+ struct hlist_node node;
+ struct cmem_lat_pmu_hw_events hw_events;
+};
+
+#define to_cmem_lat_pmu(p) \
+ container_of(p, struct cmem_lat_pmu, pmu)
+
+
+/* Get event type from perf_event. */
+static inline u32 get_event_type(struct perf_event *event)
+{
+ return (event->attr.config) & MASK_EVENT;
+}
+
+/* PMU operations. */
+static int cmem_lat_pmu_get_event_idx(struct cmem_lat_pmu_hw_events *hw_events,
+ struct perf_event *event)
+{
+ unsigned int idx;
+
+ idx = find_first_zero_bit(hw_events->used_ctrs, MAX_ACTIVE_EVENTS);
+ if (idx >= MAX_ACTIVE_EVENTS)
+ return -EAGAIN;
+
+ set_bit(idx, hw_events->used_ctrs);
+
+ return idx;
+}
+
+static bool cmem_lat_pmu_validate_event(struct pmu *pmu,
+ struct cmem_lat_pmu_hw_events *hw_events,
+ struct perf_event *event)
+{
+ if (is_software_event(event))
+ return true;
+
+ /* Reject groups spanning multiple HW PMUs. */
+ if (event->pmu != pmu)
+ return false;
+
+ return (cmem_lat_pmu_get_event_idx(hw_events, event) >= 0);
+}
+
+/*
+ * Make sure the group of events can be scheduled at once
+ * on the PMU.
+ */
+static bool cmem_lat_pmu_validate_group(struct perf_event *event)
+{
+ struct perf_event *sibling, *leader = event->group_leader;
+ struct cmem_lat_pmu_hw_events fake_hw_events;
+
+ if (event->group_leader == event)
+ return true;
+
+ memset(&fake_hw_events, 0, sizeof(fake_hw_events));
+
+ if (!cmem_lat_pmu_validate_event(event->pmu, &fake_hw_events, leader))
+ return false;
+
+ for_each_sibling_event(sibling, leader) {
+ if (!cmem_lat_pmu_validate_event(event->pmu, &fake_hw_events,
+ sibling))
+ return false;
+ }
+
+ return cmem_lat_pmu_validate_event(event->pmu, &fake_hw_events, event);
+}
+
+static int cmem_lat_pmu_event_init(struct perf_event *event)
+{
+ struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu);
+ struct hw_perf_event *hwc = &event->hw;
+ u32 event_type = get_event_type(event);
+
+ if (event->attr.type != event->pmu->type ||
+ event_type >= NUM_EVENTS)
+ return -ENOENT;
+
+ /*
+ * Following other "uncore" PMUs, we do not support sampling mode or
+ * attach to a task (per-process mode).
+ */
+ if (is_sampling_event(event)) {
+ dev_dbg(cmem_lat_pmu->pmu.dev,
+ "Can't support sampling events\n");
+ return -EOPNOTSUPP;
+ }
+
+ if (event->cpu < 0 || event->attach_state & PERF_ATTACH_TASK) {
+ dev_dbg(cmem_lat_pmu->pmu.dev,
+ "Can't support per-task counters\n");
+ return -EINVAL;
+ }
+
+ /*
+ * Make sure the CPU assignment is on one of the CPUs associated with
+ * this PMU.
+ */
+ if (!cpumask_test_cpu(event->cpu, &cmem_lat_pmu->associated_cpus)) {
+ dev_dbg(cmem_lat_pmu->pmu.dev,
+ "Requested cpu is not associated with the PMU\n");
+ return -EINVAL;
+ }
+
+ /* Enforce the current active CPU to handle the events in this PMU. */
+ event->cpu = cpumask_first(&cmem_lat_pmu->active_cpu);
+ if (event->cpu >= nr_cpu_ids)
+ return -EINVAL;
+
+ if (!cmem_lat_pmu_validate_group(event))
+ return -EINVAL;
+
+ hwc->idx = -1;
+ hwc->config = event_type;
+
+ return 0;
+}
+
+static u64 cmem_lat_pmu_read_status(struct cmem_lat_pmu *cmem_lat_pmu,
+ unsigned int inst)
+{
+ return readq(cmem_lat_pmu->base[inst] + STATUS);
+}
+
+static u64 cmem_lat_pmu_read_cycle_counter(struct perf_event *event)
+{
+ const unsigned int instance = 0;
+ u64 status;
+ struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu);
+ struct device *dev = cmem_lat_pmu->dev;
+
+ /*
+ * Use the reading from first instance since all instances are
+ * identical.
+ */
+ status = cmem_lat_pmu_read_status(cmem_lat_pmu, instance);
+ if (status & STATUS_CYCLE_OVF)
+ dev_warn(dev, "Cycle counter overflow\n");
+
+ return readq(cmem_lat_pmu->base[instance] + CYCLE_CNTR);
+}
+
+static u64 cmem_lat_pmu_read_req_counter(struct perf_event *event)
+{
+ unsigned int i;
+ u64 status, val = 0;
+ struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu);
+ struct device *dev = cmem_lat_pmu->dev;
+
+ /* Sum up the counts from all instances. */
+ for (i = 0; i < NUM_INSTANCES; i++) {
+ status = cmem_lat_pmu_read_status(cmem_lat_pmu, i);
+ if (status & STATUS_MC0_REQ_OVF)
+ dev_warn(dev, "MC0 request counter overflow\n");
+ if (status & STATUS_MC1_REQ_OVF)
+ dev_warn(dev, "MC1 request counter overflow\n");
+ if (status & STATUS_MC2_REQ_OVF)
+ dev_warn(dev, "MC2 request counter overflow\n");
+
+ val += readq(cmem_lat_pmu->base[i] + MC0_REQ_CNTR);
+ val += readq(cmem_lat_pmu->base[i] + MC1_REQ_CNTR);
+ val += readq(cmem_lat_pmu->base[i] + MC2_REQ_CNTR);
+ }
+
+ return val;
+}
+
+static u64 cmem_lat_pmu_read_aor_counter(struct perf_event *event)
+{
+ unsigned int i;
+ u64 status, val = 0;
+ struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu);
+ struct device *dev = cmem_lat_pmu->dev;
+
+ /* Sum up the counts from all instances. */
+ for (i = 0; i < NUM_INSTANCES; i++) {
+ status = cmem_lat_pmu_read_status(cmem_lat_pmu, i);
+ if (status & STATUS_MC0_AOR_OVF)
+ dev_warn(dev, "MC0 AOR counter overflow\n");
+ if (status & STATUS_MC1_AOR_OVF)
+ dev_warn(dev, "MC1 AOR counter overflow\n");
+ if (status & STATUS_MC2_AOR_OVF)
+ dev_warn(dev, "MC2 AOR counter overflow\n");
+
+ val += readq(cmem_lat_pmu->base[i] + MC0_AOR_CNTR);
+ val += readq(cmem_lat_pmu->base[i] + MC1_AOR_CNTR);
+ val += readq(cmem_lat_pmu->base[i] + MC2_AOR_CNTR);
+ }
+
+ return val;
+}
+
+static u64 (*read_counter_fn[NUM_EVENTS])(struct perf_event *) = {
+ [EVENT_CYCLES] = cmem_lat_pmu_read_cycle_counter,
+ [EVENT_REQ] = cmem_lat_pmu_read_req_counter,
+ [EVENT_AOR] = cmem_lat_pmu_read_aor_counter,
+};
+
+static void cmem_lat_pmu_event_update(struct perf_event *event)
+{
+ u32 event_type;
+ u64 prev, now;
+ struct hw_perf_event *hwc = &event->hw;
+
+ if (hwc->state & PERF_HES_STOPPED)
+ return;
+
+ event_type = hwc->config;
+
+ do {
+ prev = local64_read(&hwc->prev_count);
+ now = read_counter_fn[event_type](event);
+ } while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev);
+
+ local64_add(now - prev, &event->count);
+
+ hwc->state |= PERF_HES_UPTODATE;
+}
+
+static void cmem_lat_pmu_start(struct perf_event *event, int pmu_flags)
+{
+ event->hw.state = 0;
+}
+
+static void cmem_lat_pmu_stop(struct perf_event *event, int pmu_flags)
+{
+ event->hw.state |= PERF_HES_STOPPED;
+}
+
+static int cmem_lat_pmu_add(struct perf_event *event, int flags)
+{
+ struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu);
+ struct cmem_lat_pmu_hw_events *hw_events = &cmem_lat_pmu->hw_events;
+ struct hw_perf_event *hwc = &event->hw;
+ int idx;
+
+ if (WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(),
+ &cmem_lat_pmu->associated_cpus)))
+ return -ENOENT;
+
+ idx = cmem_lat_pmu_get_event_idx(hw_events, event);
+ if (idx < 0)
+ return idx;
+
+ hw_events->events[idx] = event;
+ hwc->idx = idx;
+ hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
+
+ if (flags & PERF_EF_START)
+ cmem_lat_pmu_start(event, PERF_EF_RELOAD);
+
+ /* Propagate changes to the userspace mapping. */
+ perf_event_update_userpage(event);
+
+ return 0;
+}
+
+static void cmem_lat_pmu_del(struct perf_event *event, int flags)
+{
+ struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu);
+ struct cmem_lat_pmu_hw_events *hw_events = &cmem_lat_pmu->hw_events;
+ struct hw_perf_event *hwc = &event->hw;
+ int idx = hwc->idx;
+
+ cmem_lat_pmu_stop(event, PERF_EF_UPDATE);
+
+ hw_events->events[idx] = NULL;
+
+ clear_bit(idx, hw_events->used_ctrs);
+
+ perf_event_update_userpage(event);
+}
+
+static void cmem_lat_pmu_read(struct perf_event *event)
+{
+ cmem_lat_pmu_event_update(event);
+}
+
+static inline void cmem_lat_pmu_cg_ctrl(struct cmem_lat_pmu *cmem_lat_pmu, u64 val)
+{
+ writeq(val, BCAST(cmem_lat_pmu) + CG_CTRL);
+}
+
+static inline void cmem_lat_pmu_ctrl(struct cmem_lat_pmu *cmem_lat_pmu, u64 val)
+{
+ writeq(val, BCAST(cmem_lat_pmu) + CTRL);
+}
+
+static void cmem_lat_pmu_enable(struct pmu *pmu)
+{
+ bool disabled;
+ struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(pmu);
+
+ disabled = bitmap_empty(
+ cmem_lat_pmu->hw_events.used_ctrs, MAX_ACTIVE_EVENTS);
+
+ if (disabled)
+ return;
+
+ /* Enable all the counters. */
+ cmem_lat_pmu_cg_ctrl(cmem_lat_pmu, CG_CTRL_ENABLE);
+ cmem_lat_pmu_ctrl(cmem_lat_pmu, CTRL_ENABLE);
+}
+
+static void cmem_lat_pmu_disable(struct pmu *pmu)
+{
+ int idx;
+ struct perf_event *event;
+ struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(pmu);
+
+ /* Disable all the counters. */
+ cmem_lat_pmu_ctrl(cmem_lat_pmu, CTRL_DISABLE);
+
+ /*
+ * The counters will start from 0 again on restart.
+ * Update the events immediately to avoid losing the counts.
+ */
+ for_each_set_bit(
+ idx, cmem_lat_pmu->hw_events.used_ctrs, MAX_ACTIVE_EVENTS) {
+ event = cmem_lat_pmu->hw_events.events[idx];
+
+ if (!event)
+ continue;
+
+ cmem_lat_pmu_event_update(event);
+
+ local64_set(&event->hw.prev_count, 0ULL);
+ }
+
+ cmem_lat_pmu_ctrl(cmem_lat_pmu, CTRL_CLR);
+ cmem_lat_pmu_cg_ctrl(cmem_lat_pmu, CG_CTRL_DISABLE);
+}
+
+/* PMU identifier attribute. */
+
+static ssize_t cmem_lat_pmu_identifier_show(struct device *dev,
+ struct device_attribute *attr,
+ char *page)
+{
+ struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(dev_get_drvdata(dev));
+
+ return sysfs_emit(page, "%s\n", cmem_lat_pmu->identifier);
+}
+
+static struct device_attribute cmem_lat_pmu_identifier_attr =
+ __ATTR(identifier, 0444, cmem_lat_pmu_identifier_show, NULL);
+
+static struct attribute *cmem_lat_pmu_identifier_attrs[] = {
+ &cmem_lat_pmu_identifier_attr.attr,
+ NULL,
+};
+
+static struct attribute_group cmem_lat_pmu_identifier_attr_group = {
+ .attrs = cmem_lat_pmu_identifier_attrs,
+};
+
+/* Format attributes. */
+
+#define NV_PMU_EXT_ATTR(_name, _func, _config) \
+ (&((struct dev_ext_attribute[]){ \
+ { \
+ .attr = __ATTR(_name, 0444, _func, NULL), \
+ .var = (void *)_config \
+ } \
+ })[0].attr.attr)
+
+static struct attribute *cmem_lat_pmu_formats[] = {
+ NV_PMU_EXT_ATTR(event, device_show_string, "config:0-1"),
+ NULL,
+};
+
+static const struct attribute_group cmem_lat_pmu_format_group = {
+ .name = "format",
+ .attrs = cmem_lat_pmu_formats,
+};
+
+/* Event attributes. */
+
+static ssize_t cmem_lat_pmu_sysfs_event_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct perf_pmu_events_attr *pmu_attr;
+
+ pmu_attr = container_of(attr, typeof(*pmu_attr), attr);
+ return sysfs_emit(buf, "event=0x%llx\n", pmu_attr->id);
+}
+
+#define NV_PMU_EVENT_ATTR(_name, _config) \
+ PMU_EVENT_ATTR_ID(_name, cmem_lat_pmu_sysfs_event_show, _config)
+
+static struct attribute *cmem_lat_pmu_events[] = {
+ NV_PMU_EVENT_ATTR(cycles, EVENT_CYCLES),
+ NV_PMU_EVENT_ATTR(rd_req, EVENT_REQ),
+ NV_PMU_EVENT_ATTR(rd_cum_outs, EVENT_AOR),
+ NULL
+};
+
+static const struct attribute_group cmem_lat_pmu_events_group = {
+ .name = "events",
+ .attrs = cmem_lat_pmu_events,
+};
+
+/* Cpumask attributes. */
+
+static ssize_t cmem_lat_pmu_cpumask_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct pmu *pmu = dev_get_drvdata(dev);
+ struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(pmu);
+ struct dev_ext_attribute *eattr =
+ container_of(attr, struct dev_ext_attribute, attr);
+ unsigned long mask_id = (unsigned long)eattr->var;
+ const cpumask_t *cpumask;
+
+ switch (mask_id) {
+ case ACTIVE_CPU_MASK:
+ cpumask = &cmem_lat_pmu->active_cpu;
+ break;
+ case ASSOCIATED_CPU_MASK:
+ cpumask = &cmem_lat_pmu->associated_cpus;
+ break;
+ default:
+ return 0;
+ }
+ return cpumap_print_to_pagebuf(true, buf, cpumask);
+}
+
+#define NV_PMU_CPUMASK_ATTR(_name, _config) \
+ NV_PMU_EXT_ATTR(_name, cmem_lat_pmu_cpumask_show, \
+ (unsigned long)_config)
+
+static struct attribute *cmem_lat_pmu_cpumask_attrs[] = {
+ NV_PMU_CPUMASK_ATTR(cpumask, ACTIVE_CPU_MASK),
+ NV_PMU_CPUMASK_ATTR(associated_cpus, ASSOCIATED_CPU_MASK),
+ NULL,
+};
+
+static const struct attribute_group cmem_lat_pmu_cpumask_attr_group = {
+ .attrs = cmem_lat_pmu_cpumask_attrs,
+};
+
+/* Per PMU device attribute groups. */
+
+static const struct attribute_group *cmem_lat_pmu_attr_groups[] = {
+ &cmem_lat_pmu_identifier_attr_group,
+ &cmem_lat_pmu_format_group,
+ &cmem_lat_pmu_events_group,
+ &cmem_lat_pmu_cpumask_attr_group,
+ NULL,
+};
+
+static int cmem_lat_pmu_cpu_online(unsigned int cpu, struct hlist_node *node)
+{
+ struct cmem_lat_pmu *cmem_lat_pmu =
+ hlist_entry_safe(node, struct cmem_lat_pmu, node);
+
+ if (!cpumask_test_cpu(cpu, &cmem_lat_pmu->associated_cpus))
+ return 0;
+
+ /* If the PMU is already managed, there is nothing to do */
+ if (!cpumask_empty(&cmem_lat_pmu->active_cpu))
+ return 0;
+
+ /* Use this CPU for event counting */
+ cpumask_set_cpu(cpu, &cmem_lat_pmu->active_cpu);
+
+ return 0;
+}
+
+static int cmem_lat_pmu_cpu_teardown(unsigned int cpu, struct hlist_node *node)
+{
+ unsigned int dst;
+
+ struct cmem_lat_pmu *cmem_lat_pmu =
+ hlist_entry_safe(node, struct cmem_lat_pmu, node);
+
+ /* Nothing to do if this CPU doesn't own the PMU */
+ if (!cpumask_test_and_clear_cpu(cpu, &cmem_lat_pmu->active_cpu))
+ return 0;
+
+ /* Choose a new CPU to migrate ownership of the PMU to */
+ dst = cpumask_any_and_but(&cmem_lat_pmu->associated_cpus,
+ cpu_online_mask, cpu);
+ if (dst >= nr_cpu_ids)
+ return 0;
+
+ /* Use this CPU for event counting */
+ perf_pmu_migrate_context(&cmem_lat_pmu->pmu, cpu, dst);
+ cpumask_set_cpu(dst, &cmem_lat_pmu->active_cpu);
+
+ return 0;
+}
+
+static int cmem_lat_pmu_get_cpus(struct cmem_lat_pmu *cmem_lat_pmu,
+ unsigned int socket)
+{
+ int ret = 0, cpu;
+
+ for_each_possible_cpu(cpu) {
+ if (cpu_to_node(cpu) == socket)
+ cpumask_set_cpu(cpu, &cmem_lat_pmu->associated_cpus);
+ }
+
+ if (cpumask_empty(&cmem_lat_pmu->associated_cpus)) {
+ dev_dbg(cmem_lat_pmu->dev,
+ "No cpu associated with PMU socket-%u\n", socket);
+ ret = -ENODEV;
+ }
+
+ return ret;
+}
+
+static int cmem_lat_pmu_probe(struct platform_device *pdev)
+{
+ struct device *dev = &pdev->dev;
+ struct acpi_device *acpi_dev;
+ struct cmem_lat_pmu *cmem_lat_pmu;
+ char *name, *uid_str;
+ int ret, i;
+ u32 socket;
+
+ acpi_dev = ACPI_COMPANION(dev);
+ if (!acpi_dev)
+ return -ENODEV;
+
+ uid_str = acpi_device_uid(acpi_dev);
+ if (!uid_str)
+ return -ENODEV;
+
+ ret = kstrtou32(uid_str, 0, &socket);
+ if (ret)
+ return ret;
+
+ cmem_lat_pmu = devm_kzalloc(dev, sizeof(*cmem_lat_pmu), GFP_KERNEL);
+ name = devm_kasprintf(dev, GFP_KERNEL, "nvidia_cmem_latency_pmu_%u", socket);
+ if (!cmem_lat_pmu || !name)
+ return -ENOMEM;
+
+ cmem_lat_pmu->dev = dev;
+ cmem_lat_pmu->name = name;
+ cmem_lat_pmu->identifier = acpi_device_hid(acpi_dev);
+ platform_set_drvdata(pdev, cmem_lat_pmu);
+
+ cmem_lat_pmu->pmu = (struct pmu) {
+ .parent = &pdev->dev,
+ .task_ctx_nr = perf_invalid_context,
+ .pmu_enable = cmem_lat_pmu_enable,
+ .pmu_disable = cmem_lat_pmu_disable,
+ .event_init = cmem_lat_pmu_event_init,
+ .add = cmem_lat_pmu_add,
+ .del = cmem_lat_pmu_del,
+ .start = cmem_lat_pmu_start,
+ .stop = cmem_lat_pmu_stop,
+ .read = cmem_lat_pmu_read,
+ .attr_groups = cmem_lat_pmu_attr_groups,
+ .capabilities = PERF_PMU_CAP_NO_EXCLUDE |
+ PERF_PMU_CAP_NO_INTERRUPT,
+ };
+
+ /* Map the address of all the instances plus one for the broadcast. */
+ for (i = 0; i < NUM_INSTANCES + 1; i++) {
+ cmem_lat_pmu->base[i] = devm_platform_ioremap_resource(pdev, i);
+ if (IS_ERR(cmem_lat_pmu->base[i])) {
+ dev_err(dev, "Failed map address for instance %d\n", i);
+ return PTR_ERR(cmem_lat_pmu->base[i]);
+ }
+ }
+
+ ret = cmem_lat_pmu_get_cpus(cmem_lat_pmu, socket);
+ if (ret)
+ return ret;
+
+ ret = cpuhp_state_add_instance(cmem_lat_pmu_cpuhp_state,
+ &cmem_lat_pmu->node);
+ if (ret) {
+ dev_err(&pdev->dev, "Error %d registering hotplug\n", ret);
+ return ret;
+ }
+
+ cmem_lat_pmu_cg_ctrl(cmem_lat_pmu, CG_CTRL_ENABLE);
+ cmem_lat_pmu_ctrl(cmem_lat_pmu, CTRL_CLR);
+ cmem_lat_pmu_cg_ctrl(cmem_lat_pmu, CG_CTRL_DISABLE);
+
+ ret = perf_pmu_register(&cmem_lat_pmu->pmu, name, -1);
+ if (ret) {
+ dev_err(&pdev->dev, "Failed to register PMU: %d\n", ret);
+ cpuhp_state_remove_instance(cmem_lat_pmu_cpuhp_state,
+ &cmem_lat_pmu->node);
+ return ret;
+ }
+
+ dev_dbg(&pdev->dev, "Registered %s PMU\n", name);
+
+ return 0;
+}
+
+static void cmem_lat_pmu_device_remove(struct platform_device *pdev)
+{
+ struct cmem_lat_pmu *cmem_lat_pmu = platform_get_drvdata(pdev);
+
+ perf_pmu_unregister(&cmem_lat_pmu->pmu);
+ cpuhp_state_remove_instance(cmem_lat_pmu_cpuhp_state,
+ &cmem_lat_pmu->node);
+}
+
+static const struct acpi_device_id cmem_lat_pmu_acpi_match[] = {
+ { "NVDA2021", },
+ { }
+};
+MODULE_DEVICE_TABLE(acpi, cmem_lat_pmu_acpi_match);
+
+static struct platform_driver cmem_lat_pmu_driver = {
+ .driver = {
+ .name = "nvidia-t410-cmem-latency-pmu",
+ .acpi_match_table = ACPI_PTR(cmem_lat_pmu_acpi_match),
+ .suppress_bind_attrs = true,
+ },
+ .probe = cmem_lat_pmu_probe,
+ .remove = cmem_lat_pmu_device_remove,
+};
+
+static int __init cmem_lat_pmu_init(void)
+{
+ int ret;
+
+ ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
+ "perf/nvidia/cmem_latency:online",
+ cmem_lat_pmu_cpu_online,
+ cmem_lat_pmu_cpu_teardown);
+ if (ret < 0)
+ return ret;
+
+ cmem_lat_pmu_cpuhp_state = ret;
+
+ return platform_driver_register(&cmem_lat_pmu_driver);
+}
+
+static void __exit cmem_lat_pmu_exit(void)
+{
+ platform_driver_unregister(&cmem_lat_pmu_driver);
+ cpuhp_remove_multi_state(cmem_lat_pmu_cpuhp_state);
+}
+
+module_init(cmem_lat_pmu_init);
+module_exit(cmem_lat_pmu_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("NVIDIA Tegra410 CPU Memory Latency PMU driver");
+MODULE_AUTHOR("Besar Wicaksono <bwicaksono@nvidia.com>");
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v2 6/8] perf: add NVIDIA Tegra410 CPU Memory Latency PMU
2026-02-18 14:58 ` [PATCH v2 6/8] perf: add NVIDIA Tegra410 CPU Memory Latency PMU Besar Wicaksono
@ 2026-02-19 10:28 ` Jonathan Cameron
0 siblings, 0 replies; 18+ messages in thread
From: Jonathan Cameron @ 2026-02-19 10:28 UTC (permalink / raw)
To: Besar Wicaksono
Cc: will, suzuki.poulose, robin.murphy, ilkka, linux-arm-kernel,
linux-kernel, linux-tegra, mark.rutland, treding, jonathanh,
vsethi, rwiley, sdonthineni, skelley, ywan, mochs, nirmoyd
On Wed, 18 Feb 2026 14:58:07 +0000
Besar Wicaksono <bwicaksono@nvidia.com> wrote:
> Adds CPU Memory (CMEM) Latency PMU support in Tegra410 SOC.
> The PMU is used to measure latency between the edge of the
> Unified Coherence Fabric to the local system DRAM.
>
> Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
Hi Besar
Various fairly superficial things inline. Problem with reviewing
uncore drivers is that I've always forgotten the details of the interactions
with the perf core. There aren't enough of them coming through to keep it in my head.
Sadly I don't have time today to page all that info back in :(
Jonathan
> diff --git a/drivers/perf/nvidia_t410_cmem_latency_pmu.c b/drivers/perf/nvidia_t410_cmem_latency_pmu.c
> new file mode 100644
> index 000000000000..9b466581c8fc
> --- /dev/null
> +++ b/drivers/perf/nvidia_t410_cmem_latency_pmu.c
> @@ -0,0 +1,727 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * NVIDIA Tegra410 CPU Memory (CMEM) Latency PMU driver.
> + *
> + * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
> + */
> +
> +#include <linux/acpi.h>
> +#include <linux/bitops.h>
> +#include <linux/cpumask.h>
> +#include <linux/device.h>
> +#include <linux/interrupt.h>
> +#include <linux/io.h>
> +#include <linux/module.h>
> +#include <linux/perf_event.h>
> +#include <linux/platform_device.h>
> +#define NUM_INSTANCES 14
> +#define BCAST(pmu) pmu->base[NUM_INSTANCES]
As below. This is a weird macro. I'd just split the
base addresses up into a broad_cast_base and array of
the individual bases.
> +
> +/* Register offsets. */
> +#define CG_CTRL 0x800
> +#define CTRL 0x808
> +#define STATUS 0x810
> +#define CYCLE_CNTR 0x818
> +#define MC0_REQ_CNTR 0x820
> +#define MC0_AOR_CNTR 0x830
> +#define MC1_REQ_CNTR 0x838
> +#define MC1_AOR_CNTR 0x848
> +#define MC2_REQ_CNTR 0x850
> +#define MC2_AOR_CNTR 0x860
> +
> +/* CTRL values. */
> +#define CTRL_DISABLE 0x0ULL
> +#define CTRL_ENABLE 0x1ULL
> +#define CTRL_CLR 0x2ULL
> +
> +/* CG_CTRL values. */
> +#define CG_CTRL_DISABLE 0x0ULL
> +#define CG_CTRL_ENABLE 0x1ULL
> +
> +/* STATUS register field. */
> +#define STATUS_CYCLE_OVF BIT(0)
> +#define STATUS_MC0_AOR_OVF BIT(1)
> +#define STATUS_MC0_REQ_OVF BIT(3)
> +#define STATUS_MC1_AOR_OVF BIT(4)
> +#define STATUS_MC1_REQ_OVF BIT(6)
> +#define STATUS_MC2_AOR_OVF BIT(7)
> +#define STATUS_MC2_REQ_OVF BIT(9)
> +
> +/* Events. */
> +#define EVENT_CYCLES 0x0
> +#define EVENT_REQ 0x1
> +#define EVENT_AOR 0x2
> +
> +#define NUM_EVENTS 0x3
> +#define MASK_EVENT 0x3
> +#define MAX_ACTIVE_EVENTS 32
> +
> +#define ACTIVE_CPU_MASK 0x0
Some of these are very generic names. To avoid a future clash with something in a
header, I'd prefix them with something related to this driver.
> +#define ASSOCIATED_CPU_MASK 0x1
> +
> +static unsigned long cmem_lat_pmu_cpuhp_state;
> +
> +struct cmem_lat_pmu_hw_events {
> + struct perf_event *events[MAX_ACTIVE_EVENTS];
> + DECLARE_BITMAP(used_ctrs, MAX_ACTIVE_EVENTS);
> +};
> +
> +struct cmem_lat_pmu {
> + struct pmu pmu;
> + struct device *dev;
> + const char *name;
> + const char *identifier;
> + void __iomem *base[NUM_INSTANCES + 1];
As below. There are two types of things in this array (hence the +1)
I'd just split it into an array of NUMSTANCES and a seperate
pointer for the last one.
> + cpumask_t associated_cpus;
> + cpumask_t active_cpu;
> + struct hlist_node node;
> + struct cmem_lat_pmu_hw_events hw_events;
> +};
> +static bool cmem_lat_pmu_validate_event(struct pmu *pmu,
> + struct cmem_lat_pmu_hw_events *hw_events,
> + struct perf_event *event)
> +{
> + if (is_software_event(event))
> + return true;
> +
> + /* Reject groups spanning multiple HW PMUs. */
> + if (event->pmu != pmu)
> + return false;
> +
> + return (cmem_lat_pmu_get_event_idx(hw_events, event) >= 0);
I'd be tempted to use
int ret;
...
ret = cmem_lat_pmu_get_event_idx(hw_events, event);
if (ret < 0)
return false;
return true;
As that make it more obvious the final check is on the validity of the idx.
> +}
> +
> +/*
> + * Make sure the group of events can be scheduled at once
> + * on the PMU.
Wrap to 80 chars.
> + */
> +static bool cmem_lat_pmu_validate_group(struct perf_event *event)
> +
> +static int cmem_lat_pmu_event_init(struct perf_event *event)
> +{
> + struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(event->pmu);
> + struct hw_perf_event *hwc = &event->hw;
> + u32 event_type = get_event_type(event);
> +
> + if (event->attr.type != event->pmu->type ||
> + event_type >= NUM_EVENTS)
> + return -ENOENT;
> +
> + /*
> + * Following other "uncore" PMUs, we do not support sampling mode or
> + * attach to a task (per-process mode).
Perhaps nicer to say why all uncore PMUs do this rather than this is
doing the same as others...
Basically it's that they are system wide and so not clear what either sampling
or task attachment would actually mean.
> + */
> + if (is_sampling_event(event)) {
> + dev_dbg(cmem_lat_pmu->pmu.dev,
> + "Can't support sampling events\n");
> + return -EOPNOTSUPP;
> + }
> +
> +static void cmem_lat_pmu_read(struct perf_event *event)
> +{
> + cmem_lat_pmu_event_update(event);
> +}
> +
> +static inline void cmem_lat_pmu_cg_ctrl(struct cmem_lat_pmu *cmem_lat_pmu, u64 val)
> +{
> + writeq(val, BCAST(cmem_lat_pmu) + CG_CTRL);
The BCAST macro is odd enough I'd just put what it does in inline here so it's
clear it's just the last element. I'm not entirely sure why you put it at the
end of that array though. Why not just have a separate element in the struct?
> +}
> +
> +static inline void cmem_lat_pmu_ctrl(struct cmem_lat_pmu *cmem_lat_pmu, u64 val)
> +{
> + writeq(val, BCAST(cmem_lat_pmu) + CTRL);
> +}
> +
> +static void cmem_lat_pmu_enable(struct pmu *pmu)
> +{
> + bool disabled;
> + struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(pmu);
> +
> + disabled = bitmap_empty(
> + cmem_lat_pmu->hw_events.used_ctrs, MAX_ACTIVE_EVENTS);
This is unusual formatting. Much better to have the parameters up a line
and if you go to a second line, then start under the first character after (
> +
> + if (disabled)
> + return;
> +
> + /* Enable all the counters. */
> + cmem_lat_pmu_cg_ctrl(cmem_lat_pmu, CG_CTRL_ENABLE);
> + cmem_lat_pmu_ctrl(cmem_lat_pmu, CTRL_ENABLE);
> +}
> +
> +static void cmem_lat_pmu_disable(struct pmu *pmu)
> +{
> + int idx;
> + struct perf_event *event;
> + struct cmem_lat_pmu *cmem_lat_pmu = to_cmem_lat_pmu(pmu);
> +
> + /* Disable all the counters. */
> + cmem_lat_pmu_ctrl(cmem_lat_pmu, CTRL_DISABLE);
> +
> + /*
> + * The counters will start from 0 again on restart.
> + * Update the events immediately to avoid losing the counts.
> + */
> + for_each_set_bit(
> + idx, cmem_lat_pmu->hw_events.used_ctrs, MAX_ACTIVE_EVENTS) {
Very unusual formatting for a for loop. Move at least some / maybe all of the
parameters up a line. The last thing we want is them to be indented
the same as the stuff in the loop.
Probably drag declaration of event in here to make it clearer what scope that
local variable has.
> + event = cmem_lat_pmu->hw_events.events[idx];
> +
> + if (!event)
> + continue;
> +
> + cmem_lat_pmu_event_update(event);
> +
> + local64_set(&event->hw.prev_count, 0ULL);
> + }
> +
> + cmem_lat_pmu_ctrl(cmem_lat_pmu, CTRL_CLR);
> + cmem_lat_pmu_cg_ctrl(cmem_lat_pmu, CG_CTRL_DISABLE);
> +}
> +static struct attribute *cmem_lat_pmu_formats[] = {
> + NV_PMU_EXT_ATTR(event, device_show_string, "config:0-1"),
> + NULL,
As below.
> +};
> +
> +#define NV_PMU_CPUMASK_ATTR(_name, _config) \
> + NV_PMU_EXT_ATTR(_name, cmem_lat_pmu_cpumask_show, \
> + (unsigned long)_config)
> +
> +static struct attribute *cmem_lat_pmu_cpumask_attrs[] = {
> + NV_PMU_CPUMASK_ATTR(cpumask, ACTIVE_CPU_MASK),
> + NV_PMU_CPUMASK_ATTR(associated_cpus, ASSOCIATED_CPU_MASK),
> + NULL,
As below.
> +};
> +
> +static const struct attribute_group cmem_lat_pmu_cpumask_attr_group = {
> + .attrs = cmem_lat_pmu_cpumask_attrs,
> +};
> +
> +/* Per PMU device attribute groups. */
> +
> +static const struct attribute_group *cmem_lat_pmu_attr_groups[] = {
> + &cmem_lat_pmu_identifier_attr_group,
> + &cmem_lat_pmu_format_group,
> + &cmem_lat_pmu_events_group,
> + &cmem_lat_pmu_cpumask_attr_group,
> + NULL,
New code, so no need to copy any local style. Hence drop that trailing comma :)
> +};
> +
> +static int cmem_lat_pmu_get_cpus(struct cmem_lat_pmu *cmem_lat_pmu,
> + unsigned int socket)
> +{
> + int ret = 0, cpu;
> +
> + for_each_possible_cpu(cpu) {
> + if (cpu_to_node(cpu) == socket)
> + cpumask_set_cpu(cpu, &cmem_lat_pmu->associated_cpus);
> + }
> +
> + if (cpumask_empty(&cmem_lat_pmu->associated_cpus)) {
> + dev_dbg(cmem_lat_pmu->dev,
> + "No cpu associated with PMU socket-%u\n", socket);
> + ret = -ENODEV;
return -ENODEV;
Saves reviewer reading on and...
> + }
> +
> + return ret;
return 0;
so the know that getting here always indicates success.
> +}
> +
> +static const struct acpi_device_id cmem_lat_pmu_acpi_match[] = {
> + { "NVDA2021", },
The trailing comma after the string doesn't add anything, so I'd drop it.
> + { }
> +};
> +MODULE_DEVICE_TABLE(acpi, cmem_lat_pmu_acpi_match);
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 7/8] perf: add NVIDIA Tegra410 C2C PMU
2026-02-18 14:58 [PATCH v2 0/8] perf: add NVIDIA Tegra410 Uncore PMU support Besar Wicaksono
` (5 preceding siblings ...)
2026-02-18 14:58 ` [PATCH v2 6/8] perf: add NVIDIA Tegra410 CPU Memory Latency PMU Besar Wicaksono
@ 2026-02-18 14:58 ` Besar Wicaksono
2026-02-19 10:55 ` Jonathan Cameron
2026-02-18 14:58 ` [PATCH v2 8/8] arm64: defconfig: Enable NVIDIA TEGRA410 PMU Besar Wicaksono
7 siblings, 1 reply; 18+ messages in thread
From: Besar Wicaksono @ 2026-02-18 14:58 UTC (permalink / raw)
To: will, suzuki.poulose, robin.murphy, ilkka
Cc: linux-arm-kernel, linux-kernel, linux-tegra, mark.rutland,
treding, jonathanh, vsethi, rwiley, sdonthineni, skelley, ywan,
mochs, nirmoyd, Besar Wicaksono
Adds NVIDIA C2C PMU support in Tegra410 SOC. This PMU is
used to measure memory latency between the SOC and device
memory, e.g GPU Memory (GMEM), CXL Memory, or memory on
remote Tegra410 SOC.
Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
---
.../admin-guide/perf/nvidia-tegra410-pmu.rst | 151 +++
drivers/perf/Kconfig | 7 +
drivers/perf/Makefile | 1 +
drivers/perf/nvidia_t410_c2c_pmu.c | 1062 +++++++++++++++++
4 files changed, 1221 insertions(+)
create mode 100644 drivers/perf/nvidia_t410_c2c_pmu.c
diff --git a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
index c8fbc289d12c..678cb3df228e 100644
--- a/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
+++ b/Documentation/admin-guide/perf/nvidia-tegra410-pmu.rst
@@ -9,6 +9,9 @@ metrics like memory bandwidth, latency, and utilization:
* PCIE
* PCIE-TGT
* CPU Memory (CMEM) Latency
+* NVLink-C2C
+* NV-CLink
+* NV-DLink
PMU Driver
----------
@@ -367,3 +370,151 @@ see /sys/bus/event_source/devices/nvidia_cmem_latency_pmu_<socket-id>.
Example usage::
perf stat -a -e '{nvidia_cmem_latency_pmu_0/rd_req/,nvidia_cmem_latency_pmu_0/rd_cum_outs/,nvidia_cmem_latency_pmu_0/cycles/}'
+
+NVLink-C2C PMU
+--------------
+
+This PMU monitors latency events of memory read/write requests that pass through
+the NVIDIA Chip-to-Chip (C2C) interface. Bandwidth events are not available
+in this PMU, unlike the C2C PMU in Grace (Tegra241 SoC).
+
+The events and configuration options of this PMU device are available in sysfs,
+see /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>.
+
+The list of events:
+
+ * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
+ * IN_RD_REQ: the number of incoming read requests.
+ * IN_WR_CUM_OUTS: accumulated outstanding request (in cycles) of incoming write requests.
+ * IN_WR_REQ: the number of incoming write requests.
+ * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
+ * OUT_RD_REQ: the number of outgoing read requests.
+ * OUT_WR_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing write requests.
+ * OUT_WR_REQ: the number of outgoing write requests.
+ * CYCLES: NVLink-C2C interface cycle counts.
+
+The incoming events count the reads/writes from remote device to the SoC.
+The outgoing events count the reads/writes from the SoC to remote device.
+
+The sysfs /sys/bus/event_source/devices/nvidia_nvlink_c2c_pmu_<socket-id>/peer
+contains the information about the connected device.
+
+When the C2C interface is connected to GPU(s), the user can use the
+"gpu_mask" parameter to filter traffic to/from specific GPU(s). Each bit represents the GPU
+index, e.g. "gpu_mask=0x1" corresponds to GPU 0 and "gpu_mask=0x3" is for GPU 0 and 1.
+The PMU will monitor all GPUs by default if not specified.
+
+When connected to another SoC, only the read events are available.
+
+The events can be used to calculate the average latency of the read/write requests::
+
+ C2C_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
+
+ IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
+ IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
+
+ IN_WR_AVG_LATENCY_IN_CYCLES = IN_WR_CUM_OUTS / IN_WR_REQ
+ IN_WR_AVG_LATENCY_IN_NS = IN_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
+
+ OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
+ OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
+
+ OUT_WR_AVG_LATENCY_IN_CYCLES = OUT_WR_CUM_OUTS / OUT_WR_REQ
+ OUT_WR_AVG_LATENCY_IN_NS = OUT_WR_AVG_LATENCY_IN_CYCLES / C2C_FREQ_IN_GHZ
+
+Example usage:
+
+ * Count incoming traffic from all GPUs connected via NVLink-C2C::
+
+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_req/
+
+ * Count incoming traffic from GPU 0 connected via NVLink-C2C::
+
+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x1/
+
+ * Count incoming traffic from GPU 1 connected via NVLink-C2C::
+
+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/in_rd_cum_outs,gpu_mask=0x2/
+
+ * Count outgoing traffic to all GPUs connected via NVLink-C2C::
+
+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_req/
+
+ * Count outgoing traffic to GPU 0 connected via NVLink-C2C::
+
+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x1/
+
+ * Count outgoing traffic to GPU 1 connected via NVLink-C2C::
+
+ perf stat -a -e nvidia_nvlink_c2c_pmu_0/out_rd_cum_outs,gpu_mask=0x2/
+
+NV-CLink PMU
+------------
+
+This PMU monitors latency events of memory read requests that pass through
+the NV-CLINK interface. Bandwidth events are not available in this PMU.
+In Tegra410 SoC, the NV-CLink interface is used to connect to another Tegra410
+SoC and this PMU only counts read traffic.
+
+The events and configuration options of this PMU device are available in sysfs,
+see /sys/bus/event_source/devices/nvidia_nvclink_pmu_<socket-id>.
+
+The list of events:
+
+ * IN_RD_CUM_OUTS: accumulated outstanding request (in cycles) of incoming read requests.
+ * IN_RD_REQ: the number of incoming read requests.
+ * OUT_RD_CUM_OUTS: accumulated outstanding request (in cycles) of outgoing read requests.
+ * OUT_RD_REQ: the number of outgoing read requests.
+ * CYCLES: NV-CLINK interface cycle counts.
+
+The incoming events count the reads from remote device to the SoC.
+The outgoing events count the reads from the SoC to remote device.
+
+The events can be used to calculate the average latency of the read requests::
+
+ CLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
+
+ IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
+ IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
+
+ OUT_RD_AVG_LATENCY_IN_CYCLES = OUT_RD_CUM_OUTS / OUT_RD_REQ
+ OUT_RD_AVG_LATENCY_IN_NS = OUT_RD_AVG_LATENCY_IN_CYCLES / CLINK_FREQ_IN_GHZ
+
+Example usage:
+
+ * Count incoming read traffic from remote SoC connected via NV-CLINK::
+
+ perf stat -a -e nvidia_nvclink_pmu_0/in_rd_req/
+
+ * Count outgoing read traffic to remote SoC connected via NV-CLINK::
+
+ perf stat -a -e nvidia_nvclink_pmu_0/out_rd_req/
+
+NV-DLink PMU
+------------
+
+This PMU monitors latency events of memory read requests that pass through
+the NV-DLINK interface. Bandwidth events are not available in this PMU.
+In Tegra410 SoC, this PMU only counts CXL memory read traffic.
+
+The events and configuration options of this PMU device are available in sysfs,
+see /sys/bus/event_source/devices/nvidia_nvdlink_pmu_<socket-id>.
+
+The list of events:
+
+ * IN_RD_CUM_OUTS: accumulated outstanding read requests (in cycles) to CXL memory.
+ * IN_RD_REQ: the number of read requests to CXL memory.
+ * CYCLES: NV-DLINK interface cycle counts.
+
+The events can be used to calculate the average latency of the read requests::
+
+ DLINK_FREQ_IN_GHZ = CYCLES / ELAPSED_TIME_IN_NS
+
+ IN_RD_AVG_LATENCY_IN_CYCLES = IN_RD_CUM_OUTS / IN_RD_REQ
+ IN_RD_AVG_LATENCY_IN_NS = IN_RD_AVG_LATENCY_IN_CYCLES / DLINK_FREQ_IN_GHZ
+
+Example usage:
+
+ * Count read events to CXL memory::
+
+ perf stat -a -e '{nvidia_nvdlink_pmu_0/in_rd_req/,nvidia_nvdlink_pmu_0/in_rd_cum_outs/}'
diff --git a/drivers/perf/Kconfig b/drivers/perf/Kconfig
index 26e86067d8f9..ab90932fc2d0 100644
--- a/drivers/perf/Kconfig
+++ b/drivers/perf/Kconfig
@@ -318,4 +318,11 @@ config NVIDIA_TEGRA410_CMEM_LATENCY_PMU
Enable perf support for CPU memory latency counters monitoring on
NVIDIA Tegra410 SoC.
+config NVIDIA_TEGRA410_C2C_PMU
+ tristate "NVIDIA Tegra410 C2C PMU"
+ depends on ARM64 && ACPI
+ help
+ Enable perf support for counters in NVIDIA C2C interface of NVIDIA
+ Tegra410 SoC.
+
endmenu
diff --git a/drivers/perf/Makefile b/drivers/perf/Makefile
index 4aa6aad393c2..eb8a022dad9a 100644
--- a/drivers/perf/Makefile
+++ b/drivers/perf/Makefile
@@ -36,3 +36,4 @@ obj-$(CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU) += arm_cspmu/
obj-$(CONFIG_MESON_DDR_PMU) += amlogic/
obj-$(CONFIG_CXL_PMU) += cxl_pmu.o
obj-$(CONFIG_NVIDIA_TEGRA410_CMEM_LATENCY_PMU) += nvidia_t410_cmem_latency_pmu.o
+obj-$(CONFIG_NVIDIA_TEGRA410_C2C_PMU) += nvidia_t410_c2c_pmu.o
diff --git a/drivers/perf/nvidia_t410_c2c_pmu.c b/drivers/perf/nvidia_t410_c2c_pmu.c
new file mode 100644
index 000000000000..a3891c94dcde
--- /dev/null
+++ b/drivers/perf/nvidia_t410_c2c_pmu.c
@@ -0,0 +1,1062 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * NVIDIA Tegra410 C2C PMU driver.
+ *
+ * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+ */
+
+#include <linux/acpi.h>
+#include <linux/bitops.h>
+#include <linux/cpumask.h>
+#include <linux/device.h>
+#include <linux/interrupt.h>
+#include <linux/io.h>
+#include <linux/module.h>
+#include <linux/perf_event.h>
+#include <linux/platform_device.h>
+#include <linux/property.h>
+
+/* The C2C interface types in Tegra410. */
+#define C2C_TYPE_NVLINK 0x0
+#define C2C_TYPE_NVCLINK 0x1
+#define C2C_TYPE_NVDLINK 0x2
+#define C2C_TYPE_COUNT 0x3
+
+/* The type of the peer device connected to the C2C interface. */
+#define C2C_PEER_TYPE_CPU 0x0
+#define C2C_PEER_TYPE_GPU 0x1
+#define C2C_PEER_TYPE_CXLMEM 0x2
+#define C2C_PEER_TYPE_COUNT 0x3
+
+/* The number of peer devices can be connected to the C2C interface. */
+#define C2C_NR_PEER_CPU 0x1
+#define C2C_NR_PEER_GPU 0x2
+#define C2C_NR_PEER_CXLMEM 0x1
+#define C2C_NR_PEER_MAX 0x2
+
+/* Number of instances on each interface. */
+#define C2C_NR_INST_NVLINK 14
+#define C2C_NR_INST_NVCLINK 12
+#define C2C_NR_INST_NVDLINK 16
+#define C2C_NR_INST_MAX 16
+
+/* Register offsets. */
+#define C2C_CTRL 0x864
+#define C2C_IN_STATUS 0x868
+#define C2C_CYCLE_CNTR 0x86c
+#define C2C_IN_RD_CUM_OUTS_CNTR 0x874
+#define C2C_IN_RD_REQ_CNTR 0x87c
+#define C2C_IN_WR_CUM_OUTS_CNTR 0x884
+#define C2C_IN_WR_REQ_CNTR 0x88c
+#define C2C_OUT_STATUS 0x890
+#define C2C_OUT_RD_CUM_OUTS_CNTR 0x898
+#define C2C_OUT_RD_REQ_CNTR 0x8a0
+#define C2C_OUT_WR_CUM_OUTS_CNTR 0x8a8
+#define C2C_OUT_WR_REQ_CNTR 0x8b0
+
+/* C2C_IN_STATUS register field. */
+#define C2C_IN_STATUS_CYCLE_OVF BIT(0)
+#define C2C_IN_STATUS_IN_RD_CUM_OUTS_OVF BIT(1)
+#define C2C_IN_STATUS_IN_RD_REQ_OVF BIT(2)
+#define C2C_IN_STATUS_IN_WR_CUM_OUTS_OVF BIT(3)
+#define C2C_IN_STATUS_IN_WR_REQ_OVF BIT(4)
+
+/* C2C_OUT_STATUS register field. */
+#define C2C_OUT_STATUS_OUT_RD_CUM_OUTS_OVF BIT(0)
+#define C2C_OUT_STATUS_OUT_RD_REQ_OVF BIT(1)
+#define C2C_OUT_STATUS_OUT_WR_CUM_OUTS_OVF BIT(2)
+#define C2C_OUT_STATUS_OUT_WR_REQ_OVF BIT(3)
+
+/* Events. */
+#define C2C_EVENT_CYCLES 0x0
+#define C2C_EVENT_IN_RD_CUM_OUTS 0x1
+#define C2C_EVENT_IN_RD_REQ 0x2
+#define C2C_EVENT_IN_WR_CUM_OUTS 0x3
+#define C2C_EVENT_IN_WR_REQ 0x4
+#define C2C_EVENT_OUT_RD_CUM_OUTS 0x5
+#define C2C_EVENT_OUT_RD_REQ 0x6
+#define C2C_EVENT_OUT_WR_CUM_OUTS 0x7
+#define C2C_EVENT_OUT_WR_REQ 0x8
+
+#define C2C_NUM_EVENTS 0x9
+#define C2C_MASK_EVENT 0xFF
+#define C2C_MAX_ACTIVE_EVENTS 32
+
+#define C2C_ACTIVE_CPU_MASK 0x0
+#define C2C_ASSOCIATED_CPU_MASK 0x1
+
+/*
+ * Maximum poll count for reading counter value using high-low-high sequence.
+ */
+#define HILOHI_MAX_POLL 1000
+
+static unsigned long nv_c2c_pmu_cpuhp_state;
+
+/* PMU descriptor. */
+
+/* Tracks the events assigned to the PMU for a given logical index. */
+struct nv_c2c_pmu_hw_events {
+ /* The events that are active. */
+ struct perf_event *events[C2C_MAX_ACTIVE_EVENTS];
+
+ /*
+ * Each bit indicates a logical counter is being used (or not) for an
+ * event.
+ */
+ DECLARE_BITMAP(used_ctrs, C2C_MAX_ACTIVE_EVENTS);
+};
+
+struct nv_c2c_pmu {
+ struct pmu pmu;
+ struct device *dev;
+ struct acpi_device *acpi_dev;
+
+ const char *name;
+ const char *identifier;
+
+ unsigned int c2c_type;
+ unsigned int peer_type;
+ unsigned int socket;
+ unsigned int nr_inst;
+ unsigned int nr_peer;
+ unsigned long peer_insts[C2C_NR_PEER_MAX][BITS_TO_LONGS(C2C_NR_INST_MAX)];
+ u32 filter_default;
+
+ struct nv_c2c_pmu_hw_events hw_events;
+
+ cpumask_t associated_cpus;
+ cpumask_t active_cpu;
+
+ struct hlist_node cpuhp_node;
+
+ struct attribute **formats;
+ const struct attribute_group *attr_groups[6];
+
+ void __iomem *base_broadcast;
+ void __iomem *base[C2C_NR_INST_MAX];
+};
+
+#define to_c2c_pmu(p) (container_of(p, struct nv_c2c_pmu, pmu))
+
+/* Get event type from perf_event. */
+static inline u32 get_event_type(struct perf_event *event)
+{
+ return (event->attr.config) & C2C_MASK_EVENT;
+}
+
+static inline u32 get_filter_mask(struct perf_event *event)
+{
+ u32 filter;
+ struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(event->pmu);
+
+ filter = ((u32)event->attr.config1) & c2c_pmu->filter_default;
+ if (filter == 0)
+ filter = c2c_pmu->filter_default;
+
+ return filter;
+}
+
+/* PMU operations. */
+
+static int nv_c2c_pmu_get_event_idx(struct nv_c2c_pmu_hw_events *hw_events,
+ struct perf_event *event)
+{
+ u32 idx;
+
+ idx = find_first_zero_bit(hw_events->used_ctrs, C2C_MAX_ACTIVE_EVENTS);
+ if (idx >= C2C_MAX_ACTIVE_EVENTS)
+ return -EAGAIN;
+
+ set_bit(idx, hw_events->used_ctrs);
+
+ return idx;
+}
+
+static bool
+nv_c2c_pmu_validate_event(struct pmu *pmu,
+ struct nv_c2c_pmu_hw_events *hw_events,
+ struct perf_event *event)
+{
+ if (is_software_event(event))
+ return true;
+
+ /* Reject groups spanning multiple HW PMUs. */
+ if (event->pmu != pmu)
+ return false;
+
+ return nv_c2c_pmu_get_event_idx(hw_events, event) >= 0;
+}
+
+/*
+ * Make sure the group of events can be scheduled at once
+ * on the PMU.
+ */
+static bool nv_c2c_pmu_validate_group(struct perf_event *event)
+{
+ struct perf_event *sibling, *leader = event->group_leader;
+ struct nv_c2c_pmu_hw_events fake_hw_events;
+
+ if (event->group_leader == event)
+ return true;
+
+ memset(&fake_hw_events, 0, sizeof(fake_hw_events));
+
+ if (!nv_c2c_pmu_validate_event(event->pmu, &fake_hw_events, leader))
+ return false;
+
+ for_each_sibling_event(sibling, leader) {
+ if (!nv_c2c_pmu_validate_event(event->pmu, &fake_hw_events,
+ sibling))
+ return false;
+ }
+
+ return nv_c2c_pmu_validate_event(event->pmu, &fake_hw_events, event);
+}
+
+static int nv_c2c_pmu_event_init(struct perf_event *event)
+{
+ struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(event->pmu);
+ struct hw_perf_event *hwc = &event->hw;
+ u32 event_type = get_event_type(event);
+
+ if (event->attr.type != event->pmu->type ||
+ event_type >= C2C_NUM_EVENTS)
+ return -ENOENT;
+
+ /*
+ * Following other "uncore" PMUs, we do not support sampling mode or
+ * attach to a task (per-process mode).
+ */
+ if (is_sampling_event(event)) {
+ dev_dbg(c2c_pmu->pmu.dev, "Can't support sampling events\n");
+ return -EOPNOTSUPP;
+ }
+
+ if (event->cpu < 0 || event->attach_state & PERF_ATTACH_TASK) {
+ dev_dbg(c2c_pmu->pmu.dev, "Can't support per-task counters\n");
+ return -EINVAL;
+ }
+
+ /*
+ * Make sure the CPU assignment is on one of the CPUs associated with
+ * this PMU.
+ */
+ if (!cpumask_test_cpu(event->cpu, &c2c_pmu->associated_cpus)) {
+ dev_dbg(c2c_pmu->pmu.dev,
+ "Requested cpu is not associated with the PMU\n");
+ return -EINVAL;
+ }
+
+ /* Enforce the current active CPU to handle the events in this PMU. */
+ event->cpu = cpumask_first(&c2c_pmu->active_cpu);
+ if (event->cpu >= nr_cpu_ids)
+ return -EINVAL;
+
+ if (!nv_c2c_pmu_validate_group(event))
+ return -EINVAL;
+
+ hwc->idx = -1;
+ hwc->config = event_type;
+
+ return 0;
+}
+
+/*
+ * Read 64-bit register as a pair of 32-bit registers using hi-lo-hi sequence.
+ */
+static u64 read_reg64_hilohi(const void __iomem *addr, u32 max_poll_count)
+{
+ u32 val_lo, val_hi;
+ u64 val;
+
+ /* Use high-low-high sequence to avoid tearing */
+ do {
+ if (max_poll_count-- == 0) {
+ pr_err("NV C2C PMU: timeout hi-low-high sequence\n");
+ return 0;
+ }
+
+ val_hi = readl(addr + 4);
+ val_lo = readl(addr);
+ } while (val_hi != readl(addr + 4));
+
+ val = (((u64)val_hi << 32) | val_lo);
+
+ return val;
+}
+
+static void nv_c2c_pmu_check_status(struct nv_c2c_pmu *c2c_pmu, u32 instance)
+{
+ u32 in_status, out_status;
+
+ in_status = readl(c2c_pmu->base[instance] + C2C_IN_STATUS);
+ out_status = readl(c2c_pmu->base[instance] + C2C_OUT_STATUS);
+
+ if (in_status || out_status)
+ dev_warn(c2c_pmu->dev,
+ "C2C PMU overflow in: 0x%x, out: 0x%x\n",
+ in_status, out_status);
+}
+
+static u32 nv_c2c_ctr_offset[C2C_NUM_EVENTS] = {
+ [C2C_EVENT_CYCLES] = C2C_CYCLE_CNTR,
+ [C2C_EVENT_IN_RD_CUM_OUTS] = C2C_IN_RD_CUM_OUTS_CNTR,
+ [C2C_EVENT_IN_RD_REQ] = C2C_IN_RD_REQ_CNTR,
+ [C2C_EVENT_IN_WR_CUM_OUTS] = C2C_IN_WR_CUM_OUTS_CNTR,
+ [C2C_EVENT_IN_WR_REQ] = C2C_IN_WR_REQ_CNTR,
+ [C2C_EVENT_OUT_RD_CUM_OUTS] = C2C_OUT_RD_CUM_OUTS_CNTR,
+ [C2C_EVENT_OUT_RD_REQ] = C2C_OUT_RD_REQ_CNTR,
+ [C2C_EVENT_OUT_WR_CUM_OUTS] = C2C_OUT_WR_CUM_OUTS_CNTR,
+ [C2C_EVENT_OUT_WR_REQ] = C2C_OUT_WR_REQ_CNTR,
+};
+
+static u64 nv_c2c_pmu_read_counter(struct perf_event *event)
+{
+ u32 ctr_id, ctr_offset, filter_mask, filter_idx, inst_idx;
+ unsigned long *inst_mask;
+ DECLARE_BITMAP(filter_bitmap, C2C_NR_PEER_MAX);
+ struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(event->pmu);
+ u64 val = 0;
+
+ filter_mask = get_filter_mask(event);
+ bitmap_from_arr32(filter_bitmap, &filter_mask, c2c_pmu->nr_peer);
+
+ ctr_id = event->hw.config;
+ ctr_offset = nv_c2c_ctr_offset[ctr_id];
+
+ for_each_set_bit(filter_idx, filter_bitmap, c2c_pmu->nr_peer) {
+ inst_mask = c2c_pmu->peer_insts[filter_idx];
+ for_each_set_bit(inst_idx, inst_mask, c2c_pmu->nr_inst) {
+ nv_c2c_pmu_check_status(c2c_pmu, inst_idx);
+
+ /*
+ * Each instance share same clock and the driver always
+ * enables all instances. So we can use the counts from
+ * one instance for cycle counter.
+ */
+ if (ctr_id == C2C_EVENT_CYCLES)
+ return read_reg64_hilohi(
+ c2c_pmu->base[inst_idx] + ctr_offset,
+ HILOHI_MAX_POLL);
+
+ /*
+ * For other events, sum up the counts from all instances.
+ */
+ val += read_reg64_hilohi(
+ c2c_pmu->base[inst_idx] + ctr_offset,
+ HILOHI_MAX_POLL);
+ }
+ }
+
+ return val;
+}
+
+static void nv_c2c_pmu_event_update(struct perf_event *event)
+{
+ struct hw_perf_event *hwc = &event->hw;
+ u64 prev, now;
+
+ do {
+ prev = local64_read(&hwc->prev_count);
+ now = nv_c2c_pmu_read_counter(event);
+ } while (local64_cmpxchg(&hwc->prev_count, prev, now) != prev);
+
+ local64_add(now - prev, &event->count);
+}
+
+static void nv_c2c_pmu_start(struct perf_event *event, int pmu_flags)
+{
+ event->hw.state = 0;
+}
+
+static void nv_c2c_pmu_stop(struct perf_event *event, int pmu_flags)
+{
+ event->hw.state |= PERF_HES_STOPPED;
+}
+
+static int nv_c2c_pmu_add(struct perf_event *event, int flags)
+{
+ struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(event->pmu);
+ struct nv_c2c_pmu_hw_events *hw_events = &c2c_pmu->hw_events;
+ struct hw_perf_event *hwc = &event->hw;
+ int idx;
+
+ if (WARN_ON_ONCE(!cpumask_test_cpu(smp_processor_id(),
+ &c2c_pmu->associated_cpus)))
+ return -ENOENT;
+
+ idx = nv_c2c_pmu_get_event_idx(hw_events, event);
+ if (idx < 0)
+ return idx;
+
+ hw_events->events[idx] = event;
+ hwc->idx = idx;
+ hwc->state = PERF_HES_STOPPED | PERF_HES_UPTODATE;
+
+ if (flags & PERF_EF_START)
+ nv_c2c_pmu_start(event, PERF_EF_RELOAD);
+
+ /* Propagate changes to the userspace mapping. */
+ perf_event_update_userpage(event);
+
+ return 0;
+}
+
+static void nv_c2c_pmu_del(struct perf_event *event, int flags)
+{
+ struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(event->pmu);
+ struct nv_c2c_pmu_hw_events *hw_events = &c2c_pmu->hw_events;
+ struct hw_perf_event *hwc = &event->hw;
+ int idx = hwc->idx;
+
+ nv_c2c_pmu_stop(event, PERF_EF_UPDATE);
+
+ hw_events->events[idx] = NULL;
+
+ clear_bit(idx, hw_events->used_ctrs);
+
+ perf_event_update_userpage(event);
+}
+
+static void nv_c2c_pmu_read(struct perf_event *event)
+{
+ nv_c2c_pmu_event_update(event);
+}
+
+static void nv_c2c_pmu_enable(struct pmu *pmu)
+{
+ void __iomem *bcast;
+ struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(pmu);
+
+ /* Check if any filter is enabled. */
+ if (bitmap_empty(c2c_pmu->hw_events.used_ctrs, C2C_MAX_ACTIVE_EVENTS))
+ return;
+
+ /* Enable all the counters. */
+ bcast = c2c_pmu->base_broadcast;
+ writel(0x1UL, bcast + C2C_CTRL);
+}
+
+static void nv_c2c_pmu_disable(struct pmu *pmu)
+{
+ unsigned int idx;
+ void __iomem *bcast;
+ struct perf_event *event;
+ struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(pmu);
+
+ /* Disable all the counters. */
+ bcast = c2c_pmu->base_broadcast;
+ writel(0x0UL, bcast + C2C_CTRL);
+
+ /*
+ * The counters will start from 0 again on restart.
+ * Update the events immediately to avoid losing the counts.
+ */
+ for_each_set_bit(idx, c2c_pmu->hw_events.used_ctrs,
+ C2C_MAX_ACTIVE_EVENTS) {
+ event = c2c_pmu->hw_events.events[idx];
+
+ if (!event)
+ continue;
+
+ nv_c2c_pmu_event_update(event);
+
+ local64_set(&event->hw.prev_count, 0ULL);
+ }
+}
+
+/* PMU identifier attribute. */
+
+static ssize_t nv_c2c_pmu_identifier_show(struct device *dev,
+ struct device_attribute *attr,
+ char *page)
+{
+ struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(dev_get_drvdata(dev));
+
+ return sysfs_emit(page, "%s\n", c2c_pmu->identifier);
+}
+
+static struct device_attribute nv_c2c_pmu_identifier_attr =
+ __ATTR(identifier, 0444, nv_c2c_pmu_identifier_show, NULL);
+
+static struct attribute *nv_c2c_pmu_identifier_attrs[] = {
+ &nv_c2c_pmu_identifier_attr.attr,
+ NULL,
+};
+
+static struct attribute_group nv_c2c_pmu_identifier_attr_group = {
+ .attrs = nv_c2c_pmu_identifier_attrs,
+};
+
+/* Peer attribute. */
+
+static ssize_t nv_c2c_pmu_peer_show(struct device *dev,
+ struct device_attribute *attr,
+ char *page)
+{
+ const char *peer_type[C2C_PEER_TYPE_COUNT] = {
+ [C2C_PEER_TYPE_CPU] = "cpu",
+ [C2C_PEER_TYPE_GPU] = "gpu",
+ [C2C_PEER_TYPE_CXLMEM] = "cxlmem",
+ };
+
+ struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(dev_get_drvdata(dev));
+ return sysfs_emit(page, "nr_%s=%u\n", peer_type[c2c_pmu->peer_type],
+ c2c_pmu->nr_peer);
+}
+
+static struct device_attribute nv_c2c_pmu_peer_attr =
+ __ATTR(peer, 0444, nv_c2c_pmu_peer_show, NULL);
+
+static struct attribute *nv_c2c_pmu_peer_attrs[] = {
+ &nv_c2c_pmu_peer_attr.attr,
+ NULL,
+};
+
+static struct attribute_group nv_c2c_pmu_peer_attr_group = {
+ .attrs = nv_c2c_pmu_peer_attrs,
+};
+
+/* Format attributes. */
+
+#define NV_C2C_PMU_EXT_ATTR(_name, _func, _config) \
+ (&((struct dev_ext_attribute[]){ \
+ { \
+ .attr = __ATTR(_name, 0444, _func, NULL), \
+ .var = (void *)_config \
+ } \
+ })[0].attr.attr)
+
+#define NV_C2C_PMU_FORMAT_ATTR(_name, _config) \
+ NV_C2C_PMU_EXT_ATTR(_name, device_show_string, _config)
+
+#define NV_C2C_PMU_FORMAT_EVENT_ATTR \
+ NV_C2C_PMU_FORMAT_ATTR(event, "config:0-3")
+
+static struct attribute *nv_c2c_nvlink_pmu_formats[] = {
+ NV_C2C_PMU_FORMAT_EVENT_ATTR,
+ NV_C2C_PMU_FORMAT_ATTR(gpu_mask, "config1:0-1"),
+ NULL,
+};
+
+static struct attribute *nv_c2c_pmu_formats[] = {
+ NV_C2C_PMU_FORMAT_EVENT_ATTR,
+ NULL,
+};
+
+static struct attribute_group *
+nv_c2c_pmu_alloc_format_attr_group(struct nv_c2c_pmu *c2c_pmu)
+{
+ struct attribute_group *format_group;
+ struct device *dev = c2c_pmu->dev;
+
+ format_group =
+ devm_kzalloc(dev, sizeof(struct attribute_group), GFP_KERNEL);
+ if (!format_group)
+ return NULL;
+
+ format_group->name = "format";
+ format_group->attrs = c2c_pmu->formats;
+
+ return format_group;
+}
+
+/* Event attributes. */
+
+static ssize_t nv_c2c_pmu_sysfs_event_show(struct device *dev,
+ struct device_attribute *attr,
+ char *buf)
+{
+ struct perf_pmu_events_attr *pmu_attr;
+
+ pmu_attr = container_of(attr, typeof(*pmu_attr), attr);
+ return sysfs_emit(buf, "event=0x%llx\n", pmu_attr->id);
+}
+
+#define NV_C2C_PMU_EVENT_ATTR(_name, _config) \
+ PMU_EVENT_ATTR_ID(_name, nv_c2c_pmu_sysfs_event_show, _config)
+
+static struct attribute *nv_c2c_pmu_events[] = {
+ NV_C2C_PMU_EVENT_ATTR(cycles, C2C_EVENT_CYCLES),
+ NV_C2C_PMU_EVENT_ATTR(in_rd_cum_outs, C2C_EVENT_IN_RD_CUM_OUTS),
+ NV_C2C_PMU_EVENT_ATTR(in_rd_req, C2C_EVENT_IN_RD_REQ),
+ NV_C2C_PMU_EVENT_ATTR(in_wr_cum_outs, C2C_EVENT_IN_WR_CUM_OUTS),
+ NV_C2C_PMU_EVENT_ATTR(in_wr_req, C2C_EVENT_IN_WR_REQ),
+ NV_C2C_PMU_EVENT_ATTR(out_rd_cum_outs, C2C_EVENT_OUT_RD_CUM_OUTS),
+ NV_C2C_PMU_EVENT_ATTR(out_rd_req, C2C_EVENT_OUT_RD_REQ),
+ NV_C2C_PMU_EVENT_ATTR(out_wr_cum_outs, C2C_EVENT_OUT_WR_CUM_OUTS),
+ NV_C2C_PMU_EVENT_ATTR(out_wr_req, C2C_EVENT_OUT_WR_REQ),
+ NULL
+};
+
+static umode_t
+nv_c2c_pmu_event_attr_is_visible(struct kobject *kobj, struct attribute *attr,
+ int unused)
+{
+ struct device *dev = kobj_to_dev(kobj);
+ struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(dev_get_drvdata(dev));
+ struct perf_pmu_events_attr *eattr;
+
+ eattr = container_of(attr, typeof(*eattr), attr.attr);
+
+ if (c2c_pmu->c2c_type == C2C_TYPE_NVDLINK) {
+ /* Only incoming reads are available. */
+ switch (eattr->id) {
+ case C2C_EVENT_IN_WR_CUM_OUTS:
+ case C2C_EVENT_IN_WR_REQ:
+ case C2C_EVENT_OUT_RD_CUM_OUTS:
+ case C2C_EVENT_OUT_RD_REQ:
+ case C2C_EVENT_OUT_WR_CUM_OUTS:
+ case C2C_EVENT_OUT_WR_REQ:
+ return 0;
+ default:
+ return attr->mode;
+ }
+ } else {
+ /* Hide the write events if C2C connected to another SoC. */
+ if (c2c_pmu->peer_type == C2C_PEER_TYPE_CPU) {
+ switch (eattr->id) {
+ case C2C_EVENT_IN_WR_CUM_OUTS:
+ case C2C_EVENT_IN_WR_REQ:
+ case C2C_EVENT_OUT_WR_CUM_OUTS:
+ case C2C_EVENT_OUT_WR_REQ:
+ return 0;
+ default:
+ return attr->mode;
+ }
+ }
+ }
+
+ return attr->mode;
+}
+
+static const struct attribute_group nv_c2c_pmu_events_group = {
+ .name = "events",
+ .attrs = nv_c2c_pmu_events,
+ .is_visible = nv_c2c_pmu_event_attr_is_visible,
+};
+
+/* Cpumask attributes. */
+
+static ssize_t nv_c2c_pmu_cpumask_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct pmu *pmu = dev_get_drvdata(dev);
+ struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(pmu);
+ struct dev_ext_attribute *eattr =
+ container_of(attr, struct dev_ext_attribute, attr);
+ unsigned long mask_id = (unsigned long)eattr->var;
+ const cpumask_t *cpumask;
+
+ switch (mask_id) {
+ case C2C_ACTIVE_CPU_MASK:
+ cpumask = &c2c_pmu->active_cpu;
+ break;
+ case C2C_ASSOCIATED_CPU_MASK:
+ cpumask = &c2c_pmu->associated_cpus;
+ break;
+ default:
+ return 0;
+ }
+ return cpumap_print_to_pagebuf(true, buf, cpumask);
+}
+
+#define NV_C2C_PMU_CPUMASK_ATTR(_name, _config) \
+ NV_C2C_PMU_EXT_ATTR(_name, nv_c2c_pmu_cpumask_show, \
+ (unsigned long)_config)
+
+static struct attribute *nv_c2c_pmu_cpumask_attrs[] = {
+ NV_C2C_PMU_CPUMASK_ATTR(cpumask, C2C_ACTIVE_CPU_MASK),
+ NV_C2C_PMU_CPUMASK_ATTR(associated_cpus, C2C_ASSOCIATED_CPU_MASK),
+ NULL,
+};
+
+static const struct attribute_group nv_c2c_pmu_cpumask_attr_group = {
+ .attrs = nv_c2c_pmu_cpumask_attrs,
+};
+
+/* Per PMU device attribute groups. */
+
+static int nv_c2c_pmu_alloc_attr_groups(struct nv_c2c_pmu *c2c_pmu)
+{
+ const struct attribute_group **attr_groups = c2c_pmu->attr_groups;
+
+ attr_groups[0] = nv_c2c_pmu_alloc_format_attr_group(c2c_pmu);
+ attr_groups[1] = &nv_c2c_pmu_events_group;
+ attr_groups[2] = &nv_c2c_pmu_cpumask_attr_group;
+ attr_groups[3] = &nv_c2c_pmu_identifier_attr_group;
+ attr_groups[4] = &nv_c2c_pmu_peer_attr_group;
+
+ if (!attr_groups[0])
+ return -ENOMEM;
+
+ return 0;
+}
+
+static int nv_c2c_pmu_online_cpu(unsigned int cpu, struct hlist_node *node)
+{
+ struct nv_c2c_pmu *c2c_pmu =
+ hlist_entry_safe(node, struct nv_c2c_pmu, cpuhp_node);
+
+ if (!cpumask_test_cpu(cpu, &c2c_pmu->associated_cpus))
+ return 0;
+
+ /* If the PMU is already managed, there is nothing to do */
+ if (!cpumask_empty(&c2c_pmu->active_cpu))
+ return 0;
+
+ /* Use this CPU for event counting */
+ cpumask_set_cpu(cpu, &c2c_pmu->active_cpu);
+
+ return 0;
+}
+
+static int nv_c2c_pmu_cpu_teardown(unsigned int cpu, struct hlist_node *node)
+{
+ unsigned int dst;
+
+ struct nv_c2c_pmu *c2c_pmu =
+ hlist_entry_safe(node, struct nv_c2c_pmu, cpuhp_node);
+
+ /* Nothing to do if this CPU doesn't own the PMU */
+ if (!cpumask_test_and_clear_cpu(cpu, &c2c_pmu->active_cpu))
+ return 0;
+
+ /* Choose a new CPU to migrate ownership of the PMU to */
+ dst = cpumask_any_and_but(&c2c_pmu->associated_cpus,
+ cpu_online_mask, cpu);
+ if (dst >= nr_cpu_ids)
+ return 0;
+
+ /* Use this CPU for event counting */
+ perf_pmu_migrate_context(&c2c_pmu->pmu, cpu, dst);
+ cpumask_set_cpu(dst, &c2c_pmu->active_cpu);
+
+ return 0;
+}
+
+static int nv_c2c_pmu_get_cpus(struct nv_c2c_pmu *c2c_pmu)
+{
+ int ret = 0, socket = c2c_pmu->socket, cpu;
+
+ for_each_possible_cpu(cpu) {
+ if (cpu_to_node(cpu) == socket)
+ cpumask_set_cpu(cpu, &c2c_pmu->associated_cpus);
+ }
+
+ if (cpumask_empty(&c2c_pmu->associated_cpus)) {
+ dev_dbg(c2c_pmu->dev,
+ "No cpu associated with C2C PMU socket-%u\n", socket);
+ ret = -ENODEV;
+ }
+
+ return ret;
+}
+
+static int nv_c2c_pmu_init_socket(struct nv_c2c_pmu *c2c_pmu)
+{
+ const char *uid_str;
+ int ret, socket;
+
+ uid_str = acpi_device_uid(c2c_pmu->acpi_dev);
+ if (!uid_str) {
+ ret = -ENODEV;
+ goto fail;
+ }
+
+ ret = kstrtou32(uid_str, 0, &socket);
+ if (ret)
+ goto fail;
+
+ c2c_pmu->socket = socket;
+ return 0;
+
+fail:
+ dev_err(c2c_pmu->dev, "Failed to initialize socket\n");
+ return ret;
+}
+
+static int nv_c2c_pmu_init_id(struct nv_c2c_pmu *c2c_pmu)
+{
+ const char *name_fmt[C2C_TYPE_COUNT] = {
+ [C2C_TYPE_NVLINK] = "nvidia_nvlink_c2c_pmu_%u",
+ [C2C_TYPE_NVCLINK] = "nvidia_nvclink_pmu_%u",
+ [C2C_TYPE_NVDLINK] = "nvidia_nvdlink_pmu_%u",
+ };
+
+ char *name;
+ int ret;
+
+ name = devm_kasprintf(c2c_pmu->dev, GFP_KERNEL,
+ name_fmt[c2c_pmu->c2c_type], c2c_pmu->socket);
+ if (!name) {
+ ret = -ENOMEM;
+ goto fail;
+ }
+
+ c2c_pmu->name = name;
+
+ c2c_pmu->identifier = acpi_device_hid(c2c_pmu->acpi_dev);
+
+ return 0;
+
+fail:
+ dev_err(c2c_pmu->dev, "Failed to initialize name\n");
+ return ret;
+}
+
+static int nv_c2c_pmu_init_filter(struct nv_c2c_pmu *c2c_pmu)
+{
+ u32 cpu_en = 0;
+ struct device *dev = c2c_pmu->dev;
+
+ if (c2c_pmu->c2c_type == C2C_TYPE_NVDLINK) {
+ c2c_pmu->peer_type = C2C_PEER_TYPE_CXLMEM;
+
+ c2c_pmu->nr_inst = C2C_NR_INST_NVDLINK;
+ c2c_pmu->peer_insts[0][0] = (1UL << c2c_pmu->nr_inst) - 1;
+
+ c2c_pmu->nr_peer = C2C_NR_PEER_CXLMEM;
+ c2c_pmu->filter_default = (1 << c2c_pmu->nr_peer) - 1;
+
+ c2c_pmu->formats = nv_c2c_pmu_formats;
+
+ return 0;
+ }
+
+ c2c_pmu->nr_inst = (c2c_pmu->c2c_type == C2C_TYPE_NVLINK) ?
+ C2C_NR_INST_NVLINK : C2C_NR_INST_NVCLINK;
+
+ if (device_property_read_u32(dev, "cpu_en_mask", &cpu_en))
+ dev_dbg(dev, "no cpu_en_mask property\n");
+
+ if (cpu_en) {
+ c2c_pmu->peer_type = C2C_PEER_TYPE_CPU;
+
+ /* Fill peer_insts bitmap with instances connected to peer CPU. */
+ bitmap_from_arr32(c2c_pmu->peer_insts[0], &cpu_en,
+ c2c_pmu->nr_inst);
+
+ c2c_pmu->nr_peer = 1;
+ c2c_pmu->formats = nv_c2c_pmu_formats;
+ } else {
+ u32 i;
+ const char *props[C2C_NR_PEER_MAX] = {
+ "gpu0_en_mask", "gpu1_en_mask"
+ };
+
+ for (i = 0; i < C2C_NR_PEER_MAX; i++) {
+ u32 gpu_en = 0;
+
+ if (device_property_read_u32(dev, props[i], &gpu_en))
+ dev_dbg(dev, "no %s property\n", props[i]);
+
+ if (gpu_en) {
+ /* Fill peer_insts bitmap with instances connected to peer GPU. */
+ bitmap_from_arr32(c2c_pmu->peer_insts[i], &gpu_en,
+ c2c_pmu->nr_inst);
+
+ c2c_pmu->nr_peer++;
+ }
+ }
+
+ if (c2c_pmu->nr_peer == 0) {
+ dev_err(dev, "No GPU is enabled\n");
+ return -EINVAL;
+ }
+
+ c2c_pmu->peer_type = C2C_PEER_TYPE_GPU;
+ c2c_pmu->formats = nv_c2c_nvlink_pmu_formats;
+ }
+
+ c2c_pmu->filter_default = (1 << c2c_pmu->nr_peer) - 1;
+
+ return 0;
+}
+
+static void *nv_c2c_pmu_init_pmu(struct platform_device *pdev)
+{
+ int ret;
+ struct nv_c2c_pmu *c2c_pmu;
+ struct acpi_device *acpi_dev;
+ struct device *dev = &pdev->dev;
+
+ acpi_dev = ACPI_COMPANION(dev);
+ if (!acpi_dev)
+ return ERR_PTR(-ENODEV);
+
+ c2c_pmu = devm_kzalloc(dev, sizeof(*c2c_pmu), GFP_KERNEL);
+ if (!c2c_pmu)
+ return ERR_PTR(-ENOMEM);
+
+ c2c_pmu->dev = dev;
+ c2c_pmu->acpi_dev = acpi_dev;
+ c2c_pmu->c2c_type = (unsigned int)(unsigned long)device_get_match_data(dev);
+ platform_set_drvdata(pdev, c2c_pmu);
+
+ ret = nv_c2c_pmu_init_socket(c2c_pmu);
+ if (ret)
+ goto done;
+
+ ret = nv_c2c_pmu_init_id(c2c_pmu);
+ if (ret)
+ goto done;
+
+ ret = nv_c2c_pmu_init_filter(c2c_pmu);
+ if (ret)
+ goto done;
+
+done:
+ if (ret)
+ return ERR_PTR(ret);
+
+ return c2c_pmu;
+}
+
+static int nv_c2c_pmu_init_mmio(struct nv_c2c_pmu *c2c_pmu)
+{
+ int i;
+ struct device *dev = c2c_pmu->dev;
+ struct platform_device *pdev = to_platform_device(dev);
+
+ /* Map the address of all the instances. */
+ for (i = 0; i < c2c_pmu->nr_inst; i++) {
+ c2c_pmu->base[i] = devm_platform_ioremap_resource(pdev, i);
+ if (IS_ERR(c2c_pmu->base[i])) {
+ dev_err(dev, "Failed map address for instance %d\n", i);
+ return PTR_ERR(c2c_pmu->base[i]);
+ }
+ }
+
+ /* Map broadcast address. */
+ c2c_pmu->base_broadcast = devm_platform_ioremap_resource(pdev,
+ c2c_pmu->nr_inst);
+ if (IS_ERR(c2c_pmu->base_broadcast)) {
+ dev_err(dev, "Failed map broadcast address\n");
+ return PTR_ERR(c2c_pmu->base_broadcast);
+ }
+
+ return 0;
+}
+
+static int nv_c2c_pmu_register_pmu(struct nv_c2c_pmu *c2c_pmu)
+{
+ int ret;
+
+ ret = cpuhp_state_add_instance(nv_c2c_pmu_cpuhp_state,
+ &c2c_pmu->cpuhp_node);
+ if (ret) {
+ dev_err(c2c_pmu->dev, "Error %d registering hotplug\n", ret);
+ return ret;
+ }
+
+ c2c_pmu->pmu = (struct pmu) {
+ .parent = c2c_pmu->dev,
+ .task_ctx_nr = perf_invalid_context,
+ .pmu_enable = nv_c2c_pmu_enable,
+ .pmu_disable = nv_c2c_pmu_disable,
+ .event_init = nv_c2c_pmu_event_init,
+ .add = nv_c2c_pmu_add,
+ .del = nv_c2c_pmu_del,
+ .start = nv_c2c_pmu_start,
+ .stop = nv_c2c_pmu_stop,
+ .read = nv_c2c_pmu_read,
+ .attr_groups = c2c_pmu->attr_groups,
+ .capabilities = PERF_PMU_CAP_NO_EXCLUDE |
+ PERF_PMU_CAP_NO_INTERRUPT,
+ };
+
+ ret = perf_pmu_register(&c2c_pmu->pmu, c2c_pmu->name, -1);
+ if (ret) {
+ dev_err(c2c_pmu->dev, "Failed to register C2C PMU: %d\n", ret);
+ cpuhp_state_remove_instance(nv_c2c_pmu_cpuhp_state,
+ &c2c_pmu->cpuhp_node);
+ return ret;
+ }
+
+ return 0;
+}
+
+static int nv_c2c_pmu_probe(struct platform_device *pdev)
+{
+ int ret;
+ struct nv_c2c_pmu *c2c_pmu;
+
+ c2c_pmu = nv_c2c_pmu_init_pmu(pdev);
+ if (IS_ERR(c2c_pmu))
+ return PTR_ERR(c2c_pmu);
+
+ ret = nv_c2c_pmu_init_mmio(c2c_pmu);
+ if (ret)
+ return ret;
+
+ ret = nv_c2c_pmu_get_cpus(c2c_pmu);
+ if (ret)
+ return ret;
+
+ ret = nv_c2c_pmu_alloc_attr_groups(c2c_pmu);
+ if (ret)
+ return ret;
+
+ ret = nv_c2c_pmu_register_pmu(c2c_pmu);
+ if (ret)
+ return ret;
+
+ dev_dbg(c2c_pmu->dev, "Registered %s PMU\n", c2c_pmu->name);
+
+ return 0;
+}
+
+static void nv_c2c_pmu_device_remove(struct platform_device *pdev)
+{
+ struct nv_c2c_pmu *c2c_pmu = platform_get_drvdata(pdev);
+
+ perf_pmu_unregister(&c2c_pmu->pmu);
+ cpuhp_state_remove_instance(nv_c2c_pmu_cpuhp_state, &c2c_pmu->cpuhp_node);
+}
+
+static const struct acpi_device_id nv_c2c_pmu_acpi_match[] = {
+ { "NVDA2023", (kernel_ulong_t)C2C_TYPE_NVLINK },
+ { "NVDA2022", (kernel_ulong_t)C2C_TYPE_NVCLINK },
+ { "NVDA2020", (kernel_ulong_t)C2C_TYPE_NVDLINK },
+ { }
+};
+MODULE_DEVICE_TABLE(acpi, nv_c2c_pmu_acpi_match);
+
+static struct platform_driver nv_c2c_pmu_driver = {
+ .driver = {
+ .name = "nvidia-t410-c2c-pmu",
+ .acpi_match_table = ACPI_PTR(nv_c2c_pmu_acpi_match),
+ .suppress_bind_attrs = true,
+ },
+ .probe = nv_c2c_pmu_probe,
+ .remove = nv_c2c_pmu_device_remove,
+};
+
+static int __init nv_c2c_pmu_init(void)
+{
+ int ret;
+
+ ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
+ "perf/nvidia/c2c:online",
+ nv_c2c_pmu_online_cpu,
+ nv_c2c_pmu_cpu_teardown);
+ if (ret < 0)
+ return ret;
+
+ nv_c2c_pmu_cpuhp_state = ret;
+ return platform_driver_register(&nv_c2c_pmu_driver);
+}
+
+static void __exit nv_c2c_pmu_exit(void)
+{
+ platform_driver_unregister(&nv_c2c_pmu_driver);
+ cpuhp_remove_multi_state(nv_c2c_pmu_cpuhp_state);
+}
+
+module_init(nv_c2c_pmu_init);
+module_exit(nv_c2c_pmu_exit);
+
+MODULE_LICENSE("GPL");
+MODULE_DESCRIPTION("NVIDIA Tegra410 C2C PMU driver");
+MODULE_AUTHOR("Besar Wicaksono <bwicaksono@nvidia.com>");
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread* Re: [PATCH v2 7/8] perf: add NVIDIA Tegra410 C2C PMU
2026-02-18 14:58 ` [PATCH v2 7/8] perf: add NVIDIA Tegra410 C2C PMU Besar Wicaksono
@ 2026-02-19 10:55 ` Jonathan Cameron
0 siblings, 0 replies; 18+ messages in thread
From: Jonathan Cameron @ 2026-02-19 10:55 UTC (permalink / raw)
To: Besar Wicaksono
Cc: will, suzuki.poulose, robin.murphy, ilkka, linux-arm-kernel,
linux-kernel, linux-tegra, mark.rutland, treding, jonathanh,
vsethi, rwiley, sdonthineni, skelley, ywan, mochs, nirmoyd
On Wed, 18 Feb 2026 14:58:08 +0000
Besar Wicaksono <bwicaksono@nvidia.com> wrote:
> Adds NVIDIA C2C PMU support in Tegra410 SOC. This PMU is
> used to measure memory latency between the SOC and device
> memory, e.g GPU Memory (GMEM), CXL Memory, or memory on
> remote Tegra410 SOC.
>
> Reviewed-by: Ilkka Koskinen <ilkka@os.amperecomputing.com>
> Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
Hi Besar.
Another nice looking driver.
My main comment here is around attribute management. In general I'd avoid doing
dynamic assignment of attribute arrays when there are only a couple of options.
Much cleaner and easier to read in those cases if you just have multiple
static const arrays to pick between.
That's made easier if you move to a struct for each of the supported types
(and get rid of the enum currently used as that makes it too easy to end up
with a mix of data in the struct vs code using the enum).
Jonathan
> diff --git a/drivers/perf/nvidia_t410_c2c_pmu.c b/drivers/perf/nvidia_t410_c2c_pmu.c
> new file mode 100644
> index 000000000000..a3891c94dcde
> --- /dev/null
> +++ b/drivers/perf/nvidia_t410_c2c_pmu.c
> @@ -0,0 +1,1062 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * NVIDIA Tegra410 C2C PMU driver.
> + *
> + * Copyright (c) 2026, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
> + */
> +
> +#include <linux/acpi.h>
> +#include <linux/bitops.h>
> +#include <linux/cpumask.h>
> +#include <linux/device.h>
> +#include <linux/interrupt.h>
> +#include <linux/io.h>
> +#include <linux/module.h>
> +#include <linux/perf_event.h>
> +#include <linux/platform_device.h>
> +#include <linux/property.h>
> +
> +struct nv_c2c_pmu {
> + struct pmu pmu;
> + struct device *dev;
> + struct acpi_device *acpi_dev;
> +
> + const char *name;
> + const char *identifier;
> +
> + unsigned int c2c_type;
> + unsigned int peer_type;
> + unsigned int socket;
> + unsigned int nr_inst;
> + unsigned int nr_peer;
> + unsigned long peer_insts[C2C_NR_PEER_MAX][BITS_TO_LONGS(C2C_NR_INST_MAX)];
Pity there isn't a DECLARE_BITMAP_ARRAY()
macro. I guess this isn't that common.
> + u32 filter_default;
> +
> + struct nv_c2c_pmu_hw_events hw_events;
> +
> + cpumask_t associated_cpus;
> + cpumask_t active_cpu;
> +
> + struct hlist_node cpuhp_node;
> +
> + struct attribute **formats;
> + const struct attribute_group *attr_groups[6];
As below. I'd push this into a type specific structure to remove all the dynamic
code to fill it in.
> +
> + void __iomem *base_broadcast;
Ah. Good. This matches what I suggested for previous.
> + void __iomem *base[C2C_NR_INST_MAX];
> +};
> +/*
> + * Read 64-bit register as a pair of 32-bit registers using hi-lo-hi sequence.
> + */
> +static u64 read_reg64_hilohi(const void __iomem *addr, u32 max_poll_count)
> +{
> + u32 val_lo, val_hi;
> + u64 val;
> +
> + /* Use high-low-high sequence to avoid tearing */
> + do {
> + if (max_poll_count-- == 0) {
> + pr_err("NV C2C PMU: timeout hi-low-high sequence\n");
> + return 0;
> + }
> +
> + val_hi = readl(addr + 4);
> + val_lo = readl(addr);
> + } while (val_hi != readl(addr + 4));
I wonder if it's worth adding a non tearing variant include/linux/io-64-nonatomic-hi-lo.h
Feels like I see this open coded often enough that it might be nice to replace it
once and for all with a generic version.
Implementation would be pretty much what you have here.
> +
> + val = (((u64)val_hi << 32) | val_lo);
> +
> + return val;
> +}
> +
> +static umode_t
> +nv_c2c_pmu_event_attr_is_visible(struct kobject *kobj, struct attribute *attr,
> + int unused)
> +{
> + struct device *dev = kobj_to_dev(kobj);
> + struct nv_c2c_pmu *c2c_pmu = to_c2c_pmu(dev_get_drvdata(dev));
> + struct perf_pmu_events_attr *eattr;
> +
> + eattr = container_of(attr, typeof(*eattr), attr.attr);
> +
> + if (c2c_pmu->c2c_type == C2C_TYPE_NVDLINK) {
> + /* Only incoming reads are available. */
> + switch (eattr->id) {
> + case C2C_EVENT_IN_WR_CUM_OUTS:
> + case C2C_EVENT_IN_WR_REQ:
> + case C2C_EVENT_OUT_RD_CUM_OUTS:
> + case C2C_EVENT_OUT_RD_REQ:
> + case C2C_EVENT_OUT_WR_CUM_OUTS:
> + case C2C_EVENT_OUT_WR_REQ:
> + return 0;
> + default:
> + return attr->mode;
Given suggestion below to use separate attribute_groups[] for each
of the 3 types, I'd do separate event attribute groups for each as well.
That will cover this case as const data.
> + }
> + } else {
> + /* Hide the write events if C2C connected to another SoC. */
> + if (c2c_pmu->peer_type == C2C_PEER_TYPE_CPU) {
And only the two types where this is relevant will use the is_visible.
> + switch (eattr->id) {
> + case C2C_EVENT_IN_WR_CUM_OUTS:
> + case C2C_EVENT_IN_WR_REQ:
> + case C2C_EVENT_OUT_WR_CUM_OUTS:
> + case C2C_EVENT_OUT_WR_REQ:
> + return 0;
> + default:
> + return attr->mode;
> + }
> + }
> + }
> +
> + return attr->mode;
> +}
> +
> +static const struct attribute_group nv_c2c_pmu_events_group = {
> + .name = "events",
> + .attrs = nv_c2c_pmu_events,
> + .is_visible = nv_c2c_pmu_event_attr_is_visible,
> +};
> +
> +/* Per PMU device attribute groups. */
> +
> +static int nv_c2c_pmu_alloc_attr_groups(struct nv_c2c_pmu *c2c_pmu)
> +{
> + const struct attribute_group **attr_groups = c2c_pmu->attr_groups;
> +
> + attr_groups[0] = nv_c2c_pmu_alloc_format_attr_group(c2c_pmu);
> + attr_groups[1] = &nv_c2c_pmu_events_group;
> + attr_groups[2] = &nv_c2c_pmu_cpumask_attr_group;
> + attr_groups[3] = &nv_c2c_pmu_identifier_attr_group;
> + attr_groups[4] = &nv_c2c_pmu_peer_attr_group;
> +
> + if (!attr_groups[0])
> + return -ENOMEM;
This seems unnecessary code complexity to avoid having a couple of structures
that duplicate some elements. If you have a choice between picking between
sets of static const data vs more code. Go the data route (as long as it
isn't a huge amount more data!)
So have a
const struct attribute_groups *xxxx[] = {
};
For each of the 3 types.
Alternative is put all the format group attributes in one group and then use
is_visible() to select what is relevant to each type, but I think that will
be more complex than just replicating the array 3 times.
> +
> + return 0;
> +}
> +static int nv_c2c_pmu_get_cpus(struct nv_c2c_pmu *c2c_pmu)
> +{
> + int ret = 0, socket = c2c_pmu->socket, cpu;
> +
> + for_each_possible_cpu(cpu) {
> + if (cpu_to_node(cpu) == socket)
> + cpumask_set_cpu(cpu, &c2c_pmu->associated_cpus);
> + }
> +
> + if (cpumask_empty(&c2c_pmu->associated_cpus)) {
> + dev_dbg(c2c_pmu->dev,
> + "No cpu associated with C2C PMU socket-%u\n", socket);
> + ret = -ENODEV;
> + }
> +
> + return ret;
Direct returns are often more readable. I think that applies here
as we can then make ti clear where the good exit paths are with return 0.
> +}
> +
> +static int nv_c2c_pmu_init_socket(struct nv_c2c_pmu *c2c_pmu)
> +{
> + const char *uid_str;
> + int ret, socket;
> +
> + uid_str = acpi_device_uid(c2c_pmu->acpi_dev);
> + if (!uid_str) {
> + ret = -ENODEV;
> + goto fail;
> + }
> +
> + ret = kstrtou32(uid_str, 0, &socket);
> + if (ret)
> + goto fail;
> +
> + c2c_pmu->socket = socket;
> + return 0;
> +
> +fail:
> + dev_err(c2c_pmu->dev, "Failed to initialize socket\n");
I'd return above, with a more specific error message given there are two
different things that can go wrong.
> + return ret;
> +}
> +
> +static int nv_c2c_pmu_init_id(struct nv_c2c_pmu *c2c_pmu)
> +{
> + const char *name_fmt[C2C_TYPE_COUNT] = {
> + [C2C_TYPE_NVLINK] = "nvidia_nvlink_c2c_pmu_%u",
> + [C2C_TYPE_NVCLINK] = "nvidia_nvclink_pmu_%u",
> + [C2C_TYPE_NVDLINK] = "nvidia_nvdlink_pmu_%u",
> + };
> +
> + char *name;
> + int ret;
> +
> + name = devm_kasprintf(c2c_pmu->dev, GFP_KERNEL,
> + name_fmt[c2c_pmu->c2c_type], c2c_pmu->socket);
> + if (!name) {
> + ret = -ENOMEM;
> + goto fail;
> + }
> +
> + c2c_pmu->name = name;
> +
> + c2c_pmu->identifier = acpi_device_hid(c2c_pmu->acpi_dev);
> +
> + return 0;
> +
> +fail:
> + dev_err(c2c_pmu->dev, "Failed to initialize name\n");
Why the goto? Just error out above.
For any calls like this that are only made from probe() use
return dev_err_probe(c2c_pmu->dev, ret, "...\n");
However general view is that allocation failures make enough noise
anyway that we never print additional error messages on them. So
just drop the print here.
There are a few people cleaning the kernel tree up to remove exactly
this case. Not sure anyone got to perf yet though.
> + return ret;
> +}
> +
> +static int nv_c2c_pmu_init_filter(struct nv_c2c_pmu *c2c_pmu)
> +{
> + u32 cpu_en = 0;
> + struct device *dev = c2c_pmu->dev;
> +
> + if (c2c_pmu->c2c_type == C2C_TYPE_NVDLINK) {
> + c2c_pmu->peer_type = C2C_PEER_TYPE_CXLMEM;
> +
> + c2c_pmu->nr_inst = C2C_NR_INST_NVDLINK;
> + c2c_pmu->peer_insts[0][0] = (1UL << c2c_pmu->nr_inst) - 1;
> +
> + c2c_pmu->nr_peer = C2C_NR_PEER_CXLMEM;
> + c2c_pmu->filter_default = (1 << c2c_pmu->nr_peer) - 1;
> +
> + c2c_pmu->formats = nv_c2c_pmu_formats;
> +
> + return 0;
> + }
> +
> + c2c_pmu->nr_inst = (c2c_pmu->c2c_type == C2C_TYPE_NVLINK) ?
> + C2C_NR_INST_NVLINK : C2C_NR_INST_NVCLINK;
> +
> + if (device_property_read_u32(dev, "cpu_en_mask", &cpu_en))
> + dev_dbg(dev, "no cpu_en_mask property\n");
> +
> + if (cpu_en) {
> + c2c_pmu->peer_type = C2C_PEER_TYPE_CPU;
> +
> + /* Fill peer_insts bitmap with instances connected to peer CPU. */
> + bitmap_from_arr32(c2c_pmu->peer_insts[0], &cpu_en,
> + c2c_pmu->nr_inst);
> +
> + c2c_pmu->nr_peer = 1;
> + c2c_pmu->formats = nv_c2c_pmu_formats;
> + } else {
> + u32 i;
> + const char *props[C2C_NR_PEER_MAX] = {
> + "gpu0_en_mask", "gpu1_en_mask"
> + };
> +
> + for (i = 0; i < C2C_NR_PEER_MAX; i++) {
> + u32 gpu_en = 0;
> +
> + if (device_property_read_u32(dev, props[i], &gpu_en))
> + dev_dbg(dev, "no %s property\n", props[i]);
> +
> + if (gpu_en) {
> + /* Fill peer_insts bitmap with instances connected to peer GPU. */
> + bitmap_from_arr32(c2c_pmu->peer_insts[i], &gpu_en,
> + c2c_pmu->nr_inst);
> +
> + c2c_pmu->nr_peer++;
> + }
> + }
> +
> + if (c2c_pmu->nr_peer == 0) {
> + dev_err(dev, "No GPU is enabled\n");
> + return -EINVAL;
> + }
> +
> + c2c_pmu->peer_type = C2C_PEER_TYPE_GPU;
> + c2c_pmu->formats = nv_c2c_nvlink_pmu_formats;
> + }
> +
> + c2c_pmu->filter_default = (1 << c2c_pmu->nr_peer) - 1;
> +
> + return 0;
> +}
> +
> +static void *nv_c2c_pmu_init_pmu(struct platform_device *pdev)
> +{
> + int ret;
> + struct nv_c2c_pmu *c2c_pmu;
> + struct acpi_device *acpi_dev;
> + struct device *dev = &pdev->dev;
> +
> + acpi_dev = ACPI_COMPANION(dev);
> + if (!acpi_dev)
> + return ERR_PTR(-ENODEV);
> +
> + c2c_pmu = devm_kzalloc(dev, sizeof(*c2c_pmu), GFP_KERNEL);
> + if (!c2c_pmu)
> + return ERR_PTR(-ENOMEM);
> +
> + c2c_pmu->dev = dev;
> + c2c_pmu->acpi_dev = acpi_dev;
> + c2c_pmu->c2c_type = (unsigned int)(unsigned long)device_get_match_data(dev);
As below, I'd make this a pointer to a struct with all the type specific
info in the struct.
> + platform_set_drvdata(pdev, c2c_pmu);
> +
> + ret = nv_c2c_pmu_init_socket(c2c_pmu);
> + if (ret)
> + goto done;
> +
> + ret = nv_c2c_pmu_init_id(c2c_pmu);
> + if (ret)
> + goto done;
> +
> + ret = nv_c2c_pmu_init_filter(c2c_pmu);
> + if (ret)
> + goto done;
> +
> +done:
Why not just return ERR_PTR() above and drop this goto?
That will be easier to read as clear that on any error you just return.
With the label here it looks like there might be good paths that use
it. That briefly confused me :)
> + if (ret)
> + return ERR_PTR(ret);
> +
> + return c2c_pmu;
> +}
> +
> +static const struct acpi_device_id nv_c2c_pmu_acpi_match[] = {
> + { "NVDA2023", (kernel_ulong_t)C2C_TYPE_NVLINK },
Speaking from long experience of maintaining stuff that uses this pattern,
it's much cleaner to have a per device type struct and put a pointer
to that here. The enum approach tends to lead to lots of switch statements
of scattered data about the uniqueness of each type.
Up to you though, but I'd suggest this will bite you.
> + { "NVDA2022", (kernel_ulong_t)C2C_TYPE_NVCLINK },
> + { "NVDA2020", (kernel_ulong_t)C2C_TYPE_NVDLINK },
> + { }
> +};
> +MODULE_DEVICE_TABLE(acpi, nv_c2c_pmu_acpi_match);
> +
> +static struct platform_driver nv_c2c_pmu_driver = {
> + .driver = {
> + .name = "nvidia-t410-c2c-pmu",
> + .acpi_match_table = ACPI_PTR(nv_c2c_pmu_acpi_match),
ACPI_PTR() mostly causes annoying issues with __maybe_unused
being used to save a trivial amount of data for !CONFIG_ACPI.
My advice would be to not use it. It's not necessary at all.
> + .suppress_bind_attrs = true,
> + },
> + .probe = nv_c2c_pmu_probe,
> + .remove = nv_c2c_pmu_device_remove,
> +};
> +
> +static int __init nv_c2c_pmu_init(void)
> +{
> + int ret;
> +
> + ret = cpuhp_setup_state_multi(CPUHP_AP_ONLINE_DYN,
> + "perf/nvidia/c2c:online",
> + nv_c2c_pmu_online_cpu,
> + nv_c2c_pmu_cpu_teardown);
> + if (ret < 0)
> + return ret;
> +
> + nv_c2c_pmu_cpuhp_state = ret;
> + return platform_driver_register(&nv_c2c_pmu_driver);
> +}
> +
> +static void __exit nv_c2c_pmu_exit(void)
> +{
> + platform_driver_unregister(&nv_c2c_pmu_driver);
> + cpuhp_remove_multi_state(nv_c2c_pmu_cpuhp_state);
> +}
> +
> +module_init(nv_c2c_pmu_init);
> +module_exit(nv_c2c_pmu_exit);
> +
> +MODULE_LICENSE("GPL");
> +MODULE_DESCRIPTION("NVIDIA Tegra410 C2C PMU driver");
> +MODULE_AUTHOR("Besar Wicaksono <bwicaksono@nvidia.com>");
^ permalink raw reply [flat|nested] 18+ messages in thread
* [PATCH v2 8/8] arm64: defconfig: Enable NVIDIA TEGRA410 PMU
2026-02-18 14:58 [PATCH v2 0/8] perf: add NVIDIA Tegra410 Uncore PMU support Besar Wicaksono
` (6 preceding siblings ...)
2026-02-18 14:58 ` [PATCH v2 7/8] perf: add NVIDIA Tegra410 C2C PMU Besar Wicaksono
@ 2026-02-18 14:58 ` Besar Wicaksono
7 siblings, 0 replies; 18+ messages in thread
From: Besar Wicaksono @ 2026-02-18 14:58 UTC (permalink / raw)
To: will, suzuki.poulose, robin.murphy, ilkka
Cc: linux-arm-kernel, linux-kernel, linux-tegra, mark.rutland,
treding, jonathanh, vsethi, rwiley, sdonthineni, skelley, ywan,
mochs, nirmoyd, Besar Wicaksono
Enable the driver for NVIDIA TEGRA410 CMEM Latency and C2C PMU
present in Tegra410 SOC, which is an ACPI-based ARM64 platform.
Signed-off-by: Besar Wicaksono <bwicaksono@nvidia.com>
---
arch/arm64/configs/defconfig | 2 ++
1 file changed, 2 insertions(+)
diff --git a/arch/arm64/configs/defconfig b/arch/arm64/configs/defconfig
index 45288ec9eaf7..3d0e438cb997 100644
--- a/arch/arm64/configs/defconfig
+++ b/arch/arm64/configs/defconfig
@@ -1723,6 +1723,8 @@ CONFIG_ARM_DMC620_PMU=m
CONFIG_HISI_PMU=y
CONFIG_ARM_CORESIGHT_PMU_ARCH_SYSTEM_PMU=m
CONFIG_NVIDIA_CORESIGHT_PMU_ARCH_SYSTEM_PMU=m
+CONFIG_NVIDIA_TEGRA410_CMEM_LATENCY_PMU=m
+CONFIG_NVIDIA_TEGRA410_C2C_PMU=m
CONFIG_MESON_DDR_PMU=m
CONFIG_NVMEM_LAYOUT_SL28_VPD=m
CONFIG_NVMEM_IMX_OCOTP=y
--
2.43.0
^ permalink raw reply related [flat|nested] 18+ messages in thread