* [PATCH v11 0/3] ACPI: Add support for ACPI RAS2 feature table
@ 2025-08-12 14:26 shiju.jose
2025-08-12 14:26 ` [PATCH v11 1/3] mm: Add support to retrieve physical address range of memory from the node ID shiju.jose
` (3 more replies)
0 siblings, 4 replies; 13+ messages in thread
From: shiju.jose @ 2025-08-12 14:26 UTC (permalink / raw)
To: rafael, bp, akpm, rppt, dferguson, linux-edac, linux-acpi,
linux-mm, linux-doc, tony.luck, lenb, leo.duran, Yazen.Ghannam,
mchehab
Cc: jonathan.cameron, linuxarm, rientjes, jiaqiyan, Jon.Grimm,
dave.hansen, naoya.horiguchi, james.morse, jthoughton,
somasundaram.a, erdemaktas, pgonda, duenwen, gthelen, wschwartz,
wbs, nifan.cxl, tanxiaofei, prime.zeng, roberto.sassu,
kangkang.shen, wanghuiqiang, shiju.jose
From: Shiju Jose <shiju.jose@huawei.com>
1. In numa_memblks, add support to retrieve physical address range of
memory in a NUMA domain, which is required for the ACPI RAS2 memory
features.
2. Add support for ACPI RAS2 feature table (RAS2) defined in the
ACPI 6.5 specification, section 5.2.21 and RAS2 HW based memory
scrubbing feature.
ACPI RAS2 patches were part of the EDAC series [1].
The code is based on linux.git v6.17-rc1 [2].
1. https://lore.kernel.org/linux-cxl/20250212143654.1893-1-shiju.jose@huawei.com/
2. https://github.com/torvalds/linux.git
Changes
=======
v10 -> v11:
1. Simplified code by removing workarounds previously added to support
non-compliant case of single PCC channel shared across all proximity
domains (which is no longer required).
https://lore.kernel.org/all/f5b28977-0b80-4c39-929b-cf02ab1efb97@os.amperecomputing.com/
2. Fix for the comments from Borislav (Thanks).
https://lore.kernel.org/all/20250811152805.GQaJoMBecC4DSDtTAu@fat_crate.local/
3. Rebase to v6.17-rc1.
v9 -> v10:
1. Use pcc_chan->shmem instead of
acpi_os_ioremap(pcc_chan->shmem_base_addr,...) as it was
acpi_os_ioremap internally by the PCC driver to pcc_chan->shmem.
2. Changes required for the Ampere Computing system where uses a single
PCC channel for RAS2 memory features across all NUMA domains. Based on the
requirements from by Daniel on V9
https://lore.kernel.org/all/547ed8fb-d6b7-4b6b-a38b-bf13223971b1@os.amperecomputing.com/
and discussion with Jonathan.
2.1 Add node_to_range lookup facility to numa_memblks. This is to retrieve the lowest
physical continuous memory range of the memory associated with a NUMA domain.
2.2. Set requested addr range to the memory region's base addr and size
while send RAS2 cmd GET_PATROL_PARAMETER
in functions ras2_update_patrol_scrub_params_cache() &
ras2_get_patrol_scrub_running().
2.3. Split struct ras2_mem_ctx into struct ras2_mem_ctx_hdr and struct ras2_pxm_domain
to support cases, uses a single PCC channel for RAS2 scrubbers across all NUMA
domains and PCC channel per RAS2 scrub instance. Provided ACPI spec define single
memory scrub per NUMA domain.
2.4. EDAC feature sysfs folder for RAS2 changed from "acpi_ras_memX" to "acpi_ras_mem_idX"
because memory scrub instances across all NUMA domains would present under
"acpi_ras_mem_id0" when a system uses a single PCC channel for RAS2 scrubbers across
all NUMA domains etc.
2.5. Removed Acked-by: Rafael from patch [2], because of the several above changes from v9.
v8 -> v9:
1. Added following changes for feedback from Yazen.
1.1 In ras2_check_pcc_chan(..) function
- u32 variables moved to the same line.
- Updated error log for readw_relaxed_poll_timeout()
- Added error log for if (status & PCC_STATUS_ERROR), error condition.
- Removed an impossible condition check.
1.2. Added guard for ras2_pc_list_lock in ras2_get_pcc_subspace().
2. Rebased to linux.git v6.16-rc2 [2].
v7 -> v8:
1. Rebased to linux.git v6.16-rc1 [2].
v6 -> v7:
1. Fix for the issue reported by Daniel,
In ras2_check_pcc_chan(), add read, clear and check RAS2 set_cap_status outside
if (status & PCC_STATUS_ERROR) check.
https://lore.kernel.org/all/51bcb52c-4132-4daf-8903-29b121c485a1@os.amperecomputing.com/
v5 -> v6:
1. Fix for the issue reported by Daniel, in start scrubbing with correct addr and size
after firmware return INVALID DATA error for scrub request with invalid addr or size.
https://lore.kernel.org/all/8cdf7885-31b3-4308-8a7c-f4e427486429@os.amperecomputing.com/
v4 -> v5:
1. Fix for the build warnings reported by kernel test robot.
https://patchwork.kernel.org/project/linux-edac/patch/20250423163511.1412-3-shiju.jose@huawei.com/
2. Removed patch "ACPI: ACPI 6.5: RAS2: Rename RAS2 table structure and field names"
from the series as the patch was merged to linux-pm.git : branch linux-next
3. Rebased to ras.git: edac-for-next branch merged with linux-pm.git : linux-next branch.
v3 -> v4:
1. Changes for feedbacks from Yazen on v3.
https://lore.kernel.org/all/20250415210504.GA854098@yaz-khff2.amd.com/
v2 -> v3:
1. Rename RAS2 table structure and field names in
include/acpi/actbl2.h limited to only necessary
for RAS2 scrub feature.
2. Changes for feedbacks from Jonathan on v2.
3. Daniel reported a known behaviour: when readback 'size' attribute after
setting in, returns 0 before starting scrubbing via 'addr' attribute.
Changes added to fix this.
4. Daniel reported that firmware cannot update status of demand scrubbing
via the 'Actual Address Range (OUTPUT)', thus add workaround in the
kernel to update sysfs 'addr' attribute with the status of demand
scrubbing.
5. Optimized logic in ras2_check_pcc_chan() function
(patch - ACPI:RAS2: Add ACPI RAS2 driver).
6. Add PCC channel lock to struct ras2_pcc_subspace and change
lock in ras2_mem_ctx as a pointer to pcc channel lock to make sure
writing to PCC subspace shared memory is protected from race conditions.
v1 -> v2:
1. Changes for feedbacks from Borislav.
- Shorten ACPI RAS2 structures and variables names.
- Shorten some of the other variables in the RAS2 drivers.
- Fixed few CamelCases.
2. Changes for feedbacks from Yazen.
- Added newline after number of '}' and return statements.
- Changed return type for "ras2_add_aux_device() to 'int'.
- Deleted a duplication of acpi_get_table("RAS2",...) in the ras2_acpi_parse_table().
- Add "FW_WARN" to few error logs in the ras2_acpi_parse_table().
- Rename ras2_acpi_init() to acpi_ras2_init() and modified to call acpi_ras2_init()
function from the acpi_init().
- Moved scrub related variables from the struct ras2_mem_ctx from patch
"ACPI:RAS2: Add ACPI RAS2 driver" to "ras: mem: Add memory ACPI RAS2 driver".
Shiju Jose (3):
mm: Add support to retrieve physical address range of memory from the
node ID
ACPI:RAS2: Add ACPI RAS2 driver
ras: mem: Add memory ACPI RAS2 driver
Documentation/edac/scrub.rst | 73 ++++++
drivers/acpi/Kconfig | 12 +
drivers/acpi/Makefile | 1 +
drivers/acpi/bus.c | 3 +
drivers/acpi/ras2.c | 385 +++++++++++++++++++++++++++++++
drivers/ras/Kconfig | 11 +
drivers/ras/Makefile | 1 +
drivers/ras/acpi_ras2.c | 424 +++++++++++++++++++++++++++++++++++
include/acpi/ras2.h | 77 +++++++
include/linux/numa.h | 9 +
include/linux/numa_memblks.h | 2 +
mm/numa.c | 10 +
mm/numa_memblks.c | 37 +++
13 files changed, 1045 insertions(+)
create mode 100644 drivers/acpi/ras2.c
create mode 100644 drivers/ras/acpi_ras2.c
create mode 100644 include/acpi/ras2.h
--
2.43.0
^ permalink raw reply [flat|nested] 13+ messages in thread
* [PATCH v11 1/3] mm: Add support to retrieve physical address range of memory from the node ID
2025-08-12 14:26 [PATCH v11 0/3] ACPI: Add support for ACPI RAS2 feature table shiju.jose
@ 2025-08-12 14:26 ` shiju.jose
2025-08-19 16:54 ` Jonathan Cameron
2025-08-12 14:26 ` [PATCH v11 2/3] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
` (2 subsequent siblings)
3 siblings, 1 reply; 13+ messages in thread
From: shiju.jose @ 2025-08-12 14:26 UTC (permalink / raw)
To: rafael, bp, akpm, rppt, dferguson, linux-edac, linux-acpi,
linux-mm, linux-doc, tony.luck, lenb, leo.duran, Yazen.Ghannam,
mchehab
Cc: jonathan.cameron, linuxarm, rientjes, jiaqiyan, Jon.Grimm,
dave.hansen, naoya.horiguchi, james.morse, jthoughton,
somasundaram.a, erdemaktas, pgonda, duenwen, gthelen, wschwartz,
wbs, nifan.cxl, tanxiaofei, prime.zeng, roberto.sassu,
kangkang.shen, wanghuiqiang, shiju.jose
From: Shiju Jose <shiju.jose@huawei.com>
In the numa_memblks, a lookup facility is required to retrieve the
physical address range of memory in a NUMA node. ACPI RAS2 memory
features are among the use cases.
Suggested-by: Jonathan Cameron <jonathan.cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
include/linux/numa.h | 9 +++++++++
include/linux/numa_memblks.h | 2 ++
mm/numa.c | 10 ++++++++++
mm/numa_memblks.c | 37 ++++++++++++++++++++++++++++++++++++
4 files changed, 58 insertions(+)
diff --git a/include/linux/numa.h b/include/linux/numa.h
index e6baaf6051bc..1d1aabebd26b 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -41,6 +41,10 @@ int memory_add_physaddr_to_nid(u64 start);
int phys_to_target_node(u64 start);
#endif
+#ifndef nid_get_mem_physaddr_range
+int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end);
+#endif
+
int numa_fill_memblks(u64 start, u64 end);
#else /* !CONFIG_NUMA */
@@ -63,6 +67,11 @@ static inline int phys_to_target_node(u64 start)
return 0;
}
+static inline int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end)
+{
+ return 0;
+}
+
static inline void alloc_offline_node_data(int nid) {}
#endif
diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
index 991076cba7c5..7b32d96d0134 100644
--- a/include/linux/numa_memblks.h
+++ b/include/linux/numa_memblks.h
@@ -55,6 +55,8 @@ extern int phys_to_target_node(u64 start);
#define phys_to_target_node phys_to_target_node
extern int memory_add_physaddr_to_nid(u64 start);
#define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
+extern int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end);
+#define nid_get_mem_physaddr_range nid_get_mem_physaddr_range
#endif /* CONFIG_NUMA_KEEP_MEMINFO */
#endif /* CONFIG_NUMA_MEMBLKS */
diff --git a/mm/numa.c b/mm/numa.c
index 7d5e06fe5bd4..5335af1fefee 100644
--- a/mm/numa.c
+++ b/mm/numa.c
@@ -59,3 +59,13 @@ int phys_to_target_node(u64 start)
}
EXPORT_SYMBOL_GPL(phys_to_target_node);
#endif
+
+#ifndef nid_get_mem_physaddr_range
+int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end)
+{
+ pr_info_once("Unknown target phys addr range for node=%d\n", nid);
+
+ return 0;
+}
+EXPORT_SYMBOL_GPL(nid_get_mem_physaddr_range);
+#endif
diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
index 541a99c4071a..e1e56b7a3499 100644
--- a/mm/numa_memblks.c
+++ b/mm/numa_memblks.c
@@ -590,4 +590,41 @@ int memory_add_physaddr_to_nid(u64 start)
}
EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
+/**
+ * nid_get_mem_physaddr_range - Get the physical address range
+ * of the memblk in the NUMA node.
+ * @nid: NUMA node ID of the memblk
+ * @start: Start address of the memblk
+ * @end: End address of the memblk
+ *
+ * Find the lowest contiguous physical memory address range of the memblk
+ * in the NUMA node with the given nid and return the start and end
+ * addresses.
+ *
+ * RETURNS:
+ * 0 on success, -errno on failure.
+ */
+int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end)
+{
+ struct numa_meminfo *mi = &numa_meminfo;
+ int i;
+
+ if (!numa_valid_node(nid))
+ return -EINVAL;
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ if (mi->blk[i].nid == nid) {
+ *start = mi->blk[i].start;
+ /*
+ * Assumption: mi->blk[i].end is the last address
+ * in the range + 1.
+ */
+ *end = mi->blk[i].end;
+ return 0;
+ }
+ }
+
+ return -ENODEV;
+}
+EXPORT_SYMBOL_GPL(nid_get_mem_physaddr_range);
#endif /* CONFIG_NUMA_KEEP_MEMINFO */
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v11 2/3] ACPI:RAS2: Add ACPI RAS2 driver
2025-08-12 14:26 [PATCH v11 0/3] ACPI: Add support for ACPI RAS2 feature table shiju.jose
2025-08-12 14:26 ` [PATCH v11 1/3] mm: Add support to retrieve physical address range of memory from the node ID shiju.jose
@ 2025-08-12 14:26 ` shiju.jose
2025-08-12 14:26 ` [PATCH v11 3/3] ras: mem: Add memory " shiju.jose
2025-08-19 20:12 ` [PATCH v11 0/3] ACPI: Add support for ACPI RAS2 feature table Daniel Ferguson
3 siblings, 0 replies; 13+ messages in thread
From: shiju.jose @ 2025-08-12 14:26 UTC (permalink / raw)
To: rafael, bp, akpm, rppt, dferguson, linux-edac, linux-acpi,
linux-mm, linux-doc, tony.luck, lenb, leo.duran, Yazen.Ghannam,
mchehab
Cc: jonathan.cameron, linuxarm, rientjes, jiaqiyan, Jon.Grimm,
dave.hansen, naoya.horiguchi, james.morse, jthoughton,
somasundaram.a, erdemaktas, pgonda, duenwen, gthelen, wschwartz,
wbs, nifan.cxl, tanxiaofei, prime.zeng, roberto.sassu,
kangkang.shen, wanghuiqiang, shiju.jose
From: Shiju Jose <shiju.jose@huawei.com>
Add support for ACPI RAS2 feature table (RAS2) defined in the
ACPI 6.5 Specification, section 5.2.21.
Driver defines RAS2 Init, which extracts the RAS2 table and driver
adds auxiliary device for each memory feature which binds to the
RAS2 memory driver.
Driver uses PCC mailbox to communicate with the ACPI HW and the
driver adds OSPM interfaces to send RAS2 commands.
According to the ACPI specification rev 6.5, section 5.2.21.1.1
RAS2 Platform Communication Channel Descriptor, “RAS2 supports multiple
PCC channels, where a channel is dedicated to a given component
instance.” Thus, RAS2 driver has been implemented to support only
systems that comply with the specification, i.e. a dedicated PCC channel
per system component instance for communication.
ACPI specification rev 6.5, section 5.2.21.1.1 Table 5.80: RAS2 Platform
Communication Channel Descriptor format, defines Field: Instance,
Identifier for the system component instance that the RAS feature is
associated with.
Section 5.2.21.2.1 Hardware-based Memory Scrubbing describes as
The platform can use this feature to expose controls and capabilities
associated with hardware-based memory scrub engines. Modern scalable
platforms have complex memory systems with a multitude of memory
controllers that are in turn associated with NUMA domains. It is also
common for RAS errors related to memory to be associated with NUMA
domains, where the NUMA domain functions as a FRU identifier. This
feature thus provides memory scrubbing at a NUMA domain granularity.
The following are supported:
1. Independent memory scrubbing controls for each NUMA domain, identified
using its proximity domain.
2. Provision for background (patrol) scrubbing of the entire memory
system, as well as on-demand scrubbing for a specific region of memory.
Thus, the RAS2 driver requires the lowest physical continuous memory range
of the memory associated with a NUMA domain when communicating with the
firmware for memory-related features such as scrubbing. The driver uses
the component instance ID, as defined in Table 5.80, to look up the
corresponding entry in numa_memblks. This allows it to retrieve the lowest
physical continuous memory range for the NUMA node associated with the
memory and store it in the struct ras2_context.
Co-developed-by: A Somasundaram <somasundaram.a@hpe.com>
Signed-off-by: A Somasundaram <somasundaram.a@hpe.com>
Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Daniel Ferguson <danielf@os.amperecomputing.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
drivers/acpi/Kconfig | 12 ++
drivers/acpi/Makefile | 1 +
drivers/acpi/bus.c | 3 +
drivers/acpi/ras2.c | 385 ++++++++++++++++++++++++++++++++++++++++++
include/acpi/ras2.h | 63 +++++++
5 files changed, 464 insertions(+)
create mode 100644 drivers/acpi/ras2.c
create mode 100644 include/acpi/ras2.h
diff --git a/drivers/acpi/Kconfig b/drivers/acpi/Kconfig
index b594780a57d7..db21bf5a39c7 100644
--- a/drivers/acpi/Kconfig
+++ b/drivers/acpi/Kconfig
@@ -293,6 +293,18 @@ config ACPI_CPPC_LIB
If your platform does not support CPPC in firmware,
leave this option disabled.
+config ACPI_RAS2
+ bool "ACPI RAS2 driver"
+ select AUXILIARY_BUS
+ select MAILBOX
+ select PCC
+ select NUMA_KEEP_MEMINFO if NUMA_MEMBLKS
+ help
+ The driver adds support for ACPI RAS2 feature table (extracts RAS2
+ table from OS system table) and OSPM interfaces to send RAS2
+ commands via PCC mailbox subspace. Driver adds platform device for
+ the RAS2 memory features which binds to the RAS2 memory driver.
+
config ACPI_PROCESSOR
tristate "Processor"
depends on X86 || ARM64 || LOONGARCH || RISCV
diff --git a/drivers/acpi/Makefile b/drivers/acpi/Makefile
index d1b0affb844f..abfec6745724 100644
--- a/drivers/acpi/Makefile
+++ b/drivers/acpi/Makefile
@@ -105,6 +105,7 @@ obj-$(CONFIG_ACPI_EC_DEBUGFS) += ec_sys.o
obj-$(CONFIG_ACPI_BGRT) += bgrt.o
obj-$(CONFIG_ACPI_CPPC_LIB) += cppc_acpi.o
obj-$(CONFIG_ACPI_SPCR_TABLE) += spcr.o
+obj-$(CONFIG_ACPI_RAS2) += ras2.o
obj-$(CONFIG_ACPI_DEBUGGER_USER) += acpi_dbg.o
obj-$(CONFIG_ACPI_PPTT) += pptt.o
obj-$(CONFIG_ACPI_PFRUT) += pfr_update.o pfr_telemetry.o
diff --git a/drivers/acpi/bus.c b/drivers/acpi/bus.c
index a984ccd4a2a0..b02ceb2837c6 100644
--- a/drivers/acpi/bus.c
+++ b/drivers/acpi/bus.c
@@ -31,6 +31,7 @@
#include <acpi/apei.h>
#include <linux/suspend.h>
#include <linux/prmt.h>
+#include <acpi/ras2.h>
#include "internal.h"
@@ -1474,6 +1475,8 @@ static int __init acpi_init(void)
acpi_debugger_init();
acpi_setup_sb_notify_handler();
acpi_viot_init();
+ acpi_ras2_init();
+
return 0;
}
diff --git a/drivers/acpi/ras2.c b/drivers/acpi/ras2.c
new file mode 100644
index 000000000000..e58c969a5d75
--- /dev/null
+++ b/drivers/acpi/ras2.c
@@ -0,0 +1,385 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Implementation of ACPI RAS2 driver.
+ *
+ * Copyright (c) 2024-2025 HiSilicon Limited.
+ *
+ * Support for RAS2 - ACPI 6.5 Specification, section 5.2.21
+ *
+ * Driver contains ACPI RAS2 init, which extracts the ACPI RAS2 table and
+ * get the PCC channel subspace for communicating with the ACPI compliant
+ * HW platform which supports ACPI RAS2. Driver adds auxiliary devices
+ * for each RAS2 memory feature which binds to the memory ACPI RAS2 driver.
+ */
+
+#define pr_fmt(fmt) "ACPI RAS2: " fmt
+
+#include <linux/delay.h>
+#include <linux/export.h>
+#include <linux/iopoll.h>
+#include <linux/ktime.h>
+#include <acpi/pcc.h>
+#include <acpi/ras2.h>
+
+/**
+ * struct ras2_pcc_subspace - Data structure for PCC communication
+ * @mbox_client: struct mbox_client object
+ * @pcc_chan: Pointer to struct pcc_mbox_chan
+ * @comm_addr: Pointer to RAS2 PCC shared memory region
+ * @elem: List for registered RAS2 PCC channel subspaces
+ * @pcc_lock: PCC lock to provide mutually exclusive access
+ * to PCC channel subspace
+ * @deadline_us: Poll PCC status register timeout in micro secs
+ * for PCC command complete
+ * @pcc_mpar: Maximum Periodic Access Rate(MPAR) for PCC channel
+ * @pcc_mrtt: Minimum Request Turnaround Time(MRTT) in micro secs
+ * OS must wait after completion of a PCC command before
+ * issue next command
+ * @last_cmd_cmpl_time: completion time of last PCC command
+ * @last_mpar_reset: Time of last MPAR count reset
+ * @mpar_count: MPAR count
+ * @pcc_id: Identifier of the RAS2 platform communication channel
+ * @last_cmd: Last PCC command
+ * @pcc_chnl_acq: Status of PCC channel acquired
+ */
+struct ras2_pcc_subspace {
+ struct mbox_client mbox_client;
+ struct pcc_mbox_chan *pcc_chan;
+ struct acpi_ras2_shmem __iomem *comm_addr;
+ struct list_head elem;
+ struct mutex pcc_lock;
+ unsigned int deadline_us;
+ unsigned int pcc_mpar;
+ unsigned int pcc_mrtt;
+ ktime_t last_cmd_cmpl_time;
+ ktime_t last_mpar_reset;
+ int mpar_count;
+ int pcc_id;
+ u16 last_cmd;
+ bool pcc_chnl_acq;
+};
+
+/*
+ * Arbitrary retries for PCC commands because the remote processor
+ * could be much slower to reply. Keeping it high enough to cover
+ * emulators where the processors run painfully slow.
+ */
+#define RAS2_NUM_RETRIES 600ULL
+
+#define RAS2_FEAT_TYPE_MEMORY 0x00
+
+static int ras2_report_cap_error(u32 cap_status)
+{
+ switch (cap_status) {
+ case ACPI_RAS2_NOT_VALID:
+ case ACPI_RAS2_NOT_SUPPORTED:
+ return -EPERM;
+ case ACPI_RAS2_BUSY:
+ return -EBUSY;
+ case ACPI_RAS2_FAILED:
+ case ACPI_RAS2_ABORTED:
+ case ACPI_RAS2_INVALID_DATA:
+ return -EINVAL;
+ default: /* 0 or other, Success */
+ return 0;
+ }
+}
+
+static int ras2_check_pcc_chan(struct ras2_pcc_subspace *pcc_subspace)
+{
+ struct acpi_ras2_shmem __iomem *gen_comm_base = pcc_subspace->comm_addr;
+ u32 cap_status, rc;
+ u16 status;
+
+ /*
+ * As per ACPI spec, the PCC space will be initialized by
+ * platform and should have set the command completion bit when
+ * PCC can be used by OSPM.
+ *
+ * Poll PCC status register every 3us(delay_us) for maximum of
+ * deadline_us(timeout_us) until PCC command complete bit is set(cond).
+ */
+ rc = readw_relaxed_poll_timeout(&gen_comm_base->status, status,
+ status & PCC_STATUS_CMD_COMPLETE, 3,
+ pcc_subspace->deadline_us);
+ if (rc) {
+ pr_warn("PCC check channel timeout for pcc_id=%d rc=%d\n",
+ pcc_subspace->pcc_id, rc);
+ return rc;
+ }
+
+ if (status & PCC_STATUS_ERROR) {
+ pr_warn("Error in executing last command=%d for pcc_id=%d\n",
+ pcc_subspace->last_cmd, pcc_subspace->pcc_id);
+ status &= ~PCC_STATUS_ERROR;
+ writew_relaxed(status, &gen_comm_base->status);
+ return -EIO;
+ }
+
+ cap_status = readw_relaxed(&gen_comm_base->set_caps_status);
+ writew_relaxed(0x0, &gen_comm_base->set_caps_status);
+ return ras2_report_cap_error(cap_status);
+}
+
+/**
+ * ras2_send_pcc_cmd() - Send RAS2 command via PCC channel
+ * @ras2_ctx: pointer to the RAS2 context structure
+ * @cmd: command to send
+ *
+ * Returns: 0 on success, an error otherwise
+ */
+int ras2_send_pcc_cmd(struct ras2_mem_ctx *ras2_ctx, u16 cmd)
+{
+ struct ras2_pcc_subspace *pcc_subspace = ras2_ctx->pcc_subspace;
+ struct acpi_ras2_shmem __iomem *gen_comm_base = pcc_subspace->comm_addr;
+ struct mbox_chan *pcc_channel;
+ unsigned int time_delta;
+ int rc;
+
+ rc = ras2_check_pcc_chan(pcc_subspace);
+ if (rc < 0)
+ return rc;
+
+ pcc_channel = pcc_subspace->pcc_chan->mchan;
+
+ /*
+ * Handle the Minimum Request Turnaround Time(MRTT).
+ * "The minimum amount of time that OSPM must wait after the completion
+ * of a command before issuing the next command, in microseconds."
+ */
+ if (pcc_subspace->pcc_mrtt) {
+ time_delta = ktime_us_delta(ktime_get(),
+ pcc_subspace->last_cmd_cmpl_time);
+ if (pcc_subspace->pcc_mrtt > time_delta)
+ udelay(pcc_subspace->pcc_mrtt - time_delta);
+ }
+
+ /*
+ * Handle the non-zero Maximum Periodic Access Rate(MPAR).
+ * "The maximum number of periodic requests that the subspace channel can
+ * support, reported in commands per minute. 0 indicates no limitation."
+ *
+ * This parameter should be ideally zero or large enough so that it can
+ * handle maximum number of requests that all the cores in the system can
+ * collectively generate. If it is not, we will follow the spec and just
+ * not send the request to the platform after hitting the MPAR limit in
+ * any 60s window.
+ */
+ if (pcc_subspace->pcc_mpar) {
+ if (pcc_subspace->mpar_count == 0) {
+ time_delta = ktime_ms_delta(ktime_get(),
+ pcc_subspace->last_mpar_reset);
+ if (time_delta < 60 * MSEC_PER_SEC) {
+ dev_dbg(ras2_ctx->dev,
+ "PCC cmd not sent due to MPAR limit");
+ return -EIO;
+ }
+ pcc_subspace->last_mpar_reset = ktime_get();
+ pcc_subspace->mpar_count = pcc_subspace->pcc_mpar;
+ }
+ pcc_subspace->mpar_count--;
+ }
+
+ /* Write to the shared comm region */
+ writew_relaxed(cmd, &gen_comm_base->command);
+
+ /* Flip CMD COMPLETE bit */
+ writew_relaxed(0, &gen_comm_base->status);
+
+ /* Ring doorbell */
+ rc = mbox_send_message(pcc_channel, &cmd);
+ if (rc < 0) {
+ dev_warn(ras2_ctx->dev,
+ "Err sending PCC mbox message. cmd:%d, rc:%d\n",
+ cmd, rc);
+ return rc;
+ }
+
+ pcc_subspace->last_cmd = cmd;
+
+ /*
+ * If Minimum Request Turnaround Time is non-zero, we need
+ * to record the completion time of both READ and WRITE
+ * command for proper handling of MRTT, so we need to check
+ * for pcc_mrtt in addition to CMD_READ.
+ */
+ if (cmd == PCC_CMD_EXEC_RAS2 || pcc_subspace->pcc_mrtt) {
+ rc = ras2_check_pcc_chan(pcc_subspace);
+ if (pcc_subspace->pcc_mrtt)
+ pcc_subspace->last_cmd_cmpl_time = ktime_get();
+ }
+
+ if (pcc_channel->mbox->txdone_irq)
+ mbox_chan_txdone(pcc_channel, rc);
+ else
+ mbox_client_txdone(pcc_channel, rc);
+
+ return rc < 0 ? rc : 0;
+}
+EXPORT_SYMBOL_GPL(ras2_send_pcc_cmd);
+
+static void ras2_list_pcc_release(struct ras2_pcc_subspace *pcc_subspace)
+{
+ pcc_mbox_free_channel(pcc_subspace->pcc_chan);
+ kfree(pcc_subspace);
+}
+
+static int ras2_register_pcc_channel(struct ras2_mem_ctx *ras2_ctx, int pcc_id)
+{
+ struct ras2_pcc_subspace *pcc_subspace;
+ struct pcc_mbox_chan *pcc_chan;
+ struct mbox_client *mbox_cl;
+
+ if (pcc_id < 0)
+ return -EINVAL;
+
+ pcc_subspace = kzalloc(sizeof(*pcc_subspace), GFP_KERNEL);
+ if (!pcc_subspace)
+ return -ENOMEM;
+
+ mbox_cl = &pcc_subspace->mbox_client;
+ mbox_cl->knows_txdone = true;
+
+ pcc_chan = pcc_mbox_request_channel(mbox_cl, pcc_id);
+ if (IS_ERR(pcc_chan)) {
+ kfree(pcc_subspace);
+ return PTR_ERR(pcc_chan);
+ }
+
+ pcc_subspace->pcc_id = pcc_id;
+ pcc_subspace->pcc_chan = pcc_chan;
+ pcc_subspace->comm_addr = pcc_chan->shmem;
+ pcc_subspace->deadline_us = RAS2_NUM_RETRIES * pcc_chan->latency;
+ pcc_subspace->pcc_mrtt = pcc_chan->min_turnaround_time;
+ pcc_subspace->pcc_mpar = pcc_chan->max_access_rate;
+ pcc_subspace->mbox_client.knows_txdone = true;
+ pcc_subspace->pcc_chnl_acq = true;
+
+ ras2_ctx->pcc_subspace = pcc_subspace;
+ ras2_ctx->comm_addr = pcc_subspace->comm_addr;
+ ras2_ctx->dev = pcc_chan->mchan->mbox->dev;
+
+ mutex_init(&pcc_subspace->pcc_lock);
+ ras2_ctx->pcc_lock = &pcc_subspace->pcc_lock;
+
+ return 0;
+}
+
+static DEFINE_IDA(ras2_ida);
+static void ras2_release(struct device *device)
+{
+ struct auxiliary_device *auxdev = to_auxiliary_dev(device);
+ struct ras2_mem_ctx *ras2_ctx =
+ container_of(auxdev, struct ras2_mem_ctx, adev);
+
+ ida_free(&ras2_ida, auxdev->id);
+ ras2_list_pcc_release(ras2_ctx->pcc_subspace);
+ kfree(ras2_ctx);
+}
+
+static int ras2_add_aux_device(char *name, int channel, u32 pxm_inst)
+{
+ struct ras2_mem_ctx *ras2_ctx;
+ u64 start = 0, end = 0;
+ int id, rc;
+
+ ras2_ctx = kzalloc(sizeof(*ras2_ctx), GFP_KERNEL);
+ if (!ras2_ctx)
+ return -ENOMEM;
+
+ ras2_ctx->sys_comp_nid = pxm_to_node(pxm_inst);
+ rc = nid_get_mem_physaddr_range(ras2_ctx->sys_comp_nid,
+ &start, &end);
+ if (rc < 0 || (!start && !end)) {
+ pr_warn("Failed to find phy addr range for NUMA node(%u) rc=%d\n",
+ pxm_inst, rc);
+ goto ctx_free;
+ }
+ ras2_ctx->mem_base_addr = start;
+ ras2_ctx->mem_size = end - start;
+
+ rc = ras2_register_pcc_channel(ras2_ctx, channel);
+ if (rc < 0) {
+ pr_debug("Failed to register pcc channel rc=%d\n", rc);
+ goto ctx_free;
+ }
+
+ id = ida_alloc(&ras2_ida, GFP_KERNEL);
+ if (id < 0) {
+ rc = id;
+ goto ctx_free;
+ }
+
+ ras2_ctx->adev.id = id;
+ ras2_ctx->adev.name = RAS2_MEM_DEV_ID_NAME;
+ ras2_ctx->adev.dev.release = ras2_release;
+ ras2_ctx->adev.dev.parent = ras2_ctx->dev;
+
+ rc = auxiliary_device_init(&ras2_ctx->adev);
+ if (rc)
+ goto ida_free;
+
+ rc = auxiliary_device_add(&ras2_ctx->adev);
+ if (rc) {
+ auxiliary_device_uninit(&ras2_ctx->adev);
+ return rc;
+ }
+
+ return 0;
+
+ida_free:
+ ida_free(&ras2_ida, id);
+ctx_free:
+ kfree(ras2_ctx);
+
+ return rc;
+}
+
+static int acpi_ras2_parse(struct acpi_table_ras2 *ras2_tab)
+{
+ struct acpi_ras2_pcc_desc *pcc_desc_list;
+ int rc;
+ u16 i;
+
+ if (ras2_tab->header.length < sizeof(*ras2_tab)) {
+ pr_warn(FW_WARN "ACPI RAS2 table present but broken (too short #1)\n");
+ return -EINVAL;
+ }
+
+ if (!ras2_tab->num_pcc_descs) {
+ pr_warn(FW_WARN "No PCC descs in ACPI RAS2 table\n");
+ return -EINVAL;
+ }
+
+ pcc_desc_list = (struct acpi_ras2_pcc_desc *)(ras2_tab + 1);
+ for (i = 0; i < ras2_tab->num_pcc_descs; i++, pcc_desc_list++) {
+ if (pcc_desc_list->feature_type != RAS2_FEAT_TYPE_MEMORY)
+ continue;
+
+ rc = ras2_add_aux_device(RAS2_MEM_DEV_ID_NAME,
+ pcc_desc_list->channel_id,
+ pcc_desc_list->instance);
+ if (rc)
+ pr_warn("Failed to add RAS2 auxiliary device rc=%d\n", rc);
+ }
+
+ return 0;
+}
+
+void __init acpi_ras2_init(void)
+{
+ struct acpi_table_ras2 *ras2_tab;
+ acpi_status status;
+
+ status = acpi_get_table(ACPI_SIG_RAS2, 0,
+ (struct acpi_table_header **)&ras2_tab);
+ if (ACPI_FAILURE(status)) {
+ pr_err("Failed to get table, %s\n", acpi_format_exception(status));
+ return;
+ }
+
+ if (acpi_ras2_parse(ras2_tab))
+ pr_err("Failed to parse RAS2 table\n");
+
+ acpi_put_table((struct acpi_table_header *)ras2_tab);
+}
diff --git a/include/acpi/ras2.h b/include/acpi/ras2.h
new file mode 100644
index 000000000000..cb053b5f37e7
--- /dev/null
+++ b/include/acpi/ras2.h
@@ -0,0 +1,63 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * ACPI RAS2 driver header file
+ *
+ * Copyright (c) 2024-2025 HiSilicon Limited
+ */
+
+#ifndef _ACPI_RAS2_H
+#define _ACPI_RAS2_H
+
+#include <linux/acpi.h>
+#include <linux/auxiliary_bus.h>
+#include <linux/mailbox_client.h>
+#include <linux/mutex.h>
+#include <linux/types.h>
+
+struct device;
+
+/*
+ * ACPI spec 6.5 Table 5.82: PCC command codes used by
+ * RAS2 platform communication channel.
+ */
+#define PCC_CMD_EXEC_RAS2 0x01
+
+#define RAS2_AUX_DEV_NAME "ras2"
+#define RAS2_MEM_DEV_ID_NAME "acpi_ras2_mem"
+
+/**
+ * struct ras2_mem_ctx - Context for RAS2 memory features
+ * @adev: Auxiliary device object
+ * @comm_addr: Pointer to RAS2 PCC shared memory region
+ * @dev: Pointer to device backing struct mbox_controller for PCC
+ * @pcc_subspace: Pointer to local data structure for PCC communication
+ * @pcc_lock: Pointer to PCC lock to provide mutually exclusive access
+ * to PCC channel subspace
+ * @sys_comp_nid: Node ID of the system component that the RAS feature
+ * is associated with. See ACPI spec 6.5 Table 5.80: RAS2
+ * Platform Communication Channel Descriptor format,
+ * Field: Instance
+ * @mem_base_addr: Base of the lowest physical continuous memory range
+ * of the memory associated with the NUMA domain
+ * @mem_size Size of the lowest physical continuous memory range
+ * of the memory associated with the NUMA domain
+ */
+struct ras2_mem_ctx {
+ struct auxiliary_device adev;
+ struct acpi_ras2_shmem __iomem *comm_addr;
+ struct device *dev;
+ void *pcc_subspace;
+ struct mutex *pcc_lock;
+ u32 sys_comp_nid;
+ u64 mem_base_addr;
+ u64 mem_size;
+};
+
+#ifdef CONFIG_ACPI_RAS2
+void __init acpi_ras2_init(void);
+int ras2_send_pcc_cmd(struct ras2_mem_ctx *ras2_ctx, u16 cmd);
+#else
+static inline void acpi_ras2_init(void) { }
+#endif
+
+#endif /* _ACPI_RAS2_H */
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* [PATCH v11 3/3] ras: mem: Add memory ACPI RAS2 driver
2025-08-12 14:26 [PATCH v11 0/3] ACPI: Add support for ACPI RAS2 feature table shiju.jose
2025-08-12 14:26 ` [PATCH v11 1/3] mm: Add support to retrieve physical address range of memory from the node ID shiju.jose
2025-08-12 14:26 ` [PATCH v11 2/3] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
@ 2025-08-12 14:26 ` shiju.jose
2025-08-19 20:12 ` [PATCH v11 0/3] ACPI: Add support for ACPI RAS2 feature table Daniel Ferguson
3 siblings, 0 replies; 13+ messages in thread
From: shiju.jose @ 2025-08-12 14:26 UTC (permalink / raw)
To: rafael, bp, akpm, rppt, dferguson, linux-edac, linux-acpi,
linux-mm, linux-doc, tony.luck, lenb, leo.duran, Yazen.Ghannam,
mchehab
Cc: jonathan.cameron, linuxarm, rientjes, jiaqiyan, Jon.Grimm,
dave.hansen, naoya.horiguchi, james.morse, jthoughton,
somasundaram.a, erdemaktas, pgonda, duenwen, gthelen, wschwartz,
wbs, nifan.cxl, tanxiaofei, prime.zeng, roberto.sassu,
kangkang.shen, wanghuiqiang, shiju.jose
From: Shiju Jose <shiju.jose@huawei.com>
Memory ACPI RAS2 auxiliary driver binds to the auxiliary device
add by the ACPI RAS2 table parser.
Driver uses a PCC subspace for communicating with the ACPI compliant
platform.
According to the ACPI specification rev 6.5, section 5.2.21.1.1
RAS2 Platform Communication Channel Descriptor, “RAS2 supports multiple
PCC channels, where a channel is dedicated to a given component
instance.”
Device with ACPI RAS2 scrub feature registers with EDAC device driver,
which retrieves the scrub descriptor from EDAC scrub and exposes
the scrub control attributes for RAS2 scrub instance to userspace in
/sys/bus/edac/devices/acpi_ras_memX/scrub0/.
Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: Daniel Ferguson <danielf@os.amperecomputing.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
---
Documentation/edac/scrub.rst | 73 ++++++
drivers/ras/Kconfig | 11 +
drivers/ras/Makefile | 1 +
drivers/ras/acpi_ras2.c | 424 +++++++++++++++++++++++++++++++++++
include/acpi/ras2.h | 14 ++
5 files changed, 523 insertions(+)
create mode 100644 drivers/ras/acpi_ras2.c
diff --git a/Documentation/edac/scrub.rst b/Documentation/edac/scrub.rst
index 2cfa74fa1ffd..4c6ee84fb691 100644
--- a/Documentation/edac/scrub.rst
+++ b/Documentation/edac/scrub.rst
@@ -340,3 +340,76 @@ controller or platform when unexpectedly high error rates are detected.
Sysfs files for scrubbing are documented in
`Documentation/ABI/testing/sysfs-edac-ecs`
+
+3. ACPI RAS2 Hardware-based Memory Scrubbing
+
+3.1. On demand scrubbing for a specific memory region.
+
+3.1.1. Query the status of demand scrubbing
+
+Readback 'addr', non-zero - demand scrub is in progress, zero - scrub is finished.
+
+# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/addr
+
+0
+
+3.1.2. Query what is device default/current scrub cycle setting.
+
+ Applicable to both on-demand and background scrubbing.
+
+# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/current_cycle_duration
+
+36000
+
+3.1.3. Query the range of device supported scrub cycle for a memory region.
+
+# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/min_cycle_duration
+
+3600
+
+# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/max_cycle_duration
+
+86400
+
+3.1.4. Program scrubbing for the memory region in RAS2 device to repeat every
+43200 seconds (half a day).
+
+# echo 43200 > /sys/bus/edac/devices/acpi_ras_mem0/scrub0/current_cycle_duration
+
+3.1.5. Program address and size of the memory region to scrub
+
+Write 'size' of the memory region to scrub.
+
+# echo 0x300000 > /sys/bus/edac/devices/acpi_ras_mem0/scrub0/size
+
+Write 'addr' starts demand scrubbing, please make sure other attributes are
+set prior to that.
+
+# echo 0x200000 > /sys/bus/edac/devices/acpi_ras_mem0/scrub0/addr
+
+Readback 'addr', non-zero - demand scrub is in progress, zero - scrub is finished.
+
+# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/addr
+
+0x200000
+
+# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/addr
+
+0
+
+3.2. Background scrubbing the entire memory
+
+3.2.1. Query the status of background scrubbing.
+
+# cat /sys/bus/edac/devices/acpi_ras_mem0/scrub0/enable_background
+
+0
+
+3.2.2. Program background scrubbing for RAS2 device to repeat in every 21600
+seconds (quarter of a day).
+
+# echo 21600 > /sys/bus/edac/devices/acpi_ras_mem0/scrub0/current_cycle_duration
+
+3.2.3. Start 'background scrubbing'.
+
+# echo 1 > /sys/bus/edac/devices/acpi_ras_mem0/scrub0/enable_background
diff --git a/drivers/ras/Kconfig b/drivers/ras/Kconfig
index fc4f4bb94a4c..a88002f1f462 100644
--- a/drivers/ras/Kconfig
+++ b/drivers/ras/Kconfig
@@ -46,4 +46,15 @@ config RAS_FMPM
Memory will be retired during boot time and run time depending on
platform-specific policies.
+config MEM_ACPI_RAS2
+ tristate "Memory ACPI RAS2 driver"
+ depends on ACPI_RAS2
+ depends on EDAC
+ depends on EDAC_SCRUB
+ help
+ The driver binds to the platform device added by the ACPI RAS2
+ table parser. Use a PCC channel subspace for communicating with
+ the ACPI compliant platform to provide control of memory scrub
+ parameters to the user via the EDAC scrub.
+
endif
diff --git a/drivers/ras/Makefile b/drivers/ras/Makefile
index 11f95d59d397..a0e6e903d6b0 100644
--- a/drivers/ras/Makefile
+++ b/drivers/ras/Makefile
@@ -2,6 +2,7 @@
obj-$(CONFIG_RAS) += ras.o
obj-$(CONFIG_DEBUG_FS) += debugfs.o
obj-$(CONFIG_RAS_CEC) += cec.o
+obj-$(CONFIG_MEM_ACPI_RAS2) += acpi_ras2.o
obj-$(CONFIG_RAS_FMPM) += amd/fmpm.o
obj-y += amd/atl/
diff --git a/drivers/ras/acpi_ras2.c b/drivers/ras/acpi_ras2.c
new file mode 100644
index 000000000000..3971653b477a
--- /dev/null
+++ b/drivers/ras/acpi_ras2.c
@@ -0,0 +1,424 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * ACPI RAS2 memory driver
+ *
+ * Copyright (c) 2024-2025 HiSilicon Limited.
+ *
+ */
+
+#define pr_fmt(fmt) "ACPI RAS2 MEMORY: " fmt
+
+#include <linux/bitfield.h>
+#include <linux/edac.h>
+#include <linux/platform_device.h>
+#include <acpi/ras2.h>
+
+#define RAS2_SUPPORT_HW_PARTOL_SCRUB BIT(0)
+#define RAS2_TYPE_PATROL_SCRUB 0x0000
+
+#define RAS2_GET_PATROL_PARAMETERS 0x01
+#define RAS2_START_PATROL_SCRUBBER 0x02
+#define RAS2_STOP_PATROL_SCRUBBER 0x03
+
+/*
+ * RAS2 patrol scrub
+ */
+#define RAS2_PS_SC_HRS_IN_MASK GENMASK(15, 8)
+#define RAS2_PS_EN_BACKGROUND BIT(0)
+#define RAS2_PS_SC_HRS_OUT_MASK GENMASK(7, 0)
+#define RAS2_PS_MIN_SC_HRS_OUT_MASK GENMASK(15, 8)
+#define RAS2_PS_MAX_SC_HRS_OUT_MASK GENMASK(23, 16)
+#define RAS2_PS_FLAG_SCRUB_RUNNING BIT(0)
+
+#define RAS2_SCRUB_NAME_LEN 128
+#define RAS2_HOUR_IN_SECS 3600
+
+enum ras2_od_scrub_status {
+ OD_SCRUB_STS_IDLE,
+ OD_SCRUB_STS_INIT,
+ OD_SCRUB_STS_ACTIVE,
+};
+
+struct acpi_ras2_ps_shared_mem {
+ struct acpi_ras2_shmem common;
+ struct acpi_ras2_patrol_scrub_param params;
+};
+
+#define TO_ACPI_RAS2_PS_SHMEM(_addr) \
+ container_of(_addr, struct acpi_ras2_ps_shared_mem, common)
+
+static int ras2_is_patrol_scrub_support(struct ras2_mem_ctx *ras2_ctx)
+{
+ struct acpi_ras2_shmem __iomem *common = (void *)ras2_ctx->comm_addr;
+
+ guard(mutex)(ras2_ctx->pcc_lock);
+ common->set_caps[0] = 0;
+
+ return common->features[0] & RAS2_SUPPORT_HW_PARTOL_SCRUB;
+}
+
+static int ras2_update_patrol_scrub_params_cache(struct ras2_mem_ctx *ras2_ctx)
+{
+ struct acpi_ras2_ps_shared_mem __iomem *ps_sm =
+ TO_ACPI_RAS2_PS_SHMEM(ras2_ctx->comm_addr);
+ int ret;
+
+ ps_sm->common.set_caps[0] = RAS2_SUPPORT_HW_PARTOL_SCRUB;
+ ps_sm->params.command = RAS2_GET_PATROL_PARAMETERS;
+ ps_sm->params.req_addr_range[0] = ras2_ctx->mem_base_addr;
+ ps_sm->params.req_addr_range[1] = ras2_ctx->mem_size;
+ ret = ras2_send_pcc_cmd(ras2_ctx, PCC_CMD_EXEC_RAS2);
+ if (ret) {
+ dev_err(ras2_ctx->dev, "failed to read parameters\n");
+ return ret;
+ }
+
+ ras2_ctx->min_scrub_cycle = FIELD_GET(RAS2_PS_MIN_SC_HRS_OUT_MASK,
+ ps_sm->params.scrub_params_out);
+ ras2_ctx->max_scrub_cycle = FIELD_GET(RAS2_PS_MAX_SC_HRS_OUT_MASK,
+ ps_sm->params.scrub_params_out);
+ ras2_ctx->scrub_cycle_hrs = FIELD_GET(RAS2_PS_SC_HRS_OUT_MASK,
+ ps_sm->params.scrub_params_out);
+ if (ras2_ctx->bg_scrub) {
+ ras2_ctx->base = 0;
+ ras2_ctx->size = 0;
+ ras2_ctx->od_scrub_sts = OD_SCRUB_STS_IDLE;
+ return 0;
+ }
+
+ if (ps_sm->params.flags & RAS2_PS_FLAG_SCRUB_RUNNING) {
+ ras2_ctx->base = ps_sm->params.actl_addr_range[0];
+ ras2_ctx->size = ps_sm->params.actl_addr_range[1];
+ } else if (ras2_ctx->od_scrub_sts != OD_SCRUB_STS_INIT) {
+ /*
+ * When demand scrubbing is finished driver resets actual
+ * address range to 0 when readback. Otherwise userspace
+ * assumes demand scrubbing is in progress.
+ */
+ ras2_ctx->base = 0;
+ ras2_ctx->size = 0;
+ ras2_ctx->od_scrub_sts = OD_SCRUB_STS_IDLE;
+ }
+
+ return 0;
+}
+
+/* Context - PCC lock must be held */
+static int ras2_get_patrol_scrub_running(struct ras2_mem_ctx *ras2_ctx,
+ bool *running)
+{
+ struct acpi_ras2_ps_shared_mem __iomem *ps_sm =
+ TO_ACPI_RAS2_PS_SHMEM(ras2_ctx->comm_addr);
+ int ret;
+
+ ps_sm->common.set_caps[0] = RAS2_SUPPORT_HW_PARTOL_SCRUB;
+ ps_sm->params.command = RAS2_GET_PATROL_PARAMETERS;
+ ps_sm->params.req_addr_range[0] = ras2_ctx->mem_base_addr;
+ ps_sm->params.req_addr_range[1] = ras2_ctx->mem_size;
+
+ ret = ras2_send_pcc_cmd(ras2_ctx, PCC_CMD_EXEC_RAS2);
+ if (ret) {
+ dev_err(ras2_ctx->dev, "failed to read parameters\n");
+ return ret;
+ }
+
+ *running = ps_sm->params.flags & RAS2_PS_FLAG_SCRUB_RUNNING;
+
+ return 0;
+}
+
+static int ras2_hw_scrub_read_min_scrub_cycle(struct device *dev, void *drv_data,
+ u32 *min)
+{
+ struct ras2_mem_ctx *ras2_ctx = drv_data;
+
+ *min = ras2_ctx->min_scrub_cycle * RAS2_HOUR_IN_SECS;
+
+ return 0;
+}
+
+static int ras2_hw_scrub_read_max_scrub_cycle(struct device *dev, void *drv_data,
+ u32 *max)
+{
+ struct ras2_mem_ctx *ras2_ctx = drv_data;
+
+ *max = ras2_ctx->max_scrub_cycle * RAS2_HOUR_IN_SECS;
+
+ return 0;
+}
+
+static int ras2_hw_scrub_cycle_read(struct device *dev, void *drv_data,
+ u32 *scrub_cycle_secs)
+{
+ struct ras2_mem_ctx *ras2_ctx = drv_data;
+
+ *scrub_cycle_secs = ras2_ctx->scrub_cycle_hrs * RAS2_HOUR_IN_SECS;
+
+ return 0;
+}
+
+static int ras2_hw_scrub_cycle_write(struct device *dev, void *drv_data,
+ u32 scrub_cycle_secs)
+{
+ u8 scrub_cycle_hrs = scrub_cycle_secs / RAS2_HOUR_IN_SECS;
+ struct ras2_mem_ctx *ras2_ctx = drv_data;
+ bool running;
+ int ret;
+
+ guard(mutex)(ras2_ctx->pcc_lock);
+ ret = ras2_get_patrol_scrub_running(ras2_ctx, &running);
+ if (ret)
+ return ret;
+
+ if (running)
+ return -EBUSY;
+
+ if (scrub_cycle_hrs < ras2_ctx->min_scrub_cycle ||
+ scrub_cycle_hrs > ras2_ctx->max_scrub_cycle)
+ return -EINVAL;
+
+ ras2_ctx->scrub_cycle_hrs = scrub_cycle_hrs;
+
+ return 0;
+}
+
+static int ras2_hw_scrub_read_addr(struct device *dev, void *drv_data, u64 *base)
+{
+ struct ras2_mem_ctx *ras2_ctx = drv_data;
+ int ret;
+
+ /*
+ * When BG scrubbing is enabled the actual address range is not valid.
+ * Return -EBUSY now unless find out a method to retrieve actual full PA range.
+ */
+ if (ras2_ctx->bg_scrub)
+ return -EBUSY;
+
+ ret = ras2_update_patrol_scrub_params_cache(ras2_ctx);
+ if (ret)
+ return ret;
+
+ *base = ras2_ctx->base;
+
+ return 0;
+}
+
+static int ras2_hw_scrub_read_size(struct device *dev, void *drv_data, u64 *size)
+{
+ struct ras2_mem_ctx *ras2_ctx = drv_data;
+ int ret;
+
+ if (ras2_ctx->bg_scrub)
+ return -EBUSY;
+
+ ret = ras2_update_patrol_scrub_params_cache(ras2_ctx);
+ if (ret)
+ return ret;
+
+ *size = ras2_ctx->size;
+
+ return 0;
+}
+
+static int ras2_hw_scrub_write_addr(struct device *dev, void *drv_data, u64 base)
+{
+ struct ras2_mem_ctx *ras2_ctx = drv_data;
+ struct acpi_ras2_ps_shared_mem __iomem *ps_sm =
+ TO_ACPI_RAS2_PS_SHMEM(ras2_ctx->comm_addr);
+ bool running;
+ int ret;
+
+ if (ras2_ctx->bg_scrub)
+ return -EBUSY;
+
+ guard(mutex)(ras2_ctx->pcc_lock);
+ ps_sm->common.set_caps[0] = RAS2_SUPPORT_HW_PARTOL_SCRUB;
+
+ if (!ras2_ctx->size || ras2_ctx->size > ras2_ctx->mem_size ||
+ base < ras2_ctx->mem_base_addr ||
+ (base + ras2_ctx->size) >
+ (ras2_ctx->mem_base_addr + ras2_ctx->mem_size)) {
+ dev_warn(dev,
+ "%s: Invalid address range, base=0x%llx size=0x%llx\n",
+ __func__, base, ras2_ctx->size);
+ return -ERANGE;
+ }
+
+ ret = ras2_get_patrol_scrub_running(ras2_ctx, &running);
+ if (ret)
+ return ret;
+
+ if (running)
+ return -EBUSY;
+
+ ras2_ctx->base = base;
+ ps_sm->params.scrub_params_in &= ~RAS2_PS_SC_HRS_IN_MASK;
+ ps_sm->params.scrub_params_in |= FIELD_PREP(RAS2_PS_SC_HRS_IN_MASK,
+ ras2_ctx->scrub_cycle_hrs);
+ ps_sm->params.req_addr_range[0] = ras2_ctx->base;
+ ps_sm->params.req_addr_range[1] = ras2_ctx->size;
+ ps_sm->params.scrub_params_in &= ~RAS2_PS_EN_BACKGROUND;
+ ps_sm->params.command = RAS2_START_PATROL_SCRUBBER;
+
+ ret = ras2_send_pcc_cmd(ras2_ctx, PCC_CMD_EXEC_RAS2);
+ if (ret) {
+ dev_err(dev, "Failed to start demand scrubbing rc(%d)\n", ret);
+ if (ret != -EBUSY) {
+ ps_sm->params.req_addr_range[0] = 0;
+ ps_sm->params.req_addr_range[1] = 0;
+ ras2_ctx->base = 0;
+ ras2_ctx->size = 0;
+ ras2_ctx->od_scrub_sts = OD_SCRUB_STS_IDLE;
+ }
+ return ret;
+ }
+ ras2_ctx->od_scrub_sts = OD_SCRUB_STS_ACTIVE;
+
+ return ras2_update_patrol_scrub_params_cache(ras2_ctx);
+}
+
+static int ras2_hw_scrub_write_size(struct device *dev, void *drv_data, u64 size)
+{
+ struct ras2_mem_ctx *ras2_ctx = drv_data;
+ bool running;
+ int ret;
+
+ if (!size) {
+ dev_warn(dev, "%s: Invalid address range size=0x%llx\n",
+ __func__, size);
+ return -EINVAL;
+ }
+
+ if (ras2_ctx->bg_scrub)
+ return -EBUSY;
+
+ guard(mutex)(ras2_ctx->pcc_lock);
+ ret = ras2_get_patrol_scrub_running(ras2_ctx, &running);
+ if (ret)
+ return ret;
+
+ if (running)
+ return -EBUSY;
+
+ ras2_ctx->size = size;
+ ras2_ctx->od_scrub_sts = OD_SCRUB_STS_INIT;
+
+ return 0;
+}
+
+static int ras2_hw_scrub_set_enabled_bg(struct device *dev, void *drv_data, bool enable)
+{
+ struct ras2_mem_ctx *ras2_ctx = drv_data;
+ struct acpi_ras2_ps_shared_mem __iomem *ps_sm =
+ TO_ACPI_RAS2_PS_SHMEM(ras2_ctx->comm_addr);
+ bool running;
+ int ret;
+
+ guard(mutex)(ras2_ctx->pcc_lock);
+ ps_sm->common.set_caps[0] = RAS2_SUPPORT_HW_PARTOL_SCRUB;
+ ret = ras2_get_patrol_scrub_running(ras2_ctx, &running);
+ if (ret)
+ return ret;
+
+ if (enable) {
+ if (ras2_ctx->bg_scrub || running)
+ return -EBUSY;
+ ps_sm->params.req_addr_range[0] = 0;
+ ps_sm->params.req_addr_range[1] = 0;
+ ps_sm->params.scrub_params_in &= ~RAS2_PS_SC_HRS_IN_MASK;
+ ps_sm->params.scrub_params_in |= FIELD_PREP(RAS2_PS_SC_HRS_IN_MASK,
+ ras2_ctx->scrub_cycle_hrs);
+ ps_sm->params.command = RAS2_START_PATROL_SCRUBBER;
+ } else {
+ if (!ras2_ctx->bg_scrub)
+ return -EPERM;
+ ps_sm->params.command = RAS2_STOP_PATROL_SCRUBBER;
+ }
+
+ ps_sm->params.scrub_params_in &= ~RAS2_PS_EN_BACKGROUND;
+ ps_sm->params.scrub_params_in |= FIELD_PREP(RAS2_PS_EN_BACKGROUND,
+ enable);
+ ret = ras2_send_pcc_cmd(ras2_ctx, PCC_CMD_EXEC_RAS2);
+ if (ret) {
+ dev_err(dev, "Failed to %s background scrubbing\n",
+ str_enable_disable(enable));
+ return ret;
+ }
+
+ if (enable) {
+ ras2_ctx->bg_scrub = true;
+ /* Update the cache to account for rounding of supplied parameters and similar */
+ ret = ras2_update_patrol_scrub_params_cache(ras2_ctx);
+ } else {
+ ret = ras2_update_patrol_scrub_params_cache(ras2_ctx);
+ ras2_ctx->bg_scrub = false;
+ }
+
+ return ret;
+}
+
+static int ras2_hw_scrub_get_enabled_bg(struct device *dev, void *drv_data, bool *enabled)
+{
+ struct ras2_mem_ctx *ras2_ctx = drv_data;
+
+ *enabled = ras2_ctx->bg_scrub;
+
+ return 0;
+}
+
+static const struct edac_scrub_ops ras2_scrub_ops = {
+ .read_addr = ras2_hw_scrub_read_addr,
+ .read_size = ras2_hw_scrub_read_size,
+ .write_addr = ras2_hw_scrub_write_addr,
+ .write_size = ras2_hw_scrub_write_size,
+ .get_enabled_bg = ras2_hw_scrub_get_enabled_bg,
+ .set_enabled_bg = ras2_hw_scrub_set_enabled_bg,
+ .get_min_cycle = ras2_hw_scrub_read_min_scrub_cycle,
+ .get_max_cycle = ras2_hw_scrub_read_max_scrub_cycle,
+ .get_cycle_duration = ras2_hw_scrub_cycle_read,
+ .set_cycle_duration = ras2_hw_scrub_cycle_write,
+};
+
+static int ras2_probe(struct auxiliary_device *auxdev,
+ const struct auxiliary_device_id *id)
+{
+ struct ras2_mem_ctx *ras2_ctx = container_of(auxdev, struct ras2_mem_ctx, adev);
+ struct edac_dev_feature ras_features;
+ char scrub_name[RAS2_SCRUB_NAME_LEN];
+ int ret;
+
+ if (!ras2_is_patrol_scrub_support(ras2_ctx))
+ return -EOPNOTSUPP;
+
+ ret = ras2_update_patrol_scrub_params_cache(ras2_ctx);
+ if (ret)
+ return ret;
+
+ sprintf(scrub_name, "acpi_ras_mem%d", auxdev->id);
+
+ ras_features.ft_type = RAS_FEAT_SCRUB;
+ ras_features.instance = 0;
+ ras_features.scrub_ops = &ras2_scrub_ops;
+ ras_features.ctx = ras2_ctx;
+
+ return edac_dev_register(&auxdev->dev, scrub_name, NULL, 1,
+ &ras_features);
+}
+
+static const struct auxiliary_device_id ras2_mem_dev_id_table[] = {
+ { .name = RAS2_AUX_DEV_NAME "." RAS2_MEM_DEV_ID_NAME, },
+ { }
+};
+
+MODULE_DEVICE_TABLE(auxiliary, ras2_mem_dev_id_table);
+
+static struct auxiliary_driver ras2_mem_driver = {
+ .name = RAS2_MEM_DEV_ID_NAME,
+ .probe = ras2_probe,
+ .id_table = ras2_mem_dev_id_table,
+};
+module_auxiliary_driver(ras2_mem_driver);
+
+MODULE_IMPORT_NS("ACPI_RAS2");
+MODULE_DESCRIPTION("ACPI RAS2 memory driver");
+MODULE_LICENSE("GPL");
diff --git a/include/acpi/ras2.h b/include/acpi/ras2.h
index cb053b5f37e7..fe7b7c454ac8 100644
--- a/include/acpi/ras2.h
+++ b/include/acpi/ras2.h
@@ -41,6 +41,13 @@ struct device;
* of the memory associated with the NUMA domain
* @mem_size Size of the lowest physical continuous memory range
* of the memory associated with the NUMA domain
+ * @base: Base address of the memory region to scrub
+ * @size: Size of the memory region to scrub
+ * @scrub_cycle_hrs: Current scrub rate in hours
+ * @min_scrub_cycle: Minimum scrub rate supported
+ * @max_scrub_cycle: Maximum scrub rate supported
+ * @od_scrub_sts: Status of demand scrubbing (memory region)
+ * @bg_scrub: Status of background patrol scrubbing
*/
struct ras2_mem_ctx {
struct auxiliary_device adev;
@@ -51,6 +58,13 @@ struct ras2_mem_ctx {
u32 sys_comp_nid;
u64 mem_base_addr;
u64 mem_size;
+ u64 base;
+ u64 size;
+ u8 scrub_cycle_hrs;
+ u8 min_scrub_cycle;
+ u8 max_scrub_cycle;
+ u8 od_scrub_sts;
+ bool bg_scrub;
};
#ifdef CONFIG_ACPI_RAS2
--
2.43.0
^ permalink raw reply related [flat|nested] 13+ messages in thread
* Re: [PATCH v11 1/3] mm: Add support to retrieve physical address range of memory from the node ID
2025-08-12 14:26 ` [PATCH v11 1/3] mm: Add support to retrieve physical address range of memory from the node ID shiju.jose
@ 2025-08-19 16:54 ` Jonathan Cameron
2025-08-20 7:34 ` Mike Rapoport
0 siblings, 1 reply; 13+ messages in thread
From: Jonathan Cameron @ 2025-08-19 16:54 UTC (permalink / raw)
To: shiju.jose
Cc: rafael, bp, akpm, rppt, dferguson, linux-edac, linux-acpi,
linux-mm, linux-doc, tony.luck, lenb, leo.duran, Yazen.Ghannam,
mchehab, linuxarm, rientjes, jiaqiyan, Jon.Grimm, dave.hansen,
naoya.horiguchi, james.morse, jthoughton, somasundaram.a,
erdemaktas, pgonda, duenwen, gthelen, wschwartz, wbs, nifan.cxl,
tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
wanghuiqiang
On Tue, 12 Aug 2025 15:26:13 +0100
<shiju.jose@huawei.com> wrote:
> From: Shiju Jose <shiju.jose@huawei.com>
>
> In the numa_memblks, a lookup facility is required to retrieve the
> physical address range of memory in a NUMA node. ACPI RAS2 memory
> features are among the use cases.
>
> Suggested-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Looks fine to me. Mike, what do you think?
One passing comment inline.
Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> ---
> include/linux/numa.h | 9 +++++++++
> include/linux/numa_memblks.h | 2 ++
> mm/numa.c | 10 ++++++++++
> mm/numa_memblks.c | 37 ++++++++++++++++++++++++++++++++++++
> 4 files changed, 58 insertions(+)
>
> diff --git a/include/linux/numa.h b/include/linux/numa.h
> index e6baaf6051bc..1d1aabebd26b 100644
> --- a/include/linux/numa.h
> +++ b/include/linux/numa.h
> @@ -41,6 +41,10 @@ int memory_add_physaddr_to_nid(u64 start);
> int phys_to_target_node(u64 start);
> #endif
>
> +#ifndef nid_get_mem_physaddr_range
> +int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end);
> +#endif
> +
> int numa_fill_memblks(u64 start, u64 end);
>
> #else /* !CONFIG_NUMA */
> @@ -63,6 +67,11 @@ static inline int phys_to_target_node(u64 start)
> return 0;
> }
>
> +static inline int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end)
> +{
> + return 0;
> +}
> +
> static inline void alloc_offline_node_data(int nid) {}
> #endif
>
> diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
> index 991076cba7c5..7b32d96d0134 100644
> --- a/include/linux/numa_memblks.h
> +++ b/include/linux/numa_memblks.h
> @@ -55,6 +55,8 @@ extern int phys_to_target_node(u64 start);
> #define phys_to_target_node phys_to_target_node
> extern int memory_add_physaddr_to_nid(u64 start);
> #define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
> +extern int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end);
> +#define nid_get_mem_physaddr_range nid_get_mem_physaddr_range
> #endif /* CONFIG_NUMA_KEEP_MEMINFO */
>
> #endif /* CONFIG_NUMA_MEMBLKS */
> diff --git a/mm/numa.c b/mm/numa.c
> index 7d5e06fe5bd4..5335af1fefee 100644
> --- a/mm/numa.c
> +++ b/mm/numa.c
> @@ -59,3 +59,13 @@ int phys_to_target_node(u64 start)
> }
> EXPORT_SYMBOL_GPL(phys_to_target_node);
> #endif
> +
> +#ifndef nid_get_mem_physaddr_range
> +int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end)
> +{
> + pr_info_once("Unknown target phys addr range for node=%d\n", nid);
> +
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(nid_get_mem_physaddr_range);
> +#endif
> diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
> index 541a99c4071a..e1e56b7a3499 100644
> --- a/mm/numa_memblks.c
> +++ b/mm/numa_memblks.c
> @@ -590,4 +590,41 @@ int memory_add_physaddr_to_nid(u64 start)
> }
> EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>
> +/**
> + * nid_get_mem_physaddr_range - Get the physical address range
> + * of the memblk in the NUMA node.
> + * @nid: NUMA node ID of the memblk
> + * @start: Start address of the memblk
> + * @end: End address of the memblk
> + *
> + * Find the lowest contiguous physical memory address range of the memblk
> + * in the NUMA node with the given nid and return the start and end
> + * addresses.
> + *
> + * RETURNS:
> + * 0 on success, -errno on failure.
> + */
> +int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end)
> +{
> + struct numa_meminfo *mi = &numa_meminfo;
> + int i;
> +
> + if (!numa_valid_node(nid))
> + return -EINVAL;
> +
> + for (i = 0; i < mi->nr_blks; i++) {
> + if (mi->blk[i].nid == nid) {
> + *start = mi->blk[i].start;
> + /*
> + * Assumption: mi->blk[i].end is the last address
> + * in the range + 1.
This was my fault for asking on internal review if this was documented
anywhere. It's kind of implicitly obvious when reading numa_memblk.c
because there are a bunch of end - 1 prints.
So can probably drop this comment.
> + */
> + *end = mi->blk[i].end;
> + return 0;
> + }
> + }
> +
> + return -ENODEV;
> +}
> +EXPORT_SYMBOL_GPL(nid_get_mem_physaddr_range);
> #endif /* CONFIG_NUMA_KEEP_MEMINFO */
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v11 0/3] ACPI: Add support for ACPI RAS2 feature table
2025-08-12 14:26 [PATCH v11 0/3] ACPI: Add support for ACPI RAS2 feature table shiju.jose
` (2 preceding siblings ...)
2025-08-12 14:26 ` [PATCH v11 3/3] ras: mem: Add memory " shiju.jose
@ 2025-08-19 20:12 ` Daniel Ferguson
3 siblings, 0 replies; 13+ messages in thread
From: Daniel Ferguson @ 2025-08-19 20:12 UTC (permalink / raw)
To: shiju.jose, rafael, bp, akpm, rppt, dferguson, linux-edac,
linux-acpi, linux-mm, linux-doc, tony.luck, lenb, leo.duran,
Yazen.Ghannam, mchehab
Cc: jonathan.cameron, linuxarm, rientjes, jiaqiyan, Jon.Grimm,
dave.hansen, naoya.horiguchi, james.morse, jthoughton,
somasundaram.a, erdemaktas, pgonda, duenwen, gthelen, wschwartz,
wbs, nifan.cxl, tanxiaofei, prime.zeng, roberto.sassu,
kangkang.shen, wanghuiqiang
> Changes
> =======
> v10 -> v11:
> 1. Simplified code by removing workarounds previously added to support
> non-compliant case of single PCC channel shared across all proximity
> domains (which is no longer required).
> https://lore.kernel.org/all/f5b28977-0b80-4c39-929b-cf02ab1efb97@os.amperecomputing.com/
I've tested everything again, and it works.
Thanks,
Daniel
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v11 1/3] mm: Add support to retrieve physical address range of memory from the node ID
2025-08-19 16:54 ` Jonathan Cameron
@ 2025-08-20 7:34 ` Mike Rapoport
2025-08-20 8:54 ` Jonathan Cameron
0 siblings, 1 reply; 13+ messages in thread
From: Mike Rapoport @ 2025-08-20 7:34 UTC (permalink / raw)
To: Jonathan Cameron
Cc: shiju.jose, rafael, bp, akpm, dferguson, linux-edac, linux-acpi,
linux-mm, linux-doc, tony.luck, lenb, leo.duran, Yazen.Ghannam,
mchehab, linuxarm, rientjes, jiaqiyan, Jon.Grimm, dave.hansen,
naoya.horiguchi, james.morse, jthoughton, somasundaram.a,
erdemaktas, pgonda, duenwen, gthelen, wschwartz, wbs, nifan.cxl,
tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
wanghuiqiang
On Tue, Aug 19, 2025 at 05:54:20PM +0100, Jonathan Cameron wrote:
> On Tue, 12 Aug 2025 15:26:13 +0100
> <shiju.jose@huawei.com> wrote:
>
> > From: Shiju Jose <shiju.jose@huawei.com>
> >
> > In the numa_memblks, a lookup facility is required to retrieve the
> > physical address range of memory in a NUMA node. ACPI RAS2 memory
> > features are among the use cases.
> >
> > Suggested-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> > Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
>
> Looks fine to me. Mike, what do you think?
I still don't see why we can't use existing functions like
get_pfn_range_for_nid() or memblock_search_pfn_nid().
Or even node_start_pfn() and node_spanned_pages().
> One passing comment inline.
>
> Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>
> > ---
> > include/linux/numa.h | 9 +++++++++
> > include/linux/numa_memblks.h | 2 ++
> > mm/numa.c | 10 ++++++++++
> > mm/numa_memblks.c | 37 ++++++++++++++++++++++++++++++++++++
> > 4 files changed, 58 insertions(+)
> >
> > diff --git a/include/linux/numa.h b/include/linux/numa.h
> > index e6baaf6051bc..1d1aabebd26b 100644
> > --- a/include/linux/numa.h
> > +++ b/include/linux/numa.h
> > @@ -41,6 +41,10 @@ int memory_add_physaddr_to_nid(u64 start);
> > int phys_to_target_node(u64 start);
> > #endif
> >
> > +#ifndef nid_get_mem_physaddr_range
> > +int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end);
> > +#endif
> > +
> > int numa_fill_memblks(u64 start, u64 end);
> >
> > #else /* !CONFIG_NUMA */
> > @@ -63,6 +67,11 @@ static inline int phys_to_target_node(u64 start)
> > return 0;
> > }
> >
> > +static inline int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end)
> > +{
> > + return 0;
> > +}
> > +
> > static inline void alloc_offline_node_data(int nid) {}
> > #endif
> >
> > diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
> > index 991076cba7c5..7b32d96d0134 100644
> > --- a/include/linux/numa_memblks.h
> > +++ b/include/linux/numa_memblks.h
> > @@ -55,6 +55,8 @@ extern int phys_to_target_node(u64 start);
> > #define phys_to_target_node phys_to_target_node
> > extern int memory_add_physaddr_to_nid(u64 start);
> > #define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
> > +extern int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end);
> > +#define nid_get_mem_physaddr_range nid_get_mem_physaddr_range
> > #endif /* CONFIG_NUMA_KEEP_MEMINFO */
> >
> > #endif /* CONFIG_NUMA_MEMBLKS */
> > diff --git a/mm/numa.c b/mm/numa.c
> > index 7d5e06fe5bd4..5335af1fefee 100644
> > --- a/mm/numa.c
> > +++ b/mm/numa.c
> > @@ -59,3 +59,13 @@ int phys_to_target_node(u64 start)
> > }
> > EXPORT_SYMBOL_GPL(phys_to_target_node);
> > #endif
> > +
> > +#ifndef nid_get_mem_physaddr_range
> > +int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end)
> > +{
> > + pr_info_once("Unknown target phys addr range for node=%d\n", nid);
> > +
> > + return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(nid_get_mem_physaddr_range);
> > +#endif
> > diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
> > index 541a99c4071a..e1e56b7a3499 100644
> > --- a/mm/numa_memblks.c
> > +++ b/mm/numa_memblks.c
> > @@ -590,4 +590,41 @@ int memory_add_physaddr_to_nid(u64 start)
> > }
> > EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> >
> > +/**
> > + * nid_get_mem_physaddr_range - Get the physical address range
> > + * of the memblk in the NUMA node.
> > + * @nid: NUMA node ID of the memblk
> > + * @start: Start address of the memblk
> > + * @end: End address of the memblk
> > + *
> > + * Find the lowest contiguous physical memory address range of the memblk
> > + * in the NUMA node with the given nid and return the start and end
> > + * addresses.
> > + *
> > + * RETURNS:
> > + * 0 on success, -errno on failure.
> > + */
> > +int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end)
> > +{
> > + struct numa_meminfo *mi = &numa_meminfo;
> > + int i;
> > +
> > + if (!numa_valid_node(nid))
> > + return -EINVAL;
> > +
> > + for (i = 0; i < mi->nr_blks; i++) {
> > + if (mi->blk[i].nid == nid) {
> > + *start = mi->blk[i].start;
> > + /*
> > + * Assumption: mi->blk[i].end is the last address
> > + * in the range + 1.
>
> This was my fault for asking on internal review if this was documented
> anywhere. It's kind of implicitly obvious when reading numa_memblk.c
> because there are a bunch of end - 1 prints.
> So can probably drop this comment.
>
> > + */
> > + *end = mi->blk[i].end;
> > + return 0;
> > + }
> > + }
> > +
> > + return -ENODEV;
> > +}
> > +EXPORT_SYMBOL_GPL(nid_get_mem_physaddr_range);
> > #endif /* CONFIG_NUMA_KEEP_MEMINFO */
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v11 1/3] mm: Add support to retrieve physical address range of memory from the node ID
2025-08-20 7:34 ` Mike Rapoport
@ 2025-08-20 8:54 ` Jonathan Cameron
2025-08-20 10:00 ` Shiju Jose
0 siblings, 1 reply; 13+ messages in thread
From: Jonathan Cameron @ 2025-08-20 8:54 UTC (permalink / raw)
To: Mike Rapoport
Cc: shiju.jose, rafael, bp, akpm, dferguson, linux-edac, linux-acpi,
linux-mm, linux-doc, tony.luck, lenb, leo.duran, Yazen.Ghannam,
mchehab, linuxarm, rientjes, jiaqiyan, Jon.Grimm, dave.hansen,
naoya.horiguchi, james.morse, jthoughton, somasundaram.a,
erdemaktas, pgonda, duenwen, gthelen, wschwartz, wbs, nifan.cxl,
tanxiaofei, prime.zeng, roberto.sassu, kangkang.shen,
wanghuiqiang
On Wed, 20 Aug 2025 10:34:13 +0300
Mike Rapoport <rppt@kernel.org> wrote:
> On Tue, Aug 19, 2025 at 05:54:20PM +0100, Jonathan Cameron wrote:
> > On Tue, 12 Aug 2025 15:26:13 +0100
> > <shiju.jose@huawei.com> wrote:
> >
> > > From: Shiju Jose <shiju.jose@huawei.com>
> > >
> > > In the numa_memblks, a lookup facility is required to retrieve the
> > > physical address range of memory in a NUMA node. ACPI RAS2 memory
> > > features are among the use cases.
> > >
> > > Suggested-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> > > Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> >
> > Looks fine to me. Mike, what do you think?
>
> I still don't see why we can't use existing functions like
> get_pfn_range_for_nid() or memblock_search_pfn_nid().
>
> Or even node_start_pfn() and node_spanned_pages().
Good point. No reason anyone would scrub this on memory that hasn't
been hotplugged yet, so no need to use numa-memblk to get the info.
I guess I was thinking of the wrong hammer :)
I'm not sure node_spanned_pages() works though as we need not to include
ranges that might be on another node as we'd give a wrong impression of
what was being scrubbed.
Should be able to use some combination of node_start_pfn() and maybe
memblock_search_pfn_nid() to get it though (that also gets the nid
we already know but meh, no ral harm in that.)
Jonathan
>
> > One passing comment inline.
> >
> > Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> >
> > > ---
> > > include/linux/numa.h | 9 +++++++++
> > > include/linux/numa_memblks.h | 2 ++
> > > mm/numa.c | 10 ++++++++++
> > > mm/numa_memblks.c | 37 ++++++++++++++++++++++++++++++++++++
> > > 4 files changed, 58 insertions(+)
> > >
> > > diff --git a/include/linux/numa.h b/include/linux/numa.h
> > > index e6baaf6051bc..1d1aabebd26b 100644
> > > --- a/include/linux/numa.h
> > > +++ b/include/linux/numa.h
> > > @@ -41,6 +41,10 @@ int memory_add_physaddr_to_nid(u64 start);
> > > int phys_to_target_node(u64 start);
> > > #endif
> > >
> > > +#ifndef nid_get_mem_physaddr_range
> > > +int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end);
> > > +#endif
> > > +
> > > int numa_fill_memblks(u64 start, u64 end);
> > >
> > > #else /* !CONFIG_NUMA */
> > > @@ -63,6 +67,11 @@ static inline int phys_to_target_node(u64 start)
> > > return 0;
> > > }
> > >
> > > +static inline int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end)
> > > +{
> > > + return 0;
> > > +}
> > > +
> > > static inline void alloc_offline_node_data(int nid) {}
> > > #endif
> > >
> > > diff --git a/include/linux/numa_memblks.h b/include/linux/numa_memblks.h
> > > index 991076cba7c5..7b32d96d0134 100644
> > > --- a/include/linux/numa_memblks.h
> > > +++ b/include/linux/numa_memblks.h
> > > @@ -55,6 +55,8 @@ extern int phys_to_target_node(u64 start);
> > > #define phys_to_target_node phys_to_target_node
> > > extern int memory_add_physaddr_to_nid(u64 start);
> > > #define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
> > > +extern int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end);
> > > +#define nid_get_mem_physaddr_range nid_get_mem_physaddr_range
> > > #endif /* CONFIG_NUMA_KEEP_MEMINFO */
> > >
> > > #endif /* CONFIG_NUMA_MEMBLKS */
> > > diff --git a/mm/numa.c b/mm/numa.c
> > > index 7d5e06fe5bd4..5335af1fefee 100644
> > > --- a/mm/numa.c
> > > +++ b/mm/numa.c
> > > @@ -59,3 +59,13 @@ int phys_to_target_node(u64 start)
> > > }
> > > EXPORT_SYMBOL_GPL(phys_to_target_node);
> > > #endif
> > > +
> > > +#ifndef nid_get_mem_physaddr_range
> > > +int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end)
> > > +{
> > > + pr_info_once("Unknown target phys addr range for node=%d\n", nid);
> > > +
> > > + return 0;
> > > +}
> > > +EXPORT_SYMBOL_GPL(nid_get_mem_physaddr_range);
> > > +#endif
> > > diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c
> > > index 541a99c4071a..e1e56b7a3499 100644
> > > --- a/mm/numa_memblks.c
> > > +++ b/mm/numa_memblks.c
> > > @@ -590,4 +590,41 @@ int memory_add_physaddr_to_nid(u64 start)
> > > }
> > > EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > >
> > > +/**
> > > + * nid_get_mem_physaddr_range - Get the physical address range
> > > + * of the memblk in the NUMA node.
> > > + * @nid: NUMA node ID of the memblk
> > > + * @start: Start address of the memblk
> > > + * @end: End address of the memblk
> > > + *
> > > + * Find the lowest contiguous physical memory address range of the memblk
> > > + * in the NUMA node with the given nid and return the start and end
> > > + * addresses.
> > > + *
> > > + * RETURNS:
> > > + * 0 on success, -errno on failure.
> > > + */
> > > +int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end)
> > > +{
> > > + struct numa_meminfo *mi = &numa_meminfo;
> > > + int i;
> > > +
> > > + if (!numa_valid_node(nid))
> > > + return -EINVAL;
> > > +
> > > + for (i = 0; i < mi->nr_blks; i++) {
> > > + if (mi->blk[i].nid == nid) {
> > > + *start = mi->blk[i].start;
> > > + /*
> > > + * Assumption: mi->blk[i].end is the last address
> > > + * in the range + 1.
> >
> > This was my fault for asking on internal review if this was documented
> > anywhere. It's kind of implicitly obvious when reading numa_memblk.c
> > because there are a bunch of end - 1 prints.
> > So can probably drop this comment.
> >
> > > + */
> > > + *end = mi->blk[i].end;
> > > + return 0;
> > > + }
> > > + }
> > > +
> > > + return -ENODEV;
> > > +}
> > > +EXPORT_SYMBOL_GPL(nid_get_mem_physaddr_range);
> > > #endif /* CONFIG_NUMA_KEEP_MEMINFO */
> >
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [PATCH v11 1/3] mm: Add support to retrieve physical address range of memory from the node ID
2025-08-20 8:54 ` Jonathan Cameron
@ 2025-08-20 10:00 ` Shiju Jose
2025-08-20 17:02 ` Mike Rapoport
0 siblings, 1 reply; 13+ messages in thread
From: Shiju Jose @ 2025-08-20 10:00 UTC (permalink / raw)
To: Jonathan Cameron, Mike Rapoport
Cc: rafael@kernel.org, bp@alien8.de, akpm@linux-foundation.org,
dferguson@amperecomputing.com, linux-edac@vger.kernel.org,
linux-acpi@vger.kernel.org, linux-mm@kvack.org,
linux-doc@vger.kernel.org, tony.luck@intel.com, lenb@kernel.org,
leo.duran@amd.com, Yazen.Ghannam@amd.com, mchehab@kernel.org,
Linuxarm, rientjes@google.com, jiaqiyan@google.com,
Jon.Grimm@amd.com, dave.hansen@linux.intel.com,
naoya.horiguchi@nec.com, james.morse@arm.com,
jthoughton@google.com, somasundaram.a@hpe.com,
erdemaktas@google.com, pgonda@google.com, duenwen@google.com,
gthelen@google.com, wschwartz@amperecomputing.com,
wbs@os.amperecomputing.com, nifan.cxl@gmail.com, tanxiaofei,
Zengtao (B), Roberto Sassu, kangkang.shen@futurewei.com,
wanghuiqiang
>-----Original Message-----
>From: Jonathan Cameron <jonathan.cameron@huawei.com>
>Sent: 20 August 2025 09:54
>To: Mike Rapoport <rppt@kernel.org>
>Cc: Shiju Jose <shiju.jose@huawei.com>; rafael@kernel.org; bp@alien8.de;
>akpm@linux-foundation.org; dferguson@amperecomputing.com; linux-
>edac@vger.kernel.org; linux-acpi@vger.kernel.org; linux-mm@kvack.org; linux-
>doc@vger.kernel.org; tony.luck@intel.com; lenb@kernel.org;
>leo.duran@amd.com; Yazen.Ghannam@amd.com; mchehab@kernel.org;
>Linuxarm <linuxarm@huawei.com>; rientjes@google.com;
>jiaqiyan@google.com; Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
>naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
>somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
>duenwen@google.com; gthelen@google.com;
>wschwartz@amperecomputing.com; wbs@os.amperecomputing.com;
>nifan.cxl@gmail.com; tanxiaofei <tanxiaofei@huawei.com>; Zengtao (B)
><prime.zeng@hisilicon.com>; Roberto Sassu <roberto.sassu@huawei.com>;
>kangkang.shen@futurewei.com; wanghuiqiang <wanghuiqiang@huawei.com>
>Subject: Re: [PATCH v11 1/3] mm: Add support to retrieve physical address
>range of memory from the node ID
>
>On Wed, 20 Aug 2025 10:34:13 +0300
>Mike Rapoport <rppt@kernel.org> wrote:
>
>> On Tue, Aug 19, 2025 at 05:54:20PM +0100, Jonathan Cameron wrote:
>> > On Tue, 12 Aug 2025 15:26:13 +0100
>> > <shiju.jose@huawei.com> wrote:
>> >
>> > > From: Shiju Jose <shiju.jose@huawei.com>
>> > >
>> > > In the numa_memblks, a lookup facility is required to retrieve the
>> > > physical address range of memory in a NUMA node. ACPI RAS2 memory
>> > > features are among the use cases.
>> > >
>> > > Suggested-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>> > > Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
>> >
>> > Looks fine to me. Mike, what do you think?
>>
>> I still don't see why we can't use existing functions like
>> get_pfn_range_for_nid() or memblock_search_pfn_nid().
>>
>> Or even node_start_pfn() and node_spanned_pages().
>
>Good point. No reason anyone would scrub this on memory that hasn't been
>hotplugged yet, so no need to use numa-memblk to get the info.
>I guess I was thinking of the wrong hammer :)
>
>I'm not sure node_spanned_pages() works though as we need not to include
>ranges that might be on another node as we'd give a wrong impression of what
>was being scrubbed.
>
>Should be able to use some combination of node_start_pfn() and maybe
>memblock_search_pfn_nid() to get it though (that also gets the nid we already
>know but meh, no ral harm in that.)
Thanks Mike and Jonathan.
The following approaches were tried as you suggested, instead of newly proposed
nid_get_mem_physaddr_range().
Methods 1 to 3 give the same result as nid_get_mem_physaddr_range(), but
Method 4 gives a different value for the size.
Please advise which method should be used for the RAS2?
Thanks,
Shiju
Method 1
start_pfn = node_start_pfn(ras2_ctx->sys_comp_nid);
end_pfn = node_end_pfn(ras2_ctx->sys_comp_nid);
start = __pfn_to_phys(start_pfn);
end = __pfn_to_phys(end_pfn);
ras2_ctx->mem_base_addr = start;
ras2_ctx->mem_size = end - start;
pr_info("mem_base_addr=0x%lx mem_size=0x%lx\n", ras2_ctx->mem_base_addr, ras2_ctx->mem_size);
Method 2
start_pfn = node_start_pfn(ras2_ctx->sys_comp_nid);
size_pfn = node_spanned_pages(ras2_ctx->sys_comp_nid);
ras2_ctx->mem_base_addr = __pfn_to_phys(start_pfn);
ras2_ctx->mem_size = __pfn_to_phys(size_pfn);
pr_info("mem_base_addr=0x%lx mem_size=0x%lx\n", ras2_ctx->mem_base_addr, ras2_ctx->mem_size);
Method 3
get_pfn_range_for_nid(ras2_ctx->sys_comp_nid, &start_pfn, &end_pfn);
start = __pfn_to_phys(start_pfn);
end = __pfn_to_phys(end_pfn);
ras2_ctx->mem_base_addr = start;
ras2_ctx->mem_size = end - start;
pr_info("mem_base_addr=0x%lx mem_size=0x%lx\n", ras2_ctx->mem_base_addr, ras2_ctx->mem_size);
Method 4
pfn = node_start_pfn(ras2_ctx->sys_comp_nid);
rc = memblock_search_pfn_nid(pfn, &start_pfn, &end_pfn);
if (rc == NUMA_NO_NODE) {
pr_warn("Failed to find phy addr range for NUMA node(%u) rc=%d\n", rc);
goto ctx_free;
}
start = __pfn_to_phys(start_pfn);
end = __pfn_to_phys(end_pfn);
ras2_ctx->mem_base_addr = start;
ras2_ctx->mem_size = end - start;
pr_info("mem_base_addr=0x%lx mem_size=0x%lx\n", ras2_ctx->mem_base_addr, ras2_ctx->mem_size);
>
>Jonathan
>
>
>
>
>>
>> > One passing comment inline.
>> >
>> > Reviewed-by: Jonathan Cameron <jonathan.cameron@huawei.com>
>> >
>> > > ---
>> > > include/linux/numa.h | 9 +++++++++
>> > > include/linux/numa_memblks.h | 2 ++
>> > > mm/numa.c | 10 ++++++++++
>> > > mm/numa_memblks.c | 37
>++++++++++++++++++++++++++++++++++++
>> > > 4 files changed, 58 insertions(+)
>> > >
>> > > diff --git a/include/linux/numa.h b/include/linux/numa.h index
>> > > e6baaf6051bc..1d1aabebd26b 100644
>> > > --- a/include/linux/numa.h
>> > > +++ b/include/linux/numa.h
>> > > @@ -41,6 +41,10 @@ int memory_add_physaddr_to_nid(u64 start); int
>> > > phys_to_target_node(u64 start); #endif
>> > >
>> > > +#ifndef nid_get_mem_physaddr_range int
>> > > +nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end); #endif
>> > > +
>> > > int numa_fill_memblks(u64 start, u64 end);
>> > >
>> > > #else /* !CONFIG_NUMA */
>> > > @@ -63,6 +67,11 @@ static inline int phys_to_target_node(u64 start)
>> > > return 0;
>> > > }
>> > >
>> > > +static inline int nid_get_mem_physaddr_range(int nid, u64 *start,
>> > > +u64 *end) {
>> > > + return 0;
>> > > +}
>> > > +
>> > > static inline void alloc_offline_node_data(int nid) {} #endif
>> > >
>> > > diff --git a/include/linux/numa_memblks.h
>> > > b/include/linux/numa_memblks.h index 991076cba7c5..7b32d96d0134
>> > > 100644
>> > > --- a/include/linux/numa_memblks.h
>> > > +++ b/include/linux/numa_memblks.h
>> > > @@ -55,6 +55,8 @@ extern int phys_to_target_node(u64 start);
>> > > #define phys_to_target_node phys_to_target_node extern int
>> > > memory_add_physaddr_to_nid(u64 start); #define
>> > > memory_add_physaddr_to_nid memory_add_physaddr_to_nid
>> > > +extern int nid_get_mem_physaddr_range(int nid, u64 *start, u64
>> > > +*end); #define nid_get_mem_physaddr_range
>> > > +nid_get_mem_physaddr_range
>> > > #endif /* CONFIG_NUMA_KEEP_MEMINFO */
>> > >
>> > > #endif /* CONFIG_NUMA_MEMBLKS */
>> > > diff --git a/mm/numa.c b/mm/numa.c index
>> > > 7d5e06fe5bd4..5335af1fefee 100644
>> > > --- a/mm/numa.c
>> > > +++ b/mm/numa.c
>> > > @@ -59,3 +59,13 @@ int phys_to_target_node(u64 start) }
>> > > EXPORT_SYMBOL_GPL(phys_to_target_node);
>> > > #endif
>> > > +
>> > > +#ifndef nid_get_mem_physaddr_range int
>> > > +nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end) {
>> > > + pr_info_once("Unknown target phys addr range for node=%d\n",
>> > > +nid);
>> > > +
>> > > + return 0;
>> > > +}
>> > > +EXPORT_SYMBOL_GPL(nid_get_mem_physaddr_range);
>> > > +#endif
>> > > diff --git a/mm/numa_memblks.c b/mm/numa_memblks.c index
>> > > 541a99c4071a..e1e56b7a3499 100644
>> > > --- a/mm/numa_memblks.c
>> > > +++ b/mm/numa_memblks.c
>> > > @@ -590,4 +590,41 @@ int memory_add_physaddr_to_nid(u64 start) }
>> > > EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
>> > >
>> > > +/**
>> > > + * nid_get_mem_physaddr_range - Get the physical address range
>> > > + * of the memblk in the NUMA node.
>> > > + * @nid: NUMA node ID of the memblk
>> > > + * @start: Start address of the memblk
>> > > + * @end: End address of the memblk
>> > > + *
>> > > + * Find the lowest contiguous physical memory address range of
>> > > +the memblk
>> > > + * in the NUMA node with the given nid and return the start and
>> > > +end
>> > > + * addresses.
>> > > + *
>> > > + * RETURNS:
>> > > + * 0 on success, -errno on failure.
>> > > + */
>> > > +int nid_get_mem_physaddr_range(int nid, u64 *start, u64 *end) {
>> > > + struct numa_meminfo *mi = &numa_meminfo;
>> > > + int i;
>> > > +
>> > > + if (!numa_valid_node(nid))
>> > > + return -EINVAL;
>> > > +
>> > > + for (i = 0; i < mi->nr_blks; i++) {
>> > > + if (mi->blk[i].nid == nid) {
>> > > + *start = mi->blk[i].start;
>> > > + /*
>> > > + * Assumption: mi->blk[i].end is the last address
>> > > + * in the range + 1.
>> >
>> > This was my fault for asking on internal review if this was
>> > documented anywhere. It's kind of implicitly obvious when reading
>> > numa_memblk.c because there are a bunch of end - 1 prints.
>> > So can probably drop this comment.
>> >
>> > > + */
>> > > + *end = mi->blk[i].end;
>> > > + return 0;
>> > > + }
>> > > + }
>> > > +
>> > > + return -ENODEV;
>> > > +}
>> > > +EXPORT_SYMBOL_GPL(nid_get_mem_physaddr_range);
>> > > #endif /* CONFIG_NUMA_KEEP_MEMINFO */
>> >
>>
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v11 1/3] mm: Add support to retrieve physical address range of memory from the node ID
2025-08-20 10:00 ` Shiju Jose
@ 2025-08-20 17:02 ` Mike Rapoport
2025-08-21 9:06 ` Jonathan Cameron
0 siblings, 1 reply; 13+ messages in thread
From: Mike Rapoport @ 2025-08-20 17:02 UTC (permalink / raw)
To: Shiju Jose
Cc: Jonathan Cameron, rafael@kernel.org, bp@alien8.de,
akpm@linux-foundation.org, dferguson@amperecomputing.com,
linux-edac@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-mm@kvack.org, linux-doc@vger.kernel.org,
tony.luck@intel.com, lenb@kernel.org, leo.duran@amd.com,
Yazen.Ghannam@amd.com, mchehab@kernel.org, Linuxarm,
rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
james.morse@arm.com, jthoughton@google.com,
somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
duenwen@google.com, gthelen@google.com,
wschwartz@amperecomputing.com, wbs@os.amperecomputing.com,
nifan.cxl@gmail.com, tanxiaofei, Zengtao (B), Roberto Sassu,
kangkang.shen@futurewei.com, wanghuiqiang
On Wed, Aug 20, 2025 at 10:00:50AM +0000, Shiju Jose wrote:
> >-----Original Message-----
> >From: Jonathan Cameron <jonathan.cameron@huawei.com>
> >Sent: 20 August 2025 09:54
> >To: Mike Rapoport <rppt@kernel.org>
> >Cc: Shiju Jose <shiju.jose@huawei.com>; rafael@kernel.org; bp@alien8.de;
> >akpm@linux-foundation.org; dferguson@amperecomputing.com; linux-
> >edac@vger.kernel.org; linux-acpi@vger.kernel.org; linux-mm@kvack.org; linux-
> >doc@vger.kernel.org; tony.luck@intel.com; lenb@kernel.org;
> >leo.duran@amd.com; Yazen.Ghannam@amd.com; mchehab@kernel.org;
> >Linuxarm <linuxarm@huawei.com>; rientjes@google.com;
> >jiaqiyan@google.com; Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
> >naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
> >somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
> >duenwen@google.com; gthelen@google.com;
> >wschwartz@amperecomputing.com; wbs@os.amperecomputing.com;
> >nifan.cxl@gmail.com; tanxiaofei <tanxiaofei@huawei.com>; Zengtao (B)
> ><prime.zeng@hisilicon.com>; Roberto Sassu <roberto.sassu@huawei.com>;
> >kangkang.shen@futurewei.com; wanghuiqiang <wanghuiqiang@huawei.com>
> >Subject: Re: [PATCH v11 1/3] mm: Add support to retrieve physical address
> >range of memory from the node ID
> >
> >On Wed, 20 Aug 2025 10:34:13 +0300
> >Mike Rapoport <rppt@kernel.org> wrote:
> >
> >> On Tue, Aug 19, 2025 at 05:54:20PM +0100, Jonathan Cameron wrote:
> >> > On Tue, 12 Aug 2025 15:26:13 +0100
> >> > <shiju.jose@huawei.com> wrote:
> >> >
> >> > > From: Shiju Jose <shiju.jose@huawei.com>
> >> > >
> >> > > In the numa_memblks, a lookup facility is required to retrieve the
> >> > > physical address range of memory in a NUMA node. ACPI RAS2 memory
> >> > > features are among the use cases.
> >> > >
> >> > > Suggested-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> >> > > Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> >> >
> >> > Looks fine to me. Mike, what do you think?
> >>
> >> I still don't see why we can't use existing functions like
> >> get_pfn_range_for_nid() or memblock_search_pfn_nid().
> >>
> >> Or even node_start_pfn() and node_spanned_pages().
> >
> >Good point. No reason anyone would scrub this on memory that hasn't been
> >hotplugged yet, so no need to use numa-memblk to get the info.
> >I guess I was thinking of the wrong hammer :)
> >
> >I'm not sure node_spanned_pages() works though as we need not to include
> >ranges that might be on another node as we'd give a wrong impression of what
> >was being scrubbed.
If nodes are not interleaved node_spanned_pages() would work, even if there
are holes inside the node, like e.g. e820-reserved memory.
So with non-interleaved nodes node_start_pfn() and either
node_spanned_pages() or node_end_pfn() will give the node extents and they
are faster than get_pfn_range_for_nid().
If the nodes are interleaved, though, a single mem_base, mem_size are not
enough for a node as there are a few contiguous ranges in that node, e.g.
0 4G 8G 12G 16G
+-------------+ +-------------+ +-------------+ +-------------+
| node 0 | | node 1 | | node 0 | | node 1 |
+-------------+ +-------------+ +-------------+ +-------------+
I didn't look into the details of the RAS2 driver, but isn't it's something
it should handle?
> >Should be able to use some combination of node_start_pfn() and maybe
> >memblock_search_pfn_nid() to get it though (that also gets the nid we already
> >know but meh, no ral harm in that.)
>
> Thanks Mike and Jonathan.
>
> The following approaches were tried as you suggested, instead of newly proposed
> nid_get_mem_physaddr_range().
> Methods 1 to 3 give the same result as nid_get_mem_physaddr_range(), but
> Method 4 gives a different value for the size.
I believe that's because on x86 the node 0 is really scrambled because of
e820/efi reservations that never make it to memblock.
> Please advise which method should be used for the RAS2?
>
> Thanks,
> Shiju
>
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v11 1/3] mm: Add support to retrieve physical address range of memory from the node ID
2025-08-20 17:02 ` Mike Rapoport
@ 2025-08-21 9:06 ` Jonathan Cameron
2025-08-21 16:16 ` Luck, Tony
0 siblings, 1 reply; 13+ messages in thread
From: Jonathan Cameron @ 2025-08-21 9:06 UTC (permalink / raw)
To: Mike Rapoport
Cc: Shiju Jose, rafael@kernel.org, bp@alien8.de,
akpm@linux-foundation.org, dferguson@amperecomputing.com,
linux-edac@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-mm@kvack.org, linux-doc@vger.kernel.org,
tony.luck@intel.com, lenb@kernel.org, leo.duran@amd.com,
Yazen.Ghannam@amd.com, mchehab@kernel.org, Linuxarm,
rientjes@google.com, jiaqiyan@google.com, Jon.Grimm@amd.com,
dave.hansen@linux.intel.com, naoya.horiguchi@nec.com,
james.morse@arm.com, jthoughton@google.com,
somasundaram.a@hpe.com, erdemaktas@google.com, pgonda@google.com,
duenwen@google.com, gthelen@google.com,
wschwartz@amperecomputing.com, wbs@os.amperecomputing.com,
nifan.cxl@gmail.com, tanxiaofei, Zengtao (B), Roberto Sassu,
kangkang.shen@futurewei.com, wanghuiqiang
On Wed, 20 Aug 2025 20:02:38 +0300
Mike Rapoport <rppt@kernel.org> wrote:
> On Wed, Aug 20, 2025 at 10:00:50AM +0000, Shiju Jose wrote:
> > >-----Original Message-----
> > >From: Jonathan Cameron <jonathan.cameron@huawei.com>
> > >Sent: 20 August 2025 09:54
> > >To: Mike Rapoport <rppt@kernel.org>
> > >Cc: Shiju Jose <shiju.jose@huawei.com>; rafael@kernel.org; bp@alien8.de;
> > >akpm@linux-foundation.org; dferguson@amperecomputing.com; linux-
> > >edac@vger.kernel.org; linux-acpi@vger.kernel.org; linux-mm@kvack.org; linux-
> > >doc@vger.kernel.org; tony.luck@intel.com; lenb@kernel.org;
> > >leo.duran@amd.com; Yazen.Ghannam@amd.com; mchehab@kernel.org;
> > >Linuxarm <linuxarm@huawei.com>; rientjes@google.com;
> > >jiaqiyan@google.com; Jon.Grimm@amd.com; dave.hansen@linux.intel.com;
> > >naoya.horiguchi@nec.com; james.morse@arm.com; jthoughton@google.com;
> > >somasundaram.a@hpe.com; erdemaktas@google.com; pgonda@google.com;
> > >duenwen@google.com; gthelen@google.com;
> > >wschwartz@amperecomputing.com; wbs@os.amperecomputing.com;
> > >nifan.cxl@gmail.com; tanxiaofei <tanxiaofei@huawei.com>; Zengtao (B)
> > ><prime.zeng@hisilicon.com>; Roberto Sassu <roberto.sassu@huawei.com>;
> > >kangkang.shen@futurewei.com; wanghuiqiang <wanghuiqiang@huawei.com>
> > >Subject: Re: [PATCH v11 1/3] mm: Add support to retrieve physical address
> > >range of memory from the node ID
> > >
> > >On Wed, 20 Aug 2025 10:34:13 +0300
> > >Mike Rapoport <rppt@kernel.org> wrote:
> > >
> > >> On Tue, Aug 19, 2025 at 05:54:20PM +0100, Jonathan Cameron wrote:
> > >> > On Tue, 12 Aug 2025 15:26:13 +0100
> > >> > <shiju.jose@huawei.com> wrote:
> > >> >
> > >> > > From: Shiju Jose <shiju.jose@huawei.com>
> > >> > >
> > >> > > In the numa_memblks, a lookup facility is required to retrieve the
> > >> > > physical address range of memory in a NUMA node. ACPI RAS2 memory
> > >> > > features are among the use cases.
> > >> > >
> > >> > > Suggested-by: Jonathan Cameron <jonathan.cameron@huawei.com>
> > >> > > Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
> > >> >
> > >> > Looks fine to me. Mike, what do you think?
> > >>
> > >> I still don't see why we can't use existing functions like
> > >> get_pfn_range_for_nid() or memblock_search_pfn_nid().
> > >>
> > >> Or even node_start_pfn() and node_spanned_pages().
> > >
> > >Good point. No reason anyone would scrub this on memory that hasn't been
> > >hotplugged yet, so no need to use numa-memblk to get the info.
> > >I guess I was thinking of the wrong hammer :)
> > >
> > >I'm not sure node_spanned_pages() works though as we need not to include
> > >ranges that might be on another node as we'd give a wrong impression of what
> > >was being scrubbed.
>
> If nodes are not interleaved node_spanned_pages() would work, even if there
> are holes inside the node, like e.g. e820-reserved memory.
> So with non-interleaved nodes node_start_pfn() and either
> node_spanned_pages() or node_end_pfn() will give the node extents and they
> are faster than get_pfn_range_for_nid().
>
> If the nodes are interleaved, though, a single mem_base, mem_size are not
> enough for a node as there are a few contiguous ranges in that node, e.g.
>
> 0 4G 8G 12G 16G
> +-------------+ +-------------+ +-------------+ +-------------+
> | node 0 | | node 1 | | node 0 | | node 1 |
> +-------------+ +-------------+ +-------------+ +-------------+
>
> I didn't look into the details of the RAS2 driver, but isn't it's something
> it should handle?
The aim here is that a query prior to setting a specific range returns
data for at least a range that the scrub controller covers and nothing
it doesn't. So just presenting the first chunk for a node is fine.
There is plenty of info for userspace to figure things out if it wants
to trigger a scrub on 8-12G in your example, but until it does we want
to return 0-4G for the default range.
I hacked up some SRAT tables to give something like the above for testing.
>
> > >Should be able to use some combination of node_start_pfn() and maybe
> > >memblock_search_pfn_nid() to get it though (that also gets the nid we already
> > >know but meh, no ral harm in that.)
> >
> > Thanks Mike and Jonathan.
> >
> > The following approaches were tried as you suggested, instead of newly proposed
> > nid_get_mem_physaddr_range().
> > Methods 1 to 3 give the same result as nid_get_mem_physaddr_range(), but
> > Method 4 gives a different value for the size.
>
> I believe that's because on x86 the node 0 is really scrambled because of
> e820/efi reservations that never make it to memblock.
Fun question of whether we should take any notice of those.
Would depend on whether anyone's scrub firmware gets confused if we scrub
them and they aren't backed by memory. If they are we can rely on system
constraints refusing to scrub that stuff at an 'unsafe' level and if we
set it higher than it otherwise would be only possibility is we see earlier
error detections in those and have to deal with them.
Jonathan
>
> > Please advise which method should be used for the RAS2?
> >
> > Thanks,
> > Shiju
> >
>
^ permalink raw reply [flat|nested] 13+ messages in thread
* RE: [PATCH v11 1/3] mm: Add support to retrieve physical address range of memory from the node ID
2025-08-21 9:06 ` Jonathan Cameron
@ 2025-08-21 16:16 ` Luck, Tony
2025-08-24 12:41 ` Mike Rapoport
0 siblings, 1 reply; 13+ messages in thread
From: Luck, Tony @ 2025-08-21 16:16 UTC (permalink / raw)
To: Jonathan Cameron, Mike Rapoport
Cc: Shiju Jose, rafael@kernel.org, bp@alien8.de,
akpm@linux-foundation.org, dferguson@amperecomputing.com,
linux-edac@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-mm@kvack.org, linux-doc@vger.kernel.org, lenb@kernel.org,
leo.duran@amd.com, Yazen.Ghannam@amd.com, mchehab@kernel.org,
Linuxarm, rientjes@google.com, jiaqiyan@google.com,
Jon.Grimm@amd.com, dave.hansen@linux.intel.com,
naoya.horiguchi@nec.com, james.morse@arm.com,
jthoughton@google.com, Somasundaram A, Aktas, Erdem,
pgonda@google.com, duenwen@google.com, gthelen@google.com,
wschwartz@amperecomputing.com, wbs@os.amperecomputing.com,
nifan.cxl@gmail.com, tanxiaofei, Zengtao (B), Roberto Sassu,
kangkang.shen@futurewei.com, wanghuiqiang
> > I believe that's because on x86 the node 0 is really scrambled because of
> > e820/efi reservations that never make it to memblock.
>
> Fun question of whether we should take any notice of those.
> Would depend on whether anyone's scrub firmware gets confused if we scrub
> them and they aren't backed by memory. If they are we can rely on system
> constraints refusing to scrub that stuff at an 'unsafe' level and if we
> set it higher than it otherwise would be only possibility is we see earlier
> error detections in those and have to deal with them.
Yes. On x86 the physical memory map below 4GB is a bunch of address
ranges with varying properties:
1) There's the "low" MMIO region (often 2G to very nearly 4G) where 32-bit
PCI devices have device BAR ranges mapped. Some of this isn't memory
at all. It's device registers that may have side effects when read. I don't think
the x86 patrol scrubbers can access this at all.
2) There's EFI allocated memory that is accessible to the OS (e.g. ACPI tables)
3) There's BIOS protected memory for use by SMI that the OS can't access at all
4) There are ranges listed in E820 or efi_memory_map that are usable by OS.
What is the use case for OS control of the patrol scrubbers?
While you might want to specifically scrub some range to make sure there
are no lurking problems before allocating to some important process/guest,
I'd expect that you'd want to make sure that types 2 & 3 listed above still
get a periodic scan to clean up single bit errors.
-Tony
^ permalink raw reply [flat|nested] 13+ messages in thread
* Re: [PATCH v11 1/3] mm: Add support to retrieve physical address range of memory from the node ID
2025-08-21 16:16 ` Luck, Tony
@ 2025-08-24 12:41 ` Mike Rapoport
0 siblings, 0 replies; 13+ messages in thread
From: Mike Rapoport @ 2025-08-24 12:41 UTC (permalink / raw)
To: Luck, Tony
Cc: Jonathan Cameron, Shiju Jose, rafael@kernel.org, bp@alien8.de,
akpm@linux-foundation.org, dferguson@amperecomputing.com,
linux-edac@vger.kernel.org, linux-acpi@vger.kernel.org,
linux-mm@kvack.org, linux-doc@vger.kernel.org, lenb@kernel.org,
leo.duran@amd.com, Yazen.Ghannam@amd.com, mchehab@kernel.org,
Linuxarm, rientjes@google.com, jiaqiyan@google.com,
Jon.Grimm@amd.com, dave.hansen@linux.intel.com,
naoya.horiguchi@nec.com, james.morse@arm.com,
jthoughton@google.com, Somasundaram A, Aktas, Erdem,
pgonda@google.com, duenwen@google.com, gthelen@google.com,
wschwartz@amperecomputing.com, wbs@os.amperecomputing.com,
nifan.cxl@gmail.com, tanxiaofei, Zengtao (B), Roberto Sassu,
kangkang.shen@futurewei.com, wanghuiqiang
On Thu, Aug 21, 2025 at 04:16:02PM +0000, Luck, Tony wrote:
> > > I believe that's because on x86 the node 0 is really scrambled because of
> > > e820/efi reservations that never make it to memblock.
> >
> > Fun question of whether we should take any notice of those.
> > Would depend on whether anyone's scrub firmware gets confused if we scrub
> > them and they aren't backed by memory. If they are we can rely on system
> > constraints refusing to scrub that stuff at an 'unsafe' level and if we
> > set it higher than it otherwise would be only possibility is we see earlier
> > error detections in those and have to deal with them.
>
> Yes. On x86 the physical memory map below 4GB is a bunch of address
> ranges with varying properties:
>
> 1) There's the "low" MMIO region (often 2G to very nearly 4G) where 32-bit
> PCI devices have device BAR ranges mapped. Some of this isn't memory
> at all. It's device registers that may have side effects when read. I don't think
> the x86 patrol scrubbers can access this at all.
> 2) There's EFI allocated memory that is accessible to the OS (e.g. ACPI tables)
> 3) There's BIOS protected memory for use by SMI that the OS can't access at all
> 4) There are ranges listed in E820 or efi_memory_map that are usable by OS.
There is a slight problem here with getting the first contiguous range from
a node to seed the scrubber.
If we use PXM, the range for node 0 will usually cover the entire node
including types 1 and 3. And if we take it from memblock, it does not include
type 2, or anything reserved in e820/efi.
> What is the use case for OS control of the patrol scrubbers?
>
> While you might want to specifically scrub some range to make sure there
> are no lurking problems before allocating to some important process/guest,
> I'd expect that you'd want to make sure that types 2 & 3 listed above still
> get a periodic scan to clean up single bit errors.
>
> -Tony
--
Sincerely yours,
Mike.
^ permalink raw reply [flat|nested] 13+ messages in thread
end of thread, other threads:[~2025-08-24 12:42 UTC | newest]
Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-12 14:26 [PATCH v11 0/3] ACPI: Add support for ACPI RAS2 feature table shiju.jose
2025-08-12 14:26 ` [PATCH v11 1/3] mm: Add support to retrieve physical address range of memory from the node ID shiju.jose
2025-08-19 16:54 ` Jonathan Cameron
2025-08-20 7:34 ` Mike Rapoport
2025-08-20 8:54 ` Jonathan Cameron
2025-08-20 10:00 ` Shiju Jose
2025-08-20 17:02 ` Mike Rapoport
2025-08-21 9:06 ` Jonathan Cameron
2025-08-21 16:16 ` Luck, Tony
2025-08-24 12:41 ` Mike Rapoport
2025-08-12 14:26 ` [PATCH v11 2/3] ACPI:RAS2: Add ACPI RAS2 driver shiju.jose
2025-08-12 14:26 ` [PATCH v11 3/3] ras: mem: Add memory " shiju.jose
2025-08-19 20:12 ` [PATCH v11 0/3] ACPI: Add support for ACPI RAS2 feature table Daniel Ferguson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).