* [RFC 0/3] Apply SRAT defined PXM to entire CFMWS
@ 2023-05-10 18:44 alison.schofield
2023-05-10 18:44 ` [RFC 1/3] x86/numa: Introduce numa_find_node(start, end) alison.schofield
` (3 more replies)
0 siblings, 4 replies; 12+ messages in thread
From: alison.schofield @ 2023-05-10 18:44 UTC (permalink / raw)
To: Dan Williams, Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky
Cc: Alison Schofield, linux-cxl
From: Alison Schofield <alison.schofield@intel.com>
The patch that created fake NUMA nodes for CFMWS windows
not defined in the SRAT, made wrong assumptions about the
SRAT defined entries. Specifically, it assumed an SRAT entry
using cfmws->start, uses the entire CMFWS range, start through
end. The assumption is wrong, and so the ACPI driver, needs
to examine the SRAT created memblks more closely to discover
partial definitions of the HPA range.
This work-in-progress addresses that issue. The first 2 patches
introduce numa helpers that are used in the 3rd patch, where the
ACPI drivers parsing of the CFMWS is updated.
The patch commit logs, especially Patch 3, describes more
of the approach as well as other approaches considered, and
questions. So, perhaps, scan 1 & 2, and dive into #3 and
confirm or refute this approach.
I did not include our NUMA or ACPI friends in this posting,
because I want to get a direction check from CXL folks before
addressing how the helpers can get merged into the NUMA arch.
Thanks for looking!
Alison Schofield (3):
x86/numa: Introduce numa_find_node(start, end)
x86/numa: Introduce numa_remove_memblks(node, start, end)
ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window
arch/x86/include/asm/numa.h | 2 ++
arch/x86/mm/numa.c | 36 ++++++++++++++++++++++++++++++++++++
drivers/acpi/numa/srat.c | 32 ++++++++++++++++++++++++++------
3 files changed, 64 insertions(+), 6 deletions(-)
--
2.37.3
^ permalink raw reply [flat|nested] 12+ messages in thread
* [RFC 1/3] x86/numa: Introduce numa_find_node(start, end)
2023-05-10 18:44 [RFC 0/3] Apply SRAT defined PXM to entire CFMWS alison.schofield
@ 2023-05-10 18:44 ` alison.schofield
2023-05-11 22:46 ` Dan Williams
2023-05-10 18:44 ` [RFC 2/3] x86/numa: Introduce numa_remove_memblks(node, start, end) alison.schofield
` (2 subsequent siblings)
3 siblings, 1 reply; 12+ messages in thread
From: alison.schofield @ 2023-05-10 18:44 UTC (permalink / raw)
To: Dan Williams, Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky
Cc: Alison Schofield, linux-cxl
From: Alison Schofield <alison.schofield@intel.com>
phys_to_target_node(phys_addr_t start) returns a NUMA node id for
a single physical address. In order to discover if there is a NUMA
node assigned to any, in a range of addressses, there is no solution.
Repeatedly calling phys_to_target_node() from start/end is too
expensive to consider. Examining the numa memblks is nicer.
Introduce numa_find_node(start, end) to return the first NUMA node
found anywhere in the start/end HPA range.
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
arch/x86/include/asm/numa.h | 1 +
arch/x86/mm/numa.c | 14 ++++++++++++++
2 files changed, 15 insertions(+)
diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index e3bae2b60a0d..5f2b811f1a5f 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -34,6 +34,7 @@ extern nodemask_t numa_nodes_parsed __initdata;
extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
extern void __init numa_set_distance(int from, int to, int distance);
+extern int __init numa_find_node(u64 start, u64 end);
static inline void set_apicid_to_node(int apicid, s16 node)
{
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2aadb2019b4f..62990977f720 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -225,6 +225,20 @@ static void __init alloc_node_data(int nid)
node_set_online(nid);
}
+/* find node with any memblk start/end */
+int __init numa_find_node(u64 start, u64 end)
+{
+ struct numa_meminfo *mi = &numa_meminfo;
+
+ for (int i = 0; i < mi->nr_blks; i++) {
+ struct numa_memblk *bi = &mi->blk[i];
+
+ if (start <= bi->start && end >= bi->end)
+ return bi->nid;
+ }
+ return NUMA_NO_NODE;
+}
+
/**
* numa_cleanup_meminfo - Cleanup a numa_meminfo
* @mi: numa_meminfo to clean up
--
2.37.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RFC 2/3] x86/numa: Introduce numa_remove_memblks(node, start, end)
2023-05-10 18:44 [RFC 0/3] Apply SRAT defined PXM to entire CFMWS alison.schofield
2023-05-10 18:44 ` [RFC 1/3] x86/numa: Introduce numa_find_node(start, end) alison.schofield
@ 2023-05-10 18:44 ` alison.schofield
2023-05-11 22:50 ` Dan Williams
2023-05-10 18:44 ` [RFC 3/3] ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window alison.schofield
2023-05-11 22:42 ` [RFC 0/3] Apply SRAT defined PXM to entire CFMWS Dan Williams
3 siblings, 1 reply; 12+ messages in thread
From: alison.schofield @ 2023-05-10 18:44 UTC (permalink / raw)
To: Dan Williams, Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky
Cc: Alison Schofield, linux-cxl
From: Alison Schofield <alison.schofield@intel.com>
Add support for removing memblks from numa_meminfo that are
within a start/end range, and of a node.
numa_add_memblk() allows in kernel users to add a memblk to
a NUMA node. There is no method exposed to remove a memblk.
The use case here is to allow the ACPI driver to remove
redundant memblks when it knows they exist, rather than
implementing a cleanup that needlessly walks all memblks
during a cleanup phase.
numa_cleanup_meminfo() exists for merging memblks, however,
it only considers adjacent memblks, and, it actually moves the
memblks to numa_reserved_meminfo, before doing the cleanup.
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
arch/x86/include/asm/numa.h | 1 +
arch/x86/mm/numa.c | 22 ++++++++++++++++++++++
2 files changed, 23 insertions(+)
diff --git a/arch/x86/include/asm/numa.h b/arch/x86/include/asm/numa.h
index 5f2b811f1a5f..cb8b9a8cae32 100644
--- a/arch/x86/include/asm/numa.h
+++ b/arch/x86/include/asm/numa.h
@@ -35,6 +35,7 @@ extern nodemask_t numa_nodes_parsed __initdata;
extern int __init numa_add_memblk(int nodeid, u64 start, u64 end);
extern void __init numa_set_distance(int from, int to, int distance);
extern int __init numa_find_node(u64 start, u64 end);
+extern void __init numa_remove_memblks(int node, u64 start, u64 end);
static inline void set_apicid_to_node(int apicid, s16 node)
{
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 62990977f720..42d70f01ca0a 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -239,6 +239,28 @@ int __init numa_find_node(u64 start, u64 end)
return NUMA_NO_NODE;
}
+/**
+ * numa_remove_memblks - Remove memblocks from a node
+ * @node: node
+ * @start: start addr of memblks to remove
+ * @end: end addr of memblks to remove
+ *
+ * Remove any memblks of node within start/end range
+ */
+void __init numa_remove_memblks(int node, u64 start, u64 end)
+{
+ struct numa_meminfo *mi = &numa_meminfo;
+
+ for (int i = 0; i < mi->nr_blks; i++) {
+ struct numa_memblk *bi = &mi->blk[i];
+
+ if (bi->nid != node)
+ continue;
+ if (start <= bi->start && end >= bi->end)
+ numa_remove_memblk_from(i--, &numa_meminfo);
+ }
+}
+
/**
* numa_cleanup_meminfo - Cleanup a numa_meminfo
* @mi: numa_meminfo to clean up
--
2.37.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* [RFC 3/3] ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window
2023-05-10 18:44 [RFC 0/3] Apply SRAT defined PXM to entire CFMWS alison.schofield
2023-05-10 18:44 ` [RFC 1/3] x86/numa: Introduce numa_find_node(start, end) alison.schofield
2023-05-10 18:44 ` [RFC 2/3] x86/numa: Introduce numa_remove_memblks(node, start, end) alison.schofield
@ 2023-05-10 18:44 ` alison.schofield
2023-05-11 23:16 ` Dan Williams
2023-05-11 22:42 ` [RFC 0/3] Apply SRAT defined PXM to entire CFMWS Dan Williams
3 siblings, 1 reply; 12+ messages in thread
From: alison.schofield @ 2023-05-10 18:44 UTC (permalink / raw)
To: Dan Williams, Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky
Cc: Alison Schofield, linux-cxl
From: Alison Schofield <alison.schofield@intel.com>
Commit fd49f99c1809 ("ACPI: NUMA: Add a node and memblk for each
CFMWS not in SRAT") did not account for the case where an SRAT entry
only partially describes a CFMWS window. It assumed an SRAT entry
covered the entire CFMWS HPA range, start through end.
Broaden the search for an SRAT defined NUMA node, by replacing the
previously used phys_to_targe_node(start) search, with the recently
introduced, numa_find_node(), that can discover a NUMA node anywhere
in the CFMWS HPA range.
If any NUMA node is discovered, proactively cleanup, by removing any
memblks, partial or whole, in the HPA range, and add one memblk that
encompasses the entire range. That has the effect of applying the SRAT
defined proximity domain to the entire range, as well as doing a memblk
cleanup at the point a redundancy is created.
Considered and rejected, letting numa_cleanup_meminfo() try to remove
the redundancy. It doesn't currently address this case, because these
memblks will be moved to numa_reserved_meminfo, before any numa_meminfo
merge is done. Also, the merge logic in numa_cleanup_meminfo() works
on adjacent memblks, so it would need to grow in complexity to search
for these potential cases.
Considered and ready to reconsider, allow an extra memblk for every
CFMWS HPA range that is also described in the SRAT. Is that a concern?
If not a concern to have the extra memblk, then skip the memblk remove
work entirely.
Fixes: fd49f99c1809 ("ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT")
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
drivers/acpi/numa/srat.c | 32 ++++++++++++++++++++++++++------
1 file changed, 26 insertions(+), 6 deletions(-)
diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index 1f4fc5f8a819..f41b65e9b085 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -301,27 +301,47 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
void *arg, const unsigned long table_end)
{
+ int node, found_node, *fake_pxm = arg;
struct acpi_cedt_cfmws *cfmws;
- int *fake_pxm = arg;
u64 start, end;
- int node;
cfmws = (struct acpi_cedt_cfmws *)header;
start = cfmws->base_hpa;
end = cfmws->base_hpa + cfmws->window_size;
- /* Skip if the SRAT already described the NUMA details for this HPA */
- node = phys_to_target_node(start);
- if (node != NUMA_NO_NODE)
+ /*
+ * The SRAT may have already described the NUMA details for
+ * this CFMWS HPA range, yet it may not have created memblks
+ * for the entire range. Look for a node with a memblk covering
+ * any part of the HPA range. Don't bother figuring out if it
+ * is partially or wholly described. Replace any memblks in the
+ * range with one single memblk that covers the entire range.
+ *
+ * This preserves the SRAT defined node and Proximity Domain.
+ */
+
+ found_node = numa_find_node(start, end);
+ if (found_node != NUMA_NO_NODE) {
+ numa_remove_memblks(found_node, start, end);
+ if (numa_add_memblk(found_node, start, end) < 0) {
+ /* CXL driver must handle the NUMA_NO_NODE case */
+ pr_warn("ACPI NUMA: failed to add memblk for CFMWS node %d [mem %#llx-%#llx]\n",
+ found_node, start, end);
+ }
return 0;
+ }
+ /*
+ * SRAT did not describe this window at all.
+ * Create a new node with a fake proximity domain. Add a
+ * memblk covering the entire HPA range.
+ */
node = acpi_map_pxm_to_node(*fake_pxm);
if (node == NUMA_NO_NODE) {
pr_err("ACPI NUMA: Too many proximity domains while processing CFMWS.\n");
return -EINVAL;
}
-
if (numa_add_memblk(node, start, end) < 0) {
/* CXL driver must handle the NUMA_NO_NODE case */
pr_warn("ACPI NUMA: Failed to add memblk for CFMWS node %d [mem %#llx-%#llx]\n",
--
2.37.3
^ permalink raw reply related [flat|nested] 12+ messages in thread
* RE: [RFC 0/3] Apply SRAT defined PXM to entire CFMWS
2023-05-10 18:44 [RFC 0/3] Apply SRAT defined PXM to entire CFMWS alison.schofield
` (2 preceding siblings ...)
2023-05-10 18:44 ` [RFC 3/3] ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window alison.schofield
@ 2023-05-11 22:42 ` Dan Williams
2023-05-12 16:33 ` Jonathan Cameron
3 siblings, 1 reply; 12+ messages in thread
From: Dan Williams @ 2023-05-11 22:42 UTC (permalink / raw)
To: alison.schofield, Dan Williams, Ira Weiny, Vishal Verma,
Dave Jiang, Ben Widawsky
Cc: Alison Schofield, linux-cxl
alison.schofield@ wrote:
> From: Alison Schofield <alison.schofield@intel.com>
>
> The patch that created fake NUMA nodes for CFMWS windows
"fake" seems the wrong word to me. These are first class Linux NUMA
nodes in the end. So to me it's the "patch that extended SRAT proximity
domains by the potential performance-class windows in the CFMWS", or
something like that.
> not defined in the SRAT, made wrong assumptions about the
> SRAT defined entries. Specifically, it assumed an SRAT entry
> using cfmws->start, uses the entire CMFWS range, start through
> end. The assumption is wrong, and so the ACPI driver, needs
I also wouldn't say it was wrong, just incomplete. I.e. this is just
extending the heuristic to say that if the BIOS describes *any* portion
of a CFMWS window with a proximity domain, might as well reuse that
proximity domain for the entire window since everything in that window
is expected to be of a similar performance class.
> to examine the SRAT created memblks more closely to discover
> partial definitions of the HPA range.
>
> This work-in-progress addresses that issue. The first 2 patches
> introduce numa helpers that are used in the 3rd patch, where the
> ACPI drivers parsing of the CFMWS is updated.
>
> The patch commit logs, especially Patch 3, describes more
> of the approach as well as other approaches considered, and
> questions. So, perhaps, scan 1 & 2, and dive into #3 and
> confirm or refute this approach.
>
> I did not include our NUMA or ACPI friends in this posting,
> because I want to get a direction check from CXL folks before
> addressing how the helpers can get merged into the NUMA arch.
>
> Thanks for looking!
>
> Alison Schofield (3):
> x86/numa: Introduce numa_find_node(start, end)
> x86/numa: Introduce numa_remove_memblks(node, start, end)
> ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window
>
> arch/x86/include/asm/numa.h | 2 ++
> arch/x86/mm/numa.c | 36 ++++++++++++++++++++++++++++++++++++
> drivers/acpi/numa/srat.c | 32 ++++++++++++++++++++++++++------
> 3 files changed, 64 insertions(+), 6 deletions(-)
>
> --
> 2.37.3
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [RFC 1/3] x86/numa: Introduce numa_find_node(start, end)
2023-05-10 18:44 ` [RFC 1/3] x86/numa: Introduce numa_find_node(start, end) alison.schofield
@ 2023-05-11 22:46 ` Dan Williams
0 siblings, 0 replies; 12+ messages in thread
From: Dan Williams @ 2023-05-11 22:46 UTC (permalink / raw)
To: alison.schofield, Dan Williams, Ira Weiny, Vishal Verma,
Dave Jiang, Ben Widawsky
Cc: Alison Schofield, linux-cxl
alison.schofield@ wrote:
> From: Alison Schofield <alison.schofield@intel.com>
>
> phys_to_target_node(phys_addr_t start) returns a NUMA node id for
> a single physical address. In order to discover if there is a NUMA
> node assigned to any, in a range of addressses, there is no solution.
>
> Repeatedly calling phys_to_target_node() from start/end is too
> expensive to consider. Examining the numa memblks is nicer.
>
> Introduce numa_find_node(start, end) to return the first NUMA node
> found anywhere in the start/end HPA range.
I don't think this patch stands on its own, it assumes that something
*wants* to scan for nodes by a range. Maybe it becomes clearer in a
follow-on patch, but this feels like it wants squashing.
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [RFC 2/3] x86/numa: Introduce numa_remove_memblks(node, start, end)
2023-05-10 18:44 ` [RFC 2/3] x86/numa: Introduce numa_remove_memblks(node, start, end) alison.schofield
@ 2023-05-11 22:50 ` Dan Williams
0 siblings, 0 replies; 12+ messages in thread
From: Dan Williams @ 2023-05-11 22:50 UTC (permalink / raw)
To: alison.schofield, Dan Williams, Ira Weiny, Vishal Verma,
Dave Jiang, Ben Widawsky
Cc: Alison Schofield, linux-cxl
alison.schofield@ wrote:
> From: Alison Schofield <alison.schofield@intel.com>
>
> Add support for removing memblks from numa_meminfo that are
> within a start/end range, and of a node.
>
> numa_add_memblk() allows in kernel users to add a memblk to
> a NUMA node. There is no method exposed to remove a memblk.
>
> The use case here is to allow the ACPI driver to remove
> redundant memblks when it knows they exist, rather than
> implementing a cleanup that needlessly walks all memblks
> during a cleanup phase.
Again this assumes a particular way to solve the memblk update problem,
it's not clear that a remove method is the way to go versus an update
method.
I.e. its not clear to me where the "redundant" memblks appear.
> numa_cleanup_meminfo() exists for merging memblks, however,
> it only considers adjacent memblks, and, it actually moves the
> memblks to numa_reserved_meminfo, before doing the cleanup.
So either the algorithm for solving the extend memblks when they overlap
CFMWS needs to be described here, or this patch needs to be squashed.
^ permalink raw reply [flat|nested] 12+ messages in thread
* RE: [RFC 3/3] ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window
2023-05-10 18:44 ` [RFC 3/3] ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window alison.schofield
@ 2023-05-11 23:16 ` Dan Williams
2023-05-12 0:01 ` Alison Schofield
0 siblings, 1 reply; 12+ messages in thread
From: Dan Williams @ 2023-05-11 23:16 UTC (permalink / raw)
To: alison.schofield, Dan Williams, Ira Weiny, Vishal Verma,
Dave Jiang, Ben Widawsky
Cc: Alison Schofield, linux-cxl
alison.schofield@ wrote:
> From: Alison Schofield <alison.schofield@intel.com>
>
> Commit fd49f99c1809 ("ACPI: NUMA: Add a node and memblk for each
> CFMWS not in SRAT") did not account for the case where an SRAT entry
> only partially describes a CFMWS window. It assumed an SRAT entry
> covered the entire CFMWS HPA range, start through end.
>
> Broaden the search for an SRAT defined NUMA node, by replacing the
> previously used phys_to_targe_node(start) search, with the recently
> introduced, numa_find_node(), that can discover a NUMA node anywhere
> in the CFMWS HPA range.
>
> If any NUMA node is discovered, proactively cleanup, by removing any
> memblks, partial or whole, in the HPA range, and add one memblk that
> encompasses the entire range. That has the effect of applying the SRAT
> defined proximity domain to the entire range, as well as doing a memblk
> cleanup at the point a redundancy is created.
>
> Considered and rejected, letting numa_cleanup_meminfo() try to remove
> the redundancy. It doesn't currently address this case, because these
> memblks will be moved to numa_reserved_meminfo, before any numa_meminfo
> merge is done. Also, the merge logic in numa_cleanup_meminfo() works
> on adjacent memblks, so it would need to grow in complexity to search
> for these potential cases.
>
> Considered and ready to reconsider, allow an extra memblk for every
> CFMWS HPA range that is also described in the SRAT. Is that a concern?
> If not a concern to have the extra memblk, then skip the memblk remove
> work entirely.
>
> Fixes: fd49f99c1809 ("ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT")
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> ---
> drivers/acpi/numa/srat.c | 32 ++++++++++++++++++++++++++------
> 1 file changed, 26 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> index 1f4fc5f8a819..f41b65e9b085 100644
> --- a/drivers/acpi/numa/srat.c
> +++ b/drivers/acpi/numa/srat.c
> @@ -301,27 +301,47 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
> static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
> void *arg, const unsigned long table_end)
> {
> + int node, found_node, *fake_pxm = arg;
> struct acpi_cedt_cfmws *cfmws;
> - int *fake_pxm = arg;
> u64 start, end;
> - int node;
>
> cfmws = (struct acpi_cedt_cfmws *)header;
> start = cfmws->base_hpa;
> end = cfmws->base_hpa + cfmws->window_size;
>
> - /* Skip if the SRAT already described the NUMA details for this HPA */
> - node = phys_to_target_node(start);
> - if (node != NUMA_NO_NODE)
> + /*
> + * The SRAT may have already described the NUMA details for
> + * this CFMWS HPA range, yet it may not have created memblks
> + * for the entire range. Look for a node with a memblk covering
> + * any part of the HPA range. Don't bother figuring out if it
> + * is partially or wholly described. Replace any memblks in the
> + * range with one single memblk that covers the entire range.
> + *
> + * This preserves the SRAT defined node and Proximity Domain.
> + */
> +
> + found_node = numa_find_node(start, end);
> + if (found_node != NUMA_NO_NODE) {
> + numa_remove_memblks(found_node, start, end);
> + if (numa_add_memblk(found_node, start, end) < 0) {
I worry that this blows away information in the case where the BIOS for
some reason decides to describe multiple proximity domains per CXL
window.
I think this wants to be an algorithm like
numa_fill_memblks(start, end)
...where that walks the range and for any gaps in that range extend the
last memblk to the end of the gap.
For example if the CFWMS to SRAT mapping is this:
┌──────────────────────────────────────────────────┐
│ WINDOW0 │
├──────────┬────────────┬────────────┬─────────────┤
│PXM0 │ GAP │ PXM1 │ GAP │
└──────────┴────────────┴────────────┴─────────────┘
┌──────────────────────────────────────────────────┐
│ WINDOW0 │
├───────────────────────┬──────────────────────────┤
│PXM0 │ PXM1 │
└───────────────────────┴──────────────────────────┘
...and if that finds nothing to fill, *then* create the next proximity
domain for the window.
> + /* CXL driver must handle the NUMA_NO_NODE case */
> + pr_warn("ACPI NUMA: failed to add memblk for CFMWS node %d [mem %#llx-%#llx]\n",
> + found_node, start, end);
> + }
> return 0;
> + }
>
> + /*
> + * SRAT did not describe this window at all.
> + * Create a new node with a fake proximity domain. Add a
> + * memblk covering the entire HPA range.
> + */
> node = acpi_map_pxm_to_node(*fake_pxm);
>
> if (node == NUMA_NO_NODE) {
> pr_err("ACPI NUMA: Too many proximity domains while processing CFMWS.\n");
> return -EINVAL;
> }
> -
Unrelated whitespace change.
> if (numa_add_memblk(node, start, end) < 0) {
> /* CXL driver must handle the NUMA_NO_NODE case */
> pr_warn("ACPI NUMA: Failed to add memblk for CFMWS node %d [mem %#llx-%#llx]\n",
> --
> 2.37.3
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC 3/3] ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window
2023-05-11 23:16 ` Dan Williams
@ 2023-05-12 0:01 ` Alison Schofield
2023-05-12 0:18 ` Dan Williams
0 siblings, 1 reply; 12+ messages in thread
From: Alison Schofield @ 2023-05-12 0:01 UTC (permalink / raw)
To: Dan Williams; +Cc: Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky, linux-cxl
On Thu, May 11, 2023 at 04:16:53PM -0700, Dan Williams wrote:
> alison.schofield@ wrote:
snip...
> > + /*
> > + * The SRAT may have already described the NUMA details for
> > + * this CFMWS HPA range, yet it may not have created memblks
> > + * for the entire range. Look for a node with a memblk covering
> > + * any part of the HPA range. Don't bother figuring out if it
> > + * is partially or wholly described. Replace any memblks in the
> > + * range with one single memblk that covers the entire range.
> > + *
> > + * This preserves the SRAT defined node and Proximity Domain.
> > + */
> > +
> > + found_node = numa_find_node(start, end);
> > + if (found_node != NUMA_NO_NODE) {
> > + numa_remove_memblks(found_node, start, end);
> > + if (numa_add_memblk(found_node, start, end) < 0) {
>
> I worry that this blows away information in the case where the BIOS for
> some reason decides to describe multiple proximity domains per CXL
> window.
>
> I think this wants to be an algorithm like
>
> numa_fill_memblks(start, end)
>
> ...where that walks the range and for any gaps in that range extend the
> last memblk to the end of the gap.
>
> For example if the CFWMS to SRAT mapping is this:
>
> ┌──────────────────────────────────────────────────┐
> │ WINDOW0 │
> ├──────────┬────────────┬────────────┬─────────────┤
> │PXM0 │ GAP │ PXM1 │ GAP │
> └──────────┴────────────┴────────────┴─────────────┘
>
>
> ┌──────────────────────────────────────────────────┐
> │ WINDOW0 │
> ├───────────────────────┬──────────────────────────┤
> │PXM0 │ PXM1 │
> └───────────────────────┴──────────────────────────┘
>
> ...and if that finds nothing to fill, *then* create the next proximity
> domain for the window.
Thanks for the feedback Dan.
Is there any guarantee that the first PXM begins at cfmws.start?
ie...that gaps always 'follow'
This one's easy: gap-pxm0 becomes all pxm0
This one's easy too: gap-pxm0-gap becomes all pxm0
This one's not so obvious: gap-pxm0-gap-pxm1
Imagine those are quarters. Is the end result 50/50 or 75/25?
I'm wondering if there is any policy in existence or this is new,
and we are setting the policy now?
Alison
>
>
> > + /* CXL driver must handle the NUMA_NO_NODE case */
> > + pr_warn("ACPI NUMA: failed to add memblk for CFMWS node %d [mem %#llx-%#llx]\n",
> > + found_node, start, end);
> > + }
> > return 0;
> > + }
> >
> > + /*
> > + * SRAT did not describe this window at all.
> > + * Create a new node with a fake proximity domain. Add a
> > + * memblk covering the entire HPA range.
> > + */
> > node = acpi_map_pxm_to_node(*fake_pxm);
> >
> > if (node == NUMA_NO_NODE) {
> > pr_err("ACPI NUMA: Too many proximity domains while processing CFMWS.\n");
> > return -EINVAL;
> > }
> > -
>
> Unrelated whitespace change.
>
> > if (numa_add_memblk(node, start, end) < 0) {
> > /* CXL driver must handle the NUMA_NO_NODE case */
> > pr_warn("ACPI NUMA: Failed to add memblk for CFMWS node %d [mem %#llx-%#llx]\n",
> > --
> > 2.37.3
> >
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC 3/3] ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window
2023-05-12 0:01 ` Alison Schofield
@ 2023-05-12 0:18 ` Dan Williams
2023-05-12 16:45 ` Jonathan Cameron
0 siblings, 1 reply; 12+ messages in thread
From: Dan Williams @ 2023-05-12 0:18 UTC (permalink / raw)
To: Alison Schofield, Dan Williams
Cc: Ira Weiny, Vishal Verma, Dave Jiang, Ben Widawsky, linux-cxl
Alison Schofield wrote:
> On Thu, May 11, 2023 at 04:16:53PM -0700, Dan Williams wrote:
> > alison.schofield@ wrote:
>
> snip...
>
> > > + /*
> > > + * The SRAT may have already described the NUMA details for
> > > + * this CFMWS HPA range, yet it may not have created memblks
> > > + * for the entire range. Look for a node with a memblk covering
> > > + * any part of the HPA range. Don't bother figuring out if it
> > > + * is partially or wholly described. Replace any memblks in the
> > > + * range with one single memblk that covers the entire range.
> > > + *
> > > + * This preserves the SRAT defined node and Proximity Domain.
> > > + */
> > > +
> > > + found_node = numa_find_node(start, end);
> > > + if (found_node != NUMA_NO_NODE) {
> > > + numa_remove_memblks(found_node, start, end);
> > > + if (numa_add_memblk(found_node, start, end) < 0) {
> >
> > I worry that this blows away information in the case where the BIOS for
> > some reason decides to describe multiple proximity domains per CXL
> > window.
> >
> > I think this wants to be an algorithm like
> >
> > numa_fill_memblks(start, end)
> >
> > ...where that walks the range and for any gaps in that range extend the
> > last memblk to the end of the gap.
> >
> > For example if the CFWMS to SRAT mapping is this:
> >
> > ┌──────────────────────────────────────────────────┐
> > │ WINDOW0 │
> > ├──────────┬────────────┬────────────┬─────────────┤
> > │PXM0 │ GAP │ PXM1 │ GAP │
> > └──────────┴────────────┴────────────┴─────────────┘
> >
> >
> > ┌──────────────────────────────────────────────────┐
> > │ WINDOW0 │
> > ├───────────────────────┬──────────────────────────┤
> > │PXM0 │ PXM1 │
> > └───────────────────────┴──────────────────────────┘
> >
> > ...and if that finds nothing to fill, *then* create the next proximity
> > domain for the window.
>
> Thanks for the feedback Dan.
>
> Is there any guarantee that the first PXM begins at cfmws.start?
> ie...that gaps always 'follow'
>
> This one's easy: gap-pxm0 becomes all pxm0
> This one's easy too: gap-pxm0-gap becomes all pxm0
>
> This one's not so obvious: gap-pxm0-gap-pxm1
Good point.
I would say that pxm0 consuming both gaps is ok. pxm1 grabbing the
second gap is also ok, whatever makes the code simpler.
> Imagine those are quarters. Is the end result 50/50 or 75/25?
The window is all one QoS class anyway as far as CXL is concerned. So
the end distribution is not much of a concern. I.e. I would not worry if
you end up with 75/25 even if 50/50 was possible.
> I'm wondering if there is any policy in existence or this is new,
> and we are setting the policy now?
We are trailblazing on our own. The reason why I think we have the
freedom to make an arbitrary decision here is that it is unlikely that the
BIOS describes multiple proximity domains per window, but on the off
chance they do lets not throw away information.
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC 0/3] Apply SRAT defined PXM to entire CFMWS
2023-05-11 22:42 ` [RFC 0/3] Apply SRAT defined PXM to entire CFMWS Dan Williams
@ 2023-05-12 16:33 ` Jonathan Cameron
0 siblings, 0 replies; 12+ messages in thread
From: Jonathan Cameron @ 2023-05-12 16:33 UTC (permalink / raw)
To: Dan Williams
Cc: alison.schofield, Ira Weiny, Vishal Verma, Dave Jiang,
Ben Widawsky, linux-cxl
On Thu, 11 May 2023 15:42:39 -0700
Dan Williams <dan.j.williams@intel.com> wrote:
> alison.schofield@ wrote:
> > From: Alison Schofield <alison.schofield@intel.com>
> >
> > The patch that created fake NUMA nodes for CFMWS windows
>
> "fake" seems the wrong word to me. These are first class Linux NUMA
> nodes in the end. So to me it's the "patch that extended SRAT proximity
> domains by the potential performance-class windows in the CFMWS", or
> something like that.
>
> > not defined in the SRAT, made wrong assumptions about the
> > SRAT defined entries. Specifically, it assumed an SRAT entry
> > using cfmws->start, uses the entire CMFWS range, start through
> > end. The assumption is wrong, and so the ACPI driver, needs
>
> I also wouldn't say it was wrong, just incomplete. I.e. this is just
> extending the heuristic to say that if the BIOS describes *any* portion
> of a CFMWS window with a proximity domain, might as well reuse that
> proximity domain for the entire window since everything in that window
> is expected to be of a similar performance class.
Some strong assumptions in this statement.
A platform might not care about QTG groups because it's not doing any fancy
QOS stuff (or at least not enough of them to cover 'similar' performance).
At that point, you could have just a few CFMWS with a wide range
of different characteristics. At somepoint I think we'll need to do
something more sophisticated to cover that case. Maybe not yet though.
The moment we have platforms doing interleave or not in the host bridges
we will get very different bandwidth at least, potentially to different
parts of identical memory on the same device. What fun ;)
Jonathan
>
> > to examine the SRAT created memblks more closely to discover
> > partial definitions of the HPA range.
> >
> > This work-in-progress addresses that issue. The first 2 patches
> > introduce numa helpers that are used in the 3rd patch, where the
> > ACPI drivers parsing of the CFMWS is updated.
> >
> > The patch commit logs, especially Patch 3, describes more
> > of the approach as well as other approaches considered, and
> > questions. So, perhaps, scan 1 & 2, and dive into #3 and
> > confirm or refute this approach.
> >
> > I did not include our NUMA or ACPI friends in this posting,
> > because I want to get a direction check from CXL folks before
> > addressing how the helpers can get merged into the NUMA arch.
> >
> > Thanks for looking!
> >
> > Alison Schofield (3):
> > x86/numa: Introduce numa_find_node(start, end)
> > x86/numa: Introduce numa_remove_memblks(node, start, end)
> > ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window
> >
> > arch/x86/include/asm/numa.h | 2 ++
> > arch/x86/mm/numa.c | 36 ++++++++++++++++++++++++++++++++++++
> > drivers/acpi/numa/srat.c | 32 ++++++++++++++++++++++++++------
> > 3 files changed, 64 insertions(+), 6 deletions(-)
> >
> > --
> > 2.37.3
> >
>
>
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [RFC 3/3] ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window
2023-05-12 0:18 ` Dan Williams
@ 2023-05-12 16:45 ` Jonathan Cameron
0 siblings, 0 replies; 12+ messages in thread
From: Jonathan Cameron @ 2023-05-12 16:45 UTC (permalink / raw)
To: Dan Williams
Cc: Alison Schofield, Ira Weiny, Vishal Verma, Dave Jiang,
Ben Widawsky, linux-cxl
On Thu, 11 May 2023 17:18:27 -0700
Dan Williams <dan.j.williams@intel.com> wrote:
> Alison Schofield wrote:
> > On Thu, May 11, 2023 at 04:16:53PM -0700, Dan Williams wrote:
> > > alison.schofield@ wrote:
> >
> > snip...
> >
> > > > + /*
> > > > + * The SRAT may have already described the NUMA details for
> > > > + * this CFMWS HPA range, yet it may not have created memblks
> > > > + * for the entire range. Look for a node with a memblk covering
> > > > + * any part of the HPA range. Don't bother figuring out if it
> > > > + * is partially or wholly described. Replace any memblks in the
> > > > + * range with one single memblk that covers the entire range.
> > > > + *
> > > > + * This preserves the SRAT defined node and Proximity Domain.
> > > > + */
> > > > +
> > > > + found_node = numa_find_node(start, end);
> > > > + if (found_node != NUMA_NO_NODE) {
> > > > + numa_remove_memblks(found_node, start, end);
> > > > + if (numa_add_memblk(found_node, start, end) < 0) {
> > >
> > > I worry that this blows away information in the case where the BIOS for
> > > some reason decides to describe multiple proximity domains per CXL
> > > window.
> > >
> > > I think this wants to be an algorithm like
> > >
> > > numa_fill_memblks(start, end)
> > >
> > > ...where that walks the range and for any gaps in that range extend the
> > > last memblk to the end of the gap.
> > >
> > > For example if the CFWMS to SRAT mapping is this:
> > >
> > > ┌──────────────────────────────────────────────────┐
> > > │ WINDOW0 │
> > > ├──────────┬────────────┬────────────┬─────────────┤
> > > │PXM0 │ GAP │ PXM1 │ GAP │
> > > └──────────┴────────────┴────────────┴─────────────┘
> > >
> > >
> > > ┌──────────────────────────────────────────────────┐
> > > │ WINDOW0 │
> > > ├───────────────────────┬──────────────────────────┤
> > > │PXM0 │ PXM1 │
> > > └───────────────────────┴──────────────────────────┘
> > >
> > > ...and if that finds nothing to fill, *then* create the next proximity
> > > domain for the window.
> >
> > Thanks for the feedback Dan.
> >
> > Is there any guarantee that the first PXM begins at cfmws.start?
> > ie...that gaps always 'follow'
> >
> > This one's easy: gap-pxm0 becomes all pxm0
> > This one's easy too: gap-pxm0-gap becomes all pxm0
> >
> > This one's not so obvious: gap-pxm0-gap-pxm1
>
> Good point.
>
> I would say that pxm0 consuming both gaps is ok. pxm1 grabbing the
> second gap is also ok, whatever makes the code simpler.
Agreed - it's best effort on a heuristic. My guess is we'll end up
revisiting this when hotplug means we see similar patterns to those
that led a bios to do the above, but this is fine for now.
>
> > Imagine those are quarters. Is the end result 50/50 or 75/25?
>
> The window is all one QoS class anyway as far as CXL is concerned. So
> the end distribution is not much of a concern. I.e. I would not worry if
> you end up with 75/25 even if 50/50 was possible.
I don't like tying QoS class to a narrow range of access characteristics
but that's a problem for another day.
>
> > I'm wondering if there is any policy in existence or this is new,
> > and we are setting the policy now?
>
> We are trailblazing on our own. The reason why I think we have the
> freedom to make an arbitrary decision here is that it is unlikely that the
> BIOS describes multiple proximity domains per window, but on the off
> chance they do lets not throw away information.
Agree with not throwing it away, fairly sure you'll see this :)
If nothing else it's handy for a system that doesn't do interleaving
across devices for some reason where the OS is then going to use NUMA Interleave.
So it can be useful to have multiple proximity domains even for identical
memory if the access paths are different.
Ha! I see an excuse to reuse old work:
https://github.com/hisilicon/acpi-numa-whitepaper/releases/tag/v0.93
See 5.1
I should bring that back one day when things are quieter (so likely never).
Was targeted as an ACPI WP and out for review when something happened (check
the date and who I work for if curious ;)
Jonathan
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2023-05-12 16:45 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-10 18:44 [RFC 0/3] Apply SRAT defined PXM to entire CFMWS alison.schofield
2023-05-10 18:44 ` [RFC 1/3] x86/numa: Introduce numa_find_node(start, end) alison.schofield
2023-05-11 22:46 ` Dan Williams
2023-05-10 18:44 ` [RFC 2/3] x86/numa: Introduce numa_remove_memblks(node, start, end) alison.schofield
2023-05-11 22:50 ` Dan Williams
2023-05-10 18:44 ` [RFC 3/3] ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window alison.schofield
2023-05-11 23:16 ` Dan Williams
2023-05-12 0:01 ` Alison Schofield
2023-05-12 0:18 ` Dan Williams
2023-05-12 16:45 ` Jonathan Cameron
2023-05-11 22:42 ` [RFC 0/3] Apply SRAT defined PXM to entire CFMWS Dan Williams
2023-05-12 16:33 ` Jonathan Cameron
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox