[PATCH 0/2] CXL: Apply SRAT defined PXM to entire CFMWS window

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] CXL: Apply SRAT defined PXM to entire CFMWS window
@ 2023-05-19  0:04 alison.schofield
  2023-05-19  0:04 ` [PATCH 1/2] x86/numa: Introduce numa_fill_memblks() alison.schofield
  2023-05-19  0:04 ` [PATCH 2/2] ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window alison.schofield
  0 siblings, 2 replies; 12+ messages in thread
From: alison.schofield @ 2023-05-19  0:04 UTC (permalink / raw)
  To: Rafael J. Wysocki, Len Brown, Dan Williams, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Jonathan Cameron,
	Dave Jiang
  Cc: Alison Schofield, x86, linux-cxl, linux-acpi, linux-kernel

From: Alison Schofield <alison.schofield@intel.com>

The CXL subsystem requires the creation of NUMA nodes for CFMWS
Windows not described in the SRAT. The existing implementation
only addresses windows that the SRAT describes completely or
not at all. This work addresses the case of partially described
CFMWS Windows by extending proximity domains in a portion of
a CFMWS window to the entire window.

Introduce a NUMA helper, numa_fill_memblks(), to fill gaps in
a numa_meminfo memblk address range. Update the CFMWS parsing
in the ACPI driver to use numa_fill_memblks() to extend SRAT
defined proximity domains to entire CXL windows.

An RFC of this patchset was previously posted for CXL folks
review.[1] The RFC feedback led to the implementation here,
extending existing memblks (Dan). Also, both Jonathan and
Dan influenced the changelog comments in the ACPI patch, with
regards to setting expectations on this evolving heuristic.

Repeating here to set reviewer expectations:
*Note that this heuristic will evolve when CFMWS Windows present
a wider range of characteristics. The extension of the proximity
domain, implemented here, is likely a step in developing a more
sophisticated performance profile in the future.

[1] https://lore.kernel.org/linux-cxl/cover.1683742429.git.alison.schofield@intel.com/

Alison Schofield (2):
  x86/numa: Introduce numa_fill_memblks()
  ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window

 arch/x86/include/asm/sparsemem.h |  2 +
 arch/x86/mm/numa.c               | 82 ++++++++++++++++++++++++++++++++
 drivers/acpi/numa/srat.c         | 11 +++--
 include/linux/numa.h             |  7 +++
 4 files changed, 99 insertions(+), 3 deletions(-)

base-commit: f81d8f759e7f80c643027e631c586369836aac90
-- 
2.37.3

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 1/2] x86/numa: Introduce numa_fill_memblks()
  2023-05-19  0:04 [PATCH 0/2] CXL: Apply SRAT defined PXM to entire CFMWS window alison.schofield
@ 2023-05-19  0:04 ` alison.schofield
  2023-05-19  0:08   ` Dave Hansen
  2023-06-03 23:53   ` Dan Williams
  2023-05-19  0:04 ` [PATCH 2/2] ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window alison.schofield
  1 sibling, 2 replies; 12+ messages in thread
From: alison.schofield @ 2023-05-19  0:04 UTC (permalink / raw)
  To: Rafael J. Wysocki, Len Brown, Dan Williams, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Jonathan Cameron,
	Dave Jiang
  Cc: Alison Schofield, x86, linux-cxl, linux-acpi, linux-kernel

From: Alison Schofield <alison.schofield@intel.com>

numa_fill_memblks() fills in the gaps in numa_meminfo memblks
over an HPA address range.

The initial use case is the ACPI driver that needs to extend
SRAT defined proximity domains to an entire CXL CFMWS Window[1].

The APCI driver expects to use numa_fill_memblks() while parsing
the CFMWS. Extending the memblks created during SRAT parsing, to
cover the entire CFMWS Window, is desirable because everything in
a CFMWS Window is expected to be of a similar performance class.

Requires CONFIG_NUMA_KEEP_MEMINFO.

[1] A CXL CFMWS Window represents a contiguous CXL memory resource,
aka an HPA range. The CFMWS (CXL Fixed Memory Window Structure) is
part of the ACPI CEDT (CXL Early Discovery Table).

Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
 arch/x86/include/asm/sparsemem.h |  2 +
 arch/x86/mm/numa.c               | 82 ++++++++++++++++++++++++++++++++
 include/linux/numa.h             |  7 +++
 3 files changed, 91 insertions(+)

diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
index 64df897c0ee3..1be13b2dfe8b 100644
--- a/arch/x86/include/asm/sparsemem.h
+++ b/arch/x86/include/asm/sparsemem.h
@@ -37,6 +37,8 @@ extern int phys_to_target_node(phys_addr_t start);
 #define phys_to_target_node phys_to_target_node
 extern int memory_add_physaddr_to_nid(u64 start);
 #define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
+extern int numa_fill_memblks(u64 start, u64 end);
+#define numa_fill_memblks numa_fill_memblks
 #endif
 #endif /* __ASSEMBLY__ */
 
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 2aadb2019b4f..6c8f9cff71da 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -11,6 +11,7 @@
 #include <linux/nodemask.h>
 #include <linux/sched.h>
 #include <linux/topology.h>
+#include <linux/sort.h>
 
 #include <asm/e820/api.h>
 #include <asm/proto.h>
@@ -961,4 +962,85 @@ int memory_add_physaddr_to_nid(u64 start)
 	return nid;
 }
 EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
+
+static int __init cmp_memblk(const void *a, const void *b)
+{
+	const struct numa_memblk *ma = *(const struct numa_memblk **)a;
+	const struct numa_memblk *mb = *(const struct numa_memblk **)b;
+
+	if (ma->start != mb->start)
+		return (ma->start < mb->start) ? -1 : 1;
+
+	if (ma->end != mb->end)
+		return (ma->end < mb->end) ? -1 : 1;
+
+	return 0;
+}
+
+static struct numa_memblk *numa_memblk_list[NR_NODE_MEMBLKS] __initdata;
+
+/**
+ * numa_fill_memblks - Fill gaps in numa_meminfo memblks
+ * @start: address to begin fill
+ * @end: address to end fill
+ *
+ * Find and extend numa_meminfo memblks to cover the @start/@end
+ * HPA address range, following these rules:
+ * 1. The first memblk must start at @start
+ * 2. The last memblk must end at @end
+ * 3. Fill the gaps between memblks by extending numa_memblk.end
+ * Result: All addresses in start/end range are included in
+ *	   numa_meminfo.
+ *
+ * RETURNS:
+ * 0		  : Success. numa_meminfo fully describes start/end
+ * NUMA_NO_MEMBLK : No memblk exists in start/end range
+ */
+
+int __init numa_fill_memblks(u64 start, u64 end)
+{
+	struct numa_meminfo *mi = &numa_meminfo;
+	struct numa_memblk **blk = &numa_memblk_list[0];
+	int count = 0;
+
+	for (int i = 0; i < mi->nr_blks; i++) {
+		struct numa_memblk *bi = &mi->blk[i];
+
+		if (start <= bi->start && end >= bi->end) {
+			blk[count] = &mi->blk[i];
+			count++;
+		}
+	}
+	if (!count)
+		return NUMA_NO_MEMBLK;
+
+	if (count == 1) {
+		blk[0]->start = start;
+		blk[0]->end = end;
+		return 0;
+	}
+
+	sort(&blk[0], count, sizeof(blk[0]), cmp_memblk, NULL);
+	blk[0]->start = start;
+	blk[count - 1]->end = end;
+
+	for (int i = 0, j = 1; j < count; i++, j++) {
+		/* Overlaps OK. sort() put the lesser end first */
+		if (blk[i]->start == blk[j]->start)
+			continue;
+
+		/* No gap */
+		if (blk[i]->end == blk[j]->start)
+			continue;
+
+		/* Fill the gap */
+		if (blk[i]->end < blk[j]->start) {
+			blk[i]->end = blk[j]->start;
+			continue;
+		}
+	}
+	return 0;
+}
+EXPORT_SYMBOL_GPL(numa_fill_memblks);
+
 #endif
diff --git a/include/linux/numa.h b/include/linux/numa.h
index 59df211d051f..0f512c0aba54 100644
--- a/include/linux/numa.h
+++ b/include/linux/numa.h
@@ -12,6 +12,7 @@
 #define MAX_NUMNODES    (1 << NODES_SHIFT)
 
 #define	NUMA_NO_NODE	(-1)
+#define	NUMA_NO_MEMBLK	(-1)
 
 /* optionally keep NUMA memory info available post init */
 #ifdef CONFIG_NUMA_KEEP_MEMINFO
@@ -43,6 +44,12 @@ static inline int phys_to_target_node(u64 start)
 	return 0;
 }
 #endif
+#ifndef numa_fill_memblks
+static inline int __init numa_fill_memblks(u64 start, u64 end)
+{
+	return NUMA_NO_MEMBLK;
+}
+#endif
 #else /* !CONFIG_NUMA */
 static inline int numa_map_to_online_node(int node)
 {
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] x86/numa: Introduce numa_fill_memblks()
  2023-05-19  0:04 ` [PATCH 1/2] x86/numa: Introduce numa_fill_memblks() alison.schofield
@ 2023-05-19  0:08   ` Dave Hansen
  2023-05-19  0:26     ` Alison Schofield
  2023-06-03 23:53   ` Dan Williams
  1 sibling, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2023-05-19  0:08 UTC (permalink / raw)
  To: alison.schofield, Rafael J. Wysocki, Len Brown, Dan Williams,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra, Andrew Morton,
	Jonathan Cameron, Dave Jiang
  Cc: x86, linux-cxl, linux-acpi, linux-kernel

On 5/18/23 17:04, alison.schofield@intel.com wrote:
> The initial use case is the ACPI driver that needs to extend
> SRAT defined proximity domains to an entire CXL CFMWS Window[1].

Dumb question time: Why didn't the SRAT just cover this sucker in the
first place?  Are we fixing up a BIOS bug or is there a legitimate
reason that the SRAT didn't cover it up front?



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] x86/numa: Introduce numa_fill_memblks()
  2023-05-19  0:08   ` Dave Hansen
@ 2023-05-19  0:26     ` Alison Schofield
  2023-05-19  0:40       ` Dan Williams
  2023-05-19  0:43       ` Dave Hansen
  0 siblings, 2 replies; 12+ messages in thread
From: Alison Schofield @ 2023-05-19  0:26 UTC (permalink / raw)
  To: Dave Hansen
  Cc: Rafael J. Wysocki, Len Brown, Dan Williams, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Jonathan Cameron,
	Dave Jiang, x86, linux-cxl, linux-acpi, linux-kernel

On Thu, May 18, 2023 at 05:08:16PM -0700, Dave Hansen wrote:
> On 5/18/23 17:04, alison.schofield@intel.com wrote:
> > The initial use case is the ACPI driver that needs to extend
> > SRAT defined proximity domains to an entire CXL CFMWS Window[1].
> 
> Dumb question time: Why didn't the SRAT just cover this sucker in the
> first place?  Are we fixing up a BIOS bug or is there a legitimate
> reason that the SRAT didn't cover it up front?
> 
>

There is no requirement that the BIOS describe (in the SRAT) all the
HPA assigned to a CFMWS Window. The HPA range may not actually map to
any memory at boot time. It can be persistent capacity or may be there
to enable hot-plug. IIUC BIOS can pick and choose and define volatile
regions wherever it pleases.

So, no we're not fixing up a BIOS bug, nor doing a BIOS sanity check.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] x86/numa: Introduce numa_fill_memblks()
  2023-05-19  0:26     ` Alison Schofield
@ 2023-05-19  0:40       ` Dan Williams
  2023-05-19  0:43       ` Dave Hansen
  1 sibling, 0 replies; 12+ messages in thread
From: Dan Williams @ 2023-05-19  0:40 UTC (permalink / raw)
  To: Alison Schofield, Dave Hansen
  Cc: Rafael J. Wysocki, Len Brown, Dan Williams, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Jonathan Cameron,
	Dave Jiang, x86, linux-cxl, linux-acpi, linux-kernel

Alison Schofield wrote:
> On Thu, May 18, 2023 at 05:08:16PM -0700, Dave Hansen wrote:
> > On 5/18/23 17:04, alison.schofield@intel.com wrote:
> > > The initial use case is the ACPI driver that needs to extend
> > > SRAT defined proximity domains to an entire CXL CFMWS Window[1].
> > 
> > Dumb question time: Why didn't the SRAT just cover this sucker in the
> > first place?  Are we fixing up a BIOS bug or is there a legitimate
> > reason that the SRAT didn't cover it up front?
> > 
> >
> 
> There is no requirement that the BIOS describe (in the SRAT) all the
> HPA assigned to a CFMWS Window. The HPA range may not actually map to
> any memory at boot time. It can be persistent capacity or may be there
> to enable hot-plug. IIUC BIOS can pick and choose and define volatile
> regions wherever it pleases.
> 
> So, no we're not fixing up a BIOS bug, nor doing a BIOS sanity check.
> 

Another way to think about it is that CXL is dynamic and SRAT is static.
ACPI hotplug assumes you're just onlining an address range that was
declared offline in SRAT. CXL hotplug allows various capacities and
performance types to be added. So, how many potential proximity domains
performance targets could exist in a given CXL window? It depends, and
because it depends BIOS goes hands off and lets the OS define that
policy.

The Linux policy for now is just keep it simple. Add a proximity domain
for every unmapped (no SRAT intersections) CXL Window, and put the onus
on the platform owner to assign devices of similar performance to a
given CXL window to keep the proximity domain proliferation to a
minimum.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] x86/numa: Introduce numa_fill_memblks()
  2023-05-19  0:26     ` Alison Schofield
  2023-05-19  0:40       ` Dan Williams
@ 2023-05-19  0:43       ` Dave Hansen
  2023-05-19  1:56         ` Dan Williams
  1 sibling, 1 reply; 12+ messages in thread
From: Dave Hansen @ 2023-05-19  0:43 UTC (permalink / raw)
  To: Alison Schofield
  Cc: Rafael J. Wysocki, Len Brown, Dan Williams, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Jonathan Cameron,
	Dave Jiang, x86, linux-cxl, linux-acpi, linux-kernel

On 5/18/23 17:26, Alison Schofield wrote:
> On Thu, May 18, 2023 at 05:08:16PM -0700, Dave Hansen wrote:
>> On 5/18/23 17:04, alison.schofield@intel.com wrote:
>>> The initial use case is the ACPI driver that needs to extend
>>> SRAT defined proximity domains to an entire CXL CFMWS Window[1].
>>
>> Dumb question time: Why didn't the SRAT just cover this sucker in the
>> first place?  Are we fixing up a BIOS bug or is there a legitimate
>> reason that the SRAT didn't cover it up front?
>>
> There is no requirement that the BIOS describe (in the SRAT) all the
> HPA assigned to a CFMWS Window. The HPA range may not actually map to
> any memory at boot time. It can be persistent capacity or may be there
> to enable hot-plug. IIUC BIOS can pick and choose and define volatile
> regions wherever it pleases.

I understand that it _can_ do this.  I'm trying to get to the reasoning
of why.

Is this essentially so that the physical address space doesn't have to
be *committed* to a single use up front?  For RAM, I guess this wasn't a
problem because there was only a finite amount of RAM that could get
hotplugged into a single node.

But with these fancy schmancy new devices, it's really hard to figure
out how much space will show up and what performance it will have until
you actually start poking at it.  The firmware wasn't _quite_ sure how
it wanted to burn the physical address space at the time the SRAT was
created.  But, now it knows, and this is handling the case where the
firmware only expands an adjacent chunk of physical address space.

Close?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] x86/numa: Introduce numa_fill_memblks()
  2023-05-19  0:43       ` Dave Hansen
@ 2023-05-19  1:56         ` Dan Williams
  0 siblings, 0 replies; 12+ messages in thread
From: Dan Williams @ 2023-05-19  1:56 UTC (permalink / raw)
  To: Dave Hansen, Alison Schofield
  Cc: Rafael J. Wysocki, Len Brown, Dan Williams, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Jonathan Cameron,
	Dave Jiang, x86, linux-cxl, linux-acpi, linux-kernel

Dave Hansen wrote:
> On 5/18/23 17:26, Alison Schofield wrote:
> > On Thu, May 18, 2023 at 05:08:16PM -0700, Dave Hansen wrote:
> >> On 5/18/23 17:04, alison.schofield@intel.com wrote:
> >>> The initial use case is the ACPI driver that needs to extend
> >>> SRAT defined proximity domains to an entire CXL CFMWS Window[1].
> >>
> >> Dumb question time: Why didn't the SRAT just cover this sucker in the
> >> first place?  Are we fixing up a BIOS bug or is there a legitimate
> >> reason that the SRAT didn't cover it up front?
> >>
> > There is no requirement that the BIOS describe (in the SRAT) all the
> > HPA assigned to a CFMWS Window. The HPA range may not actually map to
> > any memory at boot time. It can be persistent capacity or may be there
> > to enable hot-plug. IIUC BIOS can pick and choose and define volatile
> > regions wherever it pleases.
> 
> I understand that it _can_ do this.  I'm trying to get to the reasoning
> of why.
> 
> Is this essentially so that the physical address space doesn't have to
> be *committed* to a single use up front?  For RAM, I guess this wasn't a
> problem because there was only a finite amount of RAM that could get
> hotplugged into a single node.

Right, for RAM the hotplug degrees of freedom was predetermined by the
platform definition.

> But with these fancy schmancy new devices, it's really hard to figure
> out how much space will show up and what performance it will have until
> you actually start poking at it.

It's less "until actually start poking at it" and more the BIOS just
declines to poke at some CXL topologies at boot, and does not poke
post-boot.

> The firmware wasn't _quite_ sure how
> it wanted to burn the physical address space at the time the SRAT was
> created.  But, now it knows, and this is handling the case where the
> firmware only expands an adjacent chunk of physical address space.

For devices that are present at boot the BIOS mostly does the right
thing and just maps them into the EFI memory map and produces all the
other ACPI collateral. For devices that are added after boot, or devices
that fall outside of a configuration that the BIOS is prepared to handle
it just creates a CXL Window with empty capacity and says "OS, you take
it from here. Here's some physical address space you can map things,
good luck!"

Compare that to ACPI hotplug where the platform knows about a
preconfigured amount of memory that might come online later, and can
produce all the relevant ACPI collateral upfront.

In other forums I have advocated against SRAT covering the unmapped
capacity of a CXL window because of the lies that firmware would need to
convey in the HMAT and SLIT for those empty proximity domains. The CXL
specification provides for an architectural way to get all the
information about a memory range that previously had to be packaged up
into an ACPI table.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH 1/2] x86/numa: Introduce numa_fill_memblks()
  2023-05-19  0:04 ` [PATCH 1/2] x86/numa: Introduce numa_fill_memblks() alison.schofield
  2023-05-19  0:08   ` Dave Hansen
@ 2023-06-03 23:53   ` Dan Williams
  2023-06-06 20:03     ` Alison Schofield
  1 sibling, 1 reply; 12+ messages in thread
From: Dan Williams @ 2023-06-03 23:53 UTC (permalink / raw)
  To: alison.schofield, Rafael J. Wysocki, Len Brown, Dan Williams,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen,
	H. Peter Anvin, Andy Lutomirski, Peter Zijlstra, Andrew Morton,
	Jonathan Cameron, Dave Jiang
  Cc: Alison Schofield, x86, linux-cxl, linux-acpi, linux-kernel

alison.schofield@ wrote:
> From: Alison Schofield <alison.schofield@intel.com>
> 
> numa_fill_memblks() fills in the gaps in numa_meminfo memblks
> over an HPA address range.
> 
> The initial use case is the ACPI driver that needs to extend
> SRAT defined proximity domains to an entire CXL CFMWS Window[1].

I feel like this demands more explanation because the "need" is not
apparent. In fact its a Linux policy choice not a requirement. The next
patch has some of this, but this story is needed earlier for someone
that reads this patch first. Something like:

---

The CFWMS is an ACPI data structure that indicates *potential* locations
where CXL memory can be placed. It is the playground where the CXL
driver has free reign to establish regions.  That space can be populated
by BIOS created regions, or driver created regions, after hotplug or
other reconfiguration.

When the BIOS creates a region in a CXL Window it additionally describes
that subset of the Window range in the other typical ACPI tables SRAT,
SLIT, and HMAT. The rationale for the BIOS not pre-describing the entire
CXL Window in SRAT, SLIT, and HMAT is that it can not predict the
future. I.e. there is nothing stopping higher or lower performance
devices being placed in the same Window. Compare that to ACPI memory
hotplug that just onlines additional capacity in the proximity domain
with little freedom for dynamic performance differentiation.

That leaves the OS with a choice, should unpopulated window capacity
match the proximity domain of an existing region, or should it allocate
a new one? This patch takes the simple position of minimizing proximity
domain proliferation and reuse any proximity domain intersection for the
entire Window. If the Window has no intersections then allocate a new
proximity domain. Note that SRAT, SLIT and HMAT information can be
enumerated dynamically in a standard way from device provided data.
Think of CXL as the end of ACPI needing to describe memory attributes,
CXL offers a standard discovery model for performance attributes, but
Linux still needs to interoperate with the old regime.

---

> 
> The APCI driver expects to use numa_fill_memblks() while parsing

s/APCI/ACPI/

Again, the ACPI code does not have any expectation, this is pure OS
policy decision about how to handle undescribed memory.

> the CFMWS. Extending the memblks created during SRAT parsing, to
> cover the entire CFMWS Window, is desirable because everything in
> a CFMWS Window is expected to be of a similar performance class.
> 
> Requires CONFIG_NUMA_KEEP_MEMINFO.

Not sure this adds anything to the description.

> 
> [1] A CXL CFMWS Window represents a contiguous CXL memory resource,
> aka an HPA range. The CFMWS (CXL Fixed Memory Window Structure) is
> part of the ACPI CEDT (CXL Early Discovery Table).
> 
> Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> ---
>  arch/x86/include/asm/sparsemem.h |  2 +
>  arch/x86/mm/numa.c               | 82 ++++++++++++++++++++++++++++++++
>  include/linux/numa.h             |  7 +++
>  3 files changed, 91 insertions(+)
> 
> diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
> index 64df897c0ee3..1be13b2dfe8b 100644
> --- a/arch/x86/include/asm/sparsemem.h
> +++ b/arch/x86/include/asm/sparsemem.h
> @@ -37,6 +37,8 @@ extern int phys_to_target_node(phys_addr_t start);
>  #define phys_to_target_node phys_to_target_node
>  extern int memory_add_physaddr_to_nid(u64 start);
>  #define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
> +extern int numa_fill_memblks(u64 start, u64 end);
> +#define numa_fill_memblks numa_fill_memblks

What is this for? The other defines are due to being an arch-specific
API and the #define is how the arch declares that it has a local version
to replace the generic one.

>  #endif
>  #endif /* __ASSEMBLY__ */
>  
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index 2aadb2019b4f..6c8f9cff71da 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -11,6 +11,7 @@
>  #include <linux/nodemask.h>
>  #include <linux/sched.h>
>  #include <linux/topology.h>
> +#include <linux/sort.h>
>  
>  #include <asm/e820/api.h>
>  #include <asm/proto.h>
> @@ -961,4 +962,85 @@ int memory_add_physaddr_to_nid(u64 start)
>  	return nid;
>  }
>  EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> +
> +static int __init cmp_memblk(const void *a, const void *b)
> +{
> +	const struct numa_memblk *ma = *(const struct numa_memblk **)a;
> +	const struct numa_memblk *mb = *(const struct numa_memblk **)b;
> +
> +	if (ma->start != mb->start)
> +		return (ma->start < mb->start) ? -1 : 1;
> +
> +	if (ma->end != mb->end)
> +		return (ma->end < mb->end) ? -1 : 1;

Why is this sorting by start and end? I can maybe guess, but a comment
would help a future intrepid reader.

> +
> +	return 0;
> +}
> +
> +static struct numa_memblk *numa_memblk_list[NR_NODE_MEMBLKS] __initdata;
> +
> +/**
> + * numa_fill_memblks - Fill gaps in numa_meminfo memblks
> + * @start: address to begin fill
> + * @end: address to end fill
> + *
> + * Find and extend numa_meminfo memblks to cover the @start/@end
> + * HPA address range, following these rules:
> + * 1. The first memblk must start at @start
> + * 2. The last memblk must end at @end

Why these requirements? I worry this is too strict because of the
existence of numa_cleanup_meminfo() which indicates that Linux has seen
quite messy firmware tables, or otherwise needs to cleanup after the
"numa=fake=" command line option. Is it not enough to just check for any
intersection?

> + * 3. Fill the gaps between memblks by extending numa_memblk.end
> + * Result: All addresses in start/end range are included in
> + *	   numa_meminfo.
> + *
> + * RETURNS:
> + * 0		  : Success. numa_meminfo fully describes start/end
> + * NUMA_NO_MEMBLK : No memblk exists in start/end range

This probably wants to clarify whether @end is inclusive or exclusive.

> + */
> +
> +int __init numa_fill_memblks(u64 start, u64 end)
> +{
> +	struct numa_meminfo *mi = &numa_meminfo;
> +	struct numa_memblk **blk = &numa_memblk_list[0];
> +	int count = 0;
> +
> +	for (int i = 0; i < mi->nr_blks; i++) {
> +		struct numa_memblk *bi = &mi->blk[i];
> +
> +		if (start <= bi->start && end >= bi->end) {

Maybe a comment about what this is doing? This is looking for to see if
any CXL window completely overlaps any SRAT entry?

> +			blk[count] = &mi->blk[i];
> +			count++;
> +		}
> +	}
> +	if (!count)
> +		return NUMA_NO_MEMBLK;
> +
> +	if (count == 1) {
> +		blk[0]->start = start;
> +		blk[0]->end = end;
> +		return 0;

So this is updating numa_meminfo in place?

> +	}
> +
> +	sort(&blk[0], count, sizeof(blk[0]), cmp_memblk, NULL);
> +	blk[0]->start = start;
> +	blk[count - 1]->end = end;
> +
> +	for (int i = 0, j = 1; j < count; i++, j++) {
> +		/* Overlaps OK. sort() put the lesser end first */
> +		if (blk[i]->start == blk[j]->start)
> +			continue;
> +
> +		/* No gap */
> +		if (blk[i]->end == blk[j]->start)
> +			continue;
> +
> +		/* Fill the gap */
> +		if (blk[i]->end < blk[j]->start) {
> +			blk[i]->end = blk[j]->start;
> +			continue;
> +		}

This looks clever to sort an array of pointers into the existing
numa_meminfo, I think it needs some comments to explain the cleverness,
but I am not seeing anything glaringly wrong about the approach.

> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(numa_fill_memblks);
> +
>  #endif
> diff --git a/include/linux/numa.h b/include/linux/numa.h
> index 59df211d051f..0f512c0aba54 100644
> --- a/include/linux/numa.h
> +++ b/include/linux/numa.h
> @@ -12,6 +12,7 @@
>  #define MAX_NUMNODES    (1 << NODES_SHIFT)
>  
>  #define	NUMA_NO_NODE	(-1)
> +#define	NUMA_NO_MEMBLK	(-1)
>  
>  /* optionally keep NUMA memory info available post init */
>  #ifdef CONFIG_NUMA_KEEP_MEMINFO
> @@ -43,6 +44,12 @@ static inline int phys_to_target_node(u64 start)
>  	return 0;
>  }
>  #endif
> +#ifndef numa_fill_memblks
> +static inline int __init numa_fill_memblks(u64 start, u64 end)
> +{
> +	return NUMA_NO_MEMBLK;
> +}
> +#endif

Why does linux/numa.h need to care about this x86-specific init routine?

>  #else /* !CONFIG_NUMA */
>  static inline int numa_map_to_online_node(int node)
>  {
> -- 
> 2.37.3
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] x86/numa: Introduce numa_fill_memblks()
  2023-06-03 23:53   ` Dan Williams
@ 2023-06-06 20:03     ` Alison Schofield
  2023-06-06 20:45       ` Dan Williams
  0 siblings, 1 reply; 12+ messages in thread
From: Alison Schofield @ 2023-06-06 20:03 UTC (permalink / raw)
  To: Dan Williams
  Cc: Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Andy Lutomirski,
	Peter Zijlstra, Andrew Morton, Jonathan Cameron, Dave Jiang, x86,
	linux-cxl, linux-acpi, linux-kernel

On Sat, Jun 03, 2023 at 04:53:13PM -0700, Dan Williams wrote:
> alison.schofield@ wrote:
> > From: Alison Schofield <alison.schofield@intel.com>
> > 
> > numa_fill_memblks() fills in the gaps in numa_meminfo memblks
> > over an HPA address range.
> > 
> > The initial use case is the ACPI driver that needs to extend
> > SRAT defined proximity domains to an entire CXL CFMWS Window[1].
> 
> I feel like this demands more explanation because the "need" is not
> apparent. In fact its a Linux policy choice not a requirement. The next
> patch has some of this, but this story is needed earlier for someone
> that reads this patch first. Something like:
> 

Hi Dan,

Thanks for the review :)

Sure, I can add the story below to make the 'need' for this function
more apparent, as well as s/needs/want so as not to conflate need with
requirement.

> ---
> 
> The CFWMS is an ACPI data structure that indicates *potential* locations
> where CXL memory can be placed. It is the playground where the CXL
> driver has free reign to establish regions.  That space can be populated
> by BIOS created regions, or driver created regions, after hotplug or
> other reconfiguration.
> 
> When the BIOS creates a region in a CXL Window it additionally describes
> that subset of the Window range in the other typical ACPI tables SRAT,
> SLIT, and HMAT. The rationale for the BIOS not pre-describing the entire
> CXL Window in SRAT, SLIT, and HMAT is that it can not predict the
> future. I.e. there is nothing stopping higher or lower performance
> devices being placed in the same Window. Compare that to ACPI memory
> hotplug that just onlines additional capacity in the proximity domain
> with little freedom for dynamic performance differentiation.
> 
> That leaves the OS with a choice, should unpopulated window capacity
> match the proximity domain of an existing region, or should it allocate
> a new one? This patch takes the simple position of minimizing proximity
> domain proliferation and reuse any proximity domain intersection for the
> entire Window. If the Window has no intersections then allocate a new
> proximity domain. Note that SRAT, SLIT and HMAT information can be
> enumerated dynamically in a standard way from device provided data.
> Think of CXL as the end of ACPI needing to describe memory attributes,
> CXL offers a standard discovery model for performance attributes, but
> Linux still needs to interoperate with the old regime.
> 
> ---
> 
> > 
> > The APCI driver expects to use numa_fill_memblks() while parsing
> 
> s/APCI/ACPI/
> 
> Again, the ACPI code does not have any expectation, this is pure OS
> policy decision about how to handle undescribed memory.
> 

The intent was to show the pending use case, perhaps 'wants to' use
this function to enact a purely OS policy decision!


> > the CFMWS. Extending the memblks created during SRAT parsing, to
> > cover the entire CFMWS Window, is desirable because everything in
> > a CFMWS Window is expected to be of a similar performance class.
> > 
> > Requires CONFIG_NUMA_KEEP_MEMINFO.
> 
> Not sure this adds anything to the description.
> 
> > 
> > [1] A CXL CFMWS Window represents a contiguous CXL memory resource,
> > aka an HPA range. The CFMWS (CXL Fixed Memory Window Structure) is
> > part of the ACPI CEDT (CXL Early Discovery Table).
> > 
> > Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> > ---
> >  arch/x86/include/asm/sparsemem.h |  2 +
> >  arch/x86/mm/numa.c               | 82 ++++++++++++++++++++++++++++++++
> >  include/linux/numa.h             |  7 +++
> >  3 files changed, 91 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
> > index 64df897c0ee3..1be13b2dfe8b 100644
> > --- a/arch/x86/include/asm/sparsemem.h
> > +++ b/arch/x86/include/asm/sparsemem.h
> > @@ -37,6 +37,8 @@ extern int phys_to_target_node(phys_addr_t start);
> >  #define phys_to_target_node phys_to_target_node
> >  extern int memory_add_physaddr_to_nid(u64 start);
> >  #define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
> > +extern int numa_fill_memblks(u64 start, u64 end);
> > +#define numa_fill_memblks numa_fill_memblks
> 
> What is this for? The other defines are due to being an arch-specific
> API and the #define is how the arch declares that it has a local version
> to replace the generic one.

That define, along with the numa.h change below, are to support builds of
CONFIG_ARM64 and CONFIG_LOONGARCH, both include the caller acpi_parse_cfmws(),
of numa_fill_memblks().

> 
> >  #endif
> >  #endif /* __ASSEMBLY__ */
> >  
> > diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> > index 2aadb2019b4f..6c8f9cff71da 100644
> > --- a/arch/x86/mm/numa.c
> > +++ b/arch/x86/mm/numa.c
> > @@ -11,6 +11,7 @@
> >  #include <linux/nodemask.h>
> >  #include <linux/sched.h>
> >  #include <linux/topology.h>
> > +#include <linux/sort.h>
> >  
> >  #include <asm/e820/api.h>
> >  #include <asm/proto.h>
> > @@ -961,4 +962,85 @@ int memory_add_physaddr_to_nid(u64 start)
> >  	return nid;
> >  }
> >  EXPORT_SYMBOL_GPL(memory_add_physaddr_to_nid);
> > +
> > +static int __init cmp_memblk(const void *a, const void *b)
> > +{
> > +	const struct numa_memblk *ma = *(const struct numa_memblk **)a;
> > +	const struct numa_memblk *mb = *(const struct numa_memblk **)b;
> > +
> > +	if (ma->start != mb->start)
> > +		return (ma->start < mb->start) ? -1 : 1;
> > +
> > +	if (ma->end != mb->end)
> > +		return (ma->end < mb->end) ? -1 : 1;
> 
> Why is this sorting by start and end? I can maybe guess, but a comment
> would help a future intrepid reader.

Sure, can add comment. It compares ends only if starts are the same.
It's putting the list in order for numa_fill_memblks() to walk and
fill.

> 
> > +
> > +	return 0;
> > +}
> > +
> > +static struct numa_memblk *numa_memblk_list[NR_NODE_MEMBLKS] __initdata;
> > +
> > +/**
> > + * numa_fill_memblks - Fill gaps in numa_meminfo memblks
> > + * @start: address to begin fill
> > + * @end: address to end fill
> > + *
> > + * Find and extend numa_meminfo memblks to cover the @start/@end
> > + * HPA address range, following these rules:
> > + * 1. The first memblk must start at @start
> > + * 2. The last memblk must end at @end
> 
> Why these requirements? I worry this is too strict because of the
> existence of numa_cleanup_meminfo() which indicates that Linux has seen
> quite messy firmware tables, or otherwise needs to cleanup after the
> "numa=fake=" command line option. Is it not enough to just check for any
> intersection?

Yes, it would be enough to just check for intersection, and not
force the alignment. Will change code to reflect.

> 
> > + * 3. Fill the gaps between memblks by extending numa_memblk.end
> > + * Result: All addresses in start/end range are included in
> > + *	   numa_meminfo.
> > + *
> > + * RETURNS:
> > + * 0		  : Success. numa_meminfo fully describes start/end
> > + * NUMA_NO_MEMBLK : No memblk exists in start/end range
> 
> This probably wants to clarify whether @end is inclusive or exclusive.

It's exclusive and I'll add comment.

> 
> > + */
> > +
> > +int __init numa_fill_memblks(u64 start, u64 end)
> > +{
> > +	struct numa_meminfo *mi = &numa_meminfo;
> > +	struct numa_memblk **blk = &numa_memblk_list[0];
> > +	int count = 0;
> > +
> > +	for (int i = 0; i < mi->nr_blks; i++) {
> > +		struct numa_memblk *bi = &mi->blk[i];
> > +
> > +		if (start <= bi->start && end >= bi->end) {
> 
> Maybe a comment about what this is doing? This is looking for to see if
> any CXL window completely overlaps any SRAT entry?

Based on your first comment about messy tables, I can see the need to
expand the search to include any intersection. Will do.

>    
> > +			blk[count] = &mi->blk[i];
> > +			count++;
> > +		}
> > +	}
> > +	if (!count)
> > +		return NUMA_NO_MEMBLK;
> > +
> > +	if (count == 1) {
> > +		blk[0]->start = start;
> > +		blk[0]->end = end;
> > +		return 0;
> 
> So this is updating numa_meminfo in place?

Yes.

> 
> > +	}
> > +
> > +	sort(&blk[0], count, sizeof(blk[0]), cmp_memblk, NULL);
> > +	blk[0]->start = start;
> > +	blk[count - 1]->end = end;
> > +
> > +	for (int i = 0, j = 1; j < count; i++, j++) {
> > +		/* Overlaps OK. sort() put the lesser end first */
> > +		if (blk[i]->start == blk[j]->start)
> > +			continue;
> > +
> > +		/* No gap */
> > +		if (blk[i]->end == blk[j]->start)
> > +			continue;
> > +
> > +		/* Fill the gap */
> > +		if (blk[i]->end < blk[j]->start) {
> > +			blk[i]->end = blk[j]->start;
> > +			continue;
> > +		}
> 
> This looks clever to sort an array of pointers into the existing
> numa_meminfo, I think it needs some comments to explain the cleverness,
> but I am not seeing anything glaringly wrong about the approach.
> 

I'll add comments on all the above.


> > +	}
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL_GPL(numa_fill_memblks);
> > +
> >  #endif
> > diff --git a/include/linux/numa.h b/include/linux/numa.h
> > index 59df211d051f..0f512c0aba54 100644
> > --- a/include/linux/numa.h
> > +++ b/include/linux/numa.h
> > @@ -12,6 +12,7 @@
> >  #define MAX_NUMNODES    (1 << NODES_SHIFT)
> >  
> >  #define	NUMA_NO_NODE	(-1)
> > +#define	NUMA_NO_MEMBLK	(-1)
> >  
> >  /* optionally keep NUMA memory info available post init */
> >  #ifdef CONFIG_NUMA_KEEP_MEMINFO
> > @@ -43,6 +44,12 @@ static inline int phys_to_target_node(u64 start)
> >  	return 0;
> >  }
> >  #endif
> > +#ifndef numa_fill_memblks
> > +static inline int __init numa_fill_memblks(u64 start, u64 end)
> > +{
> > +	return NUMA_NO_MEMBLK;
> > +}
> > +#endif
> 
> Why does linux/numa.h need to care about this x86-specific init routine?
> 

This is how I got ARM64 and LOONGARCH to build.


> >  #else /* !CONFIG_NUMA */
> >  static inline int numa_map_to_online_node(int node)
> >  {
> > -- 
> > 2.37.3
> > 
> 
> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] x86/numa: Introduce numa_fill_memblks()
  2023-06-06 20:03     ` Alison Schofield
@ 2023-06-06 20:45       ` Dan Williams
  2023-06-07 10:44         ` Mike Rapoport
  0 siblings, 1 reply; 12+ messages in thread
From: Dan Williams @ 2023-06-06 20:45 UTC (permalink / raw)
  To: Alison Schofield, Dan Williams
  Cc: Rafael J. Wysocki, Len Brown, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, Andy Lutomirski,
	Peter Zijlstra, Andrew Morton, Jonathan Cameron, Dave Jiang, x86,
	linux-cxl, linux-acpi, linux-kernel, rppt

[ add Mike, see "[Mike]" note below... ]

Alison Schofield wrote:
> On Sat, Jun 03, 2023 at 04:53:13PM -0700, Dan Williams wrote:
> > alison.schofield@ wrote:
> > > From: Alison Schofield <alison.schofield@intel.com>
> > > 
> > > numa_fill_memblks() fills in the gaps in numa_meminfo memblks
> > > over an HPA address range.
> > > 
> > > The initial use case is the ACPI driver that needs to extend
> > > SRAT defined proximity domains to an entire CXL CFMWS Window[1].
> > 
> > I feel like this demands more explanation because the "need" is not
> > apparent. In fact its a Linux policy choice not a requirement. The next
> > patch has some of this, but this story is needed earlier for someone
> > that reads this patch first. Something like:
> > 
> 
> Hi Dan,
> 
> Thanks for the review :)
> 
> Sure, I can add the story below to make the 'need' for this function
> more apparent, as well as s/needs/want so as not to conflate need with
> requirement.
> 
> > ---
> > 
> > The CFWMS is an ACPI data structure that indicates *potential* locations
> > where CXL memory can be placed. It is the playground where the CXL
> > driver has free reign to establish regions.  That space can be populated
> > by BIOS created regions, or driver created regions, after hotplug or
> > other reconfiguration.
> > 
> > When the BIOS creates a region in a CXL Window it additionally describes
> > that subset of the Window range in the other typical ACPI tables SRAT,
> > SLIT, and HMAT. The rationale for the BIOS not pre-describing the entire
> > CXL Window in SRAT, SLIT, and HMAT is that it can not predict the
> > future. I.e. there is nothing stopping higher or lower performance
> > devices being placed in the same Window. Compare that to ACPI memory
> > hotplug that just onlines additional capacity in the proximity domain
> > with little freedom for dynamic performance differentiation.
> > 
> > That leaves the OS with a choice, should unpopulated window capacity
> > match the proximity domain of an existing region, or should it allocate
> > a new one? This patch takes the simple position of minimizing proximity
> > domain proliferation and reuse any proximity domain intersection for the
> > entire Window. If the Window has no intersections then allocate a new
> > proximity domain. Note that SRAT, SLIT and HMAT information can be
> > enumerated dynamically in a standard way from device provided data.
> > Think of CXL as the end of ACPI needing to describe memory attributes,
> > CXL offers a standard discovery model for performance attributes, but
> > Linux still needs to interoperate with the old regime.
> > 
> > ---
> > 
> > > 
> > > The APCI driver expects to use numa_fill_memblks() while parsing
> > 
> > s/APCI/ACPI/
> > 
> > Again, the ACPI code does not have any expectation, this is pure OS
> > policy decision about how to handle undescribed memory.
> > 
> 
> The intent was to show the pending use case, perhaps 'wants to' use
> this function to enact a purely OS policy decision!

Sounds good, yeah I tend to read "need" as a requirement and assume that
Linux is out of spec or something breaks if it does not do the needed
thing.

> 
> 
> > > the CFMWS. Extending the memblks created during SRAT parsing, to
> > > cover the entire CFMWS Window, is desirable because everything in
> > > a CFMWS Window is expected to be of a similar performance class.
> > > 
> > > Requires CONFIG_NUMA_KEEP_MEMINFO.
> > 
> > Not sure this adds anything to the description.
> > 
> > > 
> > > [1] A CXL CFMWS Window represents a contiguous CXL memory resource,
> > > aka an HPA range. The CFMWS (CXL Fixed Memory Window Structure) is
> > > part of the ACPI CEDT (CXL Early Discovery Table).
> > > 
> > > Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> > > ---
> > >  arch/x86/include/asm/sparsemem.h |  2 +
> > >  arch/x86/mm/numa.c               | 82 ++++++++++++++++++++++++++++++++
> > >  include/linux/numa.h             |  7 +++
> > >  3 files changed, 91 insertions(+)
> > > 
> > > diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
> > > index 64df897c0ee3..1be13b2dfe8b 100644
> > > --- a/arch/x86/include/asm/sparsemem.h
> > > +++ b/arch/x86/include/asm/sparsemem.h
> > > @@ -37,6 +37,8 @@ extern int phys_to_target_node(phys_addr_t start);
> > >  #define phys_to_target_node phys_to_target_node
> > >  extern int memory_add_physaddr_to_nid(u64 start);
> > >  #define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
> > > +extern int numa_fill_memblks(u64 start, u64 end);
> > > +#define numa_fill_memblks numa_fill_memblks
> > 
> > What is this for? The other defines are due to being an arch-specific
> > API and the #define is how the arch declares that it has a local version
> > to replace the generic one.
> 
> That define, along with the numa.h change below, are to support builds of
> CONFIG_ARM64 and CONFIG_LOONGARCH, both include the caller acpi_parse_cfmws(),
> of numa_fill_memblks().

[Mike]

Hmm, ok, but this is piling onto the maintenance burden of x86 not
getting onboard with memblock for numa info yet. At a minimum that
avoidance of touching the ARM64 and LOONGARCH cases needs to be called
out, but it would be useful to have a discussion about the options here
with questions like:

- What's blocking x86 from switching to memblock?

- Or, does the memblock API support what numa_fill_memblks() wants to
  do? I.e. add a real numa_fill_memblks() implementation to
  drivers/base/arch_numa.c rather than skip SRAT based fixups for the
  generic case.

Last I remember it was the conceptual disconnect of x86 not marking Reserved
ranges as memory like other architectures:

https://lore.kernel.org/all/20200708091520.GE128651@kernel.org/

...but its been a while since this last came up and I have not been
following memblock developments. Maybe the anwser is the same in the
end, add x86-specific numa_fill_memblks, but this is as good a time as
any to revisit carrying that burden.

[..]
snipped the rest, looks like we are aligned there.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH 1/2] x86/numa: Introduce numa_fill_memblks()
  2023-06-06 20:45       ` Dan Williams
@ 2023-06-07 10:44         ` Mike Rapoport
  0 siblings, 0 replies; 12+ messages in thread
From: Mike Rapoport @ 2023-06-07 10:44 UTC (permalink / raw)
  To: Dan Williams
  Cc: Alison Schofield, Rafael J. Wysocki, Len Brown, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Jonathan Cameron,
	Dave Jiang, x86, linux-cxl, linux-acpi, linux-kernel, rppt

On Tue, Jun 06, 2023 at 01:45:35PM -0700, Dan Williams wrote:
> [ add Mike, see "[Mike]" note below... ]
> 
> Alison Schofield wrote:
> > On Sat, Jun 03, 2023 at 04:53:13PM -0700, Dan Williams wrote:
> > > alison.schofield@ wrote:
> > > > From: Alison Schofield <alison.schofield@intel.com>
> > > > 
> > > > numa_fill_memblks() fills in the gaps in numa_meminfo memblks
> > > > over an HPA address range.
> > > > 
> > > > The initial use case is the ACPI driver that needs to extend
> > > > SRAT defined proximity domains to an entire CXL CFMWS Window[1].
> > > 
> > > I feel like this demands more explanation because the "need" is not
> > > apparent. In fact its a Linux policy choice not a requirement. The next
> > > patch has some of this, but this story is needed earlier for someone
> > > that reads this patch first. Something like:
> > > 
> > 
> > Hi Dan,
> > 
> > Thanks for the review :)
> > 
> > Sure, I can add the story below to make the 'need' for this function
> > more apparent, as well as s/needs/want so as not to conflate need with
> > requirement.
> > 
> > > ---
> > > 
> > > The CFWMS is an ACPI data structure that indicates *potential* locations
> > > where CXL memory can be placed. It is the playground where the CXL
> > > driver has free reign to establish regions.  That space can be populated
> > > by BIOS created regions, or driver created regions, after hotplug or
> > > other reconfiguration.
> > > 
> > > When the BIOS creates a region in a CXL Window it additionally describes
> > > that subset of the Window range in the other typical ACPI tables SRAT,
> > > SLIT, and HMAT. The rationale for the BIOS not pre-describing the entire
> > > CXL Window in SRAT, SLIT, and HMAT is that it can not predict the
> > > future. I.e. there is nothing stopping higher or lower performance
> > > devices being placed in the same Window. Compare that to ACPI memory
> > > hotplug that just onlines additional capacity in the proximity domain
> > > with little freedom for dynamic performance differentiation.
> > > 
> > > That leaves the OS with a choice, should unpopulated window capacity
> > > match the proximity domain of an existing region, or should it allocate
> > > a new one? This patch takes the simple position of minimizing proximity
> > > domain proliferation and reuse any proximity domain intersection for the
> > > entire Window. If the Window has no intersections then allocate a new
> > > proximity domain. Note that SRAT, SLIT and HMAT information can be
> > > enumerated dynamically in a standard way from device provided data.
> > > Think of CXL as the end of ACPI needing to describe memory attributes,
> > > CXL offers a standard discovery model for performance attributes, but
> > > Linux still needs to interoperate with the old regime.
> > > 
> > > ---
> > > 
> > > > 
> > > > The APCI driver expects to use numa_fill_memblks() while parsing
> > > 
> > > s/APCI/ACPI/
> > > 
> > > Again, the ACPI code does not have any expectation, this is pure OS
> > > policy decision about how to handle undescribed memory.
> > > 
> > 
> > The intent was to show the pending use case, perhaps 'wants to' use
> > this function to enact a purely OS policy decision!
> 
> Sounds good, yeah I tend to read "need" as a requirement and assume that
> Linux is out of spec or something breaks if it does not do the needed
> thing.
> 
> > 
> > 
> > > > the CFMWS. Extending the memblks created during SRAT parsing, to
> > > > cover the entire CFMWS Window, is desirable because everything in
> > > > a CFMWS Window is expected to be of a similar performance class.
> > > > 
> > > > Requires CONFIG_NUMA_KEEP_MEMINFO.
> > > 
> > > Not sure this adds anything to the description.
> > > 
> > > > 
> > > > [1] A CXL CFMWS Window represents a contiguous CXL memory resource,
> > > > aka an HPA range. The CFMWS (CXL Fixed Memory Window Structure) is
> > > > part of the ACPI CEDT (CXL Early Discovery Table).
> > > > 
> > > > Signed-off-by: Alison Schofield <alison.schofield@intel.com>
> > > > ---
> > > >  arch/x86/include/asm/sparsemem.h |  2 +
> > > >  arch/x86/mm/numa.c               | 82 ++++++++++++++++++++++++++++++++
> > > >  include/linux/numa.h             |  7 +++
> > > >  3 files changed, 91 insertions(+)
> > > > 
> > > > diff --git a/arch/x86/include/asm/sparsemem.h b/arch/x86/include/asm/sparsemem.h
> > > > index 64df897c0ee3..1be13b2dfe8b 100644
> > > > --- a/arch/x86/include/asm/sparsemem.h
> > > > +++ b/arch/x86/include/asm/sparsemem.h
> > > > @@ -37,6 +37,8 @@ extern int phys_to_target_node(phys_addr_t start);
> > > >  #define phys_to_target_node phys_to_target_node
> > > >  extern int memory_add_physaddr_to_nid(u64 start);
> > > >  #define memory_add_physaddr_to_nid memory_add_physaddr_to_nid
> > > > +extern int numa_fill_memblks(u64 start, u64 end);
> > > > +#define numa_fill_memblks numa_fill_memblks
> > > 
> > > What is this for? The other defines are due to being an arch-specific
> > > API and the #define is how the arch declares that it has a local version
> > > to replace the generic one.
> > 
> > That define, along with the numa.h change below, are to support builds of
> > CONFIG_ARM64 and CONFIG_LOONGARCH, both include the caller acpi_parse_cfmws(),
> > of numa_fill_memblks().
> 
> [Mike]
> 
> Hmm, ok, but this is piling onto the maintenance burden of x86 not
> getting onboard with memblock for numa info yet. At a minimum that
> avoidance of touching the ARM64 and LOONGARCH cases needs to be called
> out, but it would be useful to have a discussion about the options here
> with questions like:
> 
> - What's blocking x86 from switching to memblock?

To start with, someone need to work on it :)

There are some differences in how drivers/base/arch_numa.c and
arch/x86/mm/numa.c handle SRAT ranges. E.g. x86 checks that SRAT covers all
the memory reported by e820 and have this peculiar dance around hotplugable
memory for the sake of movable_node.
Another x86 specific thing that is build around numa_meminfo is the
numa_emulation.

I don't see a conceptual reason why arch_numa.c cannot handle x86, but
that's quite some work needed to make that happen.
 
> - Or, does the memblock API support what numa_fill_memblks() wants to
>   do? I.e. add a real numa_fill_memblks() implementation to
>   drivers/base/arch_numa.c rather than skip SRAT based fixups for the
>   generic case.
 
memblock does not have a notion of empty physical ranges, so it will
require a new set of regions to support what numa_fill_memblks() wants to
do.

With this patch numa_meminfo essentially becomes a superset of
memblock.memory and to have a generic implementation in
drivers/base/arch_numa.c this set should be kept somewhere.

> Last I remember it was the conceptual disconnect of x86 not marking Reserved
> ranges as memory like other architectures:
> 
> https://lore.kernel.org/all/20200708091520.GE128651@kernel.org/

This was more about e820 vs memblock, I don't think it's relevant here.
 
> ...but its been a while since this last came up and I have not been
> following memblock developments. Maybe the anwser is the same in the
> end, add x86-specific numa_fill_memblks, but this is as good a time as
> any to revisit carrying that burden.

I've been thinking about how to make arch_numa to support x86 and (sigh)
loongarch, and the simplest way looks like shoving numa_meminfo there and
then optimizing redundant pieces.

For CXL on arm64/riscv we'd need a new data structure for empty physical
ranges anyway at some point and numa_meminfo quite fits the requirements.
We can later reconsider numa_meminfo vs memblock relationship.

That said, add x86-specific numa_fill_memblks and revisit this later is a
option as well :)

-- 
Sincerely yours,
Mike.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH 2/2] ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window
  2023-05-19  0:04 [PATCH 0/2] CXL: Apply SRAT defined PXM to entire CFMWS window alison.schofield
  2023-05-19  0:04 ` [PATCH 1/2] x86/numa: Introduce numa_fill_memblks() alison.schofield
@ 2023-05-19  0:04 ` alison.schofield
  1 sibling, 0 replies; 12+ messages in thread
From: alison.schofield @ 2023-05-19  0:04 UTC (permalink / raw)
  To: Rafael J. Wysocki, Len Brown, Dan Williams, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, H. Peter Anvin,
	Andy Lutomirski, Peter Zijlstra, Andrew Morton, Jonathan Cameron,
	Dave Jiang
  Cc: Alison Schofield, x86, linux-cxl, linux-acpi, linux-kernel

From: Alison Schofield <alison.schofield@intel.com>

Commit fd49f99c1809 ("ACPI: NUMA: Add a node and memblk for each
CFMWS not in SRAT") did not account for the case where the BIOS
only partially describes a CFMWS Window in the SRAT. That means
the omitted address ranges, of a partially described CFMWS Window,
do not get assigned to a NUMA node.

Replace the call to phys_to_target_node() with numa_add_memblks().
Numa_add_memblks() searches an HPA range for existing memblk(s)
and extends those memblk(s) to fill the entire CFMWS Window.

Extending the existing memblks is a simple strategy that reuses
SRAT defined proximity domains from part of a window to fill out
the entire window, based on the knowledge* that all of a CFMWS
window is of a similar performance class.

*Note that this heuristic will evolve when CFMWS Windows present
a wider range of characteristics. The extension of the proximity
domain, implemented here, is likely a step in developing a more
sophisticated performance profile in the future.

There is no change in behavior when the SRAT does not describe
the CFMWS Window at all. In that case, a new NUMA node with a
single memblk covering the entire CFMWS Window is created.

Fixes: fd49f99c1809 ("ACPI: NUMA: Add a node and memblk for each CFMWS not in SRAT")
Signed-off-by: Alison Schofield <alison.schofield@intel.com>
---
 drivers/acpi/numa/srat.c | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index 1f4fc5f8a819..12f330b0eac0 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -310,11 +310,16 @@ static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
 	start = cfmws->base_hpa;
 	end = cfmws->base_hpa + cfmws->window_size;

-	/* Skip if the SRAT already described the NUMA details for this HPA */
-	node = phys_to_target_node(start);
-	if (node != NUMA_NO_NODE)
+	/*
+	 * The SRAT may have already described NUMA details for all,
+	 * or a portion of, this CFMWS HPA range. Extend the memblks
+	 * found for any portion of the window to cover the entire
+	 * window.
+	 */
+	if (!numa_fill_memblks(start, end))
 		return 0;

+	/* No SRAT description. Create a new node. */
 	node = acpi_map_pxm_to_node(*fake_pxm);

 	if (node == NUMA_NO_NODE) {
-- 
2.37.3

^ permalink raw reply related	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2023-06-07 10:45 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-05-19  0:04 [PATCH 0/2] CXL: Apply SRAT defined PXM to entire CFMWS window alison.schofield
2023-05-19  0:04 ` [PATCH 1/2] x86/numa: Introduce numa_fill_memblks() alison.schofield
2023-05-19  0:08   ` Dave Hansen
2023-05-19  0:26     ` Alison Schofield
2023-05-19  0:40       ` Dan Williams
2023-05-19  0:43       ` Dave Hansen
2023-05-19  1:56         ` Dan Williams
2023-06-03 23:53   ` Dan Williams
2023-06-06 20:03     ` Alison Schofield
2023-06-06 20:45       ` Dan Williams
2023-06-07 10:44         ` Mike Rapoport
2023-05-19  0:04 ` [PATCH 2/2] ACPI: NUMA: Apply SRAT proximity domain to entire CFMWS window alison.schofield

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox