linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] memory,acpi: resize memory blocks based on CFMW alignment
@ 2024-10-08  4:43 Gregory Price
  2024-10-08  4:43 ` [PATCH 1/3] memory: extern memory_block_size_bytes and set_memory_block_size_order Gregory Price
                   ` (3 more replies)
  0 siblings, 4 replies; 21+ messages in thread
From: Gregory Price @ 2024-10-08  4:43 UTC (permalink / raw)
  To: linux-cxl, x86, linux-mm, linux-acpi
  Cc: linux-kernel, dave.hansen, luto, peterz, tglx, mingo, bp, hpa,
	david, osalvador, gregkh, rafael, akpm, dan.j.williams,
	Jonathan.Cameron, alison.schofield, rrichter, terry.bowman, lenb,
	dave.jiang, ira.weiny

When physical address capacity is not aligned to the size of a memory
block managed size, the misaligned portion is not mapped - creating
an effective loss of capacity.

This appears to be a calculated decision based on the fact that most
regions would generally be aligned, and the loss of capacity would be
relatively limited. With CXL devices, this is no longer the case.

CXL exposes its memory for management through the ACPI CEDT (CXL Early
Detection Table) in a field called the CXL Fixed Memory Window.  Per
the CXL specification, this memory must be aligned to at least 256MB.

On X86, memory block capacity increases based on the overall capacity
of the machine - eventually reaching a maximum of 2GB per memory block.
When a CFMW aligns on 256MB, this causes a loss of at least 2GB of
capacity, and in some cases more.

It is also possible for multiple CFMW to be exposed for a single device.
This can happen if a reserved region intersects with the target memory
location of the memory device. This happens on AMD x86 platforms. 

This patch set detects the alignments of all CFMW in the ACPI CEDT,
and changes the memory block size downward to meet the largest common
denomenator of the supported memory regions.

To do this, we needed 3 changes:
    1) extern memory block management functions for the acpi driver
    2) modify x86 to update its cached block size value
    3) add code in acpi/numa/srat.c to do the alignment check

Presently this only affects x86, since this is the only architecture
that implements set_memory_block_size_order.

Presently this appears to only affect x86, and we only mitigated there
since it is the only arch to implement set_memory_block_size_order.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Gregory Price <gourry@gourry.net>

Gregory Price (3):
  memory: extern memory_block_size_bytes and set_memory_block_size_order
  x86/mm: if memblock size is adjusted, update the cached value
  acpi,srat: reduce memory block size if CFMWS has a smaller alignment

 arch/x86/mm/init_64.c    | 17 ++++++++++++--
 drivers/acpi/numa/srat.c | 48 ++++++++++++++++++++++++++++++++++++++++
 drivers/base/memory.c    |  6 +++++
 include/linux/memory.h   |  4 ++--
 4 files changed, 71 insertions(+), 4 deletions(-)

-- 
2.43.0



^ permalink raw reply	[flat|nested] 21+ messages in thread

* [PATCH 1/3] memory: extern memory_block_size_bytes and set_memory_block_size_order
  2024-10-08  4:43 [PATCH 0/3] memory,acpi: resize memory blocks based on CFMW alignment Gregory Price
@ 2024-10-08  4:43 ` Gregory Price
  2024-10-08 14:03   ` David Hildenbrand
  2024-10-08  4:43 ` [PATCH 2/3] x86/mm: if memblock size is adjusted, update the cached value Gregory Price
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 21+ messages in thread
From: Gregory Price @ 2024-10-08  4:43 UTC (permalink / raw)
  To: linux-cxl, x86, linux-mm, linux-acpi
  Cc: linux-kernel, dave.hansen, luto, peterz, tglx, mingo, bp, hpa,
	david, osalvador, gregkh, rafael, akpm, dan.j.williams,
	Jonathan.Cameron, alison.schofield, rrichter, terry.bowman, lenb,
	dave.jiang, ira.weiny

On CXL systems, block alignment may be as small as 256MB, which may
require a resize of the block size during initialization.  This is done
in the ACPI driver, so the resize function need to be made available.

Presently, only x86 provides the functionality to resize memory
block sizes.  Wire up a weak stub for set_memory_block_size_order,
similar to memory_block_size_bytes, that simply returns -ENODEV.

Since set_memory_block_size_order is now extern, we also need to
drop the __init macro.

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 arch/x86/mm/init_64.c  | 2 +-
 drivers/base/memory.c  | 6 ++++++
 include/linux/memory.h | 4 ++--
 3 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index ff253648706f..6086f99449fa 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1424,7 +1424,7 @@ void mark_rodata_ro(void)
 
 /* Adjustable memory block size */
 static unsigned long set_memory_block_size;
-int __init set_memory_block_size_order(unsigned int order)
+int set_memory_block_size_order(unsigned int order)
 {
 	unsigned long size = 1UL << order;
 
diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 67858eeb92ed..f9045642f69e 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -110,6 +110,12 @@ static void memory_block_release(struct device *dev)
 	kfree(mem);
 }
 
+int __weak set_memory_block_size_order(unsigned int order)
+{
+	return -ENODEV;
+}
+EXPORT_SYMBOL_GPL(set_memory_block_size_order);
+
 unsigned long __weak memory_block_size_bytes(void)
 {
 	return MIN_MEMORY_BLOCK_SIZE;
diff --git a/include/linux/memory.h b/include/linux/memory.h
index c0afee5d126e..c57706178354 100644
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -86,8 +86,8 @@ struct memory_block {
 };
 
 int arch_get_memory_phys_device(unsigned long start_pfn);
-unsigned long memory_block_size_bytes(void);
-int set_memory_block_size_order(unsigned int order);
+extern unsigned long memory_block_size_bytes(void);
+extern int set_memory_block_size_order(unsigned int order);
 
 /* These states are exposed to userspace as text strings in sysfs */
 #define	MEM_ONLINE		(1<<0) /* exposed to userspace */
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 2/3] x86/mm: if memblock size is adjusted, update the cached value
  2024-10-08  4:43 [PATCH 0/3] memory,acpi: resize memory blocks based on CFMW alignment Gregory Price
  2024-10-08  4:43 ` [PATCH 1/3] memory: extern memory_block_size_bytes and set_memory_block_size_order Gregory Price
@ 2024-10-08  4:43 ` Gregory Price
  2024-10-08  4:43 ` [PATCH 3/3] acpi,srat: reduce memory block size if CFMWS has a smaller alignment Gregory Price
  2024-10-08 14:38 ` [PATCH 0/3] memory,acpi: resize memory blocks based on CFMW alignment Ira Weiny
  3 siblings, 0 replies; 21+ messages in thread
From: Gregory Price @ 2024-10-08  4:43 UTC (permalink / raw)
  To: linux-cxl, x86, linux-mm, linux-acpi
  Cc: linux-kernel, dave.hansen, luto, peterz, tglx, mingo, bp, hpa,
	david, osalvador, gregkh, rafael, akpm, dan.j.williams,
	Jonathan.Cameron, alison.schofield, rrichter, terry.bowman, lenb,
	dave.jiang, ira.weiny

When parsing CFMWS, we need to know the currently registered memory
block size to avoid accidentally adjusting the size upward. Querying
the size causes it to be cached in a static global.  Update the static
global if the value is subsequently updated.

Print a warning that this has occurred for debugging purposes if
the memory block size is changed at an unexpected time (post-init).

Signed-off-by: Gregory Price <gourry@gourry.net>
---
 arch/x86/mm/init_64.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 6086f99449fa..733dfa899232 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1424,6 +1424,7 @@ void mark_rodata_ro(void)
 
 /* Adjustable memory block size */
 static unsigned long set_memory_block_size;
+static unsigned long memory_block_size_probed;
 int set_memory_block_size_order(unsigned int order)
 {
 	unsigned long size = 1UL << order;
@@ -1432,6 +1433,19 @@ int set_memory_block_size_order(unsigned int order)
 		return -EINVAL;
 
 	set_memory_block_size = size;
+
+	/*
+	 * If the block size has already been probed, we need to change it.
+	 * This can happen during ACPI/CFMWS parsing, since it needs to
+	 * probe the system for the existing block size to avoid increasing
+	 * the block size if lower memory happens to be misaligned
+	 */
+	if (memory_block_size_probed) {
+		memory_block_size_probed = size;
+		pr_warn("x86/mm: Memory block size changed: %ldMB\n",
+			size >> 20);
+	}
+
 	return 0;
 }
 
@@ -1471,7 +1485,6 @@ static unsigned long probe_memory_block_size(void)
 	return bz;
 }
 
-static unsigned long memory_block_size_probed;
 unsigned long memory_block_size_bytes(void)
 {
 	if (!memory_block_size_probed)
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* [PATCH 3/3] acpi,srat: reduce memory block size if CFMWS has a smaller alignment
  2024-10-08  4:43 [PATCH 0/3] memory,acpi: resize memory blocks based on CFMW alignment Gregory Price
  2024-10-08  4:43 ` [PATCH 1/3] memory: extern memory_block_size_bytes and set_memory_block_size_order Gregory Price
  2024-10-08  4:43 ` [PATCH 2/3] x86/mm: if memblock size is adjusted, update the cached value Gregory Price
@ 2024-10-08  4:43 ` Gregory Price
  2024-10-08 14:58   ` Ira Weiny
  2024-10-08 14:38 ` [PATCH 0/3] memory,acpi: resize memory blocks based on CFMW alignment Ira Weiny
  3 siblings, 1 reply; 21+ messages in thread
From: Gregory Price @ 2024-10-08  4:43 UTC (permalink / raw)
  To: linux-cxl, x86, linux-mm, linux-acpi
  Cc: linux-kernel, dave.hansen, luto, peterz, tglx, mingo, bp, hpa,
	david, osalvador, gregkh, rafael, akpm, dan.j.williams,
	Jonathan.Cameron, alison.schofield, rrichter, terry.bowman, lenb,
	dave.jiang, ira.weiny

The CXL Fixed Memory Window allows for memory aligned down to the
size of 256MB.  However, by default on x86, memory blocks increase
in size as total System RAM capacity increases. On x86, this caps
out at 2G when 64GB of System RAM is reached.

When the CFMWS regions are not aligned to memory block size, this
results in lost capacity on either side of the alignment.

Parse all CFMWS to detect the largest common denomenator among all
regions, and reduce the block size accordingly.

This can only be done when MEMORY_HOTPLUG and SPARSEMEM configs are
enabled, but the surrounding code may not necessarily require these
configs, so build accordingly.

Suggested-by: Dan Williams <dan.j.williams@intel.com>
Signed-off-by: Gregory Price <gourry@gourry.net>
---
 drivers/acpi/numa/srat.c | 48 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 48 insertions(+)

diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
index 44f91f2c6c5d..9367d36eba9a 100644
--- a/drivers/acpi/numa/srat.c
+++ b/drivers/acpi/numa/srat.c
@@ -14,6 +14,7 @@
 #include <linux/errno.h>
 #include <linux/acpi.h>
 #include <linux/memblock.h>
+#include <linux/memory.h>
 #include <linux/numa.h>
 #include <linux/nodemask.h>
 #include <linux/topology.h>
@@ -333,6 +334,37 @@ acpi_parse_memory_affinity(union acpi_subtable_headers *header,
 	return 0;
 }
 
+#if defined(CONFIG_MEMORY_HOTPLUG)
+/*
+ * CXL allows CFMW to be aligned along 256MB boundaries, but large memory
+ * systems default to larger alignments (2GB on x86). Misalignments can
+ * cause some capacity to become unreachable. Calculate the largest supported
+ * alignment for all CFMW to maximize the amount of mappable capacity.
+ */
+static int __init acpi_align_cfmws(union acpi_subtable_headers *header,
+				   void *arg, const unsigned long table_end)
+{
+	struct acpi_cedt_cfmws *cfmws = (struct acpi_cedt_cfmws *)header;
+	u64 start = cfmws->base_hpa;
+	u64 size = cfmws->window_size;
+	unsigned long *fin_bz = arg;
+	unsigned long bz;
+
+	for (bz = SZ_64T; bz >= SZ_256M; bz >>= 1) {
+		if (IS_ALIGNED(start, bz) && IS_ALIGNED(size, bz))
+			break;
+	}
+
+	/* Only adjust downward, we never want to increase block size */
+	if (bz < *fin_bz && bz >= SZ_256M)
+		*fin_bz = bz;
+	else if (bz < SZ_256M)
+		pr_err("CFMWS: [BIOS BUG] base/size alignment violates spec\n");
+
+	return 0;
+}
+#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
+
 static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
 				   void *arg, const unsigned long table_end)
 {
@@ -501,6 +533,10 @@ acpi_table_parse_srat(enum acpi_srat_type id,
 int __init acpi_numa_init(void)
 {
 	int i, fake_pxm, cnt = 0;
+#if defined(CONFIG_MEMORY_HOTPLUG)
+	unsigned long block_sz = memory_block_size_bytes();
+	unsigned long cfmw_align = block_sz;
+#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
 
 	if (acpi_disabled)
 		return -EINVAL;
@@ -552,6 +588,18 @@ int __init acpi_numa_init(void)
 	}
 	last_real_pxm = fake_pxm;
 	fake_pxm++;
+
+#if defined(CONFIG_MEMORY_HOTPLUG)
+	/* Calculate and set largest supported memory block size alignment */
+	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_align_cfmws,
+			      &cfmw_align);
+	if (cfmw_align < block_sz && cfmw_align >= SZ_256M) {
+		if (set_memory_block_size_order(ffs(cfmw_align)-1))
+			pr_warn("CFMWS: Unable to adjust memory block size\n");
+	}
+#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
+
+	/* Then parse and fill the numa nodes with the described memory */
 	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
 			      &fake_pxm);
 
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] memory: extern memory_block_size_bytes and set_memory_block_size_order
  2024-10-08  4:43 ` [PATCH 1/3] memory: extern memory_block_size_bytes and set_memory_block_size_order Gregory Price
@ 2024-10-08 14:03   ` David Hildenbrand
  2024-10-08 14:51     ` Gregory Price
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2024-10-08 14:03 UTC (permalink / raw)
  To: Gregory Price, linux-cxl, x86, linux-mm, linux-acpi
  Cc: linux-kernel, dave.hansen, luto, peterz, tglx, mingo, bp, hpa,
	osalvador, gregkh, rafael, akpm, dan.j.williams, Jonathan.Cameron,
	alison.schofield, rrichter, terry.bowman, lenb, dave.jiang,
	ira.weiny

On 08.10.24 06:43, Gregory Price wrote:
> On CXL systems, block alignment may be as small as 256MB, which may
> require a resize of the block size during initialization.  This is done
> in the ACPI driver, so the resize function need to be made available.
> 
> Presently, only x86 provides the functionality to resize memory
> block sizes.  Wire up a weak stub for set_memory_block_size_order,
> similar to memory_block_size_bytes, that simply returns -ENODEV.
> 
> Since set_memory_block_size_order is now extern, we also need to
> drop the __init macro.
> 
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>   arch/x86/mm/init_64.c  | 2 +-
>   drivers/base/memory.c  | 6 ++++++
>   include/linux/memory.h | 4 ++--
>   3 files changed, 9 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index ff253648706f..6086f99449fa 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -1424,7 +1424,7 @@ void mark_rodata_ro(void)
>   
>   /* Adjustable memory block size */
>   static unsigned long set_memory_block_size;
> -int __init set_memory_block_size_order(unsigned int order)
> +int set_memory_block_size_order(unsigned int order)
>   {
>   	unsigned long size = 1UL << order;
>   
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 67858eeb92ed..f9045642f69e 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -110,6 +110,12 @@ static void memory_block_release(struct device *dev)
>   	kfree(mem);
>   }
>   
> +int __weak set_memory_block_size_order(unsigned int order)
> +{
> +	return -ENODEV;
> +}
> +EXPORT_SYMBOL_GPL(set_memory_block_size_order);

I can understand what you are trying to achieve, but letting arbitrary 
modules mess with this sounds like a bad idea.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] memory,acpi: resize memory blocks based on CFMW alignment
  2024-10-08  4:43 [PATCH 0/3] memory,acpi: resize memory blocks based on CFMW alignment Gregory Price
                   ` (2 preceding siblings ...)
  2024-10-08  4:43 ` [PATCH 3/3] acpi,srat: reduce memory block size if CFMWS has a smaller alignment Gregory Price
@ 2024-10-08 14:38 ` Ira Weiny
  2024-10-08 14:49   ` Gregory Price
  3 siblings, 1 reply; 21+ messages in thread
From: Ira Weiny @ 2024-10-08 14:38 UTC (permalink / raw)
  To: Gregory Price, linux-cxl, x86, linux-mm, linux-acpi
  Cc: linux-kernel, dave.hansen, luto, peterz, tglx, mingo, bp, hpa,
	david, osalvador, gregkh, rafael, akpm, dan.j.williams,
	Jonathan.Cameron, alison.schofield, rrichter, terry.bowman, lenb,
	dave.jiang, ira.weiny

Gregory Price wrote:
> When physical address capacity is not aligned to the size of a memory
> block managed size, the misaligned portion is not mapped - creating
> an effective loss of capacity.
> 
> This appears to be a calculated decision based on the fact that most
> regions would generally be aligned, and the loss of capacity would be
> relatively limited. With CXL devices, this is no longer the case.
> 
> CXL exposes its memory for management through the ACPI CEDT (CXL Early
> Detection Table) in a field called the CXL Fixed Memory Window.  Per
> the CXL specification, this memory must be aligned to at least 256MB.
> 
> On X86, memory block capacity increases based on the overall capacity
> of the machine - eventually reaching a maximum of 2GB per memory block.
> When a CFMW aligns on 256MB, this causes a loss of at least 2GB of
> capacity, and in some cases more.
> 
> It is also possible for multiple CFMW to be exposed for a single device.
> This can happen if a reserved region intersects with the target memory
> location of the memory device. This happens on AMD x86 platforms. 

I'm not clear why you mention reserved regions here.  IIUC CFMW's can
overlap to describe different attributes which may be utilized based on
the devices which are mapped within them.  For this reason, all CFMW's
must be scanned to find the lowest common denominator even if the HPA
range has already been evaluated.

Is that what you are trying to say?

> 
> This patch set detects the alignments of all CFMW in the ACPI CEDT,
> and changes the memory block size downward to meet the largest common
> denomenator of the supported memory regions.
> 
> To do this, we needed 3 changes:
>     1) extern memory block management functions for the acpi driver
>     2) modify x86 to update its cached block size value
>     3) add code in acpi/numa/srat.c to do the alignment check
> 
> Presently this only affects x86, since this is the only architecture
> that implements set_memory_block_size_order.
> 
> Presently this appears to only affect x86, and we only mitigated there
> since it is the only arch to implement set_memory_block_size_order.

NIT : duplicate statement

Ira


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 0/3] memory,acpi: resize memory blocks based on CFMW alignment
  2024-10-08 14:38 ` [PATCH 0/3] memory,acpi: resize memory blocks based on CFMW alignment Ira Weiny
@ 2024-10-08 14:49   ` Gregory Price
  0 siblings, 0 replies; 21+ messages in thread
From: Gregory Price @ 2024-10-08 14:49 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-cxl, x86, linux-mm, linux-acpi, linux-kernel, dave.hansen,
	luto, peterz, tglx, mingo, bp, hpa, david, osalvador, gregkh,
	rafael, akpm, dan.j.williams, Jonathan.Cameron, alison.schofield,
	rrichter, terry.bowman, lenb, dave.jiang

On Tue, Oct 08, 2024 at 09:38:35AM -0500, Ira Weiny wrote:
> Gregory Price wrote:
> > When physical address capacity is not aligned to the size of a memory
> > block managed size, the misaligned portion is not mapped - creating
> > an effective loss of capacity.
> > 
> > This appears to be a calculated decision based on the fact that most
> > regions would generally be aligned, and the loss of capacity would be
> > relatively limited. With CXL devices, this is no longer the case.
> > 
> > CXL exposes its memory for management through the ACPI CEDT (CXL Early
> > Detection Table) in a field called the CXL Fixed Memory Window.  Per
> > the CXL specification, this memory must be aligned to at least 256MB.
> > 
> > On X86, memory block capacity increases based on the overall capacity
> > of the machine - eventually reaching a maximum of 2GB per memory block.
> > When a CFMW aligns on 256MB, this causes a loss of at least 2GB of
> > capacity, and in some cases more.
> > 
> > It is also possible for multiple CFMW to be exposed for a single device.
> > This can happen if a reserved region intersects with the target memory
> > location of the memory device. This happens on AMD x86 platforms. 
> 
> I'm not clear why you mention reserved regions here.  IIUC CFMW's can
> overlap to describe different attributes which may be utilized based on
> the devices which are mapped within them.  For this reason, all CFMW's
> must be scanned to find the lowest common denominator even if the HPA
> range has already been evaluated.
> 
> Is that what you are trying to say?
>

On AMD systems, depending on the capacity, it is possible for a single
memory expander to be represented by multiple CFMW.

an example: There are two memory expanders w/ 256GB total capacity

[    0.000000] BIOS-e820: [mem 0x000000c050000000-0x000000fcffffffff] soft reserved
[    0.000000] BIOS-e820: [mem 0x000000fd00000000-0x000000ffffffffff] reserved
[    0.000000] BIOS-e820: [mem 0x0000010000000000-0x000001034fffffff] soft reserved

[0A4h 0164   1]                Subtable Type : 01 [CXL Fixed Memory Window Structure]
[0A5h 0165   1]                     Reserved : 00
[0A6h 0166   2]                       Length : 0028
[0A8h 0168   4]                     Reserved : 00000000
[0ACh 0172   8]          Window base address : 000000C050000000
[0B4h 0180   8]                  Window size : 0000002000000000
[0BCh 0188   1]     Interleave Members (2^n) : 00
[0BDh 0189   1]        Interleave Arithmetic : 00
[0BEh 0190   2]                     Reserved : 0000
[0C0h 0192   4]                  Granularity : 00000000
[0C4h 0196   2]                 Restrictions : 0006
[0C6h 0198   2]                        QtgId : 0001
[0C8h 0200   4]                 First Target : 00000007

[0CCh 0204   1]                Subtable Type : 01 [CXL Fixed Memory Window Structure]
[0CDh 0205   1]                     Reserved : 00
[0CEh 0206   2]                       Length : 0028
[0D0h 0208   4]                     Reserved : 00000000
[0D4h 0212   8]          Window base address : 000000E050000000
[0DCh 0220   8]                  Window size : 0000001CB0000000
[0E4h 0228   1]     Interleave Members (2^n) : 00
[0E5h 0229   1]        Interleave Arithmetic : 00
[0E6h 0230   2]                     Reserved : 0000
[0E8h 0232   4]                  Granularity : 00000000
[0ECh 0236   2]                 Restrictions : 0006
[0EEh 0238   2]                        QtgId : 0002
[0F0h 0240   4]                 First Target : 00000006

[0F4h 0244   1]                Subtable Type : 01 [CXL Fixed Memory Window Structure]
[0F5h 0245   1]                     Reserved : 00
[0F6h 0246   2]                       Length : 0028
[0F8h 0248   4]                     Reserved : 00000000
[0FCh 0252   8]          Window base address : 0000010000000000
[104h 0260   8]                  Window size : 0000000350000000
[10Ch 0268   1]     Interleave Members (2^n) : 00
[10Dh 0269   1]        Interleave Arithmetic : 00
[10Eh 0270   2]                     Reserved : 0000
[110h 0272   4]                  Granularity : 00000000
[114h 0276   2]                 Restrictions : 0006
[116h 0278   2]                        QtgId : 0003
[118h 0280   4]                 First Target : 00000006

Note that there are two soft reserved regions, but 3 CFMWS.  This is
because the first device is contained entirely within the first region,
and the second device is split across the first and the second.

The reserved space at the top of the 1TB memory region is reserved by
the system for something else - similar I imagine to the PCI hole at the
top of 4GB.

The e820 entries are not aligned to 2GB - so you will lose capacity
right off the bat. Then on top of this, when you go to map the windows,
you're met with more bases and lengths not aligned to 2GB which results
in even futher loss of usable capacity.

> > 
> > This patch set detects the alignments of all CFMW in the ACPI CEDT,
> > and changes the memory block size downward to meet the largest common
> > denomenator of the supported memory regions.
> > 
> > To do this, we needed 3 changes:
> >     1) extern memory block management functions for the acpi driver
> >     2) modify x86 to update its cached block size value
> >     3) add code in acpi/numa/srat.c to do the alignment check
> > 
> > Presently this only affects x86, since this is the only architecture
> > that implements set_memory_block_size_order.
> > 
> > Presently this appears to only affect x86, and we only mitigated there
> > since it is the only arch to implement set_memory_block_size_order.
> 
> NIT : duplicate statement
> 
> Ira


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] memory: extern memory_block_size_bytes and set_memory_block_size_order
  2024-10-08 14:03   ` David Hildenbrand
@ 2024-10-08 14:51     ` Gregory Price
  2024-10-08 15:02       ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Gregory Price @ 2024-10-08 14:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-cxl, x86, linux-mm, linux-acpi, linux-kernel, dave.hansen,
	luto, peterz, tglx, mingo, bp, hpa, osalvador, gregkh, rafael,
	akpm, dan.j.williams, Jonathan.Cameron, alison.schofield,
	rrichter, terry.bowman, lenb, dave.jiang, ira.weiny

On Tue, Oct 08, 2024 at 04:03:37PM +0200, David Hildenbrand wrote:
> On 08.10.24 06:43, Gregory Price wrote:
> > On CXL systems, block alignment may be as small as 256MB, which may
> > require a resize of the block size during initialization.  This is done
> > in the ACPI driver, so the resize function need to be made available.
> > 
> > Presently, only x86 provides the functionality to resize memory
> > block sizes.  Wire up a weak stub for set_memory_block_size_order,
> > similar to memory_block_size_bytes, that simply returns -ENODEV.
> > 
> > Since set_memory_block_size_order is now extern, we also need to
> > drop the __init macro.
> > 
> > Signed-off-by: Gregory Price <gourry@gourry.net>
> > ---
> >   arch/x86/mm/init_64.c  | 2 +-
> >   drivers/base/memory.c  | 6 ++++++
> >   include/linux/memory.h | 4 ++--
> >   3 files changed, 9 insertions(+), 3 deletions(-)
> > 
> > diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> > index ff253648706f..6086f99449fa 100644
> > --- a/arch/x86/mm/init_64.c
> > +++ b/arch/x86/mm/init_64.c
> > @@ -1424,7 +1424,7 @@ void mark_rodata_ro(void)
> >   /* Adjustable memory block size */
> >   static unsigned long set_memory_block_size;
> > -int __init set_memory_block_size_order(unsigned int order)
> > +int set_memory_block_size_order(unsigned int order)
> >   {
> >   	unsigned long size = 1UL << order;
> > diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> > index 67858eeb92ed..f9045642f69e 100644
> > --- a/drivers/base/memory.c
> > +++ b/drivers/base/memory.c
> > @@ -110,6 +110,12 @@ static void memory_block_release(struct device *dev)
> >   	kfree(mem);
> >   }
> > +int __weak set_memory_block_size_order(unsigned int order)
> > +{
> > +	return -ENODEV;
> > +}
> > +EXPORT_SYMBOL_GPL(set_memory_block_size_order);
> 
> I can understand what you are trying to achieve, but letting arbitrary
> modules mess with this sounds like a bad idea.
> 

I suppose the alternative is trying to scan the CEDT from inside each
machine, rather than the ACPI driver?  Seems less maintainable.

I don't entirely disagree with your comment.  I hummed and hawwed over
externing this - hence the warning in the x86 machine.

Open to better answers.

> -- 
> Cheers,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] acpi,srat: reduce memory block size if CFMWS has a smaller alignment
  2024-10-08  4:43 ` [PATCH 3/3] acpi,srat: reduce memory block size if CFMWS has a smaller alignment Gregory Price
@ 2024-10-08 14:58   ` Ira Weiny
  2024-10-08 15:17     ` Gregory Price
  0 siblings, 1 reply; 21+ messages in thread
From: Ira Weiny @ 2024-10-08 14:58 UTC (permalink / raw)
  To: Gregory Price, linux-cxl, x86, linux-mm, linux-acpi
  Cc: linux-kernel, dave.hansen, luto, peterz, tglx, mingo, bp, hpa,
	david, osalvador, gregkh, rafael, akpm, dan.j.williams,
	Jonathan.Cameron, alison.schofield, rrichter, terry.bowman, lenb,
	dave.jiang, ira.weiny

Gregory Price wrote:
> The CXL Fixed Memory Window allows for memory aligned down to the
> size of 256MB.  However, by default on x86, memory blocks increase
> in size as total System RAM capacity increases. On x86, this caps
> out at 2G when 64GB of System RAM is reached.
> 
> When the CFMWS regions are not aligned to memory block size, this
> results in lost capacity on either side of the alignment.
> 
> Parse all CFMWS to detect the largest common denomenator among all
> regions, and reduce the block size accordingly.
> 
> This can only be done when MEMORY_HOTPLUG and SPARSEMEM configs are
> enabled, but the surrounding code may not necessarily require these
> configs, so build accordingly.
> 
> Suggested-by: Dan Williams <dan.j.williams@intel.com>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
>  drivers/acpi/numa/srat.c | 48 ++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 48 insertions(+)
> 
> diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> index 44f91f2c6c5d..9367d36eba9a 100644
> --- a/drivers/acpi/numa/srat.c
> +++ b/drivers/acpi/numa/srat.c
> @@ -14,6 +14,7 @@
>  #include <linux/errno.h>
>  #include <linux/acpi.h>
>  #include <linux/memblock.h>
> +#include <linux/memory.h>
>  #include <linux/numa.h>
>  #include <linux/nodemask.h>
>  #include <linux/topology.h>
> @@ -333,6 +334,37 @@ acpi_parse_memory_affinity(union acpi_subtable_headers *header,
>  	return 0;
>  }
>  
> +#if defined(CONFIG_MEMORY_HOTPLUG)

Generally we avoid config defines in *.c files...  See more below.

> +/*
> + * CXL allows CFMW to be aligned along 256MB boundaries, but large memory
> + * systems default to larger alignments (2GB on x86). Misalignments can
> + * cause some capacity to become unreachable. Calculate the largest supported
> + * alignment for all CFMW to maximize the amount of mappable capacity.
> + */
> +static int __init acpi_align_cfmws(union acpi_subtable_headers *header,
> +				   void *arg, const unsigned long table_end)
> +{
> +	struct acpi_cedt_cfmws *cfmws = (struct acpi_cedt_cfmws *)header;
> +	u64 start = cfmws->base_hpa;
> +	u64 size = cfmws->window_size;
> +	unsigned long *fin_bz = arg;
> +	unsigned long bz;
> +
> +	for (bz = SZ_64T; bz >= SZ_256M; bz >>= 1) {
> +		if (IS_ALIGNED(start, bz) && IS_ALIGNED(size, bz))
> +			break;
> +	}
> +
> +	/* Only adjust downward, we never want to increase block size */
> +	if (bz < *fin_bz && bz >= SZ_256M)
> +		*fin_bz = bz;
> +	else if (bz < SZ_256M)
> +		pr_err("CFMWS: [BIOS BUG] base/size alignment violates spec\n");
> +
> +	return 0;
> +}
> +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
> +
>  static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
>  				   void *arg, const unsigned long table_end)
>  {
> @@ -501,6 +533,10 @@ acpi_table_parse_srat(enum acpi_srat_type id,
>  int __init acpi_numa_init(void)
>  {
>  	int i, fake_pxm, cnt = 0;
> +#if defined(CONFIG_MEMORY_HOTPLUG)
> +	unsigned long block_sz = memory_block_size_bytes();

To help address David's comment as well;

Is there a way to scan all the alignments of the windows and pass the
desired alignment to the arch in a new call and have the arch determine if
changing the order is ok?

Also the call to the arch would be a noop for !CONFIG_MEMORY_HOTPLUG which
cleans up this function WRT CONFIG_MEMORY_HOTPLUG.

Ira

> +	unsigned long cfmw_align = block_sz;
> +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
>  
>  	if (acpi_disabled)
>  		return -EINVAL;
> @@ -552,6 +588,18 @@ int __init acpi_numa_init(void)
>  	}
>  	last_real_pxm = fake_pxm;
>  	fake_pxm++;
> +
> +#if defined(CONFIG_MEMORY_HOTPLUG)
> +	/* Calculate and set largest supported memory block size alignment */
> +	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_align_cfmws,
> +			      &cfmw_align);
> +	if (cfmw_align < block_sz && cfmw_align >= SZ_256M) {
> +		if (set_memory_block_size_order(ffs(cfmw_align)-1))
> +			pr_warn("CFMWS: Unable to adjust memory block size\n");
> +	}
> +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
> +
> +	/* Then parse and fill the numa nodes with the described memory */
>  	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
>  			      &fake_pxm);
>  
> -- 
> 2.43.0
> 
> 




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] memory: extern memory_block_size_bytes and set_memory_block_size_order
  2024-10-08 14:51     ` Gregory Price
@ 2024-10-08 15:02       ` David Hildenbrand
  2024-10-08 15:21         ` Gregory Price
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2024-10-08 15:02 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-cxl, x86, linux-mm, linux-acpi, linux-kernel, dave.hansen,
	luto, peterz, tglx, mingo, bp, hpa, osalvador, gregkh, rafael,
	akpm, dan.j.williams, Jonathan.Cameron, alison.schofield,
	rrichter, terry.bowman, lenb, dave.jiang, ira.weiny

On 08.10.24 16:51, Gregory Price wrote:
> On Tue, Oct 08, 2024 at 04:03:37PM +0200, David Hildenbrand wrote:
>> On 08.10.24 06:43, Gregory Price wrote:
>>> On CXL systems, block alignment may be as small as 256MB, which may
>>> require a resize of the block size during initialization.  This is done
>>> in the ACPI driver, so the resize function need to be made available.
>>>
>>> Presently, only x86 provides the functionality to resize memory
>>> block sizes.  Wire up a weak stub for set_memory_block_size_order,
>>> similar to memory_block_size_bytes, that simply returns -ENODEV.
>>>
>>> Since set_memory_block_size_order is now extern, we also need to
>>> drop the __init macro.
>>>
>>> Signed-off-by: Gregory Price <gourry@gourry.net>
>>> ---
>>>    arch/x86/mm/init_64.c  | 2 +-
>>>    drivers/base/memory.c  | 6 ++++++
>>>    include/linux/memory.h | 4 ++--
>>>    3 files changed, 9 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>>> index ff253648706f..6086f99449fa 100644
>>> --- a/arch/x86/mm/init_64.c
>>> +++ b/arch/x86/mm/init_64.c
>>> @@ -1424,7 +1424,7 @@ void mark_rodata_ro(void)
>>>    /* Adjustable memory block size */
>>>    static unsigned long set_memory_block_size;
>>> -int __init set_memory_block_size_order(unsigned int order)
>>> +int set_memory_block_size_order(unsigned int order)
>>>    {
>>>    	unsigned long size = 1UL << order;
>>> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
>>> index 67858eeb92ed..f9045642f69e 100644
>>> --- a/drivers/base/memory.c
>>> +++ b/drivers/base/memory.c
>>> @@ -110,6 +110,12 @@ static void memory_block_release(struct device *dev)
>>>    	kfree(mem);
>>>    }
>>> +int __weak set_memory_block_size_order(unsigned int order)
>>> +{
>>> +	return -ENODEV;
>>> +}
>>> +EXPORT_SYMBOL_GPL(set_memory_block_size_order);
>>
>> I can understand what you are trying to achieve, but letting arbitrary
>> modules mess with this sounds like a bad idea.
>>
> 
> I suppose the alternative is trying to scan the CEDT from inside each
> machine, rather than the ACPI driver?  Seems less maintainable.
> 
> I don't entirely disagree with your comment.  I hummed and hawwed over
> externing this - hence the warning in the x86 machine.
> 
> Open to better answers.

Maybe an interface to add more restrictions on the maximum size might be 
better (instead of setting the size/order, you would impose another 
upper limit). Just imagine having various users of such an interface ..

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] acpi,srat: reduce memory block size if CFMWS has a smaller alignment
  2024-10-08 14:58   ` Ira Weiny
@ 2024-10-08 15:17     ` Gregory Price
  2024-10-08 16:46       ` Dan Williams
  2024-10-08 19:02       ` Ira Weiny
  0 siblings, 2 replies; 21+ messages in thread
From: Gregory Price @ 2024-10-08 15:17 UTC (permalink / raw)
  To: Ira Weiny
  Cc: linux-cxl, x86, linux-mm, linux-acpi, linux-kernel, dave.hansen,
	luto, peterz, tglx, mingo, bp, hpa, david, osalvador, gregkh,
	rafael, akpm, dan.j.williams, Jonathan.Cameron, alison.schofield,
	rrichter, terry.bowman, lenb, dave.jiang

On Tue, Oct 08, 2024 at 09:58:35AM -0500, Ira Weiny wrote:
> Gregory Price wrote:
> > The CXL Fixed Memory Window allows for memory aligned down to the
> > size of 256MB.  However, by default on x86, memory blocks increase
> > in size as total System RAM capacity increases. On x86, this caps
> > out at 2G when 64GB of System RAM is reached.
> > 
> > When the CFMWS regions are not aligned to memory block size, this
> > results in lost capacity on either side of the alignment.
> > 
> > Parse all CFMWS to detect the largest common denomenator among all
> > regions, and reduce the block size accordingly.
> > 
> > This can only be done when MEMORY_HOTPLUG and SPARSEMEM configs are
> > enabled, but the surrounding code may not necessarily require these
> > configs, so build accordingly.
> > 
> > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > Signed-off-by: Gregory Price <gourry@gourry.net>
> > ---
> >  drivers/acpi/numa/srat.c | 48 ++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 48 insertions(+)
> > 
> > diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> > index 44f91f2c6c5d..9367d36eba9a 100644
> > --- a/drivers/acpi/numa/srat.c
> > +++ b/drivers/acpi/numa/srat.c
> > @@ -14,6 +14,7 @@
> >  #include <linux/errno.h>
> >  #include <linux/acpi.h>
> >  #include <linux/memblock.h>
> > +#include <linux/memory.h>
> >  #include <linux/numa.h>
> >  #include <linux/nodemask.h>
> >  #include <linux/topology.h>
> > @@ -333,6 +334,37 @@ acpi_parse_memory_affinity(union acpi_subtable_headers *header,
> >  	return 0;
> >  }
> >  
> > +#if defined(CONFIG_MEMORY_HOTPLUG)
> 
> Generally we avoid config defines in *.c files...  See more below.
> 
> > +/*
> > + * CXL allows CFMW to be aligned along 256MB boundaries, but large memory
> > + * systems default to larger alignments (2GB on x86). Misalignments can
> > + * cause some capacity to become unreachable. Calculate the largest supported
> > + * alignment for all CFMW to maximize the amount of mappable capacity.
> > + */
> > +static int __init acpi_align_cfmws(union acpi_subtable_headers *header,
> > +				   void *arg, const unsigned long table_end)
> > +{
> > +	struct acpi_cedt_cfmws *cfmws = (struct acpi_cedt_cfmws *)header;
> > +	u64 start = cfmws->base_hpa;
> > +	u64 size = cfmws->window_size;
> > +	unsigned long *fin_bz = arg;
> > +	unsigned long bz;
> > +
> > +	for (bz = SZ_64T; bz >= SZ_256M; bz >>= 1) {
> > +		if (IS_ALIGNED(start, bz) && IS_ALIGNED(size, bz))
> > +			break;
> > +	}
> > +
> > +	/* Only adjust downward, we never want to increase block size */
> > +	if (bz < *fin_bz && bz >= SZ_256M)
> > +		*fin_bz = bz;
> > +	else if (bz < SZ_256M)
> > +		pr_err("CFMWS: [BIOS BUG] base/size alignment violates spec\n");
> > +
> > +	return 0;
> > +}
> > +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
> > +
> >  static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
> >  				   void *arg, const unsigned long table_end)
> >  {
> > @@ -501,6 +533,10 @@ acpi_table_parse_srat(enum acpi_srat_type id,
> >  int __init acpi_numa_init(void)
> >  {
> >  	int i, fake_pxm, cnt = 0;
> > +#if defined(CONFIG_MEMORY_HOTPLUG)
> > +	unsigned long block_sz = memory_block_size_bytes();
> 
> To help address David's comment as well;
> 
> Is there a way to scan all the alignments of the windows and pass the
> desired alignment to the arch in a new call and have the arch determine if
> changing the order is ok?
> 

At least on x86, it's only OK during init, so it would probably look like
setting a static bit (like the global value in x86) and just refusing to
update once it is locked.

I could implement that on the x86 side as an example.

FWIW: this was Dan's suggestion (quoting discord, sorry Dan!)
```
    I am assuming we would call it here
        drivers/acpi/numa/srat.c::acpi_parse_cfmws()
    which should be before page-allocator init
```

It's only safe before page-allocator init (i.e. once blocks start getting
populated and used), and this area occurs before that.


> Also the call to the arch would be a noop for !CONFIG_MEMORY_HOTPLUG which
> cleans up this function WRT CONFIG_MEMORY_HOTPLUG.
> 
> Ira
>

The ifdefs are a nasty result of the HOTPLUG and SPARSEMEM configs
being, from my perview, horrendously inconsistent throughout the system.

As an example, MIN_MEMORY_BLOCK_SIZE depends on SECTION_SIZE_BITS
which on some architectures is dependent on CONFIG_SPARSEMEM, and
on others is defined unconditionally.  Compound this with memblock
usage appearing to imply CONFIG_MEMORY_HOTPLUG which implies
CONFIG_SPARSEMEM (see drivers/base/memory.c) - but mm/memblock.c
makes no such assumption.

The result of this is that if you extern these functions and build
x86 with each combination of HOTPLUG/SPARSMEM on/off, it builds - but
loongarch (and others) fail to build because SECTION_SIZE_BITS doesn't
get defined in some configurations.

It's not clear if removing those ifdefs from those archs is "correct"
(for some definition of correct) and I didn't want to increase scope.

So it's really not clear how to wire this all up.

I spent the better part of a week trying to detangle this mess just
to get things building successfully in LKP and decided to just add the
ifdefs to get it out and get opinions on the issue :[
 
> > +	unsigned long cfmw_align = block_sz;
> > +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
> >  
> >  	if (acpi_disabled)
> >  		return -EINVAL;
> > @@ -552,6 +588,18 @@ int __init acpi_numa_init(void)
> >  	}
> >  	last_real_pxm = fake_pxm;
> >  	fake_pxm++;
> > +
> > +#if defined(CONFIG_MEMORY_HOTPLUG)
> > +	/* Calculate and set largest supported memory block size alignment */
> > +	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_align_cfmws,
> > +			      &cfmw_align);
> > +	if (cfmw_align < block_sz && cfmw_align >= SZ_256M) {
> > +		if (set_memory_block_size_order(ffs(cfmw_align)-1))
> > +			pr_warn("CFMWS: Unable to adjust memory block size\n");
> > +	}
> > +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
> > +
> > +	/* Then parse and fill the numa nodes with the described memory */
> >  	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
> >  			      &fake_pxm);
> >  
> > -- 
> > 2.43.0
> > 
> > 
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] memory: extern memory_block_size_bytes and set_memory_block_size_order
  2024-10-08 15:02       ` David Hildenbrand
@ 2024-10-08 15:21         ` Gregory Price
  2024-10-08 19:04           ` Ira Weiny
  2024-10-14 11:54           ` David Hildenbrand
  0 siblings, 2 replies; 21+ messages in thread
From: Gregory Price @ 2024-10-08 15:21 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-cxl, x86, linux-mm, linux-acpi, linux-kernel, dave.hansen,
	luto, peterz, tglx, mingo, bp, hpa, osalvador, gregkh, rafael,
	akpm, dan.j.williams, Jonathan.Cameron, alison.schofield,
	rrichter, terry.bowman, lenb, dave.jiang, ira.weiny

On Tue, Oct 08, 2024 at 05:02:33PM +0200, David Hildenbrand wrote:
> On 08.10.24 16:51, Gregory Price wrote:
> > > > +int __weak set_memory_block_size_order(unsigned int order)
> > > > +{
> > > > +	return -ENODEV;
> > > > +}
> > > > +EXPORT_SYMBOL_GPL(set_memory_block_size_order);
> > > 
> > > I can understand what you are trying to achieve, but letting arbitrary
> > > modules mess with this sounds like a bad idea.
> > > 
> > 
> > I suppose the alternative is trying to scan the CEDT from inside each
> > machine, rather than the ACPI driver?  Seems less maintainable.
> > 
> > I don't entirely disagree with your comment.  I hummed and hawwed over
> > externing this - hence the warning in the x86 machine.
> > 
> > Open to better answers.
> 
> Maybe an interface to add more restrictions on the maximum size might be
> better (instead of setting the size/order, you would impose another upper
> limit).

That is effectively what set_memory_block_size_order is, though.  Once
blocks are exposed to the allocators, its no longer safe to change the
size (in part because it was built assuming it wouldn't change, but I
imagine there are other dragons waiting in the shadows to bite me).

So this would basically amount to a lock-bit being set in the architecture,
beyond which block size can no longer be changed and a big ol' splat
can be generated that says "NO TOUCH".

> Just imagine having various users of such an interface ..

I don't wanna D:

> 
> -- 
> Cheers,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] acpi,srat: reduce memory block size if CFMWS has a smaller alignment
  2024-10-08 15:17     ` Gregory Price
@ 2024-10-08 16:46       ` Dan Williams
  2024-10-14 11:50         ` David Hildenbrand
  2024-10-08 19:02       ` Ira Weiny
  1 sibling, 1 reply; 21+ messages in thread
From: Dan Williams @ 2024-10-08 16:46 UTC (permalink / raw)
  To: Gregory Price, Ira Weiny
  Cc: linux-cxl, x86, linux-mm, linux-acpi, linux-kernel, dave.hansen,
	luto, peterz, tglx, mingo, bp, hpa, david, osalvador, gregkh,
	rafael, akpm, dan.j.williams, Jonathan.Cameron, alison.schofield,
	rrichter, terry.bowman, lenb, dave.jiang

Gregory Price wrote:
> On Tue, Oct 08, 2024 at 09:58:35AM -0500, Ira Weiny wrote:
> > Gregory Price wrote:
> > > The CXL Fixed Memory Window allows for memory aligned down to the
> > > size of 256MB.  However, by default on x86, memory blocks increase
> > > in size as total System RAM capacity increases. On x86, this caps
> > > out at 2G when 64GB of System RAM is reached.
> > > 
> > > When the CFMWS regions are not aligned to memory block size, this
> > > results in lost capacity on either side of the alignment.
> > > 
> > > Parse all CFMWS to detect the largest common denomenator among all
> > > regions, and reduce the block size accordingly.
> > > 
> > > This can only be done when MEMORY_HOTPLUG and SPARSEMEM configs are
> > > enabled, but the surrounding code may not necessarily require these
> > > configs, so build accordingly.
> > > 
> > > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > > Signed-off-by: Gregory Price <gourry@gourry.net>
> > > ---
[..]
> > To help address David's comment as well;
> > 
> > Is there a way to scan all the alignments of the windows and pass the
> > desired alignment to the arch in a new call and have the arch determine if
> > changing the order is ok?
> > 
> 
> At least on x86, it's only OK during init, so it would probably look like
> setting a static bit (like the global value in x86) and just refusing to
> update once it is locked.
> 
> I could implement that on the x86 side as an example.
> 
> FWIW: this was Dan's suggestion (quoting discord, sorry Dan!)
> ```
>     I am assuming we would call it here
>         drivers/acpi/numa/srat.c::acpi_parse_cfmws()
>     which should be before page-allocator init
> ```
> 
> It's only safe before page-allocator init (i.e. once blocks start getting
> populated and used), and this area occurs before that.

I will note though that drivers/acpi/numa/srat.c is always built-in, so
there is no need for set_memory_block_size_order() to be EXPORT_SYMBOL
for modules to play with, just an extern for NUMA init to access.


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] acpi,srat: reduce memory block size if CFMWS has a smaller alignment
  2024-10-08 15:17     ` Gregory Price
  2024-10-08 16:46       ` Dan Williams
@ 2024-10-08 19:02       ` Ira Weiny
  1 sibling, 0 replies; 21+ messages in thread
From: Ira Weiny @ 2024-10-08 19:02 UTC (permalink / raw)
  To: Gregory Price, Ira Weiny
  Cc: linux-cxl, x86, linux-mm, linux-acpi, linux-kernel, dave.hansen,
	luto, peterz, tglx, mingo, bp, hpa, david, osalvador, gregkh,
	rafael, akpm, dan.j.williams, Jonathan.Cameron, alison.schofield,
	rrichter, terry.bowman, lenb, dave.jiang

Gregory Price wrote:
> On Tue, Oct 08, 2024 at 09:58:35AM -0500, Ira Weiny wrote:
> > Gregory Price wrote:
> > > The CXL Fixed Memory Window allows for memory aligned down to the
> > > size of 256MB.  However, by default on x86, memory blocks increase
> > > in size as total System RAM capacity increases. On x86, this caps
> > > out at 2G when 64GB of System RAM is reached.
> > > 
> > > When the CFMWS regions are not aligned to memory block size, this
> > > results in lost capacity on either side of the alignment.
> > > 
> > > Parse all CFMWS to detect the largest common denomenator among all
> > > regions, and reduce the block size accordingly.
> > > 
> > > This can only be done when MEMORY_HOTPLUG and SPARSEMEM configs are
> > > enabled, but the surrounding code may not necessarily require these
> > > configs, so build accordingly.
> > > 
> > > Suggested-by: Dan Williams <dan.j.williams@intel.com>
> > > Signed-off-by: Gregory Price <gourry@gourry.net>
> > > ---
> > >  drivers/acpi/numa/srat.c | 48 ++++++++++++++++++++++++++++++++++++++++
> > >  1 file changed, 48 insertions(+)
> > > 
> > > diff --git a/drivers/acpi/numa/srat.c b/drivers/acpi/numa/srat.c
> > > index 44f91f2c6c5d..9367d36eba9a 100644
> > > --- a/drivers/acpi/numa/srat.c
> > > +++ b/drivers/acpi/numa/srat.c
> > > @@ -14,6 +14,7 @@
> > >  #include <linux/errno.h>
> > >  #include <linux/acpi.h>
> > >  #include <linux/memblock.h>
> > > +#include <linux/memory.h>
> > >  #include <linux/numa.h>
> > >  #include <linux/nodemask.h>
> > >  #include <linux/topology.h>
> > > @@ -333,6 +334,37 @@ acpi_parse_memory_affinity(union acpi_subtable_headers *header,
> > >  	return 0;
> > >  }
> > >  
> > > +#if defined(CONFIG_MEMORY_HOTPLUG)
> > 
> > Generally we avoid config defines in *.c files...  See more below.
> > 
> > > +/*
> > > + * CXL allows CFMW to be aligned along 256MB boundaries, but large memory
> > > + * systems default to larger alignments (2GB on x86). Misalignments can
> > > + * cause some capacity to become unreachable. Calculate the largest supported
> > > + * alignment for all CFMW to maximize the amount of mappable capacity.
> > > + */
> > > +static int __init acpi_align_cfmws(union acpi_subtable_headers *header,
> > > +				   void *arg, const unsigned long table_end)
> > > +{
> > > +	struct acpi_cedt_cfmws *cfmws = (struct acpi_cedt_cfmws *)header;
> > > +	u64 start = cfmws->base_hpa;
> > > +	u64 size = cfmws->window_size;
> > > +	unsigned long *fin_bz = arg;
> > > +	unsigned long bz;
> > > +
> > > +	for (bz = SZ_64T; bz >= SZ_256M; bz >>= 1) {
> > > +		if (IS_ALIGNED(start, bz) && IS_ALIGNED(size, bz))
> > > +			break;
> > > +	}
> > > +
> > > +	/* Only adjust downward, we never want to increase block size */
> > > +	if (bz < *fin_bz && bz >= SZ_256M)
> > > +		*fin_bz = bz;
> > > +	else if (bz < SZ_256M)
> > > +		pr_err("CFMWS: [BIOS BUG] base/size alignment violates spec\n");
> > > +
> > > +	return 0;
> > > +}
> > > +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
> > > +
> > >  static int __init acpi_parse_cfmws(union acpi_subtable_headers *header,
> > >  				   void *arg, const unsigned long table_end)
> > >  {
> > > @@ -501,6 +533,10 @@ acpi_table_parse_srat(enum acpi_srat_type id,
> > >  int __init acpi_numa_init(void)
> > >  {
> > >  	int i, fake_pxm, cnt = 0;
> > > +#if defined(CONFIG_MEMORY_HOTPLUG)
> > > +	unsigned long block_sz = memory_block_size_bytes();
> > 
> > To help address David's comment as well;
> > 
> > Is there a way to scan all the alignments of the windows and pass the
> > desired alignment to the arch in a new call and have the arch determine if
> > changing the order is ok?
> > 
> 
> At least on x86, it's only OK during init, so it would probably look like
> setting a static bit (like the global value in x86) and just refusing to
> update once it is locked.
> 
> I could implement that on the x86 side as an example.
> 
> FWIW: this was Dan's suggestion (quoting discord, sorry Dan!)
> ```
>     I am assuming we would call it here
>         drivers/acpi/numa/srat.c::acpi_parse_cfmws()
>     which should be before page-allocator init
> ```
> 
> It's only safe before page-allocator init (i.e. once blocks start getting
> populated and used), and this area occurs before that.

Right I was expecting the call to the arch to be here.

> 
> 
> > Also the call to the arch would be a noop for !CONFIG_MEMORY_HOTPLUG which
> > cleans up this function WRT CONFIG_MEMORY_HOTPLUG.
> > 
> > Ira
> >
> 
> The ifdefs are a nasty result of the HOTPLUG and SPARSEMEM configs
> being, from my perview, horrendously inconsistent throughout the system.

:-(

> 
> As an example, MIN_MEMORY_BLOCK_SIZE depends on SECTION_SIZE_BITS
> which on some architectures is dependent on CONFIG_SPARSEMEM, and
> on others is defined unconditionally.  Compound this with memblock
> usage appearing to imply CONFIG_MEMORY_HOTPLUG which implies
> CONFIG_SPARSEMEM (see drivers/base/memory.c) - but mm/memblock.c
> makes no such assumption.
> 
> The result of this is that if you extern these functions and build
> x86 with each combination of HOTPLUG/SPARSMEM on/off, it builds - but
> loongarch (and others) fail to build because SECTION_SIZE_BITS doesn't
> get defined in some configurations.
> 
> It's not clear if removing those ifdefs from those archs is "correct"
> (for some definition of correct) and I didn't want to increase scope.
> 
> So it's really not clear how to wire this all up.
> 
> I spent the better part of a week trying to detangle this mess just
> to get things building successfully in LKP and decided to just add the
> ifdefs to get it out and get opinions on the issue :[

Yea sometimes it is best to throw things out there.  An idea I have is to
have something like CONFIG_ARCH_HAS_ADVISE_ALIGNMENT which is only defined
for x86.  Other arch's get a default which is a noop.

So the code would be something like:

	unsigned long long cfmw_align = SZ_64T;

	/* Find largest CXL window alignment */
	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, max_cfwm_align,
			      &cfmw_align);

	/* name/interface TBD */
	if (arch_advise_alignment(cfmw_align))
		pr_warn("...", cfmw_align);

This would set the alignment up early in init if the arch allows for it.
Because it is an 'advise' call (again name TBD) it does not mean anything
is set for sure.

FWIW I am sure that David and Dan have better ideas.  Just trying to help.

Ira

>  
> > > +	unsigned long cfmw_align = block_sz;
> > > +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
> > >  
> > >  	if (acpi_disabled)
> > >  		return -EINVAL;
> > > @@ -552,6 +588,18 @@ int __init acpi_numa_init(void)
> > >  	}
> > >  	last_real_pxm = fake_pxm;
> > >  	fake_pxm++;
> > > +
> > > +#if defined(CONFIG_MEMORY_HOTPLUG)
> > > +	/* Calculate and set largest supported memory block size alignment */
> > > +	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_align_cfmws,
> > > +			      &cfmw_align);
> > > +	if (cfmw_align < block_sz && cfmw_align >= SZ_256M) {
> > > +		if (set_memory_block_size_order(ffs(cfmw_align)-1))
> > > +			pr_warn("CFMWS: Unable to adjust memory block size\n");
> > > +	}
> > > +#endif /* defined(CONFIG_MEMORY_HOTPLUG) */
> > > +
> > > +	/* Then parse and fill the numa nodes with the described memory */
> > >  	acpi_table_parse_cedt(ACPI_CEDT_TYPE_CFMWS, acpi_parse_cfmws,
> > >  			      &fake_pxm);
> > >  
> > > -- 
> > > 2.43.0
> > > 
> > > 
> > 
> > 
> 




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] memory: extern memory_block_size_bytes and set_memory_block_size_order
  2024-10-08 15:21         ` Gregory Price
@ 2024-10-08 19:04           ` Ira Weiny
  2024-10-08 19:45             ` Gregory Price
  2024-10-14 11:54           ` David Hildenbrand
  1 sibling, 1 reply; 21+ messages in thread
From: Ira Weiny @ 2024-10-08 19:04 UTC (permalink / raw)
  To: Gregory Price, David Hildenbrand
  Cc: linux-cxl, x86, linux-mm, linux-acpi, linux-kernel, dave.hansen,
	luto, peterz, tglx, mingo, bp, hpa, osalvador, gregkh, rafael,
	akpm, dan.j.williams, Jonathan.Cameron, alison.schofield,
	rrichter, terry.bowman, lenb, dave.jiang, ira.weiny

Gregory Price wrote:
> On Tue, Oct 08, 2024 at 05:02:33PM +0200, David Hildenbrand wrote:
> > On 08.10.24 16:51, Gregory Price wrote:
> > > > > +int __weak set_memory_block_size_order(unsigned int order)
> > > > > +{
> > > > > +	return -ENODEV;
> > > > > +}
> > > > > +EXPORT_SYMBOL_GPL(set_memory_block_size_order);
> > > > 
> > > > I can understand what you are trying to achieve, but letting arbitrary
> > > > modules mess with this sounds like a bad idea.
> > > > 
> > > 
> > > I suppose the alternative is trying to scan the CEDT from inside each
> > > machine, rather than the ACPI driver?  Seems less maintainable.
> > > 
> > > I don't entirely disagree with your comment.  I hummed and hawwed over
> > > externing this - hence the warning in the x86 machine.
> > > 
> > > Open to better answers.
> > 
> > Maybe an interface to add more restrictions on the maximum size might be
> > better (instead of setting the size/order, you would impose another upper
> > limit).
> 
> That is effectively what set_memory_block_size_order is, though.  Once
> blocks are exposed to the allocators, its no longer safe to change the
> size (in part because it was built assuming it wouldn't change, but I
> imagine there are other dragons waiting in the shadows to bite me).

Yea I think this is along the idea I had.  But much clearer.

Ira

> 
> So this would basically amount to a lock-bit being set in the architecture,
> beyond which block size can no longer be changed and a big ol' splat
> can be generated that says "NO TOUCH".
> 
> > Just imagine having various users of such an interface ..
> 
> I don't wanna D:
> 
> > 
> > -- 
> > Cheers,
> > 
> > David / dhildenb
> > 




^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] memory: extern memory_block_size_bytes and set_memory_block_size_order
  2024-10-08 19:04           ` Ira Weiny
@ 2024-10-08 19:45             ` Gregory Price
  0 siblings, 0 replies; 21+ messages in thread
From: Gregory Price @ 2024-10-08 19:45 UTC (permalink / raw)
  To: Ira Weiny
  Cc: David Hildenbrand, linux-cxl, x86, linux-mm, linux-acpi,
	linux-kernel, dave.hansen, luto, peterz, tglx, mingo, bp, hpa,
	osalvador, gregkh, rafael, akpm, dan.j.williams, Jonathan.Cameron,
	alison.schofield, rrichter, terry.bowman, lenb, dave.jiang

On Tue, Oct 08, 2024 at 02:04:05PM -0500, Ira Weiny wrote:
> Gregory Price wrote:
> > On Tue, Oct 08, 2024 at 05:02:33PM +0200, David Hildenbrand wrote:
> > > On 08.10.24 16:51, Gregory Price wrote:
> > > > > > +int __weak set_memory_block_size_order(unsigned int order)
> > > > > > +{
> > > > > > +	return -ENODEV;
> > > > > > +}
> > > > > > +EXPORT_SYMBOL_GPL(set_memory_block_size_order);
> > > > > 
> > > > > I can understand what you are trying to achieve, but letting arbitrary
> > > > > modules mess with this sounds like a bad idea.
> > > > > 
> > > > 
> > > > I suppose the alternative is trying to scan the CEDT from inside each
> > > > machine, rather than the ACPI driver?  Seems less maintainable.
> > > > 
> > > > I don't entirely disagree with your comment.  I hummed and hawwed over
> > > > externing this - hence the warning in the x86 machine.
> > > > 
> > > > Open to better answers.
> > > 
> > > Maybe an interface to add more restrictions on the maximum size might be
> > > better (instead of setting the size/order, you would impose another upper
> > > limit).
> > 
> > That is effectively what set_memory_block_size_order is, though.  Once
> > blocks are exposed to the allocators, its no longer safe to change the
> > size (in part because it was built assuming it wouldn't change, but I
> > imagine there are other dragons waiting in the shadows to bite me).
> 
> Yea I think this is along the idea I had.  But much clearer.
> 
> Ira
> 

Dan seems to think I can just extern without EXPORT, so let me see if I can
get that working first.  Then I'll see if I can add a lock bit.

I'll see if i can make an arch_advise call that does away with some of the
ifdef spaghetti.

> > 
> > So this would basically amount to a lock-bit being set in the architecture,
> > beyond which block size can no longer be changed and a big ol' splat
> > can be generated that says "NO TOUCH".
> > 
> > > Just imagine having various users of such an interface ..
> > 
> > I don't wanna D:
> > 
> > > 
> > > -- 
> > > Cheers,
> > > 
> > > David / dhildenb
> > > 
> 
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 3/3] acpi,srat: reduce memory block size if CFMWS has a smaller alignment
  2024-10-08 16:46       ` Dan Williams
@ 2024-10-14 11:50         ` David Hildenbrand
  0 siblings, 0 replies; 21+ messages in thread
From: David Hildenbrand @ 2024-10-14 11:50 UTC (permalink / raw)
  To: Dan Williams, Gregory Price, Ira Weiny
  Cc: linux-cxl, x86, linux-mm, linux-acpi, linux-kernel, dave.hansen,
	luto, peterz, tglx, mingo, bp, hpa, osalvador, gregkh, rafael,
	akpm, Jonathan.Cameron, alison.schofield, rrichter, terry.bowman,
	lenb, dave.jiang

On 08.10.24 18:46, Dan Williams wrote:
> Gregory Price wrote:
>> On Tue, Oct 08, 2024 at 09:58:35AM -0500, Ira Weiny wrote:
>>> Gregory Price wrote:
>>>> The CXL Fixed Memory Window allows for memory aligned down to the
>>>> size of 256MB.  However, by default on x86, memory blocks increase
>>>> in size as total System RAM capacity increases. On x86, this caps
>>>> out at 2G when 64GB of System RAM is reached.
>>>>
>>>> When the CFMWS regions are not aligned to memory block size, this
>>>> results in lost capacity on either side of the alignment.
>>>>
>>>> Parse all CFMWS to detect the largest common denomenator among all
>>>> regions, and reduce the block size accordingly.
>>>>
>>>> This can only be done when MEMORY_HOTPLUG and SPARSEMEM configs are
>>>> enabled, but the surrounding code may not necessarily require these
>>>> configs, so build accordingly.
>>>>
>>>> Suggested-by: Dan Williams <dan.j.williams@intel.com>
>>>> Signed-off-by: Gregory Price <gourry@gourry.net>
>>>> ---
> [..]
>>> To help address David's comment as well;
>>>
>>> Is there a way to scan all the alignments of the windows and pass the
>>> desired alignment to the arch in a new call and have the arch determine if
>>> changing the order is ok?
>>>
>>
>> At least on x86, it's only OK during init, so it would probably look like
>> setting a static bit (like the global value in x86) and just refusing to
>> update once it is locked.
>>
>> I could implement that on the x86 side as an example.
>>
>> FWIW: this was Dan's suggestion (quoting discord, sorry Dan!)
>> ```
>>      I am assuming we would call it here
>>          drivers/acpi/numa/srat.c::acpi_parse_cfmws()
>>      which should be before page-allocator init
>> ```
>>
>> It's only safe before page-allocator init (i.e. once blocks start getting
>> populated and used), and this area occurs before that.

Sorry for the late reply. It must also be called before 
memory_dev_init(), which happens after the buddy is up IIRC.

> 
> I will note though that drivers/acpi/numa/srat.c is always built-in, so
> there is no need for set_memory_block_size_order() to be EXPORT_SYMBOL
> for modules to play with, just an extern for NUMA init to access.

That's the magic piece I was missing.

Because it didn't make too much sense for me a call to this function 
would ever makes sense after modules where loaded.

This really only works that early during boot, that modules are not 
loaded yet.

So this patch here should be dropped.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] memory: extern memory_block_size_bytes and set_memory_block_size_order
  2024-10-08 15:21         ` Gregory Price
  2024-10-08 19:04           ` Ira Weiny
@ 2024-10-14 11:54           ` David Hildenbrand
  2024-10-14 14:25             ` Gregory Price
  1 sibling, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2024-10-14 11:54 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-cxl, x86, linux-mm, linux-acpi, linux-kernel, dave.hansen,
	luto, peterz, tglx, mingo, bp, hpa, osalvador, gregkh, rafael,
	akpm, dan.j.williams, Jonathan.Cameron, alison.schofield,
	rrichter, terry.bowman, lenb, dave.jiang, ira.weiny

On 08.10.24 17:21, Gregory Price wrote:
> On Tue, Oct 08, 2024 at 05:02:33PM +0200, David Hildenbrand wrote:
>> On 08.10.24 16:51, Gregory Price wrote:
>>>>> +int __weak set_memory_block_size_order(unsigned int order)
>>>>> +{
>>>>> +	return -ENODEV;
>>>>> +}
>>>>> +EXPORT_SYMBOL_GPL(set_memory_block_size_order);
>>>>
>>>> I can understand what you are trying to achieve, but letting arbitrary
>>>> modules mess with this sounds like a bad idea.
>>>>
>>>
>>> I suppose the alternative is trying to scan the CEDT from inside each
>>> machine, rather than the ACPI driver?  Seems less maintainable.
>>>
>>> I don't entirely disagree with your comment.  I hummed and hawwed over
>>> externing this - hence the warning in the x86 machine.
>>>
>>> Open to better answers.
>>
>> Maybe an interface to add more restrictions on the maximum size might be
>> better (instead of setting the size/order, you would impose another upper
>> limit).
> 
> That is effectively what set_memory_block_size_order is, though.  Once
> blocks are exposed to the allocators, its no longer safe to change the
> size (in part because it was built assuming it wouldn't change, but I
> imagine there are other dragons waiting in the shadows to bite me).

Yes, we must run very early.

How is this supposed to interact with code like

set_block_size()

that also calls set_memory_block_size_order() on UV systems (assuming 
there will be CXL support sooner or later?)?


> 
> So this would basically amount to a lock-bit being set in the architecture,
> beyond which block size can no longer be changed and a big ol' splat
> can be generated that says "NO TOUCH".
> 
>> Just imagine having various users of such an interface ..
> 
> I don't wanna D:

Right, and it also doesn't make sense as explained in my other comment: 
this should never apply to loaded modules. :)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] memory: extern memory_block_size_bytes and set_memory_block_size_order
  2024-10-14 11:54           ` David Hildenbrand
@ 2024-10-14 14:25             ` Gregory Price
  2024-10-14 20:32               ` David Hildenbrand
  0 siblings, 1 reply; 21+ messages in thread
From: Gregory Price @ 2024-10-14 14:25 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-cxl, x86, linux-mm, linux-acpi, linux-kernel, dave.hansen,
	luto, peterz, tglx, mingo, bp, hpa, osalvador, gregkh, rafael,
	akpm, dan.j.williams, Jonathan.Cameron, alison.schofield,
	rrichter, terry.bowman, lenb, dave.jiang, ira.weiny

On Mon, Oct 14, 2024 at 01:54:27PM +0200, David Hildenbrand wrote:
> On 08.10.24 17:21, Gregory Price wrote:
> > On Tue, Oct 08, 2024 at 05:02:33PM +0200, David Hildenbrand wrote:
> > > On 08.10.24 16:51, Gregory Price wrote:
> > > > > > +int __weak set_memory_block_size_order(unsigned int order)
> > > > > > +{
> > > > > > +	return -ENODEV;
> > > > > > +}
> > > > > > +EXPORT_SYMBOL_GPL(set_memory_block_size_order);
> > > > > 
> > > > > I can understand what you are trying to achieve, but letting arbitrary
> > > > > modules mess with this sounds like a bad idea.
> > > > > 
> > > > 
> > > > I suppose the alternative is trying to scan the CEDT from inside each
> > > > machine, rather than the ACPI driver?  Seems less maintainable.
> > > > 
> > > > I don't entirely disagree with your comment.  I hummed and hawwed over
> > > > externing this - hence the warning in the x86 machine.
> > > > 
> > > > Open to better answers.
> > > 
> > > Maybe an interface to add more restrictions on the maximum size might be
> > > better (instead of setting the size/order, you would impose another upper
> > > limit).
> > 
> > That is effectively what set_memory_block_size_order is, though.  Once
> > blocks are exposed to the allocators, its no longer safe to change the
> > size (in part because it was built assuming it wouldn't change, but I
> > imagine there are other dragons waiting in the shadows to bite me).
> 
> Yes, we must run very early.
> 
> How is this supposed to interact with code like
> 
> set_block_size()
> 
> that also calls set_memory_block_size_order() on UV systems (assuming there
> will be CXL support sooner or later?)?
> 
> 

Tying the other email to this one - just clarifying the way forward here.

It sounds like you're saying at a minimum drop EXPORT tags to prevent
modules from calling it - but it also sounds like built-ins need to be
prevented from touching it as well after a certain point in early boot.

Do you think I should go down the advise() path as suggested by Ira,
just adding a arch_lock_blocksize() bit and have set_..._order check it,
or should we just move towards each architecture having to go through
the ACPI:CEDT itself?

Doesn't sound like we've quite hit a consensus on where the actual
adjustment logic should land - just that this shouldn't be touched by modules.

~Gregory


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] memory: extern memory_block_size_bytes and set_memory_block_size_order
  2024-10-14 14:25             ` Gregory Price
@ 2024-10-14 20:32               ` David Hildenbrand
  2024-10-14 22:40                 ` Gregory Price
  0 siblings, 1 reply; 21+ messages in thread
From: David Hildenbrand @ 2024-10-14 20:32 UTC (permalink / raw)
  To: Gregory Price
  Cc: linux-cxl, x86, linux-mm, linux-acpi, linux-kernel, dave.hansen,
	luto, peterz, tglx, mingo, bp, hpa, osalvador, gregkh, rafael,
	akpm, dan.j.williams, Jonathan.Cameron, alison.schofield,
	rrichter, terry.bowman, lenb, dave.jiang, ira.weiny

On 14.10.24 16:25, Gregory Price wrote:
> On Mon, Oct 14, 2024 at 01:54:27PM +0200, David Hildenbrand wrote:
>> On 08.10.24 17:21, Gregory Price wrote:
>>> On Tue, Oct 08, 2024 at 05:02:33PM +0200, David Hildenbrand wrote:
>>>> On 08.10.24 16:51, Gregory Price wrote:
>>>>>>> +int __weak set_memory_block_size_order(unsigned int order)
>>>>>>> +{
>>>>>>> +	return -ENODEV;
>>>>>>> +}
>>>>>>> +EXPORT_SYMBOL_GPL(set_memory_block_size_order);
>>>>>>
>>>>>> I can understand what you are trying to achieve, but letting arbitrary
>>>>>> modules mess with this sounds like a bad idea.
>>>>>>
>>>>>
>>>>> I suppose the alternative is trying to scan the CEDT from inside each
>>>>> machine, rather than the ACPI driver?  Seems less maintainable.
>>>>>
>>>>> I don't entirely disagree with your comment.  I hummed and hawwed over
>>>>> externing this - hence the warning in the x86 machine.
>>>>>
>>>>> Open to better answers.
>>>>
>>>> Maybe an interface to add more restrictions on the maximum size might be
>>>> better (instead of setting the size/order, you would impose another upper
>>>> limit).
>>>
>>> That is effectively what set_memory_block_size_order is, though.  Once
>>> blocks are exposed to the allocators, its no longer safe to change the
>>> size (in part because it was built assuming it wouldn't change, but I
>>> imagine there are other dragons waiting in the shadows to bite me).
>>
>> Yes, we must run very early.
>>
>> How is this supposed to interact with code like
>>
>> set_block_size()
>>
>> that also calls set_memory_block_size_order() on UV systems (assuming there
>> will be CXL support sooner or later?)?
>>
>>
> 
> Tying the other email to this one - just clarifying the way forward here.
> 
> It sounds like you're saying at a minimum drop EXPORT tags to prevent
> modules from calling it - but it also sounds like built-ins need to be
> prevented from touching it as well after a certain point in early boot.

Right, at least the EXPORT is not required.

> 
> Do you think I should go down the advise() path as suggested by Ira,
> just adding a arch_lock_blocksize() bit and have set_..._order check it,
> or should we just move towards each architecture having to go through
> the ACPI:CEDT itself?

Let's summarize what we currently have on x86 is:

1) probe_memory_block_size()

Triggered on first memory_block_size_bytes() invocation. Makes a 
decision based on:

a) Already set size using set_memory_block_size_order()
b) RAM size
c) Bare metal vs. virt (bare metal -> use max)
d) Virt: largest block size aligned to memory end


2) set_memory_block_size_order()

Triggered by set_block_size() on UV systems.


I don't think set_memory_block_size_order() is the right tool to use. We 
just want to leave that alone I think -- it's a direct translation of a 
kernel cmdline parameter that should win.

You essentially want to tweak the b)->d) logic to take other alignment 
into consideration.

Maybe have some simple callback mechanism probe_memory_block_size() that 
can consult other sources for alignment requirements?

If that's not an option, then another way to set further min-alignment 
requirements (whereby we take MIN(old_align, new_align))?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH 1/3] memory: extern memory_block_size_bytes and set_memory_block_size_order
  2024-10-14 20:32               ` David Hildenbrand
@ 2024-10-14 22:40                 ` Gregory Price
  0 siblings, 0 replies; 21+ messages in thread
From: Gregory Price @ 2024-10-14 22:40 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-cxl, x86, linux-mm, linux-acpi, linux-kernel, dave.hansen,
	luto, peterz, tglx, mingo, bp, hpa, osalvador, gregkh, rafael,
	akpm, dan.j.williams, Jonathan.Cameron, alison.schofield,
	rrichter, terry.bowman, lenb, dave.jiang, ira.weiny

On Mon, Oct 14, 2024 at 10:32:36PM +0200, David Hildenbrand wrote:
> On 14.10.24 16:25, Gregory Price wrote:
> > On Mon, Oct 14, 2024 at 01:54:27PM +0200, David Hildenbrand wrote:
> > > On 08.10.24 17:21, Gregory Price wrote:
> > > > On Tue, Oct 08, 2024 at 05:02:33PM +0200, David Hildenbrand wrote:
> > > > > On 08.10.24 16:51, Gregory Price wrote:
> > > > > > > > +int __weak set_memory_block_size_order(unsigned int order)
> > > > > > > > +{
> > > > > > > > +	return -ENODEV;
> > > > > > > > +}
> > > > > > > > +EXPORT_SYMBOL_GPL(set_memory_block_size_order);
> > > > > > > 
> > > > > > > I can understand what you are trying to achieve, but letting arbitrary
> > > > > > > modules mess with this sounds like a bad idea.
> > > > > > > 
> > > > > > 
> > > > > > I suppose the alternative is trying to scan the CEDT from inside each
> > > > > > machine, rather than the ACPI driver?  Seems less maintainable.
> > > > > > 
> > > > > > I don't entirely disagree with your comment.  I hummed and hawwed over
> > > > > > externing this - hence the warning in the x86 machine.
> > > > > > 
> > > > > > Open to better answers.
> > > > > 
> > > > > Maybe an interface to add more restrictions on the maximum size might be
> > > > > better (instead of setting the size/order, you would impose another upper
> > > > > limit).
> > > > 
> > > > That is effectively what set_memory_block_size_order is, though.  Once
> > > > blocks are exposed to the allocators, its no longer safe to change the
> > > > size (in part because it was built assuming it wouldn't change, but I
> > > > imagine there are other dragons waiting in the shadows to bite me).
> > > 
> > > Yes, we must run very early.
> > > 
> > > How is this supposed to interact with code like
> > > 
> > > set_block_size()
> > > 
> > > that also calls set_memory_block_size_order() on UV systems (assuming there
> > > will be CXL support sooner or later?)?
> > > 
> > > 
> > 
> > Tying the other email to this one - just clarifying the way forward here.
> > 
> > It sounds like you're saying at a minimum drop EXPORT tags to prevent
> > modules from calling it - but it also sounds like built-ins need to be
> > prevented from touching it as well after a certain point in early boot.
> 
> Right, at least the EXPORT is not required.
> 
> > 
> > Do you think I should go down the advise() path as suggested by Ira,
> > just adding a arch_lock_blocksize() bit and have set_..._order check it,
> > or should we just move towards each architecture having to go through
> > the ACPI:CEDT itself?
> 
> Let's summarize what we currently have on x86 is:
> 
> 1) probe_memory_block_size()
> 
> Triggered on first memory_block_size_bytes() invocation. Makes a decision
> based on:
> 
> a) Already set size using set_memory_block_size_order()
> b) RAM size
> c) Bare metal vs. virt (bare metal -> use max)
> d) Virt: largest block size aligned to memory end
> 
> 
> 2) set_memory_block_size_order()
> 
> Triggered by set_block_size() on UV systems.
> 
> 
> I don't think set_memory_block_size_order() is the right tool to use. We
> just want to leave that alone I think -- it's a direct translation of a
> kernel cmdline parameter that should win.
> 
> You essentially want to tweak the b)->d) logic to take other alignment into
> consideration.
> 
> Maybe have some simple callback mechanism probe_memory_block_size() that can
> consult other sources for alignment requirements?
>

Thanks for this - I'll cobble something together.

Probably this ends up falling out similar to what Ira suggested. 

drivers/acpi/numa/srat.c
    acpi_numa_init():
        order = parse_cfwm(...)
        memblock_advise_size(order);

drivers/base/memory.c
    static int memblock_size_order = 0; /* let arch choose */

    int memblock_advise_size(order)
        int old_order;
        int new_order;
        if (order <= 0)
            return -EINVAL;

        do {
            old_order = memblock_size_order;
            new_order = MIN(old_order, order);
        } while (!atomic_cmpxchg(&memblock_size_order, old_order, new_order));

        /* memblock_size_order is now <= order, if -1 then the probe won */
        return new_order;

    int memblock_probe_size()
        return atomic_xchg(&memblock_size_order, -1);

drivers/base/memblock.h
    #ifdef HOTPLUG
        export memblock_advise_size()
        export memblock_probe_size()
    #else
        static memblock_advice_size() { return -ENODEV; } /* always fail */
        static memblock_probe_size() { return 0; } /* arch chooses */
    #endif

arch/*/mm/...
    probe_block_size():
        memblock_probe_size();
        /* select minimum across above suggested values */

> If that's not an option, then another way to set further min-alignment
> requirements (whereby we take MIN(old_align, new_align))?
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 


^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2024-10-14 22:40 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-10-08  4:43 [PATCH 0/3] memory,acpi: resize memory blocks based on CFMW alignment Gregory Price
2024-10-08  4:43 ` [PATCH 1/3] memory: extern memory_block_size_bytes and set_memory_block_size_order Gregory Price
2024-10-08 14:03   ` David Hildenbrand
2024-10-08 14:51     ` Gregory Price
2024-10-08 15:02       ` David Hildenbrand
2024-10-08 15:21         ` Gregory Price
2024-10-08 19:04           ` Ira Weiny
2024-10-08 19:45             ` Gregory Price
2024-10-14 11:54           ` David Hildenbrand
2024-10-14 14:25             ` Gregory Price
2024-10-14 20:32               ` David Hildenbrand
2024-10-14 22:40                 ` Gregory Price
2024-10-08  4:43 ` [PATCH 2/3] x86/mm: if memblock size is adjusted, update the cached value Gregory Price
2024-10-08  4:43 ` [PATCH 3/3] acpi,srat: reduce memory block size if CFMWS has a smaller alignment Gregory Price
2024-10-08 14:58   ` Ira Weiny
2024-10-08 15:17     ` Gregory Price
2024-10-08 16:46       ` Dan Williams
2024-10-14 11:50         ` David Hildenbrand
2024-10-08 19:02       ` Ira Weiny
2024-10-08 14:38 ` [PATCH 0/3] memory,acpi: resize memory blocks based on CFMW alignment Ira Weiny
2024-10-08 14:49   ` Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).