linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem
@ 2023-06-15 22:00 Vishal Verma
  2023-06-15 22:00 ` [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param Vishal Verma
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Vishal Verma @ 2023-06-15 22:00 UTC (permalink / raw)
  To: Rafael J. Wysocki, Len Brown, Andrew Morton, David Hildenbrand,
	Oscar Salvador, Dan Williams, Dave Jiang
  Cc: linux-acpi, linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen, Vishal Verma

The dax/kmem driver can potentially hot-add large amounts of memory
originating from CXL memory expanders, or NVDIMMs, or other 'device
memories'. There is a chance there isn't enough regular system memory
available to fit ythe memmap for this new memory. It's therefore
desirable, if all other conditions are met, for the kmem managed memory
to place its memmap on the newly added memory itself.

Arrange for this by first allowing for a module parameter override for
the mhp_supports_memmap_on_memory() test using a flag, adjusting the
only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c,
exporting the symbol so it can be called by kmem.c, and finally changing
the kmem driver to add_memory() in chunks of memory_block_size_bytes().

Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
Vishal Verma (3):
      mm/memory_hotplug: Allow an override for the memmap_on_memory param
      mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory()
      dax/kmem: Always enroll hotplugged memory for memmap_on_memory

 include/linux/memory_hotplug.h |  2 +-
 drivers/acpi/acpi_memhotplug.c |  2 +-
 drivers/dax/kmem.c             | 49 +++++++++++++++++++++++++++++++-----------
 mm/memory_hotplug.c            | 25 ++++++++++++++-------
 4 files changed, 55 insertions(+), 23 deletions(-)
---
base-commit: f1fcbaa18b28dec10281551dfe6ed3a3ed80e3d6
change-id: 20230613-vv-kmem_memmap-5483c8d04279

Best regards,
-- 
Vishal Verma <vishal.l.verma@intel.com>



^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param
  2023-06-15 22:00 [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem Vishal Verma
@ 2023-06-15 22:00 ` Vishal Verma
  2023-06-16  6:35   ` Huang, Ying
                     ` (2 more replies)
  2023-06-15 22:00 ` [PATCH 2/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory() Vishal Verma
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 27+ messages in thread
From: Vishal Verma @ 2023-06-15 22:00 UTC (permalink / raw)
  To: Rafael J. Wysocki, Len Brown, Andrew Morton, David Hildenbrand,
	Oscar Salvador, Dan Williams, Dave Jiang
  Cc: linux-acpi, linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen, Vishal Verma

For memory hotplug to consider MHP_MEMMAP_ON_MEMORY behavior, the
'memmap_on_memory' module parameter was a hard requirement.

In preparation for the dax/kmem driver to use memmap_on_memory
semantics, arrange for the module parameter check to be bypassed via the
appropriate mhp_flag.

Recall that the kmem driver could contribute huge amounts of hotplugged
memory originating from special purposes devices such as CXL memory
expanders. In some cases memmap_on_memory may be the /only/ way this new
memory can be hotplugged. Hence it makes sense for kmem to have a way to
force memmap_on_memory without depending on a module param, if all the
other conditions for it are met.

The only other user of this interface is acpi/acpi_memoryhotplug.c,
which only enables the mhp_flag if an initial
mhp_supports_memmap_on_memory() test passes. Maintain the existing
behavior and semantics for this by performing the initial check from
acpi without the MHP_MEMMAP_ON_MEMORY flag, so its decision falls back
to the module parameter.

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 include/linux/memory_hotplug.h |  2 +-
 drivers/acpi/acpi_memhotplug.c |  2 +-
 mm/memory_hotplug.c            | 24 ++++++++++++++++--------
 3 files changed, 18 insertions(+), 10 deletions(-)

diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 9fcbf5706595..c9ddcd3cad70 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -358,7 +358,7 @@ extern struct zone *zone_for_pfn_range(int online_type, int nid,
 extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
 				      struct mhp_params *params);
 void arch_remove_linear_mapping(u64 start, u64 size);
-extern bool mhp_supports_memmap_on_memory(unsigned long size);
+extern bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t mhp_flags);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 24f662d8bd39..119d3bb49753 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -211,7 +211,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (!info->length)
 			continue;
 
-		if (mhp_supports_memmap_on_memory(info->length))
+		if (mhp_supports_memmap_on_memory(info->length, 0))
 			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
 		result = __add_memory(mgid, info->start_addr, info->length,
 				      mhp_flags);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8e0fa209d533..bb3845830922 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1283,15 +1283,21 @@ static int online_memory_block(struct memory_block *mem, void *arg)
 	return device_online(&mem->dev);
 }
 
-bool mhp_supports_memmap_on_memory(unsigned long size)
+bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t mhp_flags)
 {
 	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
 	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
 	unsigned long remaining_size = size - vmemmap_size;
 
 	/*
-	 * Besides having arch support and the feature enabled at runtime, we
-	 * need a few more assumptions to hold true:
+	 * The MHP_MEMMAP_ON_MEMORY flag indicates a caller that wants to force
+	 * memmap_on_memory (if other conditions are met), regardless of the
+	 * module parameter. drivers/dax/kmem.c is an example, where large
+	 * amounts of hotplug memory may come from, and the only option to
+	 * successfully online all of it is to place the memmap on this memory.
+	 *
+	 * Besides having arch support and the feature enabled at runtime or
+	 * via the mhp_flag, we need a few more assumptions to hold true:
 	 *
 	 * a) We span a single memory block: memory onlining/offlinin;g happens
 	 *    in memory block granularity. We don't want the vmemmap of online
@@ -1315,10 +1321,12 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
 	 *       altmap as an alternative source of memory, and we do not exactly
 	 *       populate a single PMD.
 	 */
-	return mhp_memmap_on_memory() &&
-	       size == memory_block_size_bytes() &&
-	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
-	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
+
+	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) || mhp_memmap_on_memory())
+		return size == memory_block_size_bytes() &&
+		       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
+		       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
+	return false;
 }
 
 /*
@@ -1375,7 +1383,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
 	 * Self hosted memmap array
 	 */
 	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
-		if (!mhp_supports_memmap_on_memory(size)) {
+		if (!mhp_supports_memmap_on_memory(size, mhp_flags)) {
 			ret = -EINVAL;
 			goto error;
 		}

-- 
2.40.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory()
  2023-06-15 22:00 [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem Vishal Verma
  2023-06-15 22:00 ` [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param Vishal Verma
@ 2023-06-15 22:00 ` Vishal Verma
  2023-06-16  7:47   ` David Hildenbrand
  2023-06-15 22:00 ` [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory Vishal Verma
  2023-06-16  7:44 ` [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem David Hildenbrand
  3 siblings, 1 reply; 27+ messages in thread
From: Vishal Verma @ 2023-06-15 22:00 UTC (permalink / raw)
  To: Rafael J. Wysocki, Len Brown, Andrew Morton, David Hildenbrand,
	Oscar Salvador, Dan Williams, Dave Jiang
  Cc: linux-acpi, linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen, Vishal Verma

In preparation for the dax/kmem driver, which can be built as a module,
to use this interface, export it with EXPORT_SYMBOL_GPL().

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 mm/memory_hotplug.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index bb3845830922..92922080d3fa 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1328,6 +1328,7 @@ bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t mhp_flags)
 		       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
 	return false;
 }
+EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
 
 /*
  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug

-- 
2.40.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
  2023-06-15 22:00 [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem Vishal Verma
  2023-06-15 22:00 ` [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param Vishal Verma
  2023-06-15 22:00 ` [PATCH 2/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory() Vishal Verma
@ 2023-06-15 22:00 ` Vishal Verma
  2023-06-16  6:42   ` Huang, Ying
                     ` (2 more replies)
  2023-06-16  7:44 ` [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem David Hildenbrand
  3 siblings, 3 replies; 27+ messages in thread
From: Vishal Verma @ 2023-06-15 22:00 UTC (permalink / raw)
  To: Rafael J. Wysocki, Len Brown, Andrew Morton, David Hildenbrand,
	Oscar Salvador, Dan Williams, Dave Jiang
  Cc: linux-acpi, linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen, Vishal Verma

With DAX memory regions originating from CXL memory expanders or
NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
on a system without enough 'regular' main memory to support the memmap
for it. To avoid this, ensure that all kmem managed hotplugged memory is
added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
new memory region being hot added.

To do this, call add_memory() in chunks of memory_block_size_bytes() as
that is a requirement for memmap_on_memory. Additionally, Use the
mhp_flag to force the memmap_on_memory checks regardless of the
respective module parameter setting.

Cc: "Rafael J. Wysocki" <rafael@kernel.org>
Cc: Len Brown <lenb@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: David Hildenbrand <david@redhat.com>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Dave Jiang <dave.jiang@intel.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Huang Ying <ying.huang@intel.com>
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
---
 drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 36 insertions(+), 13 deletions(-)

diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
index 7b36db6f1cbd..0751346193ef 100644
--- a/drivers/dax/kmem.c
+++ b/drivers/dax/kmem.c
@@ -12,6 +12,7 @@
 #include <linux/mm.h>
 #include <linux/mman.h>
 #include <linux/memory-tiers.h>
+#include <linux/memory_hotplug.h>
 #include "dax-private.h"
 #include "bus.h"
 
@@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 	data->mgid = rc;
 
 	for (i = 0; i < dev_dax->nr_range; i++) {
+		u64 cur_start, cur_len, remaining;
 		struct resource *res;
 		struct range range;
 
@@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
 		res->flags = IORESOURCE_SYSTEM_RAM;
 
 		/*
-		 * Ensure that future kexec'd kernels will not treat
-		 * this as RAM automatically.
+		 * Add memory in chunks of memory_block_size_bytes() so that
+		 * it is considered for MHP_MEMMAP_ON_MEMORY
+		 * @range has already been aligned to memory_block_size_bytes(),
+		 * so the following loop will always break it down cleanly.
 		 */
-		rc = add_memory_driver_managed(data->mgid, range.start,
-				range_len(&range), kmem_name, MHP_NID_IS_MGID);
+		cur_start = range.start;
+		cur_len = memory_block_size_bytes();
+		remaining = range_len(&range);
+		while (remaining) {
+			mhp_t mhp_flags = MHP_NID_IS_MGID;
 
-		if (rc) {
-			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
-					i, range.start, range.end);
-			remove_resource(res);
-			kfree(res);
-			data->res[i] = NULL;
-			if (mapped)
-				continue;
-			goto err_request_mem;
+			if (mhp_supports_memmap_on_memory(cur_len,
+							  MHP_MEMMAP_ON_MEMORY))
+				mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+			/*
+			 * Ensure that future kexec'd kernels will not treat
+			 * this as RAM automatically.
+			 */
+			rc = add_memory_driver_managed(data->mgid, cur_start,
+						       cur_len, kmem_name,
+						       mhp_flags);
+
+			if (rc) {
+				dev_warn(dev,
+					 "mapping%d: %#llx-%#llx memory add failed\n",
+					 i, cur_start, cur_start + cur_len - 1);
+				remove_resource(res);
+				kfree(res);
+				data->res[i] = NULL;
+				if (mapped)
+					continue;
+				goto err_request_mem;
+			}
+
+			cur_start += cur_len;
+			remaining -= cur_len;
 		}
 		mapped++;
 	}

-- 
2.40.1



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param
  2023-06-15 22:00 ` [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param Vishal Verma
@ 2023-06-16  6:35   ` Huang, Ying
  2023-06-16  7:46   ` David Hildenbrand
  2023-06-23  8:40   ` Aneesh Kumar K.V
  2 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2023-06-16  6:35 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Rafael J. Wysocki, Len Brown, Andrew Morton, David Hildenbrand,
	Oscar Salvador, Dan Williams, Dave Jiang, linux-acpi,
	linux-kernel, linux-mm, nvdimm, linux-cxl, Dave Hansen

Hi, Vishal,

Thanks for your patch!

Vishal Verma <vishal.l.verma@intel.com> writes:

> For memory hotplug to consider MHP_MEMMAP_ON_MEMORY behavior, the
> 'memmap_on_memory' module parameter was a hard requirement.
>
> In preparation for the dax/kmem driver to use memmap_on_memory
> semantics, arrange for the module parameter check to be bypassed via the
> appropriate mhp_flag.
>
> Recall that the kmem driver could contribute huge amounts of hotplugged
> memory originating from special purposes devices such as CXL memory
> expanders. In some cases memmap_on_memory may be the /only/ way this new
> memory can be hotplugged. Hence it makes sense for kmem to have a way to
> force memmap_on_memory without depending on a module param, if all the
> other conditions for it are met.
>
> The only other user of this interface is acpi/acpi_memoryhotplug.c,
> which only enables the mhp_flag if an initial
> mhp_supports_memmap_on_memory() test passes. Maintain the existing
> behavior and semantics for this by performing the initial check from
> acpi without the MHP_MEMMAP_ON_MEMORY flag, so its decision falls back
> to the module parameter.
>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  include/linux/memory_hotplug.h |  2 +-
>  drivers/acpi/acpi_memhotplug.c |  2 +-
>  mm/memory_hotplug.c            | 24 ++++++++++++++++--------
>  3 files changed, 18 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 9fcbf5706595..c9ddcd3cad70 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -358,7 +358,7 @@ extern struct zone *zone_for_pfn_range(int online_type, int nid,
>  extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
>  				      struct mhp_params *params);
>  void arch_remove_linear_mapping(u64 start, u64 size);
> -extern bool mhp_supports_memmap_on_memory(unsigned long size);
> +extern bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t mhp_flags);
>  #endif /* CONFIG_MEMORY_HOTPLUG */
>  
>  #endif /* __LINUX_MEMORY_HOTPLUG_H */
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 24f662d8bd39..119d3bb49753 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -211,7 +211,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
>  		if (!info->length)
>  			continue;
>  
> -		if (mhp_supports_memmap_on_memory(info->length))
> +		if (mhp_supports_memmap_on_memory(info->length, 0))
>  			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>  		result = __add_memory(mgid, info->start_addr, info->length,
>  				      mhp_flags);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 8e0fa209d533..bb3845830922 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1283,15 +1283,21 @@ static int online_memory_block(struct memory_block *mem, void *arg)
>  	return device_online(&mem->dev);
>  }
>  
> -bool mhp_supports_memmap_on_memory(unsigned long size)
> +bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t mhp_flags)
>  {
>  	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
>  	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>  	unsigned long remaining_size = size - vmemmap_size;
>  
>  	/*
> -	 * Besides having arch support and the feature enabled at runtime, we
> -	 * need a few more assumptions to hold true:
> +	 * The MHP_MEMMAP_ON_MEMORY flag indicates a caller that wants to force
> +	 * memmap_on_memory (if other conditions are met), regardless of the
> +	 * module parameter. drivers/dax/kmem.c is an example, where large
> +	 * amounts of hotplug memory may come from, and the only option to
> +	 * successfully online all of it is to place the memmap on this memory.
> +	 *
> +	 * Besides having arch support and the feature enabled at runtime or
> +	 * via the mhp_flag, we need a few more assumptions to hold true:
>  	 *
>  	 * a) We span a single memory block: memory onlining/offlinin;g happens
>  	 *    in memory block granularity. We don't want the vmemmap of online
> @@ -1315,10 +1321,12 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
>  	 *       altmap as an alternative source of memory, and we do not exactly
>  	 *       populate a single PMD.
>  	 */
> -	return mhp_memmap_on_memory() &&
> -	       size == memory_block_size_bytes() &&
> -	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
> -	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
> +
> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) || mhp_memmap_on_memory())
> +		return size == memory_block_size_bytes() &&
> +		       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
> +		       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
> +	return false;
>  }
>  
>  /*
> @@ -1375,7 +1383,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  	 * Self hosted memmap array
>  	 */
>  	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> -		if (!mhp_supports_memmap_on_memory(size)) {
> +		if (!mhp_supports_memmap_on_memory(size, mhp_flags)) {
>  			ret = -EINVAL;
>  			goto error;
>  		}

It appears that we need to deal with the hot-remove path too.
try_remove_memory() will call mhp_memmap_on_memory() and only work with
MHP_MEMMAP_ON_MEMORY properly if it returns true.

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
  2023-06-15 22:00 ` [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory Vishal Verma
@ 2023-06-16  6:42   ` Huang, Ying
  2023-06-16  7:54   ` David Hildenbrand
  2023-06-20 13:14   ` Tarun Sahu
  2 siblings, 0 replies; 27+ messages in thread
From: Huang, Ying @ 2023-06-16  6:42 UTC (permalink / raw)
  To: Vishal Verma
  Cc: Rafael J. Wysocki, Len Brown, Andrew Morton, David Hildenbrand,
	Oscar Salvador, Dan Williams, Dave Jiang, linux-acpi,
	linux-kernel, linux-mm, nvdimm, linux-cxl, Dave Hansen

Vishal Verma <vishal.l.verma@intel.com> writes:

> With DAX memory regions originating from CXL memory expanders or
> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
> on a system without enough 'regular' main memory to support the memmap
> for it. To avoid this, ensure that all kmem managed hotplugged memory is
> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
> new memory region being hot added.
>
> To do this, call add_memory() in chunks of memory_block_size_bytes() as
> that is a requirement for memmap_on_memory. Additionally, Use the
> mhp_flag to force the memmap_on_memory checks regardless of the
> respective module parameter setting.
>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 36 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 7b36db6f1cbd..0751346193ef 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -12,6 +12,7 @@
>  #include <linux/mm.h>
>  #include <linux/mman.h>
>  #include <linux/memory-tiers.h>
> +#include <linux/memory_hotplug.h>
>  #include "dax-private.h"
>  #include "bus.h"
>  
> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  	data->mgid = rc;
>  
>  	for (i = 0; i < dev_dax->nr_range; i++) {
> +		u64 cur_start, cur_len, remaining;
>  		struct resource *res;
>  		struct range range;
>  
> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  		res->flags = IORESOURCE_SYSTEM_RAM;
>  
>  		/*
> -		 * Ensure that future kexec'd kernels will not treat
> -		 * this as RAM automatically.
> +		 * Add memory in chunks of memory_block_size_bytes() so that
> +		 * it is considered for MHP_MEMMAP_ON_MEMORY
> +		 * @range has already been aligned to memory_block_size_bytes(),
> +		 * so the following loop will always break it down cleanly.
>  		 */
> -		rc = add_memory_driver_managed(data->mgid, range.start,
> -				range_len(&range), kmem_name, MHP_NID_IS_MGID);
> +		cur_start = range.start;
> +		cur_len = memory_block_size_bytes();
> +		remaining = range_len(&range);
> +		while (remaining) {
> +			mhp_t mhp_flags = MHP_NID_IS_MGID;
>  
> -		if (rc) {
> -			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
> -					i, range.start, range.end);
> -			remove_resource(res);
> -			kfree(res);
> -			data->res[i] = NULL;
> -			if (mapped)
> -				continue;
> -			goto err_request_mem;
> +			if (mhp_supports_memmap_on_memory(cur_len,
> +							  MHP_MEMMAP_ON_MEMORY))
> +				mhp_flags |= MHP_MEMMAP_ON_MEMORY;
> +			/*
> +			 * Ensure that future kexec'd kernels will not treat
> +			 * this as RAM automatically.
> +			 */
> +			rc = add_memory_driver_managed(data->mgid, cur_start,
> +						       cur_len, kmem_name,
> +						       mhp_flags);
> +
> +			if (rc) {
> +				dev_warn(dev,
> +					 "mapping%d: %#llx-%#llx memory add failed\n",
> +					 i, cur_start, cur_start + cur_len - 1);
> +				remove_resource(res);
> +				kfree(res);
> +				data->res[i] = NULL;
> +				if (mapped)
> +					continue;
> +				goto err_request_mem;
> +			}
> +
> +			cur_start += cur_len;
> +			remaining -= cur_len;
>  		}
>  		mapped++;
>  	}

It appears that we need to hot-remove memory in the granularity of
memory_block_size_bytes() too, according to try_remove_memory().  If so,
it seems better to allocate one dax_kmem_data.res[] element for each
memory block instead of dax region?

Best Regards,
Huang, Ying


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem
  2023-06-15 22:00 [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem Vishal Verma
                   ` (2 preceding siblings ...)
  2023-06-15 22:00 ` [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory Vishal Verma
@ 2023-06-16  7:44 ` David Hildenbrand
  2023-06-21 19:32   ` Verma, Vishal L
  2023-07-13 19:12   ` Jeff Moyer
  3 siblings, 2 replies; 27+ messages in thread
From: David Hildenbrand @ 2023-06-16  7:44 UTC (permalink / raw)
  To: Vishal Verma, Rafael J. Wysocki, Len Brown, Andrew Morton,
	Oscar Salvador, Dan Williams, Dave Jiang
  Cc: linux-acpi, linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen

On 16.06.23 00:00, Vishal Verma wrote:
> The dax/kmem driver can potentially hot-add large amounts of memory
> originating from CXL memory expanders, or NVDIMMs, or other 'device
> memories'. There is a chance there isn't enough regular system memory
> available to fit ythe memmap for this new memory. It's therefore
> desirable, if all other conditions are met, for the kmem managed memory
> to place its memmap on the newly added memory itself.
> 
> Arrange for this by first allowing for a module parameter override for
> the mhp_supports_memmap_on_memory() test using a flag, adjusting the
> only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c,
> exporting the symbol so it can be called by kmem.c, and finally changing
> the kmem driver to add_memory() in chunks of memory_block_size_bytes().

1) Why is the override a requirement here? Just let the admin configure 
it then then add conditional support for kmem.

2) I recall that there are cases where we don't want the memmap to land 
on slow memory (which online_movable would achieve). Just imagine the 
slow PMEM case. So this might need another configuration knob on the 
kmem side.


I recall some discussions on doing that chunk handling internally (so 
kmem can just perform one add_memory() and we'd split that up internally).

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param
  2023-06-15 22:00 ` [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param Vishal Verma
  2023-06-16  6:35   ` Huang, Ying
@ 2023-06-16  7:46   ` David Hildenbrand
  2023-06-22 13:37     ` Jonathan Cameron
  2023-06-23  8:40   ` Aneesh Kumar K.V
  2 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2023-06-16  7:46 UTC (permalink / raw)
  To: Vishal Verma, Rafael J. Wysocki, Len Brown, Andrew Morton,
	Oscar Salvador, Dan Williams, Dave Jiang
  Cc: linux-acpi, linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen

On 16.06.23 00:00, Vishal Verma wrote:
> For memory hotplug to consider MHP_MEMMAP_ON_MEMORY behavior, the
> 'memmap_on_memory' module parameter was a hard requirement.
> 
> In preparation for the dax/kmem driver to use memmap_on_memory
> semantics, arrange for the module parameter check to be bypassed via the
> appropriate mhp_flag.
> 
> Recall that the kmem driver could contribute huge amounts of hotplugged
> memory originating from special purposes devices such as CXL memory
> expanders. In some cases memmap_on_memory may be the /only/ way this new
> memory can be hotplugged. Hence it makes sense for kmem to have a way to
> force memmap_on_memory without depending on a module param, if all the
> other conditions for it are met.

Just let the admin configure it. After all, an admin is involved in 
configuring the dax/kmem device to begin with. If add_memory() fails you 
could give a useful hint to the admin.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory()
  2023-06-15 22:00 ` [PATCH 2/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory() Vishal Verma
@ 2023-06-16  7:47   ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2023-06-16  7:47 UTC (permalink / raw)
  To: Vishal Verma, Rafael J. Wysocki, Len Brown, Andrew Morton,
	Oscar Salvador, Dan Williams, Dave Jiang
  Cc: linux-acpi, linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen

On 16.06.23 00:00, Vishal Verma wrote:
> In preparation for the dax/kmem driver, which can be built as a module,
> to use this interface, export it with EXPORT_SYMBOL_GPL().
> 
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>   mm/memory_hotplug.c | 1 +
>   1 file changed, 1 insertion(+)
> 
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index bb3845830922..92922080d3fa 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1328,6 +1328,7 @@ bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t mhp_flags)
>   		       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
>   	return false;
>   }
> +EXPORT_SYMBOL_GPL(mhp_supports_memmap_on_memory);
>   
>   /*
>    * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
  2023-06-15 22:00 ` [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory Vishal Verma
  2023-06-16  6:42   ` Huang, Ying
@ 2023-06-16  7:54   ` David Hildenbrand
  2023-07-11 14:30     ` Aneesh Kumar K.V
  2023-06-20 13:14   ` Tarun Sahu
  2 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2023-06-16  7:54 UTC (permalink / raw)
  To: Vishal Verma, Rafael J. Wysocki, Len Brown, Andrew Morton,
	Oscar Salvador, Dan Williams, Dave Jiang
  Cc: linux-acpi, linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen

On 16.06.23 00:00, Vishal Verma wrote:
> With DAX memory regions originating from CXL memory expanders or
> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
> on a system without enough 'regular' main memory to support the memmap
> for it. To avoid this, ensure that all kmem managed hotplugged memory is
> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
> new memory region being hot added.
> 
> To do this, call add_memory() in chunks of memory_block_size_bytes() as
> that is a requirement for memmap_on_memory. Additionally, Use the
> mhp_flag to force the memmap_on_memory checks regardless of the
> respective module parameter setting.
> 
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>   drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
>   1 file changed, 36 insertions(+), 13 deletions(-)
> 
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 7b36db6f1cbd..0751346193ef 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -12,6 +12,7 @@
>   #include <linux/mm.h>
>   #include <linux/mman.h>
>   #include <linux/memory-tiers.h>
> +#include <linux/memory_hotplug.h>
>   #include "dax-private.h"
>   #include "bus.h"
>   
> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>   	data->mgid = rc;
>   
>   	for (i = 0; i < dev_dax->nr_range; i++) {
> +		u64 cur_start, cur_len, remaining;
>   		struct resource *res;
>   		struct range range;
>   
> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>   		res->flags = IORESOURCE_SYSTEM_RAM;
>   
>   		/*
> -		 * Ensure that future kexec'd kernels will not treat
> -		 * this as RAM automatically.
> +		 * Add memory in chunks of memory_block_size_bytes() so that
> +		 * it is considered for MHP_MEMMAP_ON_MEMORY
> +		 * @range has already been aligned to memory_block_size_bytes(),
> +		 * so the following loop will always break it down cleanly.
>   		 */
> -		rc = add_memory_driver_managed(data->mgid, range.start,
> -				range_len(&range), kmem_name, MHP_NID_IS_MGID);
> +		cur_start = range.start;
> +		cur_len = memory_block_size_bytes();
> +		remaining = range_len(&range);
> +		while (remaining) {
> +			mhp_t mhp_flags = MHP_NID_IS_MGID;
>   
> -		if (rc) {
> -			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
> -					i, range.start, range.end);
> -			remove_resource(res);
> -			kfree(res);
> -			data->res[i] = NULL;
> -			if (mapped)
> -				continue;
> -			goto err_request_mem;
> +			if (mhp_supports_memmap_on_memory(cur_len,
> +							  MHP_MEMMAP_ON_MEMORY))
> +				mhp_flags |= MHP_MEMMAP_ON_MEMORY;
> +			/*
> +			 * Ensure that future kexec'd kernels will not treat
> +			 * this as RAM automatically.
> +			 */
> +			rc = add_memory_driver_managed(data->mgid, cur_start,
> +						       cur_len, kmem_name,
> +						       mhp_flags);
> +
> +			if (rc) {
> +				dev_warn(dev,
> +					 "mapping%d: %#llx-%#llx memory add failed\n",
> +					 i, cur_start, cur_start + cur_len - 1);
> +				remove_resource(res);
> +				kfree(res);
> +				data->res[i] = NULL;
> +				if (mapped)
> +					continue;
> +				goto err_request_mem;
> +			}
> +
> +			cur_start += cur_len;
> +			remaining -= cur_len;
>   		}
>   		mapped++;
>   	}
> 

Maybe the better alternative is teach 
add_memory_resource()/try_remove_memory() to do that internally.

In the add_memory_resource() case, it might be a loop around that 
memmap_on_memory + arch_add_memory code path (well, and the error path 
also needs adjustment):

	/*
	 * Self hosted memmap array
	 */
	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
		if (!mhp_supports_memmap_on_memory(size)) {
			ret = -EINVAL;
			goto error;
		}
		mhp_altmap.free = PHYS_PFN(size);
		mhp_altmap.base_pfn = PHYS_PFN(start);
		params.altmap = &mhp_altmap;
	}

	/* call arch's memory hotadd */
	ret = arch_add_memory(nid, start, size, &params);
	if (ret < 0)
		goto error;


Note that we want to handle that on a per memory-block basis, because we 
don't want the vmemmap of memory block #2 to end up on memory block #1. 
It all gets messy with memory onlining/offlining etc otherwise ...

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
  2023-06-15 22:00 ` [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory Vishal Verma
  2023-06-16  6:42   ` Huang, Ying
  2023-06-16  7:54   ` David Hildenbrand
@ 2023-06-20 13:14   ` Tarun Sahu
  2 siblings, 0 replies; 27+ messages in thread
From: Tarun Sahu @ 2023-06-20 13:14 UTC (permalink / raw)
  To: Vishal Verma, Rafael J. Wysocki, Len Brown, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Dan Williams, Dave Jiang
  Cc: linux-acpi, linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen, Vishal Verma, aneesh.kumar


Hi Vishal,

Vishal Verma <vishal.l.verma@intel.com> writes:

> With DAX memory regions originating from CXL memory expanders or
> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
> on a system without enough 'regular' main memory to support the memmap
> for it. To avoid this, ensure that all kmem managed hotplugged memory is
> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
> new memory region being hot added.
>

Some architectures doesn't have support for MEMMAP_ON_MEMORY, bypassing
the check mhp_memmap_on_memory() might cause problems on such
architectures (for e.g PPC64).

> To do this, call add_memory() in chunks of memory_block_size_bytes() as
> that is a requirement for memmap_on_memory. Additionally, Use the
> mhp_flag to force the memmap_on_memory checks regardless of the
> respective module parameter setting.
>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 36 insertions(+), 13 deletions(-)
>
> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> index 7b36db6f1cbd..0751346193ef 100644
> --- a/drivers/dax/kmem.c
> +++ b/drivers/dax/kmem.c
> @@ -12,6 +12,7 @@
>  #include <linux/mm.h>
>  #include <linux/mman.h>
>  #include <linux/memory-tiers.h>
> +#include <linux/memory_hotplug.h>
>  #include "dax-private.h"
>  #include "bus.h"
>  
> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  	data->mgid = rc;
>  
>  	for (i = 0; i < dev_dax->nr_range; i++) {
> +		u64 cur_start, cur_len, remaining;
>  		struct resource *res;
>  		struct range range;
>  
> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>  		res->flags = IORESOURCE_SYSTEM_RAM;
>  
>  		/*
> -		 * Ensure that future kexec'd kernels will not treat
> -		 * this as RAM automatically.
> +		 * Add memory in chunks of memory_block_size_bytes() so that
> +		 * it is considered for MHP_MEMMAP_ON_MEMORY
> +		 * @range has already been aligned to memory_block_size_bytes(),
> +		 * so the following loop will always break it down cleanly.
>  		 */
> -		rc = add_memory_driver_managed(data->mgid, range.start,
> -				range_len(&range), kmem_name, MHP_NID_IS_MGID);
> +		cur_start = range.start;
> +		cur_len = memory_block_size_bytes();
> +		remaining = range_len(&range);
> +		while (remaining) {
> +			mhp_t mhp_flags = MHP_NID_IS_MGID;
>  
> -		if (rc) {
> -			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
> -					i, range.start, range.end);
> -			remove_resource(res);
> -			kfree(res);
> -			data->res[i] = NULL;
> -			if (mapped)
> -				continue;
> -			goto err_request_mem;
> +			if (mhp_supports_memmap_on_memory(cur_len,
> +							  MHP_MEMMAP_ON_MEMORY))
> +				mhp_flags |= MHP_MEMMAP_ON_MEMORY;
> +			/*
> +			 * Ensure that future kexec'd kernels will not treat
> +			 * this as RAM automatically.
> +			 */
> +			rc = add_memory_driver_managed(data->mgid, cur_start,
> +						       cur_len, kmem_name,
> +						       mhp_flags);
> +
> +			if (rc) {
> +				dev_warn(dev,
> +					 "mapping%d: %#llx-%#llx memory add failed\n",
> +					 i, cur_start, cur_start + cur_len - 1);
> +				remove_resource(res);
> +				kfree(res);
> +				data->res[i] = NULL;
> +				if (mapped)
> +					continue;
> +				goto err_request_mem;
> +			}
> +
> +			cur_start += cur_len;
> +			remaining -= cur_len;
>  		}
>  		mapped++;
>  	}
>
> -- 
> 2.40.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem
  2023-06-16  7:44 ` [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem David Hildenbrand
@ 2023-06-21 19:32   ` Verma, Vishal L
  2023-06-22 13:55     ` David Hildenbrand
  2023-07-13 19:12   ` Jeff Moyer
  1 sibling, 1 reply; 27+ messages in thread
From: Verma, Vishal L @ 2023-06-21 19:32 UTC (permalink / raw)
  To: Williams, Dan J, osalvador@suse.de, Jiang, Dave,
	rafael@kernel.org, lenb@kernel.org, david@redhat.com,
	akpm@linux-foundation.org
  Cc: Huang, Ying, linux-mm@kvack.org, linux-cxl@vger.kernel.org,
	nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, dave.hansen@linux.intel.com

On Fri, 2023-06-16 at 09:44 +0200, David Hildenbrand wrote:
> On 16.06.23 00:00, Vishal Verma wrote:
> > The dax/kmem driver can potentially hot-add large amounts of memory
> > originating from CXL memory expanders, or NVDIMMs, or other 'device
> > memories'. There is a chance there isn't enough regular system memory
> > available to fit ythe memmap for this new memory. It's therefore
> > desirable, if all other conditions are met, for the kmem managed memory
> > to place its memmap on the newly added memory itself.
> > 
> > Arrange for this by first allowing for a module parameter override for
> > the mhp_supports_memmap_on_memory() test using a flag, adjusting the
> > only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c,
> > exporting the symbol so it can be called by kmem.c, and finally changing
> > the kmem driver to add_memory() in chunks of memory_block_size_bytes().
> 
> 1) Why is the override a requirement here? Just let the admin configure 
> it then then add conditional support for kmem.

Configure it in the current way using the module parameter to
memory_hotplug? The whole module param check feels a bit awkward,
especially since memory_hotplug is builtin, the only way to supply the
param is on the kernel command line as opposed to a modprobe config.

The goal with extending mhp_supports_memmap_on_memory() to check for
support with or without consideration for the module param is that it
makes it a bit more flexible for callers like kmem.

> 2) I recall that there are cases where we don't want the memmap to land 
> on slow memory (which online_movable would achieve). Just imagine the
> slow PMEM case. So this might need another configuration knob on the 
> kmem side.
> 
> I recall some discussions on doing that chunk handling internally (so
> kmem can just perform one add_memory() and we'd split that up internally).
> 
Another config knob isn't unreasonable - though the thinking in making
this behavior the new default policy was that with CXL based memory
expanders, the performance delta from main memory wouldn't be as big as
the pmem - main memory delta. With pmem devices being phased out, it's
not clear we'd need a knob, and it can always be added if it ends up
becoming necessary.

The other comments about doing the per-memblock loop internally, and
fixing up the removal paths all sound good, and I've started reworking
those - thanks for taking a look!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param
  2023-06-16  7:46   ` David Hildenbrand
@ 2023-06-22 13:37     ` Jonathan Cameron
  0 siblings, 0 replies; 27+ messages in thread
From: Jonathan Cameron @ 2023-06-22 13:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vishal Verma, Rafael J. Wysocki, Len Brown, Andrew Morton,
	Oscar Salvador, Dan Williams, Dave Jiang, linux-acpi,
	linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen

On Fri, 16 Jun 2023 09:46:59 +0200
David Hildenbrand <david@redhat.com> wrote:

> On 16.06.23 00:00, Vishal Verma wrote:
> > For memory hotplug to consider MHP_MEMMAP_ON_MEMORY behavior, the
> > 'memmap_on_memory' module parameter was a hard requirement.
> > 
> > In preparation for the dax/kmem driver to use memmap_on_memory
> > semantics, arrange for the module parameter check to be bypassed via the
> > appropriate mhp_flag.
> > 
> > Recall that the kmem driver could contribute huge amounts of hotplugged
> > memory originating from special purposes devices such as CXL memory
> > expanders. In some cases memmap_on_memory may be the /only/ way this new
> > memory can be hotplugged. Hence it makes sense for kmem to have a way to
> > force memmap_on_memory without depending on a module param, if all the
> > other conditions for it are met.  
> 
> Just let the admin configure it. After all, an admin is involved in 
> configuring the dax/kmem device to begin with. If add_memory() fails you 
> could give a useful hint to the admin.
> 

Agreed. If it were just the default then fine, but making it the only option
limits admin choices.

Jonathan


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem
  2023-06-21 19:32   ` Verma, Vishal L
@ 2023-06-22 13:55     ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2023-06-22 13:55 UTC (permalink / raw)
  To: Verma, Vishal L, Williams, Dan J, osalvador@suse.de, Jiang, Dave,
	rafael@kernel.org, lenb@kernel.org, akpm@linux-foundation.org
  Cc: Huang, Ying, linux-mm@kvack.org, linux-cxl@vger.kernel.org,
	nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, dave.hansen@linux.intel.com

On 21.06.23 21:32, Verma, Vishal L wrote:
> On Fri, 2023-06-16 at 09:44 +0200, David Hildenbrand wrote:
>> On 16.06.23 00:00, Vishal Verma wrote:
>>> The dax/kmem driver can potentially hot-add large amounts of memory
>>> originating from CXL memory expanders, or NVDIMMs, or other 'device
>>> memories'. There is a chance there isn't enough regular system memory
>>> available to fit ythe memmap for this new memory. It's therefore
>>> desirable, if all other conditions are met, for the kmem managed memory
>>> to place its memmap on the newly added memory itself.
>>>
>>> Arrange for this by first allowing for a module parameter override for
>>> the mhp_supports_memmap_on_memory() test using a flag, adjusting the
>>> only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c,
>>> exporting the symbol so it can be called by kmem.c, and finally changing
>>> the kmem driver to add_memory() in chunks of memory_block_size_bytes().
>>
>> 1) Why is the override a requirement here? Just let the admin configure
>> it then then add conditional support for kmem.
> 
> Configure it in the current way using the module parameter to
> memory_hotplug? The whole module param check feels a bit awkward,
> especially since memory_hotplug is builtin, the only way to supply the
> param is on the kernel command line as opposed to a modprobe config.

Yes, and that's nothing special. Runtime toggling is not implemented.

> 
> The goal with extending mhp_supports_memmap_on_memory() to check for
> support with or without consideration for the module param is that it
> makes it a bit more flexible for callers like kmem.

Not convinced yet that the global parameter should be bypassed TBH. And 
if so, this should be a separate patch on top that is completely 
optional for the remainder of the series.


In any case, there has to be some admin control over that, because

1) You usually don't want vmemmap on potentially slow memory
2) Using memmap-on-memory prohibits gigantic pages from forming on that 
memory (when runtime-allocating them).

So "just doing that" without any config knob is problematic.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param
  2023-06-15 22:00 ` [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param Vishal Verma
  2023-06-16  6:35   ` Huang, Ying
  2023-06-16  7:46   ` David Hildenbrand
@ 2023-06-23  8:40   ` Aneesh Kumar K.V
  2023-06-23 12:35     ` David Hildenbrand
  2 siblings, 1 reply; 27+ messages in thread
From: Aneesh Kumar K.V @ 2023-06-23  8:40 UTC (permalink / raw)
  To: Vishal Verma, Rafael J. Wysocki, Len Brown, Andrew Morton,
	David Hildenbrand, Oscar Salvador, Dan Williams, Dave Jiang
  Cc: linux-acpi, linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen, Vishal Verma

Vishal Verma <vishal.l.verma@intel.com> writes:

> For memory hotplug to consider MHP_MEMMAP_ON_MEMORY behavior, the
> 'memmap_on_memory' module parameter was a hard requirement.
>
> In preparation for the dax/kmem driver to use memmap_on_memory
> semantics, arrange for the module parameter check to be bypassed via the
> appropriate mhp_flag.
>
> Recall that the kmem driver could contribute huge amounts of hotplugged
> memory originating from special purposes devices such as CXL memory
> expanders. In some cases memmap_on_memory may be the /only/ way this new
> memory can be hotplugged. Hence it makes sense for kmem to have a way to
> force memmap_on_memory without depending on a module param, if all the
> other conditions for it are met.
>
> The only other user of this interface is acpi/acpi_memoryhotplug.c,
> which only enables the mhp_flag if an initial
> mhp_supports_memmap_on_memory() test passes. Maintain the existing
> behavior and semantics for this by performing the initial check from
> acpi without the MHP_MEMMAP_ON_MEMORY flag, so its decision falls back
> to the module parameter.
>
> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
> Cc: Len Brown <lenb@kernel.org>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Dan Williams <dan.j.williams@intel.com>
> Cc: Dave Jiang <dave.jiang@intel.com>
> Cc: Dave Hansen <dave.hansen@linux.intel.com>
> Cc: Huang Ying <ying.huang@intel.com>
> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
> ---
>  include/linux/memory_hotplug.h |  2 +-
>  drivers/acpi/acpi_memhotplug.c |  2 +-
>  mm/memory_hotplug.c            | 24 ++++++++++++++++--------
>  3 files changed, 18 insertions(+), 10 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 9fcbf5706595..c9ddcd3cad70 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -358,7 +358,7 @@ extern struct zone *zone_for_pfn_range(int online_type, int nid,
>  extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
>  				      struct mhp_params *params);
>  void arch_remove_linear_mapping(u64 start, u64 size);
> -extern bool mhp_supports_memmap_on_memory(unsigned long size);
> +extern bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t mhp_flags);
>  #endif /* CONFIG_MEMORY_HOTPLUG */
>  
>  #endif /* __LINUX_MEMORY_HOTPLUG_H */
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 24f662d8bd39..119d3bb49753 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -211,7 +211,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
>  		if (!info->length)
>  			continue;
>  
> -		if (mhp_supports_memmap_on_memory(info->length))
> +		if (mhp_supports_memmap_on_memory(info->length, 0))
>  			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>  		result = __add_memory(mgid, info->start_addr, info->length,
>  				      mhp_flags);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 8e0fa209d533..bb3845830922 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1283,15 +1283,21 @@ static int online_memory_block(struct memory_block *mem, void *arg)
>  	return device_online(&mem->dev);
>  }
>  
> -bool mhp_supports_memmap_on_memory(unsigned long size)
> +bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t mhp_flags)
>  {
>  	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
>  	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>  	unsigned long remaining_size = size - vmemmap_size;
>  
>  	/*
> -	 * Besides having arch support and the feature enabled at runtime, we
> -	 * need a few more assumptions to hold true:
> +	 * The MHP_MEMMAP_ON_MEMORY flag indicates a caller that wants to force
> +	 * memmap_on_memory (if other conditions are met), regardless of the
> +	 * module parameter. drivers/dax/kmem.c is an example, where large
> +	 * amounts of hotplug memory may come from, and the only option to
> +	 * successfully online all of it is to place the memmap on this memory.
> +	 *
> +	 * Besides having arch support and the feature enabled at runtime or
> +	 * via the mhp_flag, we need a few more assumptions to hold true:
>  	 *
>  	 * a) We span a single memory block: memory onlining/offlinin;g happens
>  	 *    in memory block granularity. We don't want the vmemmap of online
> @@ -1315,10 +1321,12 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
>  	 *       altmap as an alternative source of memory, and we do not exactly
>  	 *       populate a single PMD.
>  	 */
> -	return mhp_memmap_on_memory() &&
> -	       size == memory_block_size_bytes() &&
> -	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
> -	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
> +
> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) || mhp_memmap_on_memory())
> +		return size == memory_block_size_bytes() &&
> +		       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
> +		       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
> +	return false;
>  }
>  
>  /*
> @@ -1375,7 +1383,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>  	 * Self hosted memmap array
>  	 */
>  	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> -		if (!mhp_supports_memmap_on_memory(size)) {
> +		if (!mhp_supports_memmap_on_memory(size, mhp_flags)) {
>  			ret = -EINVAL;
>  			goto error;
>  		}
>

I was working on enabling memmap_on_memory for ppc64 and came across
this. Can we do this as below?

commit d55eddf71032cad3e057d8fa4c61ad2b8e05aaa8
Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Date:   Fri Jun 23 12:11:31 2023 +0530

    mm/hotplug: Allow to force enable memmap on memory
    
    In some cases like DAX kmem, we want to enable memmap on memory
    even if we have memory_hotplug.memmap_on_memory disabled. This
    patch enables such usage. DAX kmem can use MHP_MEMMAP_ON_MEMORY
    directly to request for memmap on memory.
    
    Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 24f662d8bd39..4d0096fc4cc2 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -211,8 +211,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
 		if (!info->length)
 			continue;
 
-		if (mhp_supports_memmap_on_memory(info->length))
-			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+		mhp_flags |= get_memmap_on_memory_flags();
 		result = __add_memory(mgid, info->start_addr, info->length,
 				      mhp_flags);
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 9fcbf5706595..f7f534084393 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -359,6 +359,13 @@ extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
 				      struct mhp_params *params);
 void arch_remove_linear_mapping(u64 start, u64 size);
 extern bool mhp_supports_memmap_on_memory(unsigned long size);
+bool mhp_memmap_on_memory(void);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
+static inline unsigned long get_memmap_on_memory_flags(void)
+{
+	if (mhp_memmap_on_memory())
+		return MHP_MEMMAP_ON_MEMORY;
+	return 0;
+}
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index eb18d1ac7e84..85d29909fd89 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -50,7 +50,7 @@ static bool memmap_on_memory __ro_after_init;
 module_param(memmap_on_memory, bool, 0444);
 MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug");
 
-static inline bool mhp_memmap_on_memory(void)
+bool mhp_memmap_on_memory(void)
 {
 	return memmap_on_memory;
 }
@@ -1316,10 +1316,9 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
 	 *       altmap as an alternative source of memory, and we do not exactly
 	 *       populate a single PMD.
 	 */
-	return mhp_memmap_on_memory() &&
-	       size == memory_block_size_bytes() &&
-	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
-	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
+	return size == memory_block_size_bytes() &&
+		IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
+		IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
 }
 
 /*



^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param
  2023-06-23  8:40   ` Aneesh Kumar K.V
@ 2023-06-23 12:35     ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2023-06-23 12:35 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Vishal Verma, Rafael J. Wysocki, Len Brown,
	Andrew Morton, Oscar Salvador, Dan Williams, Dave Jiang
  Cc: linux-acpi, linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen

On 23.06.23 10:40, Aneesh Kumar K.V wrote:
> Vishal Verma <vishal.l.verma@intel.com> writes:
> 
>> For memory hotplug to consider MHP_MEMMAP_ON_MEMORY behavior, the
>> 'memmap_on_memory' module parameter was a hard requirement.
>>
>> In preparation for the dax/kmem driver to use memmap_on_memory
>> semantics, arrange for the module parameter check to be bypassed via the
>> appropriate mhp_flag.
>>
>> Recall that the kmem driver could contribute huge amounts of hotplugged
>> memory originating from special purposes devices such as CXL memory
>> expanders. In some cases memmap_on_memory may be the /only/ way this new
>> memory can be hotplugged. Hence it makes sense for kmem to have a way to
>> force memmap_on_memory without depending on a module param, if all the
>> other conditions for it are met.
>>
>> The only other user of this interface is acpi/acpi_memoryhotplug.c,
>> which only enables the mhp_flag if an initial
>> mhp_supports_memmap_on_memory() test passes. Maintain the existing
>> behavior and semantics for this by performing the initial check from
>> acpi without the MHP_MEMMAP_ON_MEMORY flag, so its decision falls back
>> to the module parameter.
>>
>> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
>> Cc: Len Brown <lenb@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Jiang <dave.jiang@intel.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
>> ---
>>   include/linux/memory_hotplug.h |  2 +-
>>   drivers/acpi/acpi_memhotplug.c |  2 +-
>>   mm/memory_hotplug.c            | 24 ++++++++++++++++--------
>>   3 files changed, 18 insertions(+), 10 deletions(-)
>>
>> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
>> index 9fcbf5706595..c9ddcd3cad70 100644
>> --- a/include/linux/memory_hotplug.h
>> +++ b/include/linux/memory_hotplug.h
>> @@ -358,7 +358,7 @@ extern struct zone *zone_for_pfn_range(int online_type, int nid,
>>   extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
>>   				      struct mhp_params *params);
>>   void arch_remove_linear_mapping(u64 start, u64 size);
>> -extern bool mhp_supports_memmap_on_memory(unsigned long size);
>> +extern bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t mhp_flags);
>>   #endif /* CONFIG_MEMORY_HOTPLUG */
>>   
>>   #endif /* __LINUX_MEMORY_HOTPLUG_H */
>> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
>> index 24f662d8bd39..119d3bb49753 100644
>> --- a/drivers/acpi/acpi_memhotplug.c
>> +++ b/drivers/acpi/acpi_memhotplug.c
>> @@ -211,7 +211,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
>>   		if (!info->length)
>>   			continue;
>>   
>> -		if (mhp_supports_memmap_on_memory(info->length))
>> +		if (mhp_supports_memmap_on_memory(info->length, 0))
>>   			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>>   		result = __add_memory(mgid, info->start_addr, info->length,
>>   				      mhp_flags);
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 8e0fa209d533..bb3845830922 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -1283,15 +1283,21 @@ static int online_memory_block(struct memory_block *mem, void *arg)
>>   	return device_online(&mem->dev);
>>   }
>>   
>> -bool mhp_supports_memmap_on_memory(unsigned long size)
>> +bool mhp_supports_memmap_on_memory(unsigned long size, mhp_t mhp_flags)
>>   {
>>   	unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
>>   	unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
>>   	unsigned long remaining_size = size - vmemmap_size;
>>   
>>   	/*
>> -	 * Besides having arch support and the feature enabled at runtime, we
>> -	 * need a few more assumptions to hold true:
>> +	 * The MHP_MEMMAP_ON_MEMORY flag indicates a caller that wants to force
>> +	 * memmap_on_memory (if other conditions are met), regardless of the
>> +	 * module parameter. drivers/dax/kmem.c is an example, where large
>> +	 * amounts of hotplug memory may come from, and the only option to
>> +	 * successfully online all of it is to place the memmap on this memory.
>> +	 *
>> +	 * Besides having arch support and the feature enabled at runtime or
>> +	 * via the mhp_flag, we need a few more assumptions to hold true:
>>   	 *
>>   	 * a) We span a single memory block: memory onlining/offlinin;g happens
>>   	 *    in memory block granularity. We don't want the vmemmap of online
>> @@ -1315,10 +1321,12 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
>>   	 *       altmap as an alternative source of memory, and we do not exactly
>>   	 *       populate a single PMD.
>>   	 */
>> -	return mhp_memmap_on_memory() &&
>> -	       size == memory_block_size_bytes() &&
>> -	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
>> -	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
>> +
>> +	if ((mhp_flags & MHP_MEMMAP_ON_MEMORY) || mhp_memmap_on_memory())
>> +		return size == memory_block_size_bytes() &&
>> +		       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
>> +		       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
>> +	return false;
>>   }
>>   
>>   /*
>> @@ -1375,7 +1383,7 @@ int __ref add_memory_resource(int nid, struct resource *res, mhp_t mhp_flags)
>>   	 * Self hosted memmap array
>>   	 */
>>   	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
>> -		if (!mhp_supports_memmap_on_memory(size)) {
>> +		if (!mhp_supports_memmap_on_memory(size, mhp_flags)) {
>>   			ret = -EINVAL;
>>   			goto error;
>>   		}
>>
> 
> I was working on enabling memmap_on_memory for ppc64 and came across
> this. Can we do this as below?
> 
> commit d55eddf71032cad3e057d8fa4c61ad2b8e05aaa8
> Author: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Date:   Fri Jun 23 12:11:31 2023 +0530
> 
>      mm/hotplug: Allow to force enable memmap on memory
>      
>      In some cases like DAX kmem, we want to enable memmap on memory
>      even if we have memory_hotplug.memmap_on_memory disabled. This
>      patch enables such usage. DAX kmem can use MHP_MEMMAP_ON_MEMORY
>      directly to request for memmap on memory.
>      
>      Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> 
> diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
> index 24f662d8bd39..4d0096fc4cc2 100644
> --- a/drivers/acpi/acpi_memhotplug.c
> +++ b/drivers/acpi/acpi_memhotplug.c
> @@ -211,8 +211,7 @@ static int acpi_memory_enable_device(struct acpi_memory_device *mem_device)
>   		if (!info->length)
>   			continue;
>   
> -		if (mhp_supports_memmap_on_memory(info->length))
> -			mhp_flags |= MHP_MEMMAP_ON_MEMORY;
> +		mhp_flags |= get_memmap_on_memory_flags();
>   		result = __add_memory(mgid, info->start_addr, info->length,
>   				      mhp_flags);
>   
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 9fcbf5706595..f7f534084393 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -359,6 +359,13 @@ extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
>   				      struct mhp_params *params);
>   void arch_remove_linear_mapping(u64 start, u64 size);
>   extern bool mhp_supports_memmap_on_memory(unsigned long size);
> +bool mhp_memmap_on_memory(void);
>   #endif /* CONFIG_MEMORY_HOTPLUG */
>   
> +static inline unsigned long get_memmap_on_memory_flags(void)
> +{
> +	if (mhp_memmap_on_memory())
> +		return MHP_MEMMAP_ON_MEMORY;
> +	return 0;
> +}
>   #endif /* __LINUX_MEMORY_HOTPLUG_H */
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index eb18d1ac7e84..85d29909fd89 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -50,7 +50,7 @@ static bool memmap_on_memory __ro_after_init;
>   module_param(memmap_on_memory, bool, 0444);
>   MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory hotplug");
>   
> -static inline bool mhp_memmap_on_memory(void)
> +bool mhp_memmap_on_memory(void)
>   {
>   	return memmap_on_memory;
>   }
> @@ -1316,10 +1316,9 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
>   	 *       altmap as an alternative source of memory, and we do not exactly
>   	 *       populate a single PMD.
>   	 */
> -	return mhp_memmap_on_memory() &&
> -	       size == memory_block_size_bytes() &&
> -	       IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
> -	       IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
> +	return size == memory_block_size_bytes() &&
> +		IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
> +		IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
>   }
>   
>   /*
> 

Any forced enablement will have to cleanly handle try_remove_memory() as 
well.

memmap_on_memory needs a doc update then, and even the "forced 
enablement" has to still be explicitly enabled by the admin.

... but I'm still not convinced that memmap_on_memory should be 
overridden. Needs good justification.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
  2023-06-16  7:54   ` David Hildenbrand
@ 2023-07-11 14:30     ` Aneesh Kumar K.V
  2023-07-11 15:21       ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Aneesh Kumar K.V @ 2023-07-11 14:30 UTC (permalink / raw)
  To: David Hildenbrand, Vishal Verma, Rafael J. Wysocki, Len Brown,
	Andrew Morton, Oscar Salvador, Dan Williams, Dave Jiang
  Cc: linux-acpi, linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen

David Hildenbrand <david@redhat.com> writes:

> On 16.06.23 00:00, Vishal Verma wrote:
>> With DAX memory regions originating from CXL memory expanders or
>> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
>> on a system without enough 'regular' main memory to support the memmap
>> for it. To avoid this, ensure that all kmem managed hotplugged memory is
>> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
>> new memory region being hot added.
>>
>> To do this, call add_memory() in chunks of memory_block_size_bytes() as
>> that is a requirement for memmap_on_memory. Additionally, Use the
>> mhp_flag to force the memmap_on_memory checks regardless of the
>> respective module parameter setting.
>>
>> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
>> Cc: Len Brown <lenb@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Jiang <dave.jiang@intel.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
>> ---
>>   drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
>>   1 file changed, 36 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>> index 7b36db6f1cbd..0751346193ef 100644
>> --- a/drivers/dax/kmem.c
>> +++ b/drivers/dax/kmem.c
>> @@ -12,6 +12,7 @@
>>   #include <linux/mm.h>
>>   #include <linux/mman.h>
>>   #include <linux/memory-tiers.h>
>> +#include <linux/memory_hotplug.h>
>>   #include "dax-private.h"
>>   #include "bus.h"
>>
>> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>   	data->mgid = rc;
>>
>>   	for (i = 0; i < dev_dax->nr_range; i++) {
>> +		u64 cur_start, cur_len, remaining;
>>   		struct resource *res;
>>   		struct range range;
>>
>> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>   		res->flags = IORESOURCE_SYSTEM_RAM;
>>
>>   		/*
>> -		 * Ensure that future kexec'd kernels will not treat
>> -		 * this as RAM automatically.
>> +		 * Add memory in chunks of memory_block_size_bytes() so that
>> +		 * it is considered for MHP_MEMMAP_ON_MEMORY
>> +		 * @range has already been aligned to memory_block_size_bytes(),
>> +		 * so the following loop will always break it down cleanly.
>>   		 */
>> -		rc = add_memory_driver_managed(data->mgid, range.start,
>> -				range_len(&range), kmem_name, MHP_NID_IS_MGID);
>> +		cur_start = range.start;
>> +		cur_len = memory_block_size_bytes();
>> +		remaining = range_len(&range);
>> +		while (remaining) {
>> +			mhp_t mhp_flags = MHP_NID_IS_MGID;
>>
>> -		if (rc) {
>> -			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
>> -					i, range.start, range.end);
>> -			remove_resource(res);
>> -			kfree(res);
>> -			data->res[i] = NULL;
>> -			if (mapped)
>> -				continue;
>> -			goto err_request_mem;
>> +			if (mhp_supports_memmap_on_memory(cur_len,
>> +							  MHP_MEMMAP_ON_MEMORY))
>> +				mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>> +			/*
>> +			 * Ensure that future kexec'd kernels will not treat
>> +			 * this as RAM automatically.
>> +			 */
>> +			rc = add_memory_driver_managed(data->mgid, cur_start,
>> +						       cur_len, kmem_name,
>> +						       mhp_flags);
>> +
>> +			if (rc) {
>> +				dev_warn(dev,
>> +					 "mapping%d: %#llx-%#llx memory add failed\n",
>> +					 i, cur_start, cur_start + cur_len - 1);
>> +				remove_resource(res);
>> +				kfree(res);
>> +				data->res[i] = NULL;
>> +				if (mapped)
>> +					continue;
>> +				goto err_request_mem;
>> +			}
>> +
>> +			cur_start += cur_len;
>> +			remaining -= cur_len;
>>   		}
>>   		mapped++;
>>   	}
>>
>
> Maybe the better alternative is teach
> add_memory_resource()/try_remove_memory() to do that internally.
>
> In the add_memory_resource() case, it might be a loop around that
> memmap_on_memory + arch_add_memory code path (well, and the error path
> also needs adjustment):
>
> 	/*
> 	 * Self hosted memmap array
> 	 */
> 	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> 		if (!mhp_supports_memmap_on_memory(size)) {
> 			ret = -EINVAL;
> 			goto error;
> 		}
> 		mhp_altmap.free = PHYS_PFN(size);
> 		mhp_altmap.base_pfn = PHYS_PFN(start);
> 		params.altmap = &mhp_altmap;
> 	}
>
> 	/* call arch's memory hotadd */
> 	ret = arch_add_memory(nid, start, size, &params);
> 	if (ret < 0)
> 		goto error;
>
>
> Note that we want to handle that on a per memory-block basis, because we
> don't want the vmemmap of memory block #2 to end up on memory block #1.
> It all gets messy with memory onlining/offlining etc otherwise ...
>

I tried to implement this inside add_memory_driver_managed() and also
within dax/kmem. IMHO doing the error handling inside dax/kmem is
better. Here is how it looks:

1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions
2. For each succesful request_mem_regions if any blocks got added, we
keep the resource. If none got added, we will kfree the resource



	for (i = 0; i < dev_dax->nr_range; i++) {
		u64 cur_start, cur_len, remaining;
		struct resource *res;
		struct range range;
		bool block_added;

		rc = dax_kmem_range(dev_dax, i, &range);
		if (rc)
			continue;

		/* Region is permanently reserved if hotremove fails. */
		res = request_mem_region(range.start, range_len(&range), data->res_name);
		if (!res) {
			dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
					i, range.start, range.end);
			/*
			 * Once some memory has been onlined we can't
			 * assume that it can be un-onlined safely.
			 */
			if (mapped)
				continue;
			rc = -EBUSY;
			goto err_request_mem;
		}
		data->res[i] = res;

		/*
		 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
		 * so that add_memory() can add a child resource.  Do not
		 * inherit flags from the parent since it may set new flags
		 * unknown to us that will break add_memory() below.
		 */
		res->flags = IORESOURCE_SYSTEM_RAM;

		/*
		 * Add memory in chunks of memory_block_size_bytes() so that
		 * it is considered for MHP_MEMMAP_ON_MEMORY
		 * @range has already been aligned to memory_block_size_bytes(),
		 * so the following loop will always break it down cleanly.
		 */
		cur_start = range.start;
		cur_len = memory_block_size_bytes();
		remaining = range_len(&range);
		block_added = false;
		while (remaining) {
			/*
			 * If alignment rules are not satisified we will
			 * fallback normal memmap allocation.
			 */
			mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY;
			/*
			 * Ensure that future kexec'd kernels will not treat
			 * this as RAM automatically.
			 */
			rc = add_memory_driver_managed(data->mgid, cur_start,
						       cur_len, kmem_name,
						       mhp_flags);

			if (rc)
				dev_warn(dev,
					 "mapping%d: %#llx-%#llx memory add failed\n",
					 i, cur_start, cur_start + cur_len - 1);
			else
				block_added = true;

			cur_start += cur_len;
			remaining -= cur_len;
		}
		if (!block_added) {
			/*
			 * None of the blocks got added, remove the resource.
			 */
			remove_resource(res);
			kfree(res);
			data->res[i] = NULL;
		} else
			mapped++;
	}
	if (mapped) {
		dev_set_drvdata(dev, data);
		return 0;
	}

err_request_mem:
	/*
	 *  If none of the resources got mapped.
	 *  unregister the group.
	 */
	memory_group_unregister(data->mgid);
err_reg_mgid:
	kfree(data->res_name);


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
  2023-07-11 14:30     ` Aneesh Kumar K.V
@ 2023-07-11 15:21       ` David Hildenbrand
  2023-07-13  6:45         ` Verma, Vishal L
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2023-07-11 15:21 UTC (permalink / raw)
  To: Aneesh Kumar K.V, Vishal Verma, Rafael J. Wysocki, Len Brown,
	Andrew Morton, Oscar Salvador, Dan Williams, Dave Jiang
  Cc: linux-acpi, linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen

On 11.07.23 16:30, Aneesh Kumar K.V wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 16.06.23 00:00, Vishal Verma wrote:
>>> With DAX memory regions originating from CXL memory expanders or
>>> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
>>> on a system without enough 'regular' main memory to support the memmap
>>> for it. To avoid this, ensure that all kmem managed hotplugged memory is
>>> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
>>> new memory region being hot added.
>>>
>>> To do this, call add_memory() in chunks of memory_block_size_bytes() as
>>> that is a requirement for memmap_on_memory. Additionally, Use the
>>> mhp_flag to force the memmap_on_memory checks regardless of the
>>> respective module parameter setting.
>>>
>>> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
>>> Cc: Len Brown <lenb@kernel.org>
>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>> Cc: David Hildenbrand <david@redhat.com>
>>> Cc: Oscar Salvador <osalvador@suse.de>
>>> Cc: Dan Williams <dan.j.williams@intel.com>
>>> Cc: Dave Jiang <dave.jiang@intel.com>
>>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>>> Cc: Huang Ying <ying.huang@intel.com>
>>> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
>>> ---
>>>    drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
>>>    1 file changed, 36 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>> index 7b36db6f1cbd..0751346193ef 100644
>>> --- a/drivers/dax/kmem.c
>>> +++ b/drivers/dax/kmem.c
>>> @@ -12,6 +12,7 @@
>>>    #include <linux/mm.h>
>>>    #include <linux/mman.h>
>>>    #include <linux/memory-tiers.h>
>>> +#include <linux/memory_hotplug.h>
>>>    #include "dax-private.h"
>>>    #include "bus.h"
>>>
>>> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>    	data->mgid = rc;
>>>
>>>    	for (i = 0; i < dev_dax->nr_range; i++) {
>>> +		u64 cur_start, cur_len, remaining;
>>>    		struct resource *res;
>>>    		struct range range;
>>>
>>> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>    		res->flags = IORESOURCE_SYSTEM_RAM;
>>>
>>>    		/*
>>> -		 * Ensure that future kexec'd kernels will not treat
>>> -		 * this as RAM automatically.
>>> +		 * Add memory in chunks of memory_block_size_bytes() so that
>>> +		 * it is considered for MHP_MEMMAP_ON_MEMORY
>>> +		 * @range has already been aligned to memory_block_size_bytes(),
>>> +		 * so the following loop will always break it down cleanly.
>>>    		 */
>>> -		rc = add_memory_driver_managed(data->mgid, range.start,
>>> -				range_len(&range), kmem_name, MHP_NID_IS_MGID);
>>> +		cur_start = range.start;
>>> +		cur_len = memory_block_size_bytes();
>>> +		remaining = range_len(&range);
>>> +		while (remaining) {
>>> +			mhp_t mhp_flags = MHP_NID_IS_MGID;
>>>
>>> -		if (rc) {
>>> -			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
>>> -					i, range.start, range.end);
>>> -			remove_resource(res);
>>> -			kfree(res);
>>> -			data->res[i] = NULL;
>>> -			if (mapped)
>>> -				continue;
>>> -			goto err_request_mem;
>>> +			if (mhp_supports_memmap_on_memory(cur_len,
>>> +							  MHP_MEMMAP_ON_MEMORY))
>>> +				mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>>> +			/*
>>> +			 * Ensure that future kexec'd kernels will not treat
>>> +			 * this as RAM automatically.
>>> +			 */
>>> +			rc = add_memory_driver_managed(data->mgid, cur_start,
>>> +						       cur_len, kmem_name,
>>> +						       mhp_flags);
>>> +
>>> +			if (rc) {
>>> +				dev_warn(dev,
>>> +					 "mapping%d: %#llx-%#llx memory add failed\n",
>>> +					 i, cur_start, cur_start + cur_len - 1);
>>> +				remove_resource(res);
>>> +				kfree(res);
>>> +				data->res[i] = NULL;
>>> +				if (mapped)
>>> +					continue;
>>> +				goto err_request_mem;
>>> +			}
>>> +
>>> +			cur_start += cur_len;
>>> +			remaining -= cur_len;
>>>    		}
>>>    		mapped++;
>>>    	}
>>>
>>
>> Maybe the better alternative is teach
>> add_memory_resource()/try_remove_memory() to do that internally.
>>
>> In the add_memory_resource() case, it might be a loop around that
>> memmap_on_memory + arch_add_memory code path (well, and the error path
>> also needs adjustment):
>>
>> 	/*
>> 	 * Self hosted memmap array
>> 	 */
>> 	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
>> 		if (!mhp_supports_memmap_on_memory(size)) {
>> 			ret = -EINVAL;
>> 			goto error;
>> 		}
>> 		mhp_altmap.free = PHYS_PFN(size);
>> 		mhp_altmap.base_pfn = PHYS_PFN(start);
>> 		params.altmap = &mhp_altmap;
>> 	}
>>
>> 	/* call arch's memory hotadd */
>> 	ret = arch_add_memory(nid, start, size, &params);
>> 	if (ret < 0)
>> 		goto error;
>>
>>
>> Note that we want to handle that on a per memory-block basis, because we
>> don't want the vmemmap of memory block #2 to end up on memory block #1.
>> It all gets messy with memory onlining/offlining etc otherwise ...
>>
> 
> I tried to implement this inside add_memory_driver_managed() and also
> within dax/kmem. IMHO doing the error handling inside dax/kmem is
> better. Here is how it looks:
> 
> 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions
> 2. For each succesful request_mem_regions if any blocks got added, we
> keep the resource. If none got added, we will kfree the resource
> 

Doing this unconditional splitting outside of 
add_memory_driver_managed() is undesirable for at least two reasons

1) You end up always creating individual entries in the resource tree
    (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective.
2) As we call arch_add_memory() in memory block granularity (e.g., 128
    MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the
    identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective.

While you could sense for support and do the split based on that, it 
will be beneficial for other users (especially DIMMs) if we do that 
internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be 
effective or not.

In general, we avoid placing important kernel data-structures on slow 
memory. That's one of the reasons why PMEM decided to mostly always use 
ZONE_MOVABLE such that exactly what this patch does would not happen. So 
I'm wondering if there would be demand for an additional toggle.

Because even with memmap_on_memory enabled in general, you might not 
want to do that for dax/kmem.

IMHO, this patch should be dropped from your ppc64 series, as it's an 
independent change that might be valuable for other architectures as well.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
  2023-07-11 15:21       ` David Hildenbrand
@ 2023-07-13  6:45         ` Verma, Vishal L
  2023-07-13  7:23           ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Verma, Vishal L @ 2023-07-13  6:45 UTC (permalink / raw)
  To: david@redhat.com, akpm@linux-foundation.org, rafael@kernel.org,
	osalvador@suse.de, aneesh.kumar@linux.ibm.com, Williams, Dan J,
	lenb@kernel.org, Jiang, Dave
  Cc: Huang, Ying, linux-mm@kvack.org, linux-cxl@vger.kernel.org,
	nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, dave.hansen@linux.intel.com

On Tue, 2023-07-11 at 17:21 +0200, David Hildenbrand wrote:
> On 11.07.23 16:30, Aneesh Kumar K.V wrote:
> > David Hildenbrand <david@redhat.com> writes:
> > > 
> > > Maybe the better alternative is teach
> > > add_memory_resource()/try_remove_memory() to do that internally.
> > > 
> > > In the add_memory_resource() case, it might be a loop around that
> > > memmap_on_memory + arch_add_memory code path (well, and the error path
> > > also needs adjustment):
> > > 
> > >         /*
> > >          * Self hosted memmap array
> > >          */
> > >         if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> > >                 if (!mhp_supports_memmap_on_memory(size)) {
> > >                         ret = -EINVAL;
> > >                         goto error;
> > >                 }
> > >                 mhp_altmap.free = PHYS_PFN(size);
> > >                 mhp_altmap.base_pfn = PHYS_PFN(start);
> > >                 params.altmap = &mhp_altmap;
> > >         }
> > > 
> > >         /* call arch's memory hotadd */
> > >         ret = arch_add_memory(nid, start, size, &params);
> > >         if (ret < 0)
> > >                 goto error;
> > > 
> > > 
> > > Note that we want to handle that on a per memory-block basis, because we
> > > don't want the vmemmap of memory block #2 to end up on memory block #1.
> > > It all gets messy with memory onlining/offlining etc otherwise ...
> > > 
> > 
> > I tried to implement this inside add_memory_driver_managed() and also
> > within dax/kmem. IMHO doing the error handling inside dax/kmem is
> > better. Here is how it looks:
> > 
> > 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions
> > 2. For each succesful request_mem_regions if any blocks got added, we
> > keep the resource. If none got added, we will kfree the resource
> > 
> 
> Doing this unconditional splitting outside of 
> add_memory_driver_managed() is undesirable for at least two reasons
> 
> 1) You end up always creating individual entries in the resource tree
>     (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective.
> 2) As we call arch_add_memory() in memory block granularity (e.g., 128
>     MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the
>     identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective.
> 
> While you could sense for support and do the split based on that, it 
> will be beneficial for other users (especially DIMMs) if we do that 
> internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be 
> effective or not.

I'm taking a shot at implementing the splitting internally in
memory_hotplug.c. The caller (kmem) side does become trivial with this
approach, but there's a slight complication if I don't have the module
param override (patch 1 of this series).

The kmem diff now looks like:

   diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
   index 898ca9505754..8be932f63f90 100644
   --- a/drivers/dax/kmem.c
   +++ b/drivers/dax/kmem.c
   @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
           data->mgid = rc;
    
           for (i = 0; i < dev_dax->nr_range; i++) {
   +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
   +                                 MHP_SPLIT_MEMBLOCKS;
                   struct resource *res;
                   struct range range;
    
   @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
                    * this as RAM automatically.
                    */
                   rc = add_memory_driver_managed(data->mgid, range.start,
   -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
   +                               range_len(&range), kmem_name, mhp_flags);
    
                   if (rc) {
                           dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
   

However this begins to fail if the memmap_on_memory modparam is not
set, as add_memory_driver_managed EINVALs from the
mhp_supports_memmap_on_memory() check.

The way to work around this would probably include doing the
mhp_supports_memmap_on_memory() check in kmem, in a loop to check for
each memblock sized chunk, and that feels like a leak of the
implementation details into the caller.

Any suggestions on how to go about this?
> 
> In general, we avoid placing important kernel data-structures on slow
> memory. That's one of the reasons why PMEM decided to mostly always use 
> ZONE_MOVABLE such that exactly what this patch does would not happen. So 
> I'm wondering if there would be demand for an additional toggle.
> 
> Because even with memmap_on_memory enabled in general, you might not 
> want to do that for dax/kmem.
> 
> IMHO, this patch should be dropped from your ppc64 series, as it's an
> independent change that might be valuable for other architectures as well.
> 
Sure thing, I can pick this back up and Aneesh can drop this from his set.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
  2023-07-13  6:45         ` Verma, Vishal L
@ 2023-07-13  7:23           ` David Hildenbrand
  2023-07-13 15:15             ` Verma, Vishal L
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2023-07-13  7:23 UTC (permalink / raw)
  To: Verma, Vishal L, akpm@linux-foundation.org, rafael@kernel.org,
	osalvador@suse.de, aneesh.kumar@linux.ibm.com, Williams, Dan J,
	lenb@kernel.org, Jiang, Dave
  Cc: Huang, Ying, linux-mm@kvack.org, linux-cxl@vger.kernel.org,
	nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, dave.hansen@linux.intel.com

On 13.07.23 08:45, Verma, Vishal L wrote:
> On Tue, 2023-07-11 at 17:21 +0200, David Hildenbrand wrote:
>> On 11.07.23 16:30, Aneesh Kumar K.V wrote:
>>> David Hildenbrand <david@redhat.com> writes:
>>>>
>>>> Maybe the better alternative is teach
>>>> add_memory_resource()/try_remove_memory() to do that internally.
>>>>
>>>> In the add_memory_resource() case, it might be a loop around that
>>>> memmap_on_memory + arch_add_memory code path (well, and the error path
>>>> also needs adjustment):
>>>>
>>>>          /*
>>>>           * Self hosted memmap array
>>>>           */
>>>>          if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
>>>>                  if (!mhp_supports_memmap_on_memory(size)) {
>>>>                          ret = -EINVAL;
>>>>                          goto error;
>>>>                  }
>>>>                  mhp_altmap.free = PHYS_PFN(size);
>>>>                  mhp_altmap.base_pfn = PHYS_PFN(start);
>>>>                  params.altmap = &mhp_altmap;
>>>>          }
>>>>
>>>>          /* call arch's memory hotadd */
>>>>          ret = arch_add_memory(nid, start, size, &params);
>>>>          if (ret < 0)
>>>>                  goto error;
>>>>
>>>>
>>>> Note that we want to handle that on a per memory-block basis, because we
>>>> don't want the vmemmap of memory block #2 to end up on memory block #1.
>>>> It all gets messy with memory onlining/offlining etc otherwise ...
>>>>
>>>
>>> I tried to implement this inside add_memory_driver_managed() and also
>>> within dax/kmem. IMHO doing the error handling inside dax/kmem is
>>> better. Here is how it looks:
>>>
>>> 1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions
>>> 2. For each succesful request_mem_regions if any blocks got added, we
>>> keep the resource. If none got added, we will kfree the resource
>>>
>>
>> Doing this unconditional splitting outside of
>> add_memory_driver_managed() is undesirable for at least two reasons
>>
>> 1) You end up always creating individual entries in the resource tree
>>      (/proc/iomem) even if MHP_MEMMAP_ON_MEMORY is not effective.
>> 2) As we call arch_add_memory() in memory block granularity (e.g., 128
>>      MiB on x86), we might not make use of large PUDs (e.g., 1 GiB) in the
>>      identify mapping -- even if MHP_MEMMAP_ON_MEMORY is not effective.
>>
>> While you could sense for support and do the split based on that, it
>> will be beneficial for other users (especially DIMMs) if we do that
>> internally -- where we already know if MHP_MEMMAP_ON_MEMORY can be
>> effective or not.
> 
> I'm taking a shot at implementing the splitting internally in
> memory_hotplug.c. The caller (kmem) side does become trivial with this
> approach, but there's a slight complication if I don't have the module
> param override (patch 1 of this series).
> 
> The kmem diff now looks like:
> 
>     diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>     index 898ca9505754..8be932f63f90 100644
>     --- a/drivers/dax/kmem.c
>     +++ b/drivers/dax/kmem.c
>     @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>             data->mgid = rc;
>      
>             for (i = 0; i < dev_dax->nr_range; i++) {
>     +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
>     +                                 MHP_SPLIT_MEMBLOCKS;
>                     struct resource *res;
>                     struct range range;
>      
>     @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>                      * this as RAM automatically.
>                      */
>                     rc = add_memory_driver_managed(data->mgid, range.start,
>     -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
>     +                               range_len(&range), kmem_name, mhp_flags);
>      
>                     if (rc) {
>                             dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
>     
> 

Why do we need the MHP_SPLIT_MEMBLOCKS?

In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a
single memory block, you can simply split up internally, no?

Essentially in add_memory_resource() something like

if (mhp_flags & MHP_MEMMAP_ON_MEMORY &&
     mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
	for (cur_start = start, cur_start < start + size;
	     cur_start += memory_block_size_bytes()) {
		mhp_altmap.free = PHYS_PFN(memory_block_size_bytes());
		mhp_altmap.base_pfn = PHYS_PFN(start);
		params.altmap = &mhp_altmap;

		ret = arch_add_memory(nid, start,
				      memory_block_size_bytes(), &params);
		if (ret < 0) ...

		ret = create_memory_block_devices(start, memory_block_size_bytes(),
						  mhp_altmap.alloc, group);
		if (ret) ...
		
	}
} else {
	/* old boring stuff */
}

Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into
a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes()
...

If any adding of memory failed, we remove what we already added. That works, because as
long as we're holding the relevant locks, memory cannot get onlined in the meantime.

Then we only have to teach remove_memory() to deal with individual blocks when finding
blocks that have an altmap.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
  2023-07-13  7:23           ` David Hildenbrand
@ 2023-07-13 15:15             ` Verma, Vishal L
  2023-07-13 15:23               ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Verma, Vishal L @ 2023-07-13 15:15 UTC (permalink / raw)
  To: david@redhat.com, akpm@linux-foundation.org, rafael@kernel.org,
	osalvador@suse.de, aneesh.kumar@linux.ibm.com, Williams, Dan J,
	Jiang, Dave, lenb@kernel.org
  Cc: Huang, Ying, linux-mm@kvack.org, linux-cxl@vger.kernel.org,
	nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, dave.hansen@linux.intel.com

On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote:
> On 13.07.23 08:45, Verma, Vishal L wrote:
> > 
> > I'm taking a shot at implementing the splitting internally in
> > memory_hotplug.c. The caller (kmem) side does become trivial with this
> > approach, but there's a slight complication if I don't have the module
> > param override (patch 1 of this series).
> > 
> > The kmem diff now looks like:
> > 
> >     diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> >     index 898ca9505754..8be932f63f90 100644
> >     --- a/drivers/dax/kmem.c
> >     +++ b/drivers/dax/kmem.c
> >     @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> >             data->mgid = rc;
> >      
> >             for (i = 0; i < dev_dax->nr_range; i++) {
> >     +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
> >     +                                 MHP_SPLIT_MEMBLOCKS;
> >                     struct resource *res;
> >                     struct range range;
> >      
> >     @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> >                      * this as RAM automatically.
> >                      */
> >                     rc = add_memory_driver_managed(data->mgid, range.start,
> >     -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
> >     +                               range_len(&range), kmem_name, mhp_flags);
> >      
> >                     if (rc) {
> >                             dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
> >     
> > 
> 
> Why do we need the MHP_SPLIT_MEMBLOCKS?

I thought we still wanted either an opt-in or opt-out for the kmem
driver to be able to do memmap_on_memory, in case there were
performance implications or the lack of 1GiB PUDs. I haven't
implemented that yet, but I was thinking along the lines of a sysfs
knob exposed by kmem, that controls setting of this new
MHP_SPLIT_MEMBLOCKS flag.

> 
> In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a
> single memory block, you can simply split up internally, no?
> 
> Essentially in add_memory_resource() something like
> 
> if (mhp_flags & MHP_MEMMAP_ON_MEMORY &&
>      mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
>         for (cur_start = start, cur_start < start + size;
>              cur_start += memory_block_size_bytes()) {
>                 mhp_altmap.free = PHYS_PFN(memory_block_size_bytes());
>                 mhp_altmap.base_pfn = PHYS_PFN(start);
>                 params.altmap = &mhp_altmap;
> 
>                 ret = arch_add_memory(nid, start,
>                                       memory_block_size_bytes(), &params);
>                 if (ret < 0) ...
> 
>                 ret = create_memory_block_devices(start, memory_block_size_bytes(),
>                                                   mhp_altmap.alloc, group);
>                 if (ret) ...
>                 
>         }
> } else {
>         /* old boring stuff */
> }
> 
> Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into
> a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes()
> ...

My current approach was looping a level higher, on the call to
add_memory_resource, but this looks reasonable too, and I can switch to
this. In fact it is better as in my case I had to loop twice, once for
the regular add_memory() path and once for the _driver_managed() path.
Yours should avoid that.

> 
> If any adding of memory failed, we remove what we already added. That works, because as
> long as we're holding the relevant locks, memory cannot get onlined in the meantime.
> 
> Then we only have to teach remove_memory() to deal with individual blocks when finding
> blocks that have an altmap.
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
  2023-07-13 15:15             ` Verma, Vishal L
@ 2023-07-13 15:23               ` David Hildenbrand
  2023-07-13 15:40                 ` Verma, Vishal L
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2023-07-13 15:23 UTC (permalink / raw)
  To: Verma, Vishal L, akpm@linux-foundation.org, rafael@kernel.org,
	osalvador@suse.de, aneesh.kumar@linux.ibm.com, Williams, Dan J,
	Jiang, Dave, lenb@kernel.org
  Cc: Huang, Ying, linux-mm@kvack.org, linux-cxl@vger.kernel.org,
	nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, dave.hansen@linux.intel.com

On 13.07.23 17:15, Verma, Vishal L wrote:
> On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote:
>> On 13.07.23 08:45, Verma, Vishal L wrote:
>>>
>>> I'm taking a shot at implementing the splitting internally in
>>> memory_hotplug.c. The caller (kmem) side does become trivial with this
>>> approach, but there's a slight complication if I don't have the module
>>> param override (patch 1 of this series).
>>>
>>> The kmem diff now looks like:
>>>
>>>      diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>>      index 898ca9505754..8be932f63f90 100644
>>>      --- a/drivers/dax/kmem.c
>>>      +++ b/drivers/dax/kmem.c
>>>      @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>              data->mgid = rc;
>>>       
>>>              for (i = 0; i < dev_dax->nr_range; i++) {
>>>      +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
>>>      +                                 MHP_SPLIT_MEMBLOCKS;
>>>                      struct resource *res;
>>>                      struct range range;
>>>       
>>>      @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>                       * this as RAM automatically.
>>>                       */
>>>                      rc = add_memory_driver_managed(data->mgid, range.start,
>>>      -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
>>>      +                               range_len(&range), kmem_name, mhp_flags);
>>>       
>>>                      if (rc) {
>>>                              dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
>>>      
>>>
>>
>> Why do we need the MHP_SPLIT_MEMBLOCKS?
> 
> I thought we still wanted either an opt-in or opt-out for the kmem
> driver to be able to do memmap_on_memory, in case there were
> performance implications or the lack of 1GiB PUDs. I haven't
> implemented that yet, but I was thinking along the lines of a sysfs
> knob exposed by kmem, that controls setting of this new
> MHP_SPLIT_MEMBLOCKS flag.

Why is MHP_MEMMAP_ON_MEMORY not sufficient for that?

> 
>>
>> In add_memory_driver_managed(), if memmap_on_memory = 1 AND is effective for a
>> single memory block, you can simply split up internally, no?
>>
>> Essentially in add_memory_resource() something like
>>
>> if (mhp_flags & MHP_MEMMAP_ON_MEMORY &&
>>       mhp_supports_memmap_on_memory(memory_block_size_bytes())) {
>>          for (cur_start = start, cur_start < start + size;
>>               cur_start += memory_block_size_bytes()) {
>>                  mhp_altmap.free = PHYS_PFN(memory_block_size_bytes());
>>                  mhp_altmap.base_pfn = PHYS_PFN(start);
>>                  params.altmap = &mhp_altmap;
>>
>>                  ret = arch_add_memory(nid, start,
>>                                        memory_block_size_bytes(), &params);
>>                  if (ret < 0) ...
>>
>>                  ret = create_memory_block_devices(start, memory_block_size_bytes(),
>>                                                    mhp_altmap.alloc, group);
>>                  if (ret) ...
>>                  
>>          }
>> } else {
>>          /* old boring stuff */
>> }
>>
>> Of course, doing it a bit cleaner, factoring out adding of mem+creating devices into
>> a helper so we can use it on the other path, avoiding repeating memory_block_size_bytes()
>> ...
> 
> My current approach was looping a level higher, on the call to
> add_memory_resource, but this looks reasonable too, and I can switch to
> this. In fact it is better as in my case I had to loop twice, once for
> the regular add_memory() path and once for the _driver_managed() path.
> Yours should avoid that.

As we only care about the altmap here, handling it for arch_add_memory() 
+ create_memory_block_devices() should be sufficient.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
  2023-07-13 15:23               ` David Hildenbrand
@ 2023-07-13 15:40                 ` Verma, Vishal L
  2023-07-13 15:43                   ` David Hildenbrand
  0 siblings, 1 reply; 27+ messages in thread
From: Verma, Vishal L @ 2023-07-13 15:40 UTC (permalink / raw)
  To: david@redhat.com, akpm@linux-foundation.org, rafael@kernel.org,
	osalvador@suse.de, aneesh.kumar@linux.ibm.com, Williams, Dan J,
	Jiang, Dave, lenb@kernel.org
  Cc: Huang, Ying, linux-mm@kvack.org, linux-cxl@vger.kernel.org,
	nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, dave.hansen@linux.intel.com

On Thu, 2023-07-13 at 17:23 +0200, David Hildenbrand wrote:
> On 13.07.23 17:15, Verma, Vishal L wrote:
> > On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote:
> > > On 13.07.23 08:45, Verma, Vishal L wrote:
> > > > 
> > > > I'm taking a shot at implementing the splitting internally in
> > > > memory_hotplug.c. The caller (kmem) side does become trivial with this
> > > > approach, but there's a slight complication if I don't have the module
> > > > param override (patch 1 of this series).
> > > > 
> > > > The kmem diff now looks like:
> > > > 
> > > >      diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
> > > >      index 898ca9505754..8be932f63f90 100644
> > > >      --- a/drivers/dax/kmem.c
> > > >      +++ b/drivers/dax/kmem.c
> > > >      @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> > > >              data->mgid = rc;
> > > >       
> > > >              for (i = 0; i < dev_dax->nr_range; i++) {
> > > >      +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
> > > >      +                                 MHP_SPLIT_MEMBLOCKS;
> > > >                      struct resource *res;
> > > >                      struct range range;
> > > >       
> > > >      @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
> > > >                       * this as RAM automatically.
> > > >                       */
> > > >                      rc = add_memory_driver_managed(data->mgid, range.start,
> > > >      -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
> > > >      +                               range_len(&range), kmem_name, mhp_flags);
> > > >       
> > > >                      if (rc) {
> > > >                              dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
> > > >      
> > > > 
> > > 
> > > Why do we need the MHP_SPLIT_MEMBLOCKS?
> > 
> > I thought we still wanted either an opt-in or opt-out for the kmem
> > driver to be able to do memmap_on_memory, in case there were
> > performance implications or the lack of 1GiB PUDs. I haven't
> > implemented that yet, but I was thinking along the lines of a sysfs
> > knob exposed by kmem, that controls setting of this new
> > MHP_SPLIT_MEMBLOCKS flag.
> 
> Why is MHP_MEMMAP_ON_MEMORY not sufficient for that?
> 
> 
Ah I see what you mean now - knob just controls MHP_MEMMAP_ON_MEMORY,
and memory_hotplug is free to split to memblocks if it needs to to
satisfy that.

That sounds reasonable. Let me give this a try and see if I run into
anything else. Thanks David!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
  2023-07-13 15:40                 ` Verma, Vishal L
@ 2023-07-13 15:43                   ` David Hildenbrand
  0 siblings, 0 replies; 27+ messages in thread
From: David Hildenbrand @ 2023-07-13 15:43 UTC (permalink / raw)
  To: Verma, Vishal L, akpm@linux-foundation.org, rafael@kernel.org,
	osalvador@suse.de, aneesh.kumar@linux.ibm.com, Williams, Dan J,
	Jiang, Dave, lenb@kernel.org
  Cc: Huang, Ying, linux-mm@kvack.org, linux-cxl@vger.kernel.org,
	nvdimm@lists.linux.dev, linux-kernel@vger.kernel.org,
	linux-acpi@vger.kernel.org, dave.hansen@linux.intel.com

On 13.07.23 17:40, Verma, Vishal L wrote:
> On Thu, 2023-07-13 at 17:23 +0200, David Hildenbrand wrote:
>> On 13.07.23 17:15, Verma, Vishal L wrote:
>>> On Thu, 2023-07-13 at 09:23 +0200, David Hildenbrand wrote:
>>>> On 13.07.23 08:45, Verma, Vishal L wrote:
>>>>>
>>>>> I'm taking a shot at implementing the splitting internally in
>>>>> memory_hotplug.c. The caller (kmem) side does become trivial with this
>>>>> approach, but there's a slight complication if I don't have the module
>>>>> param override (patch 1 of this series).
>>>>>
>>>>> The kmem diff now looks like:
>>>>>
>>>>>       diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>>>>>       index 898ca9505754..8be932f63f90 100644
>>>>>       --- a/drivers/dax/kmem.c
>>>>>       +++ b/drivers/dax/kmem.c
>>>>>       @@ -105,6 +105,8 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>               data->mgid = rc;
>>>>>        
>>>>>               for (i = 0; i < dev_dax->nr_range; i++) {
>>>>>       +               mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY |
>>>>>       +                                 MHP_SPLIT_MEMBLOCKS;
>>>>>                       struct resource *res;
>>>>>                       struct range range;
>>>>>        
>>>>>       @@ -141,7 +143,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>>>>                        * this as RAM automatically.
>>>>>                        */
>>>>>                       rc = add_memory_driver_managed(data->mgid, range.start,
>>>>>       -                               range_len(&range), kmem_name, MHP_NID_IS_MGID);
>>>>>       +                               range_len(&range), kmem_name, mhp_flags);
>>>>>        
>>>>>                       if (rc) {
>>>>>                               dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
>>>>>       
>>>>>
>>>>
>>>> Why do we need the MHP_SPLIT_MEMBLOCKS?
>>>
>>> I thought we still wanted either an opt-in or opt-out for the kmem
>>> driver to be able to do memmap_on_memory, in case there were
>>> performance implications or the lack of 1GiB PUDs. I haven't
>>> implemented that yet, but I was thinking along the lines of a sysfs
>>> knob exposed by kmem, that controls setting of this new
>>> MHP_SPLIT_MEMBLOCKS flag.
>>
>> Why is MHP_MEMMAP_ON_MEMORY not sufficient for that?
>>
>>
> Ah I see what you mean now - knob just controls MHP_MEMMAP_ON_MEMORY,
> and memory_hotplug is free to split to memblocks if it needs to to
> satisfy that.

And if you don't want memmap holes in a larger area you're adding (for 
example to runtime-allocate 1 GiB pages), simply check the size your 
adding, and if it's, say, less than 1 G, don't set the flag.

But that's probably a corner case use case not worth considering for now.

> 
> That sounds reasonable. Let me give this a try and see if I run into
> anything else. Thanks David!

Sure!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem
  2023-06-16  7:44 ` [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem David Hildenbrand
  2023-06-21 19:32   ` Verma, Vishal L
@ 2023-07-13 19:12   ` Jeff Moyer
  2023-07-14  8:35     ` David Hildenbrand
  1 sibling, 1 reply; 27+ messages in thread
From: Jeff Moyer @ 2023-07-13 19:12 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vishal Verma, Rafael J. Wysocki, Len Brown, Andrew Morton,
	Oscar Salvador, Dan Williams, Dave Jiang, linux-acpi,
	linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen

David Hildenbrand <david@redhat.com> writes:

> On 16.06.23 00:00, Vishal Verma wrote:
>> The dax/kmem driver can potentially hot-add large amounts of memory
>> originating from CXL memory expanders, or NVDIMMs, or other 'device
>> memories'. There is a chance there isn't enough regular system memory
>> available to fit ythe memmap for this new memory. It's therefore
>> desirable, if all other conditions are met, for the kmem managed memory
>> to place its memmap on the newly added memory itself.
>>
>> Arrange for this by first allowing for a module parameter override for
>> the mhp_supports_memmap_on_memory() test using a flag, adjusting the
>> only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c,
>> exporting the symbol so it can be called by kmem.c, and finally changing
>> the kmem driver to add_memory() in chunks of memory_block_size_bytes().
>
> 1) Why is the override a requirement here? Just let the admin
> configure it then then add conditional support for kmem.
>
> 2) I recall that there are cases where we don't want the memmap to
> land on slow memory (which online_movable would achieve). Just imagine
> the slow PMEM case. So this might need another configuration knob on
> the kmem side.

From my memory, the case where you don't want the memmap to land on
*persistent memory* is when the device is small (such as NVDIMM-N), and
you want to reserve as much space as possible for the application data.
This has nothing to do with the speed of access.

-Jeff



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem
  2023-07-13 19:12   ` Jeff Moyer
@ 2023-07-14  8:35     ` David Hildenbrand
  2023-07-14 13:54       ` Jeff Moyer
  0 siblings, 1 reply; 27+ messages in thread
From: David Hildenbrand @ 2023-07-14  8:35 UTC (permalink / raw)
  To: Jeff Moyer, Dan Williams
  Cc: Vishal Verma, Rafael J. Wysocki, Len Brown, Andrew Morton,
	Oscar Salvador, Dave Jiang, linux-acpi, linux-kernel, linux-mm,
	nvdimm, linux-cxl, Huang Ying, Dave Hansen

On 13.07.23 21:12, Jeff Moyer wrote:
> David Hildenbrand <david@redhat.com> writes:
> 
>> On 16.06.23 00:00, Vishal Verma wrote:
>>> The dax/kmem driver can potentially hot-add large amounts of memory
>>> originating from CXL memory expanders, or NVDIMMs, or other 'device
>>> memories'. There is a chance there isn't enough regular system memory
>>> available to fit ythe memmap for this new memory. It's therefore
>>> desirable, if all other conditions are met, for the kmem managed memory
>>> to place its memmap on the newly added memory itself.
>>>
>>> Arrange for this by first allowing for a module parameter override for
>>> the mhp_supports_memmap_on_memory() test using a flag, adjusting the
>>> only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c,
>>> exporting the symbol so it can be called by kmem.c, and finally changing
>>> the kmem driver to add_memory() in chunks of memory_block_size_bytes().
>>
>> 1) Why is the override a requirement here? Just let the admin
>> configure it then then add conditional support for kmem.
>>
>> 2) I recall that there are cases where we don't want the memmap to
>> land on slow memory (which online_movable would achieve). Just imagine
>> the slow PMEM case. So this might need another configuration knob on
>> the kmem side.
> 
>  From my memory, the case where you don't want the memmap to land on
> *persistent memory* is when the device is small (such as NVDIMM-N), and
> you want to reserve as much space as possible for the application data.
> This has nothing to do with the speed of access.

Now that you mention it, I also do remember the origin of the altmap --
to achieve exactly that: place the memmap on the device.

commit 4b94ffdc4163bae1ec73b6e977ffb7a7da3d06d3
Author: Dan Williams <dan.j.williams@intel.com>
Date:   Fri Jan 15 16:56:22 2016 -0800

     x86, mm: introduce vmem_altmap to augment vmemmap_populate()
     
     In support of providing struct page for large persistent memory
     capacities, use struct vmem_altmap to change the default policy for
     allocating memory for the memmap array.  The default vmemmap_populate()
     allocates page table storage area from the page allocator.  Given
     persistent memory capacities relative to DRAM it may not be feasible to
     store the memmap in 'System Memory'.  Instead vmem_altmap represents
     pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
     requests.

In PFN_MODE_PMEM (and only then), we use the altmap (don't see a way to
configure it).


BUT that case is completely different from the "System RAM" mode. The memmap
of an NVDIMM in pmem mode is barely used by core-mm (i.e., not the buddy).

In comparison, if the buddy and everybody else works on the memmap in
"System RAM", it's much more significant if that resides on slow memory.


Looking at

commit 9b6e63cbf85b89b2dbffa4955dbf2df8250e5375
Author: Michal Hocko <mhocko@suse.com>
Date:   Tue Oct 3 16:16:19 2017 -0700

     mm, page_alloc: add scheduling point to memmap_init_zone
     
     memmap_init_zone gets a pfn range to initialize and it can be really
     large resulting in a soft lockup on non-preemptible kernels
     
       NMI watchdog: BUG: soft lockup - CPU#31 stuck for 23s! [kworker/u642:5:1720]
       [...]
       task: ffff88ecd7e902c0 ti: ffff88eca4e50000 task.ti: ffff88eca4e50000
       RIP: move_pfn_range_to_zone+0x185/0x1d0
       [...]
       Call Trace:
         devm_memremap_pages+0x2c7/0x430
         pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
         nvdimm_bus_probe+0x64/0x110 [libnvdimm]


It's hard to tell if that was only required due to the memmap for these devices
being that large, or also partially because the access to the memmap is slower
that it makes a real difference.


I recall that we're also often using ZONE_MOVABLE on such slow memory
to not end up placing other kernel data structures on there: especially,
user space page tables as I've been told.


@Dan, any insight on the performance aspects when placing the memmap on
(slow) memory and having that memory be consumed by the buddy where we frequently
operate on the memmap?

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem
  2023-07-14  8:35     ` David Hildenbrand
@ 2023-07-14 13:54       ` Jeff Moyer
  0 siblings, 0 replies; 27+ messages in thread
From: Jeff Moyer @ 2023-07-14 13:54 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Dan Williams, Vishal Verma, Rafael J. Wysocki, Len Brown,
	Andrew Morton, Oscar Salvador, Dave Jiang, linux-acpi,
	linux-kernel, linux-mm, nvdimm, linux-cxl, Huang Ying,
	Dave Hansen

David Hildenbrand <david@redhat.com> writes:

> On 13.07.23 21:12, Jeff Moyer wrote:
>> David Hildenbrand <david@redhat.com> writes:
>>
>>> On 16.06.23 00:00, Vishal Verma wrote:
>>>> The dax/kmem driver can potentially hot-add large amounts of memory
>>>> originating from CXL memory expanders, or NVDIMMs, or other 'device
>>>> memories'. There is a chance there isn't enough regular system memory
>>>> available to fit ythe memmap for this new memory. It's therefore
>>>> desirable, if all other conditions are met, for the kmem managed memory
>>>> to place its memmap on the newly added memory itself.
>>>>
>>>> Arrange for this by first allowing for a module parameter override for
>>>> the mhp_supports_memmap_on_memory() test using a flag, adjusting the
>>>> only other caller of this interface in dirvers/acpi/acpi_memoryhotplug.c,
>>>> exporting the symbol so it can be called by kmem.c, and finally changing
>>>> the kmem driver to add_memory() in chunks of memory_block_size_bytes().
>>>
>>> 1) Why is the override a requirement here? Just let the admin
>>> configure it then then add conditional support for kmem.
>>>
>>> 2) I recall that there are cases where we don't want the memmap to
>>> land on slow memory (which online_movable would achieve). Just imagine
>>> the slow PMEM case. So this might need another configuration knob on
>>> the kmem side.
>>
>>  From my memory, the case where you don't want the memmap to land on
>> *persistent memory* is when the device is small (such as NVDIMM-N), and
>> you want to reserve as much space as possible for the application data.
>> This has nothing to do with the speed of access.
>
> Now that you mention it, I also do remember the origin of the altmap --
> to achieve exactly that: place the memmap on the device.
>
> commit 4b94ffdc4163bae1ec73b6e977ffb7a7da3d06d3
> Author: Dan Williams <dan.j.williams@intel.com>
> Date:   Fri Jan 15 16:56:22 2016 -0800
>
>     x86, mm: introduce vmem_altmap to augment vmemmap_populate()
>       In support of providing struct page for large persistent memory
>     capacities, use struct vmem_altmap to change the default policy for
>     allocating memory for the memmap array.  The default vmemmap_populate()
>     allocates page table storage area from the page allocator.  Given
>     persistent memory capacities relative to DRAM it may not be feasible to
>     store the memmap in 'System Memory'.  Instead vmem_altmap represents
>     pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
>     requests.
>
> In PFN_MODE_PMEM (and only then), we use the altmap (don't see a way to
> configure it).

Configuration is done at pmem namespace creation time.  The metadata for
the namespace indicates where the memmap resides.  See the
ndctl-create-namespace man page:

       -M, --map=
           A pmem namespace in "fsdax" or "devdax" mode requires allocation of
           per-page metadata. The allocation can be drawn from either:

           ·   "mem": typical system memory

           ·   "dev": persistent memory reserved from the namespace

                   Given relative capacities of "Persistent Memory" to "System
                   RAM" the allocation defaults to reserving space out of the
                   namespace directly ("--map=dev"). The overhead is 64-bytes per
                   4K (16GB per 1TB) on x86.

> BUT that case is completely different from the "System RAM" mode. The memmap
> of an NVDIMM in pmem mode is barely used by core-mm (i.e., not the buddy).

Right.  (btw, I don't think system ram mode existed back then.)

> In comparison, if the buddy and everybody else works on the memmap in
> "System RAM", it's much more significant if that resides on slow memory.

Agreed.

> Looking at
>
> commit 9b6e63cbf85b89b2dbffa4955dbf2df8250e5375
> Author: Michal Hocko <mhocko@suse.com>
> Date:   Tue Oct 3 16:16:19 2017 -0700
>
>     mm, page_alloc: add scheduling point to memmap_init_zone
>       memmap_init_zone gets a pfn range to initialize and it can be
> really
>     large resulting in a soft lockup on non-preemptible kernels
>         NMI watchdog: BUG: soft lockup - CPU#31 stuck for 23s!
> [kworker/u642:5:1720]
>       [...]
>       task: ffff88ecd7e902c0 ti: ffff88eca4e50000 task.ti: ffff88eca4e50000
>       RIP: move_pfn_range_to_zone+0x185/0x1d0
>       [...]
>       Call Trace:
>         devm_memremap_pages+0x2c7/0x430
>         pmem_attach_disk+0x2fd/0x3f0 [nd_pmem]
>         nvdimm_bus_probe+0x64/0x110 [libnvdimm]
>
>
> It's hard to tell if that was only required due to the memmap for these devices
> being that large, or also partially because the access to the memmap is slower
> that it makes a real difference.

I believe the main driver was the size.  At the time, Intel was
advertising 3TiB/socket for pmem.  I can't remember the exact DRAM
configuration sizes from the time.

> I recall that we're also often using ZONE_MOVABLE on such slow memory
> to not end up placing other kernel data structures on there: especially,
> user space page tables as I've been told.

Part of the issue was preserving the media.  The page structure gets
lots of updates, and that could cause premature wear.

> @Dan, any insight on the performance aspects when placing the memmap on
> (slow) memory and having that memory be consumed by the buddy where we frequently
> operate on the memmap?

I'm glad you're asking these questions.  We definitely want to make sure
we don't conflate requirements based on some particular
technology/implementation.  Also, I wouldn't make any assumptions about
the performance of CXL devices.  As I understand it, there could be a
broad spectrum of performance profiles.

And now Dan can correct anything I got wrong.  ;-)

Cheers,
Jeff



^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2023-07-14 13:48 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-15 22:00 [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem Vishal Verma
2023-06-15 22:00 ` [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param Vishal Verma
2023-06-16  6:35   ` Huang, Ying
2023-06-16  7:46   ` David Hildenbrand
2023-06-22 13:37     ` Jonathan Cameron
2023-06-23  8:40   ` Aneesh Kumar K.V
2023-06-23 12:35     ` David Hildenbrand
2023-06-15 22:00 ` [PATCH 2/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory() Vishal Verma
2023-06-16  7:47   ` David Hildenbrand
2023-06-15 22:00 ` [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory Vishal Verma
2023-06-16  6:42   ` Huang, Ying
2023-06-16  7:54   ` David Hildenbrand
2023-07-11 14:30     ` Aneesh Kumar K.V
2023-07-11 15:21       ` David Hildenbrand
2023-07-13  6:45         ` Verma, Vishal L
2023-07-13  7:23           ` David Hildenbrand
2023-07-13 15:15             ` Verma, Vishal L
2023-07-13 15:23               ` David Hildenbrand
2023-07-13 15:40                 ` Verma, Vishal L
2023-07-13 15:43                   ` David Hildenbrand
2023-06-20 13:14   ` Tarun Sahu
2023-06-16  7:44 ` [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem David Hildenbrand
2023-06-21 19:32   ` Verma, Vishal L
2023-06-22 13:55     ` David Hildenbrand
2023-07-13 19:12   ` Jeff Moyer
2023-07-14  8:35     ` David Hildenbrand
2023-07-14 13:54       ` Jeff Moyer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).