Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
To: David Hildenbrand <david@redhat.com>,
	Vishal Verma <vishal.l.verma@intel.com>,
	"Rafael J. Wysocki" <rafael@kernel.org>,
	Len Brown <lenb@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Oscar Salvador <osalvador@suse.de>,
	Dan Williams <dan.j.williams@intel.com>,
	Dave Jiang <dave.jiang@intel.com>
Cc: linux-acpi@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, nvdimm@lists.linux.dev,
	linux-cxl@vger.kernel.org, Huang Ying <ying.huang@intel.com>,
	Dave Hansen <dave.hansen@linux.intel.com>
Subject: Re: [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory
Date: Tue, 11 Jul 2023 20:00:32 +0530	[thread overview]
Message-ID: <87edleplkn.fsf@linux.ibm.com> (raw)
In-Reply-To: <aadbedeb-424d-a146-392d-d56680263691@redhat.com>

David Hildenbrand <david@redhat.com> writes:

> On 16.06.23 00:00, Vishal Verma wrote:
>> With DAX memory regions originating from CXL memory expanders or
>> NVDIMMs, the kmem driver may be hot-adding huge amounts of system memory
>> on a system without enough 'regular' main memory to support the memmap
>> for it. To avoid this, ensure that all kmem managed hotplugged memory is
>> added with the MHP_MEMMAP_ON_MEMORY flag to place the memmap on the
>> new memory region being hot added.
>>
>> To do this, call add_memory() in chunks of memory_block_size_bytes() as
>> that is a requirement for memmap_on_memory. Additionally, Use the
>> mhp_flag to force the memmap_on_memory checks regardless of the
>> respective module parameter setting.
>>
>> Cc: "Rafael J. Wysocki" <rafael@kernel.org>
>> Cc: Len Brown <lenb@kernel.org>
>> Cc: Andrew Morton <akpm@linux-foundation.org>
>> Cc: David Hildenbrand <david@redhat.com>
>> Cc: Oscar Salvador <osalvador@suse.de>
>> Cc: Dan Williams <dan.j.williams@intel.com>
>> Cc: Dave Jiang <dave.jiang@intel.com>
>> Cc: Dave Hansen <dave.hansen@linux.intel.com>
>> Cc: Huang Ying <ying.huang@intel.com>
>> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
>> ---
>>   drivers/dax/kmem.c | 49 ++++++++++++++++++++++++++++++++++++-------------
>>   1 file changed, 36 insertions(+), 13 deletions(-)
>>
>> diff --git a/drivers/dax/kmem.c b/drivers/dax/kmem.c
>> index 7b36db6f1cbd..0751346193ef 100644
>> --- a/drivers/dax/kmem.c
>> +++ b/drivers/dax/kmem.c
>> @@ -12,6 +12,7 @@
>>   #include <linux/mm.h>
>>   #include <linux/mman.h>
>>   #include <linux/memory-tiers.h>
>> +#include <linux/memory_hotplug.h>
>>   #include "dax-private.h"
>>   #include "bus.h"
>>
>> @@ -105,6 +106,7 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>   	data->mgid = rc;
>>
>>   	for (i = 0; i < dev_dax->nr_range; i++) {
>> +		u64 cur_start, cur_len, remaining;
>>   		struct resource *res;
>>   		struct range range;
>>
>> @@ -137,21 +139,42 @@ static int dev_dax_kmem_probe(struct dev_dax *dev_dax)
>>   		res->flags = IORESOURCE_SYSTEM_RAM;
>>
>>   		/*
>> -		 * Ensure that future kexec'd kernels will not treat
>> -		 * this as RAM automatically.
>> +		 * Add memory in chunks of memory_block_size_bytes() so that
>> +		 * it is considered for MHP_MEMMAP_ON_MEMORY
>> +		 * @range has already been aligned to memory_block_size_bytes(),
>> +		 * so the following loop will always break it down cleanly.
>>   		 */
>> -		rc = add_memory_driver_managed(data->mgid, range.start,
>> -				range_len(&range), kmem_name, MHP_NID_IS_MGID);
>> +		cur_start = range.start;
>> +		cur_len = memory_block_size_bytes();
>> +		remaining = range_len(&range);
>> +		while (remaining) {
>> +			mhp_t mhp_flags = MHP_NID_IS_MGID;
>>
>> -		if (rc) {
>> -			dev_warn(dev, "mapping%d: %#llx-%#llx memory add failed\n",
>> -					i, range.start, range.end);
>> -			remove_resource(res);
>> -			kfree(res);
>> -			data->res[i] = NULL;
>> -			if (mapped)
>> -				continue;
>> -			goto err_request_mem;
>> +			if (mhp_supports_memmap_on_memory(cur_len,
>> +							  MHP_MEMMAP_ON_MEMORY))
>> +				mhp_flags |= MHP_MEMMAP_ON_MEMORY;
>> +			/*
>> +			 * Ensure that future kexec'd kernels will not treat
>> +			 * this as RAM automatically.
>> +			 */
>> +			rc = add_memory_driver_managed(data->mgid, cur_start,
>> +						       cur_len, kmem_name,
>> +						       mhp_flags);
>> +
>> +			if (rc) {
>> +				dev_warn(dev,
>> +					 "mapping%d: %#llx-%#llx memory add failed\n",
>> +					 i, cur_start, cur_start + cur_len - 1);
>> +				remove_resource(res);
>> +				kfree(res);
>> +				data->res[i] = NULL;
>> +				if (mapped)
>> +					continue;
>> +				goto err_request_mem;
>> +			}
>> +
>> +			cur_start += cur_len;
>> +			remaining -= cur_len;
>>   		}
>>   		mapped++;
>>   	}
>>
>
> Maybe the better alternative is teach
> add_memory_resource()/try_remove_memory() to do that internally.
>
> In the add_memory_resource() case, it might be a loop around that
> memmap_on_memory + arch_add_memory code path (well, and the error path
> also needs adjustment):
>
> 	/*
> 	 * Self hosted memmap array
> 	 */
> 	if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
> 		if (!mhp_supports_memmap_on_memory(size)) {
> 			ret = -EINVAL;
> 			goto error;
> 		}
> 		mhp_altmap.free = PHYS_PFN(size);
> 		mhp_altmap.base_pfn = PHYS_PFN(start);
> 		params.altmap = &mhp_altmap;
> 	}
>
> 	/* call arch's memory hotadd */
> 	ret = arch_add_memory(nid, start, size, &params);
> 	if (ret < 0)
> 		goto error;
>
>
> Note that we want to handle that on a per memory-block basis, because we
> don't want the vmemmap of memory block #2 to end up on memory block #1.
> It all gets messy with memory onlining/offlining etc otherwise ...
>

I tried to implement this inside add_memory_driver_managed() and also
within dax/kmem. IMHO doing the error handling inside dax/kmem is
better. Here is how it looks:

1. If any blocks got added before (mapped > 0) we loop through all successful request_mem_regions
2. For each succesful request_mem_regions if any blocks got added, we
keep the resource. If none got added, we will kfree the resource



	for (i = 0; i < dev_dax->nr_range; i++) {
		u64 cur_start, cur_len, remaining;
		struct resource *res;
		struct range range;
		bool block_added;

		rc = dax_kmem_range(dev_dax, i, &range);
		if (rc)
			continue;

		/* Region is permanently reserved if hotremove fails. */
		res = request_mem_region(range.start, range_len(&range), data->res_name);
		if (!res) {
			dev_warn(dev, "mapping%d: %#llx-%#llx could not reserve region\n",
					i, range.start, range.end);
			/*
			 * Once some memory has been onlined we can't
			 * assume that it can be un-onlined safely.
			 */
			if (mapped)
				continue;
			rc = -EBUSY;
			goto err_request_mem;
		}
		data->res[i] = res;

		/*
		 * Set flags appropriate for System RAM.  Leave ..._BUSY clear
		 * so that add_memory() can add a child resource.  Do not
		 * inherit flags from the parent since it may set new flags
		 * unknown to us that will break add_memory() below.
		 */
		res->flags = IORESOURCE_SYSTEM_RAM;

		/*
		 * Add memory in chunks of memory_block_size_bytes() so that
		 * it is considered for MHP_MEMMAP_ON_MEMORY
		 * @range has already been aligned to memory_block_size_bytes(),
		 * so the following loop will always break it down cleanly.
		 */
		cur_start = range.start;
		cur_len = memory_block_size_bytes();
		remaining = range_len(&range);
		block_added = false;
		while (remaining) {
			/*
			 * If alignment rules are not satisified we will
			 * fallback normal memmap allocation.
			 */
			mhp_t mhp_flags = MHP_NID_IS_MGID | MHP_MEMMAP_ON_MEMORY;
			/*
			 * Ensure that future kexec'd kernels will not treat
			 * this as RAM automatically.
			 */
			rc = add_memory_driver_managed(data->mgid, cur_start,
						       cur_len, kmem_name,
						       mhp_flags);

			if (rc)
				dev_warn(dev,
					 "mapping%d: %#llx-%#llx memory add failed\n",
					 i, cur_start, cur_start + cur_len - 1);
			else
				block_added = true;

			cur_start += cur_len;
			remaining -= cur_len;
		}
		if (!block_added) {
			/*
			 * None of the blocks got added, remove the resource.
			 */
			remove_resource(res);
			kfree(res);
			data->res[i] = NULL;
		} else
			mapped++;
	}
	if (mapped) {
		dev_set_drvdata(dev, data);
		return 0;
	}

err_request_mem:
	/*
	 *  If none of the resources got mapped.
	 *  unregister the group.
	 */
	memory_group_unregister(data->mgid);
err_reg_mgid:
	kfree(data->res_name);

next prev parent reply	other threads:[~2023-07-11 14:31 UTC|newest]

Thread overview: 27+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-15 22:00 [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem Vishal Verma
2023-06-15 22:00 ` [PATCH 1/3] mm/memory_hotplug: Allow an override for the memmap_on_memory param Vishal Verma
2023-06-16  6:35   ` Huang, Ying
2023-06-16  7:46   ` David Hildenbrand
2023-06-22 13:37     ` Jonathan Cameron
2023-06-23  8:40   ` Aneesh Kumar K.V
2023-06-23 12:35     ` David Hildenbrand
2023-06-15 22:00 ` [PATCH 2/3] mm/memory_hotplug: Export symbol mhp_supports_memmap_on_memory() Vishal Verma
2023-06-16  7:47   ` David Hildenbrand
2023-06-15 22:00 ` [PATCH 3/3] dax/kmem: Always enroll hotplugged memory for memmap_on_memory Vishal Verma
2023-06-16  6:42   ` Huang, Ying
2023-06-16  7:54   ` David Hildenbrand
2023-07-11 14:30     ` Aneesh Kumar K.V [this message]
2023-07-11 15:21       ` David Hildenbrand
2023-07-13  6:45         ` Verma, Vishal L
2023-07-13  7:23           ` David Hildenbrand
2023-07-13 15:15             ` Verma, Vishal L
2023-07-13 15:23               ` David Hildenbrand
2023-07-13 15:40                 ` Verma, Vishal L
2023-07-13 15:43                   ` David Hildenbrand
2023-06-20 13:14   ` Tarun Sahu
2023-06-16  7:44 ` [PATCH 0/3] mm: use memmap_on_memory semantics for dax/kmem David Hildenbrand
2023-06-21 19:32   ` Verma, Vishal L
2023-06-22 13:55     ` David Hildenbrand
2023-07-13 19:12   ` Jeff Moyer
2023-07-14  8:35     ` David Hildenbrand
2023-07-14 13:54       ` Jeff Moyer

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87edleplkn.fsf@linux.ibm.com \
    --to=aneesh.kumar@linux.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=dan.j.williams@intel.com \
    --cc=dave.hansen@linux.intel.com \
    --cc=dave.jiang@intel.com \
    --cc=david@redhat.com \
    --cc=lenb@kernel.org \
    --cc=linux-acpi@vger.kernel.org \
    --cc=linux-cxl@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=nvdimm@lists.linux.dev \
    --cc=osalvador@suse.de \
    --cc=rafael@kernel.org \
    --cc=vishal.l.verma@intel.com \
    --cc=ying.huang@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.