From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Gregory Price <gourry@gourry.net>,
linux-mm@kvack.org, nvdimm@lists.linux.dev
Cc: linux-kernel@vger.kernel.org, linux-cxl@vger.kernel.org,
driver-core@lists.linux.dev, linux-kselftest@vger.kernel.org,
kernel-team@meta.com, osalvador@suse.de,
gregkh@linuxfoundation.org, rafael@kernel.org, dakr@kernel.org,
djbw@kernel.org, vishal.l.verma@intel.com, dave.jiang@intel.com,
akpm@linux-foundation.org, ljs@kernel.org, liam@infradead.org,
vbabka@kernel.org, rppt@kernel.org, surenb@google.com,
mhocko@suse.com, shuah@kernel.org, alison.schofield@intel.com,
Smita.KoralahalliChannabasappa@amd.com, ira.weiny@intel.com,
apopple@nvidia.com
Subject: Re: [PATCH v5 5/9] mm/memory_hotplug: offline_and_remove_memory_ranges()
Date: Thu, 25 Jun 2026 09:22:01 +0200 [thread overview]
Message-ID: <d48feca1-0203-43ff-bd66-6243291a51ba@kernel.org> (raw)
In-Reply-To: <20260624145744.3532049-6-gourry@gourry.net>
On 6/24/26 16:57, Gregory Price wrote:
> offline_and_remove_memory() handles a single contiguous range.
>
> Callers that manage a device composed of several ranges (dax/kmem)
> currently have to call it in a loop, which gives up atomicity.
>
> In addition to pushing rollback logic into the driver, the lack
> of atomicity creates a race condition between system daemons trying
> to manage the same resource:
>
> - Manager 1: Offlines memory blocks. Removes device.
> ^^^^
> - Manager 2: Detects offline memory blocks, re-onlines them.
>
> Add offline_and_remove_memory_ranges(), which takes an array of ranges
> and processes them as one operation under a single lock_device_hotplug():
>
> - Phase 1 offlines every block of every range.
> - Phase 2 removes the ranges only if all ranges are offline.
> - If any offline fails, the whole operation is reverted.
>
> This gives callers all-or-nothing semantics for the offline step, so a
> failed or interrupted unplug leaves the device in a consistent state.
>
> This also resolves the battling managers race - the second manager's
> operation simply fails when the block is destroyed / cannot be onlined.
>
> offline_and_remove_memory() becomes a thin wrapper that passes its single
> range to the new helper, so the offline/rollback logic lives in one place.
>
> Suggested-by: David Hildenbrand (Arm) <david@kernel.org>
> Signed-off-by: Gregory Price <gourry@gourry.net>
> ---
> include/linux/memory_hotplug.h | 7 +++
> mm/memory_hotplug.c | 94 ++++++++++++++++++++++++----------
> 2 files changed, 74 insertions(+), 27 deletions(-)
>
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index d3edeb80aadb..7f1da7c428dc 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -267,6 +267,7 @@ extern int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
> extern int remove_memory(u64 start, u64 size);
> extern void __remove_memory(u64 start, u64 size);
> extern int offline_and_remove_memory(u64 start, u64 size);
> +int offline_and_remove_memory_ranges(const struct range *ranges, int nr_ranges);
>
> #else
> static inline void try_offline_node(int nid) {}
> @@ -283,6 +284,12 @@ static inline int remove_memory(u64 start, u64 size)
> }
>
> static inline void __remove_memory(u64 start, u64 size) {}
> +
> +static inline int offline_and_remove_memory_ranges(const struct range *ranges,
> + int nr_ranges)
Best to use "unsigned int" right from the start and use two tabs to indent.
> +{
> + return -EBUSY;
> +}
> #endif /* CONFIG_MEMORY_HOTREMOVE */
>
> #ifdef CONFIG_MEMORY_HOTPLUG
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index a66346def504..7d56e0c6ede0 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -2429,58 +2429,98 @@ static int try_reonline_memory_block(struct memory_block *mem, void *arg)
> */
> int offline_and_remove_memory(u64 start, u64 size)
> {
> - const unsigned long mb_count = size / memory_block_size_bytes();
> + struct range range = { .start = start, .end = start + size - 1 };
I prefer this more readable as:
struct range range = {
.start = start,
.end = start + size - 1,
};
> +
> + return offline_and_remove_memory_ranges(&range, 1);
> +}
> +EXPORT_SYMBOL_GPL(offline_and_remove_memory);
> +
> +/**
> + * offline_and_remove_memory_ranges - offline and remove multiple memory ranges
> + * @ranges: array of physical address ranges to offline and remove
> + * @nr_ranges: number of entries in @ranges
> + *
> + * Offline and remove several memory ranges as one operation, serialized
> + * against other hotplug operations by a single lock_device_hotplug().
> + *
> + * This offlines all ranges before removing any of them. If offlining any
> + * range fails, the entire process is reverted and nothing is removed.
> + * This provides a fully atomic semantic for unplugging an entire device.
> + *
> + * Each range must be memory-block aligned in start and size.
> + *
> + * Return: 0 on success, negative errno otherwise. On failure no range has
> + * been removed.
> + */
> +int offline_and_remove_memory_ranges(const struct range *ranges, int nr_ranges)
> +{
> + unsigned long mb_total = 0;
> uint8_t *online_types, *tmp;
> - int rc;
> + int i, rc = 0;
>
> - if (!IS_ALIGNED(start, memory_block_size_bytes()) ||
> - !IS_ALIGNED(size, memory_block_size_bytes()) || !size)
> + if (!ranges || nr_ranges <= 0)
With "unsigned int" this will be !nr_ranges.
Wondering whether we would WARN_ON_ONCE() here.
> return -EINVAL;
>
> + for (i = 0; i < nr_ranges; i++) {
> + u64 start = ranges[i].start;
> + u64 size = range_len(&ranges[i]);
Both can be const.
> +
> + if (!IS_ALIGNED(start, memory_block_size_bytes()) ||
> + !IS_ALIGNED(size, memory_block_size_bytes()) || !size)
> + return -EINVAL;
> + mb_total += size / memory_block_size_bytes();
> + }
> +
> /*
> - * We'll remember the old online type of each memory block, so we can
> - * try to revert whatever we did when offlining one memory block fails
> - * after offlining some others succeeded.
> + * Remember the old online type of every memory block across all ranges,
> + * so we can revert if offlining a later block fails. All entries start
> + * as MMOP_OFFLINE so blocks we never touched are skipped on rollback.
> */
> - online_types = kmalloc_array(mb_count, sizeof(*online_types),
> + online_types = kmalloc_array(mb_total, sizeof(*online_types),
> GFP_KERNEL);
Is "mb_total" really more expressive than "mb_count"?
> if (!online_types)
> return -ENOMEM;
> - /*
> - * Initialize all states to MMOP_OFFLINE, so when we abort processing in
> - * try_offline_memory_block(), we'll skip all unprocessed blocks in
> - * try_reonline_memory_block().
> - */
> - memset(online_types, MMOP_OFFLINE, mb_count);
> + memset(online_types, MMOP_OFFLINE, mb_total);
>
> lock_device_hotplug();
>
> + /* Phase 1: offline every block in every range. */
> tmp = online_types;
> - rc = walk_memory_blocks(start, size, &tmp, try_offline_memory_block);
> + for (i = 0; i < nr_ranges; i++) {
> + rc = walk_memory_blocks(ranges[i].start, range_len(&ranges[i]),
> + &tmp, try_offline_memory_block);
> + if (rc)
> + break;
> + }
>
> /*
> - * In case we succeeded to offline all memory, remove it.
> - * This cannot fail as it cannot get onlined in the meantime.
> + * Phase 2: Remove each range. This essentially cannot fail as we hold
> + * the hotplug lock . WARN if that assumption is ever broken.
> */
> if (!rc) {
> - rc = try_remove_memory(start, size);
> - if (rc)
> - pr_err("%s: Failed to remove memory: %d", __func__, rc);
> + for (i = 0; i < nr_ranges; i++) {
> + rc = try_remove_memory(ranges[i].start,
> + range_len(&ranges[i]));
> + if (WARN_ON_ONCE(rc)) {
> + pr_err("%s: Failed to remove memory: %d",
> + __func__, rc);
> + break;
Do we really want to break? I'd say, just warn and continue, and fake rc == 0.
Something is seriously messed up already, and we partially removed memory. There
is no clean rollback possible.
Similar to __remove_memory(), ignoring the error because it offlined it already.
> + }
> + }
> }
In general, looks much cleaner to me, thanks!
--
Cheers,
David
next prev parent reply other threads:[~2026-06-25 7:22 UTC|newest]
Thread overview: 19+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-24 14:57 [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
2026-06-24 14:57 ` [PATCH v5 1/9] mm/memory: add memory_block_aligned_range() helper Gregory Price
2026-06-24 14:57 ` [PATCH v5 2/9] mm/memory_hotplug: pass online_type to online_memory_block() via arg Gregory Price
2026-06-24 16:28 ` Gupta, Pankaj
2026-06-24 14:57 ` [PATCH v5 3/9] mm/memory_hotplug: export mhp_get_default_online_type Gregory Price
2026-06-24 14:57 ` [PATCH v5 4/9] mm/memory_hotplug: add __add_memory_driver_managed() with online_type arg Gregory Price
2026-06-24 16:41 ` Gupta, Pankaj
2026-06-24 14:57 ` [PATCH v5 5/9] mm/memory_hotplug: offline_and_remove_memory_ranges() Gregory Price
2026-06-25 7:22 ` David Hildenbrand (Arm) [this message]
2026-06-24 14:57 ` [PATCH v5 6/9] dax: plumb hotplug online_type through dax Gregory Price
2026-06-24 14:57 ` [PATCH v5 7/9] dax/kmem: extract hotplug/hotremove helper functions Gregory Price
2026-06-24 14:57 ` [PATCH v5 8/9] dax/kmem: add sysfs interface for atomic whole-device hotplug Gregory Price
[not found] ` <20260624151122.AFE551F000E9@smtp.kernel.org>
2026-06-24 21:28 ` Gregory Price
2026-06-25 6:17 ` Hannes Reinecke
2026-06-25 6:43 ` Gregory Price
2026-06-25 7:40 ` David Hildenbrand (Arm)
2026-06-24 14:57 ` [PATCH v5 9/9] selftests/dax: add dax/kmem hotplug sysfs regression test Gregory Price
2026-06-24 18:59 ` [PATCH v5 0/9] dax/kmem: atomic whole-device hotplug via sysfs Gregory Price
2026-06-25 7:41 ` David Hildenbrand (Arm)
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d48feca1-0203-43ff-bd66-6243291a51ba@kernel.org \
--to=david@kernel.org \
--cc=Smita.KoralahalliChannabasappa@amd.com \
--cc=akpm@linux-foundation.org \
--cc=alison.schofield@intel.com \
--cc=apopple@nvidia.com \
--cc=dakr@kernel.org \
--cc=dave.jiang@intel.com \
--cc=djbw@kernel.org \
--cc=driver-core@lists.linux.dev \
--cc=gourry@gourry.net \
--cc=gregkh@linuxfoundation.org \
--cc=ira.weiny@intel.com \
--cc=kernel-team@meta.com \
--cc=liam@infradead.org \
--cc=linux-cxl@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-kselftest@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=nvdimm@lists.linux.dev \
--cc=osalvador@suse.de \
--cc=rafael@kernel.org \
--cc=rppt@kernel.org \
--cc=shuah@kernel.org \
--cc=surenb@google.com \
--cc=vbabka@kernel.org \
--cc=vishal.l.verma@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox