* [PATCH v2 0/1] mm/memory_hotplug: introduce and use PG_offline_skippable
@ 2025-05-20 16:42 David Hildenbrand
2025-05-20 16:42 ` [PATCH v2 1/1] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages David Hildenbrand
0 siblings, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2025-05-20 16:42 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, virtualization, David Hildenbrand, Michael S. Tsirkin,
Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
Oscar Salvador, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, Matthew Wilcox (Oracle)
This is a requirement for making PageOffline pages not have a refcount
in the long future ("frozen"), and for reworking non-folio page migration
in the near future.
I have patches mostly ready to go to handle the latter. For turning all
PageOffline() pages frozen, the non-folio page migration and memory
ballooning drivers will have to be reworked first, to no longer rely on
the refcount of PageOffline pages.
Introduce PG_offline_skippable that only applies to PageOffline() pages --
of course, reusing one of the existing PG_ flags for now -- and convert
virtio-mem to make use of the new way: to allow for skipping PageOffline
pages during memory offlining, treating them as if they would not be
allocated.
Note that the existing mechanism relied on the driver (virtio-mem)
dropping its reference during MEM_GOING_OFFLINE, which is complicated and
not compatible with the concept of frozen pages (no refcount).
Tested with virtio-mem on x86, including partially hotplugging a memory
block (hotplugging 64MiB with a 128 MiB memory block size), and repeatedly
onlining+offlining the memory block. Also tested that forced driver
unloading with partially plugged memory blocks keeps working as is.
Cc: David Hildenbrand <david@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
v1 -> v2:
* Handle forced driver unloading of virtio-mem: clear the flag
* Added a comment for a PageOfflineSkippable check
* Added more details to the patch description
* Drop patch #2 ("mm/memory_hotplug: remove -EBUSY handling from
scan_movable_pages()")
David Hildenbrand (1):
mm/memory_hotplug: PG_offline_skippable for offlining memory blocks
with PageOffline pages
drivers/virtio/virtio_mem.c | 141 ++++++++----------------------------
include/linux/page-flags.h | 29 ++++++--
mm/memory_hotplug.c | 17 +++--
mm/page_alloc.c | 8 +-
mm/page_isolation.c | 21 ++----
5 files changed, 74 insertions(+), 142 deletions(-)
--
2.49.0
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH v2 1/1] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages
2025-05-20 16:42 [PATCH v2 0/1] mm/memory_hotplug: introduce and use PG_offline_skippable David Hildenbrand
@ 2025-05-20 16:42 ` David Hildenbrand
2025-05-21 13:32 ` Oscar Salvador
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: David Hildenbrand @ 2025-05-20 16:42 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mm, virtualization, David Hildenbrand, Michael S. Tsirkin,
Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
Oscar Salvador, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, Matthew Wilcox (Oracle)
A long-term goal is supporting frozen PageOffline pages, and later
PageOffline pages that don't have a refcount at all. Some more work for
that is needed -- in particular around non-folio page migration and
memory ballooning drivers -- but let's start by handling PageOffline pages
that can be skipped during memory offlining differently.
Note that PageOffline is used to mark pages that are logically offline
in an otherwise online memory block (e.g., 128 MiB). If a memory
block is offline, the memmap is considered compeltely uninitialized
and stale (see pfn_to_online_page()).
Let's introduce a PageOffline specific page flag (PG_offline_skippable)
that for now reuses PG_owner_2. In the memdesc future, it will be one of
a small number of per-memdesc flags stored alongside the type.
By setting PG_offline_skippable, a driver indicates that it can
restore the PageOffline state of these specific pages when re-onlining a
memory block: it knows that these pages are supposed to be PageOffline()
without the information in the vmemmap, so it can filter them out and
not expose them to the buddy -> they stay PageOffline().
While PG_offline_offlineable might be clearer, it is also super
confusing. Alternatives (PG_offline_sticky?) also don't quite feel right.
So let's use "skippable" for now.
The flag is not supposed to be used for movable PageOffline pages as
used for balloon compaction; movable PageOffline() pages can simply be
migrated during the memory offlining stage, turning the migration
destination page PageOffline() and turning the migration source page
into a free buddy page.
Let's convert the single user from our MEM_GOING_OFFLINE approach
to the new PG_offline_skippable approach: virtio-mem. Fortunately,
this simplifies the code quite a lot. The only corner case we have to
take care of is when force-unloading the virtio-mem driver: we have to
prevent partially-plugged memory blocks from getting offlined by
clearing PG_offline_skippable again.
What if someone decides to grab a reference on these pages although they
really shouldn't? After all, we'll now keep the refcount at 1 (until we
can properly stop using the refcount completely).
Well, less worse things will happen than would currently: currently,
if someone would grab a reference to these pages, in MEM_GOING_OFFLINE
we would run into the
if (WARN_ON(!page_ref_dec_and_test(page)))
dump_page(page, "fake-offline page referenced");
And once that unexpected reference would get dropped, we would end up
freeing that page to the buddy: ouch.
Now, we'll allow for offlining that memory, and when that unexpected
reference would get dropped, we would not end up freeing that page to
the buddy. Once we have frozen PageOffline() pages, it will all get a
lot cleaner.
Note that we didn't see the existing WARN_ON so far, because nobody
should ever be referencing such pages.
An alternative might be to have another callback chain from memory hotplug
code, where a driver that owns that page could agree to skip the
PageOffline() page. However, we would have to repeatedly issue these
callbacks for individual PageOffline() pages, which does not sound
compelling. As we have spare bits, let's use this simpler approach for
now.
Acked-by: Zi Yan <ziy@nvidia.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
---
drivers/virtio/virtio_mem.c | 141 ++++++++----------------------------
include/linux/page-flags.h | 29 ++++++--
mm/memory_hotplug.c | 17 +++--
mm/page_alloc.c | 8 +-
mm/page_isolation.c | 21 ++----
5 files changed, 74 insertions(+), 142 deletions(-)
diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 56d0dbe621637..e3b00ac8ada29 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -278,10 +278,6 @@ static DEFINE_MUTEX(virtio_mem_mutex);
static LIST_HEAD(virtio_mem_devices);
static void virtio_mem_online_page_cb(struct page *page, unsigned int order);
-static void virtio_mem_fake_offline_going_offline(unsigned long pfn,
- unsigned long nr_pages);
-static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn,
- unsigned long nr_pages);
static void virtio_mem_retry(struct virtio_mem *vm);
static int virtio_mem_create_resource(struct virtio_mem *vm);
static void virtio_mem_delete_resource(struct virtio_mem *vm);
@@ -924,64 +920,6 @@ static void virtio_mem_sbm_notify_online(struct virtio_mem *vm,
virtio_mem_sbm_set_mb_state(vm, mb_id, new_state);
}
-static void virtio_mem_sbm_notify_going_offline(struct virtio_mem *vm,
- unsigned long mb_id)
-{
- const unsigned long nr_pages = PFN_DOWN(vm->sbm.sb_size);
- unsigned long pfn;
- int sb_id;
-
- for (sb_id = 0; sb_id < vm->sbm.sbs_per_mb; sb_id++) {
- if (virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id, 1))
- continue;
- pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
- sb_id * vm->sbm.sb_size);
- virtio_mem_fake_offline_going_offline(pfn, nr_pages);
- }
-}
-
-static void virtio_mem_sbm_notify_cancel_offline(struct virtio_mem *vm,
- unsigned long mb_id)
-{
- const unsigned long nr_pages = PFN_DOWN(vm->sbm.sb_size);
- unsigned long pfn;
- int sb_id;
-
- for (sb_id = 0; sb_id < vm->sbm.sbs_per_mb; sb_id++) {
- if (virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id, 1))
- continue;
- pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
- sb_id * vm->sbm.sb_size);
- virtio_mem_fake_offline_cancel_offline(pfn, nr_pages);
- }
-}
-
-static void virtio_mem_bbm_notify_going_offline(struct virtio_mem *vm,
- unsigned long bb_id,
- unsigned long pfn,
- unsigned long nr_pages)
-{
- /*
- * When marked as "fake-offline", all online memory of this device block
- * is allocated by us. Otherwise, we don't have any memory allocated.
- */
- if (virtio_mem_bbm_get_bb_state(vm, bb_id) !=
- VIRTIO_MEM_BBM_BB_FAKE_OFFLINE)
- return;
- virtio_mem_fake_offline_going_offline(pfn, nr_pages);
-}
-
-static void virtio_mem_bbm_notify_cancel_offline(struct virtio_mem *vm,
- unsigned long bb_id,
- unsigned long pfn,
- unsigned long nr_pages)
-{
- if (virtio_mem_bbm_get_bb_state(vm, bb_id) !=
- VIRTIO_MEM_BBM_BB_FAKE_OFFLINE)
- return;
- virtio_mem_fake_offline_cancel_offline(pfn, nr_pages);
-}
-
/*
* This callback will either be called synchronously from add_memory() or
* asynchronously (e.g., triggered via user space). We have to be careful
@@ -1040,12 +978,6 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
break;
}
vm->hotplug_active = true;
- if (vm->in_sbm)
- virtio_mem_sbm_notify_going_offline(vm, id);
- else
- virtio_mem_bbm_notify_going_offline(vm, id,
- mhp->start_pfn,
- mhp->nr_pages);
break;
case MEM_GOING_ONLINE:
mutex_lock(&vm->hotplug_mutex);
@@ -1094,12 +1026,6 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
case MEM_CANCEL_OFFLINE:
if (!vm->hotplug_active)
break;
- if (vm->in_sbm)
- virtio_mem_sbm_notify_cancel_offline(vm, id);
- else
- virtio_mem_bbm_notify_cancel_offline(vm, id,
- mhp->start_pfn,
- mhp->nr_pages);
vm->hotplug_active = false;
mutex_unlock(&vm->hotplug_mutex);
break;
@@ -1157,6 +1083,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
SetPageDirty(page);
else
__SetPageOffline(page);
+ __SetPageOfflineSkippable(page);
VM_WARN_ON_ONCE(!PageOffline(page));
}
page_offline_end();
@@ -1172,6 +1099,7 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn,
for (; nr_pages--; pfn++) {
struct page *page = pfn_to_page(pfn);
+ __ClearPageOfflineSkippable(page);
if (!onlined)
/* generic_online_page() will clear PageOffline(). */
ClearPageDirty(page);
@@ -1261,41 +1189,6 @@ static int virtio_mem_fake_offline(struct virtio_mem *vm, unsigned long pfn,
return -EBUSY;
}
-/*
- * Handle fake-offline pages when memory is going offline - such that the
- * pages can be skipped by mm-core when offlining.
- */
-static void virtio_mem_fake_offline_going_offline(unsigned long pfn,
- unsigned long nr_pages)
-{
- struct page *page;
- unsigned long i;
-
- /* Drop our reference to the pages so the memory can get offlined. */
- for (i = 0; i < nr_pages; i++) {
- page = pfn_to_page(pfn + i);
- if (WARN_ON(!page_ref_dec_and_test(page)))
- dump_page(page, "fake-offline page referenced");
- }
-}
-
-/*
- * Handle fake-offline pages when memory offlining is canceled - to undo
- * what we did in virtio_mem_fake_offline_going_offline().
- */
-static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn,
- unsigned long nr_pages)
-{
- unsigned long i;
-
- /*
- * Get the reference again that we dropped via page_ref_dec_and_test()
- * when going offline.
- */
- for (i = 0; i < nr_pages; i++)
- page_ref_inc(pfn_to_page(pfn + i));
-}
-
static void virtio_mem_online_page(struct virtio_mem *vm,
struct page *page, unsigned int order)
{
@@ -2981,6 +2874,23 @@ static int virtio_mem_probe(struct virtio_device *vdev)
return rc;
}
+static void virtio_mem_sbm_mark_mb_non_skippable(struct virtio_mem *vm,
+ unsigned long mb_id)
+{
+ const unsigned long nr_pages = PFN_DOWN(vm->sbm.sb_size);
+ unsigned long i, pfn;
+ int sb_id;
+
+ for (sb_id = 0; sb_id < vm->sbm.sbs_per_mb; sb_id++) {
+ if (virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id, 1))
+ continue;
+ pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
+ sb_id * vm->sbm.sb_size);
+ for (i = 0; i < nr_pages; i++)
+ __ClearPageOfflineSkippable(pfn_to_page(pfn + i));
+ }
+}
+
static void virtio_mem_deinit_hotplug(struct virtio_mem *vm)
{
unsigned long mb_id;
@@ -3013,10 +2923,17 @@ static void virtio_mem_deinit_hotplug(struct virtio_mem *vm)
VIRTIO_MEM_SBM_MB_UNUSED);
}
/*
- * After we unregistered our callbacks, user space can no longer
- * offline partially plugged online memory blocks. No need to
- * worry about them.
+ * User space could offline partially plugged online memory
+ * blocks and re-online them without us being able to keep
+ * unplugged pieces PageOffline. Prevent offlining by marking
+ * unplugged pieces as non-skippable.
*/
+ virtio_mem_sbm_for_each_mb(vm, mb_id,
+ VIRTIO_MEM_SBM_MB_KERNEL_PARTIAL)
+ virtio_mem_sbm_mark_mb_non_skippable(vm, mb_id);
+ virtio_mem_sbm_for_each_mb(vm, mb_id,
+ VIRTIO_MEM_SBM_MB_MOVABLE_PARTIAL)
+ virtio_mem_sbm_mark_mb_non_skippable(vm, mb_id);
}
/* unregister callbacks */
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 4fe5ee67535b2..2f3b62b2b4c0f 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -178,6 +178,9 @@ enum pageflags {
PG_vmemmap_self_hosted = PG_owner_priv_1,
#endif
+ /* The driver allows for skipping these pages during memory offlining */
+ PG_offline_skippable = PG_owner_2,
+
/*
* Flags only valid for compound pages. Stored in first tail page's
* flags word. Cannot use the first 8 flags or any flag marked as
@@ -1036,14 +1039,23 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
* refcount of 1 and PageOffline(). generic_online_page() will
* take care of clearing PageOffline().
*
- * If a driver wants to allow to offline unmovable PageOffline() pages without
- * putting them back to the buddy, it can do so via the memory notifier by
- * decrementing the reference count in MEM_GOING_OFFLINE and incrementing the
- * reference count in MEM_CANCEL_OFFLINE. When offlining, the PageOffline()
- * pages (now with a reference count of zero) are treated like free (unmanaged)
- * pages, allowing the containing memory block to get offlined. A driver that
- * relies on this feature is aware that re-onlining the memory block will
- * require not giving them to the buddy via generic_online_page().
+ * If a driver wants to allow for offlining unmovable PageOffline() pages
+ * when offlining a memory block without exposing these pages as "free" to
+ * the buddy, it can mark them as PG_offline_skippable.
+ *
+ * By marking these PageOffline pages PG_offline_skippable, the driver
+ * acknowledges that it
+ * (a) knows which pages are PageOffline even without the memmap.
+ * (b) implements the online_page_callback_t callback to exclude these pages
+ * from getting exposed to the buddy, so they will stay PageOffline()
+ * when onlining a memory block.
+ * (c) synchronizes against concurrent memory onlining/offlining whenever
+ * adjusting the PG_offline_skippable flag.
+ *
+ * Note that the refcount of these pages will for now be assumed to always
+ * be 1: nobody except the owner should be referencing these pages.
+ * Offlining code will complain if the refcount is not 1. In the future,
+ * these pages will always be frozen and not have a refcount.
*
* Memory offlining code will not adjust the managed page count for any
* PageOffline() pages, treating them like they were never exposed to the
@@ -1055,6 +1067,7 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
* page_offline_freeze()/page_offline_thaw().
*/
PAGE_TYPE_OPS(Offline, offline, offline)
+__PAGEFLAG(OfflineSkippable, offline_skippable, PF_NO_COMPOUND)
extern void page_offline_freeze(void);
extern void page_offline_thaw(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b1caedbade5b1..ed04fc4ca6761 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1767,12 +1767,10 @@ static int scan_movable_pages(unsigned long start, unsigned long end,
goto found;
/*
- * PageOffline() pages that are not marked __PageMovable() and
- * have a reference count > 0 (after MEM_GOING_OFFLINE) are
- * definitely unmovable. If their reference count would be 0,
- * they could at least be skipped when offlining memory.
+ * PageOffline() pages that are neither "movable" nor
+ * "skippable" prevent memory offlining.
*/
- if (PageOffline(page) && page_count(page))
+ if (PageOffline(page) && !PageOfflineSkippable(page))
return -EBUSY;
if (!PageHuge(page))
@@ -1807,6 +1805,15 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
struct page *page;
page = pfn_to_page(pfn);
+
+ /*
+ * Only PageOffline() pages that are marked "skippable" cannot
+ * be migrated but can be skipped when offlining. See
+ * scan_movable_pages().
+ */
+ if (PageOffline(page) && PageOfflineSkippable(page))
+ continue;
+
folio = page_folio(page);
if (!folio_try_get(folio))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f6482223e28a2..7e4c41e46a911 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7023,12 +7023,12 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
continue;
}
/*
- * At this point all remaining PageOffline() pages have a
- * reference count of 0 and can simply be skipped.
+ * At this point all remaining PageOffline() pages must be
+ * "skippable" and have exactly one reference.
*/
if (PageOffline(page)) {
- BUG_ON(page_count(page));
- BUG_ON(PageBuddy(page));
+ WARN_ON_ONCE(!PageOfflineSkippable(page));
+ WARN_ON_ONCE(page_count(page) != 1);
already_offline++;
pfn++;
continue;
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b2fc5266e3d26..debd898b794ea 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -121,16 +121,11 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
continue;
/*
- * We treat all PageOffline() pages as movable when offlining
- * to give drivers a chance to decrement their reference count
- * in MEM_GOING_OFFLINE in order to indicate that these pages
- * can be offlined as there are no direct references anymore.
- * For actually unmovable PageOffline() where the driver does
- * not support this, we will fail later when trying to actually
- * move these pages that still have a reference count > 0.
- * (false negatives in this function only)
+ * PageOffline() pages that are marked as "skippable"
+ * are treated like movable pages for memory offlining.
*/
- if ((flags & MEMORY_OFFLINE) && PageOffline(page))
+ if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
+ PageOfflineSkippable(page))
continue;
if (__PageMovable(page) || PageLRU(page))
@@ -577,11 +572,11 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
/* A HWPoisoned page cannot be also PageBuddy */
pfn++;
else if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
- !page_count(page))
+ PageOfflineSkippable(page))
/*
- * The responsible driver agreed to skip PageOffline()
- * pages when offlining memory by dropping its
- * reference in MEM_GOING_OFFLINE.
+ * If the page is a skippable PageOffline() page,
+ * we can offline the memory block, as the driver will
+ * re-discover them when re-onlining the memory.
*/
pfn++;
else
--
2.49.0
^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/1] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages
2025-05-20 16:42 ` [PATCH v2 1/1] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages David Hildenbrand
@ 2025-05-21 13:32 ` Oscar Salvador
2025-05-21 13:44 ` David Hildenbrand
2025-05-23 9:20 ` Oscar Salvador
2025-05-30 16:54 ` Vlastimil Babka
2 siblings, 1 reply; 8+ messages in thread
From: Oscar Salvador @ 2025-05-21 13:32 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, virtualization, Michael S. Tsirkin,
Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, Matthew Wilcox (Oracle)
On Tue, May 20, 2025 at 06:42:11PM +0200, David Hildenbrand wrote:
> A long-term goal is supporting frozen PageOffline pages, and later
> PageOffline pages that don't have a refcount at all. Some more work for
> that is needed -- in particular around non-folio page migration and
> memory ballooning drivers -- but let's start by handling PageOffline pages
> that can be skipped during memory offlining differently.
>
> Note that PageOffline is used to mark pages that are logically offline
> in an otherwise online memory block (e.g., 128 MiB). If a memory
> block is offline, the memmap is considered compeltely uninitialized
> and stale (see pfn_to_online_page()).
>
> Let's introduce a PageOffline specific page flag (PG_offline_skippable)
> that for now reuses PG_owner_2. In the memdesc future, it will be one of
> a small number of per-memdesc flags stored alongside the type.
>
> By setting PG_offline_skippable, a driver indicates that it can
> restore the PageOffline state of these specific pages when re-onlining a
> memory block: it knows that these pages are supposed to be PageOffline()
> without the information in the vmemmap, so it can filter them out and
> not expose them to the buddy -> they stay PageOffline().
>
> While PG_offline_offlineable might be clearer, it is also super
> confusing. Alternatives (PG_offline_sticky?) also don't quite feel right.
> So let's use "skippable" for now.
>
> The flag is not supposed to be used for movable PageOffline pages as
> used for balloon compaction; movable PageOffline() pages can simply be
> migrated during the memory offlining stage, turning the migration
> destination page PageOffline() and turning the migration source page
> into a free buddy page.
>
> Let's convert the single user from our MEM_GOING_OFFLINE approach
> to the new PG_offline_skippable approach: virtio-mem. Fortunately,
> this simplifies the code quite a lot. The only corner case we have to
> take care of is when force-unloading the virtio-mem driver: we have to
> prevent partially-plugged memory blocks from getting offlined by
> clearing PG_offline_skippable again.
>
> What if someone decides to grab a reference on these pages although they
> really shouldn't? After all, we'll now keep the refcount at 1 (until we
> can properly stop using the refcount completely).
>
> Well, less worse things will happen than would currently: currently,
> if someone would grab a reference to these pages, in MEM_GOING_OFFLINE
> we would run into the
> if (WARN_ON(!page_ref_dec_and_test(page)))
> dump_page(page, "fake-offline page referenced");
>
> And once that unexpected reference would get dropped, we would end up
> freeing that page to the buddy: ouch.
>
> Now, we'll allow for offlining that memory, and when that unexpected
> reference would get dropped, we would not end up freeing that page to
> the buddy. Once we have frozen PageOffline() pages, it will all get a
> lot cleaner.
>
> Note that we didn't see the existing WARN_ON so far, because nobody
> should ever be referencing such pages.
>
> An alternative might be to have another callback chain from memory hotplug
> code, where a driver that owns that page could agree to skip the
> PageOffline() page. However, we would have to repeatedly issue these
> callbacks for individual PageOffline() pages, which does not sound
> compelling. As we have spare bits, let's use this simpler approach for
> now.
>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Hi David, sorry for jumping in late
> @@ -1157,6 +1083,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
> SetPageDirty(page);
> else
> __SetPageOffline(page);
> + __SetPageOfflineSkippable(page);
> VM_WARN_ON_ONCE(!PageOffline(page));
I think I am having some issues understanding this, let me see if I get
it.
- virtio-mem defines PageOffline pages, which are logically-offlined
pages within an onlined memory-block
- PageOffline pages have a refcount of '0' once they are properly
initialized, meaning that refcount > 0 implies somebody is holding
a refcount and that should not really happen
- logically-offline pages belonging to onlined memory-blocks are marked PageDirty,
while logically-offlined pages we allocated via alloc_contig_range are marked
PageOffline (I am getting a bit lost between fake-online, fake-offline, my fault)
- If we want to release logically-offline pages belonging to an onlined memory-block,
we ClearDirty them and be done
- If we want to release logically-offlined pages belonging we allocated
via alloc_contig_range, we clear PageOffline and be done
- PageOfflineSkipabble are unmovable PageOffline pages, which cannot be
migrated?
- So for a PageOffline to be able to be migrated away must be Movable or
marked PageOfflineSkipabble, making do_migrate_range ignore it
- PageOfflineSkipabble will be marked PageOffline upon re-onlining? Will
still be marked as PageOfflineSkipabble?
> +
> + /*
> + * Only PageOffline() pages that are marked "skippable" cannot
> + * be migrated but can be skipped when offlining. See
It is probably me, and nevermind the comment but I somehow find
"PageOfflineSkipabble are not migrated but skipped when offlining" a bit
easier.
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/1] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages
2025-05-21 13:32 ` Oscar Salvador
@ 2025-05-21 13:44 ` David Hildenbrand
2025-05-23 9:18 ` Oscar Salvador
0 siblings, 1 reply; 8+ messages in thread
From: David Hildenbrand @ 2025-05-21 13:44 UTC (permalink / raw)
To: Oscar Salvador
Cc: linux-kernel, linux-mm, virtualization, Michael S. Tsirkin,
Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, Matthew Wilcox (Oracle)
>>
>> Acked-by: Zi Yan <ziy@nvidia.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> Hi David, sorry for jumping in late
Hi,
not late at all :)
>
>> @@ -1157,6 +1083,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
>> SetPageDirty(page);
>> else
>> __SetPageOffline(page);
>> + __SetPageOfflineSkippable(page);
>> VM_WARN_ON_ONCE(!PageOffline(page));
>
> I think I am having some issues understanding this, let me see if I get
> it.
>
> - virtio-mem defines PageOffline pages, which are logically-offlined
> pages within an onlined memory-block
PageOffline is a more generic concept also e.g., used by memory
balloning. virtio-mem uses that concept, yes.
> - PageOffline pages have a refcount of '0' once they are properly
> initialized, meaning that refcount > 0 implies somebody is holding
> a refcount and that should not really happen
PageOffline pages currently always have a refcount > 0 (except
temporarily in this special case we are removing here). In the future,
it won't have a refcount at all.
> - logically-offline pages belonging to onlined memory-blocks are marked PageDirty,
> while logically-offlined pages we allocated via alloc_contig_range are marked
> PageOffline (I am getting a bit lost between fake-online, fake-offline, my fault)
No, the code is confusing.
All pages are PageOffline. Only the ones that have not been onlined are
*in addition* PageDirty.
The relevant bit is documented in page-flags.h:
"When a memory block gets onlined, all pages are initialized with a
refcount of 1 and PageOffline(). generic_online_page() will take care of
clearing PageOffline()."
> - If we want to release logically-offline pages belonging to an onlined memory-block,
> we ClearDirty them and be done
PageOffline gets cleared in both cases: See the comment in
virtio_mem_clear_fake_offline()
"/* generic_online_page() will clear PageOffline(). */"
I'll note that I am planning on removing that PageDirty() handling
completely, and also letting handling PageOffline() clearing be always
performed by memory freeing core (the latter is easier to achieve).
> - If we want to release logically-offlined pages belonging we allocated
> via alloc_contig_range, we clear PageOffline and be done
Yes, in the future that clearing will be done by the core (so the plan).
> - PageOfflineSkipabble are unmovable PageOffline pages, which cannot
be migrated?
Yes, but they can simply be skipped -- consider them memory holes we
(the driver) can rediscover when re-onlining the memory. See below.
> - So for a PageOffline to be able to be migrated away must be Movable or
> marked PageOfflineSkipabble, making do_migrate_range ignore it
Yes.
> - PageOfflineSkipabble will be marked PageOffline upon re-onlining? Will
> still be marked as PageOfflineSkipabble?
When re-onlining, the core will set them all PageOffline, and virtio-mem
will intercept page onlining using the page_online_cb.
virtio-mem will then online the actually plugged parts (->
generic_online_page(), which clears PageOffline and exposes them to the
buddy) and set the unplugged/hole parts as PageOfflineSkipabble again.
That logic resides in virtio_mem_online_page_cb().
>
>> +
>> + /*
>> + * Only PageOffline() pages that are marked "skippable" cannot
>> + * be migrated but can be skipped when offlining. See
>
> It is probably me, and nevermind the comment but I somehow find
> "PageOfflineSkipabble are not migrated but skipped when offlining" a bit
> easier.
Definitely. Thanks!
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/1] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages
2025-05-21 13:44 ` David Hildenbrand
@ 2025-05-23 9:18 ` Oscar Salvador
0 siblings, 0 replies; 8+ messages in thread
From: Oscar Salvador @ 2025-05-23 9:18 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, virtualization, Michael S. Tsirkin,
Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, Matthew Wilcox (Oracle)
On Wed, May 21, 2025 at 03:44:32PM +0200, David Hildenbrand wrote:
> No, the code is confusing.
>
> All pages are PageOffline. Only the ones that have not been onlined are *in
> addition* PageDirty.
>
> The relevant bit is documented in page-flags.h:
>
> "When a memory block gets onlined, all pages are initialized with a refcount
> of 1 and PageOffline(). generic_online_page() will take care of clearing
> PageOffline()."
Ah ok, I see it in __free_pages_core().
> > - If we want to release logically-offline pages belonging to an onlined memory-block,
> > we ClearDirty them and be done
>
> PageOffline gets cleared in both cases: See the comment in
> virtio_mem_clear_fake_offline()
>
> "/* generic_online_page() will clear PageOffline(). */"
>
> I'll note that I am planning on removing that PageDirty() handling
> completely, and also letting handling PageOffline() clearing be always
> performed by memory freeing core (the latter is easier to achieve).
Cool, I think that would be much clearer.
> When re-onlining, the core will set them all PageOffline, and virtio-mem
> will intercept page onlining using the page_online_cb.
>
> virtio-mem will then online the actually plugged parts (->
> generic_online_page(), which clears PageOffline and exposes them to the
> buddy) and set the unplugged/hole parts as PageOfflineSkipabble again.
>
> That logic resides in virtio_mem_online_page_cb().
Sorry, I had to re-cache this.
Ok, I think that now I caught up with the code.
I see that we mark it Offline in memmap_init_range(), and then
__free_pages_core() will clear the flag before releasing them to the
buddy.
Ok, I think it is much clear now, thanks for helping me out with the
details!
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/1] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages
2025-05-20 16:42 ` [PATCH v2 1/1] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages David Hildenbrand
2025-05-21 13:32 ` Oscar Salvador
@ 2025-05-23 9:20 ` Oscar Salvador
2025-05-30 16:54 ` Vlastimil Babka
2 siblings, 0 replies; 8+ messages in thread
From: Oscar Salvador @ 2025-05-23 9:20 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-kernel, linux-mm, virtualization, Michael S. Tsirkin,
Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
Brendan Jackman, Johannes Weiner, Zi Yan, Matthew Wilcox (Oracle)
On Tue, May 20, 2025 at 06:42:11PM +0200, David Hildenbrand wrote:
> A long-term goal is supporting frozen PageOffline pages, and later
> PageOffline pages that don't have a refcount at all. Some more work for
> that is needed -- in particular around non-folio page migration and
> memory ballooning drivers -- but let's start by handling PageOffline pages
> that can be skipped during memory offlining differently.
>
> Note that PageOffline is used to mark pages that are logically offline
> in an otherwise online memory block (e.g., 128 MiB). If a memory
> block is offline, the memmap is considered compeltely uninitialized
> and stale (see pfn_to_online_page()).
>
> Let's introduce a PageOffline specific page flag (PG_offline_skippable)
> that for now reuses PG_owner_2. In the memdesc future, it will be one of
> a small number of per-memdesc flags stored alongside the type.
>
> By setting PG_offline_skippable, a driver indicates that it can
> restore the PageOffline state of these specific pages when re-onlining a
> memory block: it knows that these pages are supposed to be PageOffline()
> without the information in the vmemmap, so it can filter them out and
> not expose them to the buddy -> they stay PageOffline().
>
> While PG_offline_offlineable might be clearer, it is also super
> confusing. Alternatives (PG_offline_sticky?) also don't quite feel right.
> So let's use "skippable" for now.
>
> The flag is not supposed to be used for movable PageOffline pages as
> used for balloon compaction; movable PageOffline() pages can simply be
> migrated during the memory offlining stage, turning the migration
> destination page PageOffline() and turning the migration source page
> into a free buddy page.
>
> Let's convert the single user from our MEM_GOING_OFFLINE approach
> to the new PG_offline_skippable approach: virtio-mem. Fortunately,
> this simplifies the code quite a lot. The only corner case we have to
> take care of is when force-unloading the virtio-mem driver: we have to
> prevent partially-plugged memory blocks from getting offlined by
> clearing PG_offline_skippable again.
>
> What if someone decides to grab a reference on these pages although they
> really shouldn't? After all, we'll now keep the refcount at 1 (until we
> can properly stop using the refcount completely).
>
> Well, less worse things will happen than would currently: currently,
> if someone would grab a reference to these pages, in MEM_GOING_OFFLINE
> we would run into the
> if (WARN_ON(!page_ref_dec_and_test(page)))
> dump_page(page, "fake-offline page referenced");
>
> And once that unexpected reference would get dropped, we would end up
> freeing that page to the buddy: ouch.
>
> Now, we'll allow for offlining that memory, and when that unexpected
> reference would get dropped, we would not end up freeing that page to
> the buddy. Once we have frozen PageOffline() pages, it will all get a
> lot cleaner.
>
> Note that we didn't see the existing WARN_ON so far, because nobody
> should ever be referencing such pages.
>
> An alternative might be to have another callback chain from memory hotplug
> code, where a driver that owns that page could agree to skip the
> PageOffline() page. However, we would have to repeatedly issue these
> callbacks for individual PageOffline() pages, which does not sound
> compelling. As we have spare bits, let's use this simpler approach for
> now.
>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Oscar Salvador <osalvador@suse.de>
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/1] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages
2025-05-20 16:42 ` [PATCH v2 1/1] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages David Hildenbrand
2025-05-21 13:32 ` Oscar Salvador
2025-05-23 9:20 ` Oscar Salvador
@ 2025-05-30 16:54 ` Vlastimil Babka
2025-06-02 20:55 ` David Hildenbrand
2 siblings, 1 reply; 8+ messages in thread
From: Vlastimil Babka @ 2025-05-30 16:54 UTC (permalink / raw)
To: David Hildenbrand, linux-kernel
Cc: linux-mm, virtualization, Michael S. Tsirkin, Jason Wang,
Xuan Zhuo, Eugenio Pérez, Andrew Morton, Oscar Salvador,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Matthew Wilcox (Oracle)
On 5/20/25 18:42, David Hildenbrand wrote:
> A long-term goal is supporting frozen PageOffline pages, and later
> PageOffline pages that don't have a refcount at all. Some more work for
Looking forward to that :)
> that is needed -- in particular around non-folio page migration and
> memory ballooning drivers -- but let's start by handling PageOffline pages
> that can be skipped during memory offlining differently.
>
> Note that PageOffline is used to mark pages that are logically offline
> in an otherwise online memory block (e.g., 128 MiB). If a memory
> block is offline, the memmap is considered compeltely uninitialized
> and stale (see pfn_to_online_page()).
>
> Let's introduce a PageOffline specific page flag (PG_offline_skippable)
> that for now reuses PG_owner_2. In the memdesc future, it will be one of
> a small number of per-memdesc flags stored alongside the type.
>
> By setting PG_offline_skippable, a driver indicates that it can
> restore the PageOffline state of these specific pages when re-onlining a
> memory block: it knows that these pages are supposed to be PageOffline()
> without the information in the vmemmap, so it can filter them out and
> not expose them to the buddy -> they stay PageOffline().
>
> While PG_offline_offlineable might be clearer, it is also super
> confusing. Alternatives (PG_offline_sticky?) also don't quite feel right.
> So let's use "skippable" for now.
>
> The flag is not supposed to be used for movable PageOffline pages as
> used for balloon compaction; movable PageOffline() pages can simply be
> migrated during the memory offlining stage, turning the migration
> destination page PageOffline() and turning the migration source page
> into a free buddy page.
>
> Let's convert the single user from our MEM_GOING_OFFLINE approach
> to the new PG_offline_skippable approach: virtio-mem. Fortunately,
> this simplifies the code quite a lot. The only corner case we have to
> take care of is when force-unloading the virtio-mem driver: we have to
> prevent partially-plugged memory blocks from getting offlined by
> clearing PG_offline_skippable again.
>
> What if someone decides to grab a reference on these pages although they
> really shouldn't? After all, we'll now keep the refcount at 1 (until we
> can properly stop using the refcount completely).
>
> Well, less worse things will happen than would currently: currently,
> if someone would grab a reference to these pages, in MEM_GOING_OFFLINE
> we would run into the
> if (WARN_ON(!page_ref_dec_and_test(page)))
> dump_page(page, "fake-offline page referenced");
>
> And once that unexpected reference would get dropped, we would end up
> freeing that page to the buddy: ouch.
>
> Now, we'll allow for offlining that memory, and when that unexpected
> reference would get dropped, we would not end up freeing that page to
> the buddy. Once we have frozen PageOffline() pages, it will all get a
> lot cleaner.
Hmm, a question on that later in the code (assuming I identified the right
place).
> Note that we didn't see the existing WARN_ON so far, because nobody
> should ever be referencing such pages.
It's mostly a speculative refcount increase from a pfn walker, such as
compaction scanner, that can happen due to its inherent raciness.
> An alternative might be to have another callback chain from memory hotplug
> code, where a driver that owns that page could agree to skip the
> PageOffline() page. However, we would have to repeatedly issue these
> callbacks for individual PageOffline() pages, which does not sound
> compelling. As we have spare bits, let's use this simpler approach for
> now.
>
> Acked-by: Zi Yan <ziy@nvidia.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz> # page allocator
I'll leave hotplug to the experts :)
<snip>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f6482223e28a2..7e4c41e46a911 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7023,12 +7023,12 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
> continue;
> }
> /*
> - * At this point all remaining PageOffline() pages have a
> - * reference count of 0 and can simply be skipped.
> + * At this point all remaining PageOffline() pages must be
> + * "skippable" and have exactly one reference.
> */
> if (PageOffline(page)) {
> - BUG_ON(page_count(page));
> - BUG_ON(PageBuddy(page));
> + WARN_ON_ONCE(!PageOfflineSkippable(page));
> + WARN_ON_ONCE(page_count(page) != 1);
So is this the part where an unexpected speculative refcount might be
detected? Should be harmless then as it will then decrease the refcount from
e.g. 2 to 1 and nothing will happen right.
That's assuming that once we pass __offline_isolated_pages(), the following
actions wont modify the refcount or the struct page won't be zeroed, or
removed completely (vmemmap). Probably something already prevents that...
> already_offline++;
> pfn++;
> continue;
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index b2fc5266e3d26..debd898b794ea 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -121,16 +121,11 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
> continue;
>
> /*
> - * We treat all PageOffline() pages as movable when offlining
> - * to give drivers a chance to decrement their reference count
> - * in MEM_GOING_OFFLINE in order to indicate that these pages
> - * can be offlined as there are no direct references anymore.
> - * For actually unmovable PageOffline() where the driver does
> - * not support this, we will fail later when trying to actually
> - * move these pages that still have a reference count > 0.
> - * (false negatives in this function only)
> + * PageOffline() pages that are marked as "skippable"
> + * are treated like movable pages for memory offlining.
> */
> - if ((flags & MEMORY_OFFLINE) && PageOffline(page))
> + if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
> + PageOfflineSkippable(page))
> continue;
>
> if (__PageMovable(page) || PageLRU(page))
> @@ -577,11 +572,11 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
> /* A HWPoisoned page cannot be also PageBuddy */
> pfn++;
> else if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
> - !page_count(page))
> + PageOfflineSkippable(page))
> /*
> - * The responsible driver agreed to skip PageOffline()
> - * pages when offlining memory by dropping its
> - * reference in MEM_GOING_OFFLINE.
> + * If the page is a skippable PageOffline() page,
> + * we can offline the memory block, as the driver will
> + * re-discover them when re-onlining the memory.
> */
> pfn++;
> else
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH v2 1/1] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages
2025-05-30 16:54 ` Vlastimil Babka
@ 2025-06-02 20:55 ` David Hildenbrand
0 siblings, 0 replies; 8+ messages in thread
From: David Hildenbrand @ 2025-06-02 20:55 UTC (permalink / raw)
To: Vlastimil Babka, linux-kernel
Cc: linux-mm, virtualization, Michael S. Tsirkin, Jason Wang,
Xuan Zhuo, Eugenio Pérez, Andrew Morton, Oscar Salvador,
Suren Baghdasaryan, Michal Hocko, Brendan Jackman,
Johannes Weiner, Zi Yan, Matthew Wilcox (Oracle)
On 30.05.25 18:54, Vlastimil Babka wrote:
> On 5/20/25 18:42, David Hildenbrand wrote:
>> A long-term goal is supporting frozen PageOffline pages, and later
>> PageOffline pages that don't have a refcount at all. Some more work for
>
> Looking forward to that :)
:) It's definitely ... challenging.
>
>> that is needed -- in particular around non-folio page migration and
>> memory ballooning drivers -- but let's start by handling PageOffline pages
>> that can be skipped during memory offlining differently.
>>
>> Note that PageOffline is used to mark pages that are logically offline
>> in an otherwise online memory block (e.g., 128 MiB). If a memory
>> block is offline, the memmap is considered compeltely uninitialized
>> and stale (see pfn_to_online_page()).
>>
>> Let's introduce a PageOffline specific page flag (PG_offline_skippable)
>> that for now reuses PG_owner_2. In the memdesc future, it will be one of
>> a small number of per-memdesc flags stored alongside the type.
>>
>> By setting PG_offline_skippable, a driver indicates that it can
>> restore the PageOffline state of these specific pages when re-onlining a
>> memory block: it knows that these pages are supposed to be PageOffline()
>> without the information in the vmemmap, so it can filter them out and
>> not expose them to the buddy -> they stay PageOffline().
>>
>> While PG_offline_offlineable might be clearer, it is also super
>> confusing. Alternatives (PG_offline_sticky?) also don't quite feel right.
>> So let's use "skippable" for now.
>>
>> The flag is not supposed to be used for movable PageOffline pages as
>> used for balloon compaction; movable PageOffline() pages can simply be
>> migrated during the memory offlining stage, turning the migration
>> destination page PageOffline() and turning the migration source page
>> into a free buddy page.
>>
>> Let's convert the single user from our MEM_GOING_OFFLINE approach
>> to the new PG_offline_skippable approach: virtio-mem. Fortunately,
>> this simplifies the code quite a lot. The only corner case we have to
>> take care of is when force-unloading the virtio-mem driver: we have to
>> prevent partially-plugged memory blocks from getting offlined by
>> clearing PG_offline_skippable again.
>>
>> What if someone decides to grab a reference on these pages although they
>> really shouldn't? After all, we'll now keep the refcount at 1 (until we
>> can properly stop using the refcount completely).
>>
>> Well, less worse things will happen than would currently: currently,
>> if someone would grab a reference to these pages, in MEM_GOING_OFFLINE
>> we would run into the
>> if (WARN_ON(!page_ref_dec_and_test(page)))
>> dump_page(page, "fake-offline page referenced");
>>
>> And once that unexpected reference would get dropped, we would end up
>> freeing that page to the buddy: ouch.
>>
>> Now, we'll allow for offlining that memory, and when that unexpected
>> reference would get dropped, we would not end up freeing that page to
>> the buddy. Once we have frozen PageOffline() pages, it will all get a
>> lot cleaner.
>
> Hmm, a question on that later in the code (assuming I identified the right
> place).
>
>> Note that we didn't see the existing WARN_ON so far, because nobody
>> should ever be referencing such pages.
>
> It's mostly a speculative refcount increase from a pfn walker, such as
> compaction scanner, that can happen due to its inherent raciness.
PFN walkers are in general careful to only try grabbing a reference if
the target page is a possible candidate.
E.g., compaction checks for LRU or movable non-folio pages before
calling folio_get_nontail_page().
KSM uses MEM_GOING_OFFLINE to pause scanning (and thereby
folio_try_get()) while memory is getting offlined, so it won't try
accessing any page in the trees that might now be stale.
>
>> An alternative might be to have another callback chain from memory hotplug
>> code, where a driver that owns that page could agree to skip the
>> PageOffline() page. However, we would have to repeatedly issue these
>> callbacks for individual PageOffline() pages, which does not sound
>> compelling. As we have spare bits, let's use this simpler approach for
>> now.
>>
>> Acked-by: Zi Yan <ziy@nvidia.com>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>
> Acked-by: Vlastimil Babka <vbabka@suse.cz> # page allocator
>
> I'll leave hotplug to the experts :)
>
> <snip>
>
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index f6482223e28a2..7e4c41e46a911 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -7023,12 +7023,12 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
>> continue;
>> }
>> /*
>> - * At this point all remaining PageOffline() pages have a
>> - * reference count of 0 and can simply be skipped.
>> + * At this point all remaining PageOffline() pages must be
>> + * "skippable" and have exactly one reference.
>> */
>> if (PageOffline(page)) {
>> - BUG_ON(page_count(page));
>> - BUG_ON(PageBuddy(page));
>> + WARN_ON_ONCE(!PageOfflineSkippable(page));
>> + WARN_ON_ONCE(page_count(page) != 1);
>
> So is this the part where an unexpected speculative refcount might be
> detected? Should be harmless then as it will then decrease the refcount from
> e.g. 2 to 1 and nothing will happen right.
Yes, unless it would be taken for a longer time, which is not what
usually happens: speculative references are immediately dropped if the
target page is of no use.
> That's assuming that once we pass __offline_isolated_pages(), the following
> actions wont modify the refcount or the struct page won't be zeroed, or
> removed completely (vmemmap). Probably something already prevents that...
Right, we'd have similar problems also if the refcount would be 0 and
someone would do a folio_try_get(), while we are tearing down or
poisoning the vmemmap.
The idea here is that these folios were PageOffline() already for a
while (before even starting memory offlining), so any speculative
references from pre-PageOffline times should long be gone.
It's still racy, but not more racy than our existing
pfn_to_online_page() race, or the race we already had in the old code.
Thanks!
--
Cheers,
David / dhildenb
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2025-06-02 20:56 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-20 16:42 [PATCH v2 0/1] mm/memory_hotplug: introduce and use PG_offline_skippable David Hildenbrand
2025-05-20 16:42 ` [PATCH v2 1/1] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages David Hildenbrand
2025-05-21 13:32 ` Oscar Salvador
2025-05-21 13:44 ` David Hildenbrand
2025-05-23 9:18 ` Oscar Salvador
2025-05-23 9:20 ` Oscar Salvador
2025-05-30 16:54 ` Vlastimil Babka
2025-06-02 20:55 ` David Hildenbrand
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).