linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/2] mm/memory_hotplug: introduce and use PG_offline_skippable
@ 2025-05-14 11:15 David Hildenbrand
  2025-05-14 11:15 ` [PATCH v1 1/2] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages David Hildenbrand
                   ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: David Hildenbrand @ 2025-05-14 11:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, virtualization, David Hildenbrand, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
	Oscar Salvador, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Matthew Wilcox (Oracle)

This is a requirement for making PageOffline pages not have a refcount
in the long future ("frozen"), and for reworking non-folio page migration
in the near future.

I have patches mostly ready to go to handle the latter. For turning all
PageOffline() pages frozen, the non-folio page migration and memory
ballooning drivers will have to be reworked first, to no longer rely on
the refcount of PageOffline pages.

Introduce PG_offline_skippable that only applies to PageOffline() pages --
of course, reusing one of the existing PG_ flags for now -- and convert
virtio-mem to make use of the new way: to allow for skipping PageOffline
pages during memory offlining, treating them as if they would not be
allocated.

Note that the existing mechanism relied on the driver (virtio-mem)
dropping its reference during MEM_GOING_OFFLINE, which is complicated and
not compatible with the concept of frozen pages (no refcount).

Tested with virtio-mem on x86, including  partially hotplugging a memory
block (hotplugging 64MiB with a 128 MiB memory block size), and repeatedly
onlining+offlining the memory block.

Cc: David Hildenbrand <david@redhat.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Cc: "Eugenio Pérez" <eperezma@redhat.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Oscar Salvador <osalvador@suse.de>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Zi Yan <ziy@nvidia.com>
Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>

David Hildenbrand (2):
  mm/memory_hotplug: PG_offline_skippable for offlining memory blocks
    with PageOffline pages
  mm/memory_hotplug: remove -EBUSY handling from scan_movable_pages()

 drivers/virtio/virtio_mem.c | 111 +-----------------------------------
 include/linux/page-flags.h  |  29 +++++++---
 mm/memory_hotplug.c         |  22 ++-----
 mm/page_alloc.c             |   8 +--
 mm/page_isolation.c         |  21 +++----
 5 files changed, 40 insertions(+), 151 deletions(-)


base-commit: 2f6baf8dadecc2bec7d6bc931f7e0d58d8443d76
-- 
2.49.0



^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH v1 1/2] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages
  2025-05-14 11:15 [PATCH v1 0/2] mm/memory_hotplug: introduce and use PG_offline_skippable David Hildenbrand
@ 2025-05-14 11:15 ` David Hildenbrand
  2025-05-14 19:00   ` Zi Yan
  2025-05-14 11:15 ` [PATCH v1 2/2] mm/memory_hotplug: remove -EBUSY handling from scan_movable_pages() David Hildenbrand
  2025-05-14 13:45 ` [PATCH v1 0/2] mm/memory_hotplug: introduce and use PG_offline_skippable Zi Yan
  2 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2025-05-14 11:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, virtualization, David Hildenbrand, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
	Oscar Salvador, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Matthew Wilcox (Oracle)

A long-term goal is supporting frozen PageOffline pages, and later
PageOffline pages that don't have a refcount at all. Some more work for
that is needed -- in particular around non-folio page migration and
memory ballooning drivers -- but let's start by handling PageOffline pages
that can be skipped during memory offlining differently.

Let's introduce a PageOffline specific page flag (PG_offline_skippable)
that for now reuses PG_owner_2. In the memdesc future, it will be one of
a small number of per-memdesc flags stored alongside the type.

By setting PG_offline_skippable, a driver indicates that it can
restore the PageOffline state of these specific pages when re-onlining a
memory block: it knows that these pages are supposed to be PageOffline()
without the information in the vmemmap, so it can filter them out and
not expose them to the buddy -> they stay PageOffline().

While PG_offline_offlineable might be clearer, it is also super
confusing. Alternatives (PG_offline_sticky?) also don't quite feel right.
So let's use "skippable" for now.

The flag is not supposed to be used for movable PageOffline pages as
used for balloon compaction; movable PageOffline() pages can simply be
migrated during the memory offlining stage.

Let's convert the single user from our MEM_GOING_OFFLINE approach
to the new PG_offline_skippable approach: virtio-mem. Fortunately,
this simplifies the code quite a lot.

What if someone decides to grab a reference on these pages although they
really shouldn't? After all, we'll now keep the refcount at 1 (until we
can properly stop using the refcount completely).

Well, less worse things will happen than would currently: currently,
if someone would grab a reference to these pages, in MEM_GOING_OFFLINE
we would run into the
		if (WARN_ON(!page_ref_dec_and_test(page)))
			dump_page(page, "fake-offline page referenced");

And once that unexpected reference would get dropped, we would end up
freeing that page to the buddy: ouch.

Now, we'll allow for offlining that memory, and when that unexpected
reference would get dropped, we would not end up freeing that page to
the buddy. Once we have frozen PageOffline() pages, it will all get a
lot cleaner.

Note that we didn't see the existing WARN_ON so far, because nobody
should ever be referencing such pages.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 drivers/virtio/virtio_mem.c | 111 +-----------------------------------
 include/linux/page-flags.h  |  29 +++++++---
 mm/memory_hotplug.c         |  12 ++--
 mm/page_alloc.c             |   8 +--
 mm/page_isolation.c         |  21 +++----
 5 files changed, 42 insertions(+), 139 deletions(-)

diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
index 56d0dbe621637..77b72843c4b53 100644
--- a/drivers/virtio/virtio_mem.c
+++ b/drivers/virtio/virtio_mem.c
@@ -278,10 +278,6 @@ static DEFINE_MUTEX(virtio_mem_mutex);
 static LIST_HEAD(virtio_mem_devices);
 
 static void virtio_mem_online_page_cb(struct page *page, unsigned int order);
-static void virtio_mem_fake_offline_going_offline(unsigned long pfn,
-						  unsigned long nr_pages);
-static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn,
-						   unsigned long nr_pages);
 static void virtio_mem_retry(struct virtio_mem *vm);
 static int virtio_mem_create_resource(struct virtio_mem *vm);
 static void virtio_mem_delete_resource(struct virtio_mem *vm);
@@ -924,64 +920,6 @@ static void virtio_mem_sbm_notify_online(struct virtio_mem *vm,
 	virtio_mem_sbm_set_mb_state(vm, mb_id, new_state);
 }
 
-static void virtio_mem_sbm_notify_going_offline(struct virtio_mem *vm,
-						unsigned long mb_id)
-{
-	const unsigned long nr_pages = PFN_DOWN(vm->sbm.sb_size);
-	unsigned long pfn;
-	int sb_id;
-
-	for (sb_id = 0; sb_id < vm->sbm.sbs_per_mb; sb_id++) {
-		if (virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id, 1))
-			continue;
-		pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
-			       sb_id * vm->sbm.sb_size);
-		virtio_mem_fake_offline_going_offline(pfn, nr_pages);
-	}
-}
-
-static void virtio_mem_sbm_notify_cancel_offline(struct virtio_mem *vm,
-						 unsigned long mb_id)
-{
-	const unsigned long nr_pages = PFN_DOWN(vm->sbm.sb_size);
-	unsigned long pfn;
-	int sb_id;
-
-	for (sb_id = 0; sb_id < vm->sbm.sbs_per_mb; sb_id++) {
-		if (virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id, 1))
-			continue;
-		pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
-			       sb_id * vm->sbm.sb_size);
-		virtio_mem_fake_offline_cancel_offline(pfn, nr_pages);
-	}
-}
-
-static void virtio_mem_bbm_notify_going_offline(struct virtio_mem *vm,
-						unsigned long bb_id,
-						unsigned long pfn,
-						unsigned long nr_pages)
-{
-	/*
-	 * When marked as "fake-offline", all online memory of this device block
-	 * is allocated by us. Otherwise, we don't have any memory allocated.
-	 */
-	if (virtio_mem_bbm_get_bb_state(vm, bb_id) !=
-	    VIRTIO_MEM_BBM_BB_FAKE_OFFLINE)
-		return;
-	virtio_mem_fake_offline_going_offline(pfn, nr_pages);
-}
-
-static void virtio_mem_bbm_notify_cancel_offline(struct virtio_mem *vm,
-						 unsigned long bb_id,
-						 unsigned long pfn,
-						 unsigned long nr_pages)
-{
-	if (virtio_mem_bbm_get_bb_state(vm, bb_id) !=
-	    VIRTIO_MEM_BBM_BB_FAKE_OFFLINE)
-		return;
-	virtio_mem_fake_offline_cancel_offline(pfn, nr_pages);
-}
-
 /*
  * This callback will either be called synchronously from add_memory() or
  * asynchronously (e.g., triggered via user space). We have to be careful
@@ -1040,12 +978,6 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
 			break;
 		}
 		vm->hotplug_active = true;
-		if (vm->in_sbm)
-			virtio_mem_sbm_notify_going_offline(vm, id);
-		else
-			virtio_mem_bbm_notify_going_offline(vm, id,
-							    mhp->start_pfn,
-							    mhp->nr_pages);
 		break;
 	case MEM_GOING_ONLINE:
 		mutex_lock(&vm->hotplug_mutex);
@@ -1094,12 +1026,6 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
 	case MEM_CANCEL_OFFLINE:
 		if (!vm->hotplug_active)
 			break;
-		if (vm->in_sbm)
-			virtio_mem_sbm_notify_cancel_offline(vm, id);
-		else
-			virtio_mem_bbm_notify_cancel_offline(vm, id,
-							     mhp->start_pfn,
-							     mhp->nr_pages);
 		vm->hotplug_active = false;
 		mutex_unlock(&vm->hotplug_mutex);
 		break;
@@ -1157,6 +1083,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
 			SetPageDirty(page);
 		else
 			__SetPageOffline(page);
+		__SetPageOfflineSkippable(page);
 		VM_WARN_ON_ONCE(!PageOffline(page));
 	}
 	page_offline_end();
@@ -1172,6 +1099,7 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn,
 	for (; nr_pages--; pfn++) {
 		struct page *page = pfn_to_page(pfn);
 
+		__ClearPageOfflineSkippable(page);
 		if (!onlined)
 			/* generic_online_page() will clear PageOffline(). */
 			ClearPageDirty(page);
@@ -1261,41 +1189,6 @@ static int virtio_mem_fake_offline(struct virtio_mem *vm, unsigned long pfn,
 	return -EBUSY;
 }
 
-/*
- * Handle fake-offline pages when memory is going offline - such that the
- * pages can be skipped by mm-core when offlining.
- */
-static void virtio_mem_fake_offline_going_offline(unsigned long pfn,
-						  unsigned long nr_pages)
-{
-	struct page *page;
-	unsigned long i;
-
-	/* Drop our reference to the pages so the memory can get offlined. */
-	for (i = 0; i < nr_pages; i++) {
-		page = pfn_to_page(pfn + i);
-		if (WARN_ON(!page_ref_dec_and_test(page)))
-			dump_page(page, "fake-offline page referenced");
-	}
-}
-
-/*
- * Handle fake-offline pages when memory offlining is canceled - to undo
- * what we did in virtio_mem_fake_offline_going_offline().
- */
-static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn,
-						   unsigned long nr_pages)
-{
-	unsigned long i;
-
-	/*
-	 * Get the reference again that we dropped via page_ref_dec_and_test()
-	 * when going offline.
-	 */
-	for (i = 0; i < nr_pages; i++)
-		page_ref_inc(pfn_to_page(pfn + i));
-}
-
 static void virtio_mem_online_page(struct virtio_mem *vm,
 				   struct page *page, unsigned int order)
 {
diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 1c1d49554c71a..581510ae8e83a 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -178,6 +178,9 @@ enum pageflags {
 	PG_vmemmap_self_hosted = PG_owner_priv_1,
 #endif
 
+	/* The driver allows for skipping these pages during memory offlining */
+	PG_offline_skippable = PG_owner_2,
+
 	/*
 	 * Flags only valid for compound pages.  Stored in first tail page's
 	 * flags word.  Cannot use the first 8 flags or any flag marked as
@@ -1029,14 +1032,23 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
  * refcount of 1 and PageOffline(). generic_online_page() will
  * take care of clearing PageOffline().
  *
- * If a driver wants to allow to offline unmovable PageOffline() pages without
- * putting them back to the buddy, it can do so via the memory notifier by
- * decrementing the reference count in MEM_GOING_OFFLINE and incrementing the
- * reference count in MEM_CANCEL_OFFLINE. When offlining, the PageOffline()
- * pages (now with a reference count of zero) are treated like free (unmanaged)
- * pages, allowing the containing memory block to get offlined. A driver that
- * relies on this feature is aware that re-onlining the memory block will
- * require not giving them to the buddy via generic_online_page().
+ * If a driver wants to allow for offlining unmovable PageOffline() pages
+ * when offlining a memory block without exposing these pages as "free" to
+ * the buddy, it can mark them as PG_offline_skippable.
+ *
+ * By marking these PageOffline pages PG_offline_skippable, the driver
+ * acknowledges that it
+ * (a) knows which pages are PageOffline even without the memmap.
+ * (b) implements the online_page_callback_t callback to exclude these pages
+ *     from getting exposed to the buddy, so they will stay PageOffline()
+ *     when onlining a memory block.
+ * (c) synchronizes against concurrent memory onlining/offlining whenever
+ *     adjusting the PG_offline_skippable flag.
+ *
+ * Note that the refcount of these pages will for now be assumed to always
+ * be 1: nobody except the owner should be referencing these pages.
+ * Offlining code will complain if the refcount is not 1. In the future,
+ * these pages will always be frozen and not have a refcount.
  *
  * Memory offlining code will not adjust the managed page count for any
  * PageOffline() pages, treating them like they were never exposed to the
@@ -1048,6 +1060,7 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
  * page_offline_freeze()/page_offline_thaw().
  */
 PAGE_TYPE_OPS(Offline, offline, offline)
+__PAGEFLAG(OfflineSkippable, offline_skippable, PF_NO_COMPOUND)
 
 extern void page_offline_freeze(void);
 extern void page_offline_thaw(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index b1caedbade5b1..0cc5537f234bb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1767,12 +1767,10 @@ static int scan_movable_pages(unsigned long start, unsigned long end,
 			goto found;
 
 		/*
-		 * PageOffline() pages that are not marked __PageMovable() and
-		 * have a reference count > 0 (after MEM_GOING_OFFLINE) are
-		 * definitely unmovable. If their reference count would be 0,
-		 * they could at least be skipped when offlining memory.
+		 * PageOffline() pages that are neither "movable" nor
+		 * "skippable" prevent memory offlining.
 		 */
-		if (PageOffline(page) && page_count(page))
+		if (PageOffline(page) && !PageOfflineSkippable(page))
 			return -EBUSY;
 
 		if (!PageHuge(page))
@@ -1807,6 +1805,10 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
 		struct page *page;
 
 		page = pfn_to_page(pfn);
+
+		if (PageOffline(page) && PageOfflineSkippable(page))
+			continue;
+
 		folio = page_folio(page);
 
 		if (!folio_try_get(folio))
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a6fe1e9b95941..325b51c905076 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -7023,12 +7023,12 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
 			continue;
 		}
 		/*
-		 * At this point all remaining PageOffline() pages have a
-		 * reference count of 0 and can simply be skipped.
+		 * At this point all remaining PageOffline() pages must be
+		 * "skippable" and have exactly one reference.
 		 */
 		if (PageOffline(page)) {
-			BUG_ON(page_count(page));
-			BUG_ON(PageBuddy(page));
+			WARN_ON_ONCE(!PageOfflineSkippable(page));
+			WARN_ON_ONCE(page_count(page) != 1);
 			already_offline++;
 			pfn++;
 			continue;
diff --git a/mm/page_isolation.c b/mm/page_isolation.c
index b2fc5266e3d26..debd898b794ea 100644
--- a/mm/page_isolation.c
+++ b/mm/page_isolation.c
@@ -121,16 +121,11 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
 			continue;
 
 		/*
-		 * We treat all PageOffline() pages as movable when offlining
-		 * to give drivers a chance to decrement their reference count
-		 * in MEM_GOING_OFFLINE in order to indicate that these pages
-		 * can be offlined as there are no direct references anymore.
-		 * For actually unmovable PageOffline() where the driver does
-		 * not support this, we will fail later when trying to actually
-		 * move these pages that still have a reference count > 0.
-		 * (false negatives in this function only)
+		 * PageOffline() pages that are marked as "skippable"
+		 * are treated like movable pages for memory offlining.
 		 */
-		if ((flags & MEMORY_OFFLINE) && PageOffline(page))
+		if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
+		    PageOfflineSkippable(page))
 			continue;
 
 		if (__PageMovable(page) || PageLRU(page))
@@ -577,11 +572,11 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
 			/* A HWPoisoned page cannot be also PageBuddy */
 			pfn++;
 		else if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
-			 !page_count(page))
+			 PageOfflineSkippable(page))
 			/*
-			 * The responsible driver agreed to skip PageOffline()
-			 * pages when offlining memory by dropping its
-			 * reference in MEM_GOING_OFFLINE.
+			 * If the page is a skippable PageOffline() page,
+			 * we can offline the memory block, as the driver will
+			 * re-discover them when re-onlining the memory.
 			 */
 			pfn++;
 		else
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH v1 2/2] mm/memory_hotplug: remove -EBUSY handling from scan_movable_pages()
  2025-05-14 11:15 [PATCH v1 0/2] mm/memory_hotplug: introduce and use PG_offline_skippable David Hildenbrand
  2025-05-14 11:15 ` [PATCH v1 1/2] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages David Hildenbrand
@ 2025-05-14 11:15 ` David Hildenbrand
  2025-05-14 19:57   ` David Hildenbrand
  2025-05-14 13:45 ` [PATCH v1 0/2] mm/memory_hotplug: introduce and use PG_offline_skippable Zi Yan
  2 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2025-05-14 11:15 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, virtualization, David Hildenbrand, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
	Oscar Salvador, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Matthew Wilcox (Oracle)

Now that we can reliably identify PageOffline() pages that allow for
memory offlining in has_unmovable_pages(), start_isolate_page_range()
will fail on PageOffline() pages that would prevent memory offlining, and
we no longer have to detect them in scan_movable_pages() anymore.

Note that the previous mechanism relied on MEM_GOING_OFFLINE, whereby we
were not able to distinguish the types of PageOffline() before
MEM_GOING_OFFLINE.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory_hotplug.c | 16 +---------------
 1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0cc5537f234bb..beace5b695aee 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1743,13 +1743,11 @@ bool mhp_range_allowed(u64 start, u64 size, bool need_mapping)
 /*
  * Scan pfn range [start,end) to find movable/migratable pages (LRU pages,
  * non-lru movable pages and hugepages). Will skip over most unmovable
- * pages (esp., pages that can be skipped when offlining), but bail out on
- * definitely unmovable pages.
+ * pages (esp., pages that can be skipped when offlining).
  *
  * Returns:
  *	0 in case a movable page is found and movable_pfn was updated.
  *	-ENOENT in case no movable page was found.
- *	-EBUSY in case a definitely unmovable page was found.
  */
 static int scan_movable_pages(unsigned long start, unsigned long end,
 			      unsigned long *movable_pfn)
@@ -1766,13 +1764,6 @@ static int scan_movable_pages(unsigned long start, unsigned long end,
 		if (__PageMovable(page))
 			goto found;
 
-		/*
-		 * PageOffline() pages that are neither "movable" nor
-		 * "skippable" prevent memory offlining.
-		 */
-		if (PageOffline(page) && !PageOfflineSkippable(page))
-			return -EBUSY;
-
 		if (!PageHuge(page))
 			continue;
 		folio = page_folio(page);
@@ -2051,11 +2042,6 @@ int offline_pages(unsigned long start_pfn, unsigned long nr_pages,
 			}
 		} while (!ret);
 
-		if (ret != -ENOENT) {
-			reason = "unmovable page";
-			goto failed_removal_isolated;
-		}
-
 		/*
 		 * Dissolve free hugetlb folios in the memory block before doing
 		 * offlining actually in order to make hugetlbfs's object
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 0/2] mm/memory_hotplug: introduce and use PG_offline_skippable
  2025-05-14 11:15 [PATCH v1 0/2] mm/memory_hotplug: introduce and use PG_offline_skippable David Hildenbrand
  2025-05-14 11:15 ` [PATCH v1 1/2] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages David Hildenbrand
  2025-05-14 11:15 ` [PATCH v1 2/2] mm/memory_hotplug: remove -EBUSY handling from scan_movable_pages() David Hildenbrand
@ 2025-05-14 13:45 ` Zi Yan
  2025-05-14 14:12   ` David Hildenbrand
  2 siblings, 1 reply; 14+ messages in thread
From: Zi Yan @ 2025-05-14 13:45 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, virtualization, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
	Oscar Salvador, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Matthew Wilcox (Oracle)

On 14 May 2025, at 7:15, David Hildenbrand wrote:

> This is a requirement for making PageOffline pages not have a refcount
> in the long future ("frozen"), and for reworking non-folio page migration
> in the near future.
>
> I have patches mostly ready to go to handle the latter. For turning all
> PageOffline() pages frozen, the non-folio page migration and memory
> ballooning drivers will have to be reworked first, to no longer rely on
> the refcount of PageOffline pages.
>
> Introduce PG_offline_skippable that only applies to PageOffline() pages --
> of course, reusing one of the existing PG_ flags for now -- and convert
> virtio-mem to make use of the new way: to allow for skipping PageOffline
> pages during memory offlining, treating them as if they would not be
> allocated.

IIUC, based on Documentation/admin-guide/mm/memory-hotplug.rst,
to offline a page, the page first needs to be set PageOffline() to be
removed from page allocator. Next, the page is removed from its memory
block. When will PG_offline_skippable be used? The second phase when
the page is being removed from its memory block?

Thanks.

>
> Note that the existing mechanism relied on the driver (virtio-mem)
> dropping its reference during MEM_GOING_OFFLINE, which is complicated and
> not compatible with the concept of frozen pages (no refcount).
>
> Tested with virtio-mem on x86, including  partially hotplugging a memory
> block (hotplugging 64MiB with a 128 MiB memory block size), and repeatedly
> onlining+offlining the memory block.
>
> Cc: David Hildenbrand <david@redhat.com>
> Cc: "Michael S. Tsirkin" <mst@redhat.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com>
> Cc: "Eugenio Pérez" <eperezma@redhat.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: Oscar Salvador <osalvador@suse.de>
> Cc: Vlastimil Babka <vbabka@suse.cz>
> Cc: Suren Baghdasaryan <surenb@google.com>
> Cc: Michal Hocko <mhocko@suse.com>
> Cc: Brendan Jackman <jackmanb@google.com>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Zi Yan <ziy@nvidia.com>
> Cc: "Matthew Wilcox (Oracle)" <willy@infradead.org>
>
> David Hildenbrand (2):
>   mm/memory_hotplug: PG_offline_skippable for offlining memory blocks
>     with PageOffline pages
>   mm/memory_hotplug: remove -EBUSY handling from scan_movable_pages()
>
>  drivers/virtio/virtio_mem.c | 111 +-----------------------------------
>  include/linux/page-flags.h  |  29 +++++++---
>  mm/memory_hotplug.c         |  22 ++-----
>  mm/page_alloc.c             |   8 +--
>  mm/page_isolation.c         |  21 +++----
>  5 files changed, 40 insertions(+), 151 deletions(-)
>
>
> base-commit: 2f6baf8dadecc2bec7d6bc931f7e0d58d8443d76
> -- 
> 2.49.0


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 0/2] mm/memory_hotplug: introduce and use PG_offline_skippable
  2025-05-14 13:45 ` [PATCH v1 0/2] mm/memory_hotplug: introduce and use PG_offline_skippable Zi Yan
@ 2025-05-14 14:12   ` David Hildenbrand
  2025-05-14 15:49     ` Zi Yan
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2025-05-14 14:12 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, virtualization, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
	Oscar Salvador, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Matthew Wilcox (Oracle)

On 14.05.25 15:45, Zi Yan wrote:
> On 14 May 2025, at 7:15, David Hildenbrand wrote:
> 
>> This is a requirement for making PageOffline pages not have a refcount
>> in the long future ("frozen"), and for reworking non-folio page migration
>> in the near future.
>>
>> I have patches mostly ready to go to handle the latter. For turning all
>> PageOffline() pages frozen, the non-folio page migration and memory
>> ballooning drivers will have to be reworked first, to no longer rely on
>> the refcount of PageOffline pages.
>>
>> Introduce PG_offline_skippable that only applies to PageOffline() pages --
>> of course, reusing one of the existing PG_ flags for now -- and convert
>> virtio-mem to make use of the new way: to allow for skipping PageOffline
>> pages during memory offlining, treating them as if they would not be
>> allocated.
> 

Thanks for taking a look!

> IIUC, based on Documentation/admin-guide/mm/memory-hotplug.rst,
> to offline a page, the page first needs to be set PageOffline() to be

PageOffline is not mentioned in there. :)

Note that PageOffline() is a bit confusing because it's "Memory block 
online but page is logically offline (e.g., has a memmap that can be 
touched, but the page content should not be touched)".

(memory block offline -> all pages offline and have effectively no state 
because the memmap is stale)

> removed from page allocator.

Usually, all pages are freed back to the buddy (isolated pageblock -> 
put onto the isolated list). Memory offlining code can then simply grab 
these "free" pages from the buddy -- no PageOffline involved.

If something fails during memory offlining, these isolated pages are 
simply put back on the appropriate migratetype list and become ordinary 
free pages that can be allocated immediately.

Some PageOffline pages can be migrated using the non-folio migration: 
this is done for memory ballooning (memory comapction). As they get 
migrated, they are freed back to the buddy, PageOffline() is cleared -- 
they become PageBuddy() -- and the above applies.

Other PageOffline pages can be skipped during memory offlining 
(virtio-mem use case, what we are doing her). We don't want them to ever 
go through the buddy, especially because if memory offlining fails they 
must definitely not be treated like free pages that can be allocated 
immediately.

Next, the page is removed from its memory
> block. When will PG_offline_skippable be used? The second phase when
> the page is being removed from its memory block?

PG_offline_skippable is used during memory offlining, while we look for 
any pages that are not PageBuddy (... or hwpoisoned ...), to migrate 
them off the memory so they get converted to PageBuddy.

PageOffline + PageOfflineSkippable are checked on that phase, such that 
they don't require any migration.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 0/2] mm/memory_hotplug: introduce and use PG_offline_skippable
  2025-05-14 14:12   ` David Hildenbrand
@ 2025-05-14 15:49     ` Zi Yan
  2025-05-14 17:28       ` David Hildenbrand
  0 siblings, 1 reply; 14+ messages in thread
From: Zi Yan @ 2025-05-14 15:49 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, virtualization, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
	Oscar Salvador, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Matthew Wilcox (Oracle)

On 14 May 2025, at 10:12, David Hildenbrand wrote:

> On 14.05.25 15:45, Zi Yan wrote:
>> On 14 May 2025, at 7:15, David Hildenbrand wrote:
>>
>>> This is a requirement for making PageOffline pages not have a refcount
>>> in the long future ("frozen"), and for reworking non-folio page migration
>>> in the near future.
>>>
>>> I have patches mostly ready to go to handle the latter. For turning all
>>> PageOffline() pages frozen, the non-folio page migration and memory
>>> ballooning drivers will have to be reworked first, to no longer rely on
>>> the refcount of PageOffline pages.
>>>
>>> Introduce PG_offline_skippable that only applies to PageOffline() pages --
>>> of course, reusing one of the existing PG_ flags for now -- and convert
>>> virtio-mem to make use of the new way: to allow for skipping PageOffline
>>> pages during memory offlining, treating them as if they would not be
>>> allocated.
>>
>
> Thanks for taking a look!
>
>> IIUC, based on Documentation/admin-guide/mm/memory-hotplug.rst,
>> to offline a page, the page first needs to be set PageOffline() to be
>
> PageOffline is not mentioned in there. :)

Sorry, I was mixing the code with the documentation as I was reading
both.

>
> Note that PageOffline() is a bit confusing because it's "Memory block online but page is logically offline (e.g., has a memmap that can be touched, but the page content should not be touched)".

So PageOffline() is before memory block offline, which is the first phase of
memory hotunplug.

>
> (memory block offline -> all pages offline and have effectively no state because the memmap is stale)

What do you mean by memmap is stale? When a memory block is offline, memmap is
still present, so pfn scanner can see these pages. pfn scanner checks memmap
to know that it should not touch these pages, right?

>
>> removed from page allocator.
>
> Usually, all pages are freed back to the buddy (isolated pageblock -> put onto the isolated list). Memory offlining code can then simply grab these "free" pages from the buddy -- no PageOffline involved.
>
> If something fails during memory offlining, these isolated pages are simply put back on the appropriate migratetype list and become ordinary free pages that can be allocated immediately.

I am familiar with this part. Then, when PageOffline is used?

From the comment in page-flags.h, I see two examples: inflated pages by balloon driver
and not onlined pages when onlining the section. These are two different operations:
1) inflated pages are going to be offline, 2) not onlined pages are going to be
online. But you mentioned above that Memory off lining code does not involve
PageOffline, so inflated pages by balloon driver is not part of memory offlining
code, but a different way of offlining pages. Am I getting it right?

I read a little bit more on memory ballooning and virtio-mem and understand
that memory ballooning still keeps the inflated page but guest cannot allocate
and use it, whereas virtio-mem and memory hotunplug remove the page from
Linux completely (i.e., Linux no longer sees the memory).

It seems that I am mixing memory offlining and memory hotunplug. IIUC,
memory offlining means no one can allocate and use the offlined memory, but
Linux still sees it; memory hotunplug means Linux no longer sees it (no related
memmap and other metadata). Am I getting it right?

>
> Some PageOffline pages can be migrated using the non-folio migration: this is done for memory ballooning (memory comapction). As they get migrated, they are freed back to the buddy, PageOffline() is cleared -- they become PageBuddy() -- and the above applies.

After a PageOffline page is migrated, the destination page becomes PageOffline, right?
OK, I see it in balloon_page_insert().

>
> Other PageOffline pages can be skipped during memory offlining (virtio-mem use case, what we are doing her). We don't want them to ever go through the buddy, especially because if memory offlining fails they must definitely not be treated like free pages that can be allocated immediately.

What do you mean by "skipped during memory offlining"? Are you implying when
virtio-mem is offlining some pages by marking it PageOffline and PG_offline_skippable,
someone else can do memory offlining in parallel?

>
> Next, the page is removed from its memory
>> block. When will PG_offline_skippable be used? The second phase when
>> the page is being removed from its memory block?
>
> PG_offline_skippable is used during memory offlining, while we look for any pages that are not PageBuddy (... or hwpoisoned ...), to migrate them off the memory so they get converted to PageBuddy.
>
> PageOffline + PageOfflineSkippable are checked on that phase, such that they don't require any migration.

Hmm, if you just do not want to get PageOffline migrated, not setting it
__PageMovable would work right? PageOffline + __PageMovable is used by
ballooning, as these inflated pages can be migrated. PageOffline without
__PageMovable should be virtio-mem. Am I missing any other user?

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 0/2] mm/memory_hotplug: introduce and use PG_offline_skippable
  2025-05-14 15:49     ` Zi Yan
@ 2025-05-14 17:28       ` David Hildenbrand
  2025-05-14 17:43         ` Zi Yan
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2025-05-14 17:28 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, virtualization, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
	Oscar Salvador, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Matthew Wilcox (Oracle)

>>
>> Note that PageOffline() is a bit confusing because it's "Memory block online but page is logically offline (e.g., has a memmap that can be touched, but the page content should not be touched)".
> 
> So PageOffline() is before memory block offline, which is the first phase of
> memory hotunplug.

Yes.

> 
>>
>> (memory block offline -> all pages offline and have effectively no state because the memmap is stale)
> 
> What do you mean by memmap is stale? When a memory block is offline, memmap is
> still present, so pfn scanner can see these pages. pfn scanner checks memmap
> to know that it should not touch these pages, right?

See pfn_to_online_page() for exactly that use case.

For an offline memory section (either because it was just added or 
because it was just offlined), the memmap is assumed to contain garbage 
and should not be touched.

See remove_pfn_range_from_zone() -> page_init_poison().

> 
>>
>>> removed from page allocator.
>>
>> Usually, all pages are freed back to the buddy (isolated pageblock -> put onto the isolated list). Memory offlining code can then simply grab these "free" pages from the buddy -- no PageOffline involved.
>>
>> If something fails during memory offlining, these isolated pages are simply put back on the appropriate migratetype list and become ordinary free pages that can be allocated immediately.
> 
> I am familiar with this part. Then, when PageOffline is used?
> 
>  From the comment in page-flags.h, I see two examples: inflated pages by balloon driver
> and not onlined pages when onlining the section. These are two different operations:
> 1) inflated pages are going to be offline, 2) not onlined pages are going to be
> online. But you mentioned above that Memory off lining code does not involve
> PageOffline, so inflated pages by balloon driver is not part of memory offlining
> code, but a different way of offlining pages. Am I getting it right?

Yes. PageOffline means logically offline, for whatever reason someone 
decides to turn pages logically offline.

Memory ballooning uses and virtio-mem are two users, there are more.

> 
> I read a little bit more on memory ballooning and virtio-mem and understand
> that memory ballooning still keeps the inflated page but guest cannot allocate
> and use it, whereas virtio-mem and memory hotunplug remove the page from
> Linux completely (i.e., Linux no longer sees the memory).

In virtio-mem terms, they are considered "fake offline" -- memory 
behaves as if it would never have been onlined, but there is a memmap 
for it. Like a (current) memory hole.

> 
> It seems that I am mixing memory offlining and memory hotunplug. IIUC,
> memory offlining means no one can allocate and use the offlined memory, but
> Linux still sees it; memory hotunplug means Linux no longer sees it (no related
> memmap and other metadata). Am I getting it right?

The doc has this "Phases of Memory Hotplug" description, where it is 
roughly divided into that, yes.

> 
>>
>> Some PageOffline pages can be migrated using the non-folio migration: this is done for memory ballooning (memory comapction). As they get migrated, they are freed back to the buddy, PageOffline() is cleared -- they become PageBuddy() -- and the above applies.
> 
> After a PageOffline page is migrated, the destination page becomes PageOffline, right?
> OK, I see it in balloon_page_insert().

Yes.

> 
>>
>> Other PageOffline pages can be skipped during memory offlining (virtio-mem use case, what we are doing her). We don't want them to ever go through the buddy, especially because if memory offlining fails they must definitely not be treated like free pages that can be allocated immediately.
> 
> What do you mean by "skipped during memory offlining"? Are you implying when
> virtio-mem is offlining some pages by marking it PageOffline and PG_offline_skippable,
> someone else can do memory offlining in parallel?

It could happen (e.g., manually offline a Linux memory block using 
sysfs), but that is not the primary use case.

virtio-mem unplugs memory in the following sequence:

1) alloc_contig_range() small blocks (e.g., 2 MiB)

2) Report the blocks to the hypervisor

3) Mark them fake-offline: PageOffline (+ PageOfflineSkippable now)

Once all small blocks that comprise a Linux memory block (e.g., 128 MiB) 
are fake-offline, offline the memory block and remove the memory using 
offline_and_remove_memory().

In that operation -- offline_and_remove_memory() -- memory offlining 
code must be able to skip these PageOffline pages, otherwise 
offline_and_remove_memory() will just fail, saying that there are 
unmovable pages in there.

> 
>>
>> Next, the page is removed from its memory
>>> block. When will PG_offline_skippable be used? The second phase when
>>> the page is being removed from its memory block?
>>
>> PG_offline_skippable is used during memory offlining, while we look for any pages that are not PageBuddy (... or hwpoisoned ...), to migrate them off the memory so they get converted to PageBuddy.
>>
>> PageOffline + PageOfflineSkippable are checked on that phase, such that they don't require any migration.
> 
> Hmm, if you just do not want to get PageOffline migrated, not setting it
> __PageMovable would work right? PageOffline + __PageMovable is used by
> ballooning, as these inflated pages can be migrated. PageOffline without
> __PageMovable should be virtio-mem. Am I missing any other user?

Sure. Just imagine !CONFIG_BALLOON_COMPACTION.

In summary, we have

1) Migratable PageOffline pages (balloon compaction)

2) Unmigratable PageOffline pages (e.g., XEN balloon, hyper-v balloon,
    memtrace, in the future likely some memory holes, ... )

3) Skippable PageOffline pages (virtio-mem)

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 0/2] mm/memory_hotplug: introduce and use PG_offline_skippable
  2025-05-14 17:28       ` David Hildenbrand
@ 2025-05-14 17:43         ` Zi Yan
  2025-05-14 17:46           ` David Hildenbrand
  0 siblings, 1 reply; 14+ messages in thread
From: Zi Yan @ 2025-05-14 17:43 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, virtualization, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
	Oscar Salvador, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Matthew Wilcox (Oracle)

On 14 May 2025, at 13:28, David Hildenbrand wrote:

>>>
>>> Note that PageOffline() is a bit confusing because it's "Memory block online but page is logically offline (e.g., has a memmap that can be touched, but the page content should not be touched)".
>>
>> So PageOffline() is before memory block offline, which is the first phase of
>> memory hotunplug.
>
> Yes.
>
>>
>>>
>>> (memory block offline -> all pages offline and have effectively no state because the memmap is stale)
>>
>> What do you mean by memmap is stale? When a memory block is offline, memmap is
>> still present, so pfn scanner can see these pages. pfn scanner checks memmap
>> to know that it should not touch these pages, right?
>
> See pfn_to_online_page() for exactly that use case.
>
> For an offline memory section (either because it was just added or because it was just offlined), the memmap is assumed to contain garbage and should not be touched.
>
> See remove_pfn_range_from_zone() -> page_init_poison().
>
>>
>>>
>>>> removed from page allocator.
>>>
>>> Usually, all pages are freed back to the buddy (isolated pageblock -> put onto the isolated list). Memory offlining code can then simply grab these "free" pages from the buddy -- no PageOffline involved.
>>>
>>> If something fails during memory offlining, these isolated pages are simply put back on the appropriate migratetype list and become ordinary free pages that can be allocated immediately.
>>
>> I am familiar with this part. Then, when PageOffline is used?
>>
>>  From the comment in page-flags.h, I see two examples: inflated pages by balloon driver
>> and not onlined pages when onlining the section. These are two different operations:
>> 1) inflated pages are going to be offline, 2) not onlined pages are going to be
>> online. But you mentioned above that Memory off lining code does not involve
>> PageOffline, so inflated pages by balloon driver is not part of memory offlining
>> code, but a different way of offlining pages. Am I getting it right?
>
> Yes. PageOffline means logically offline, for whatever reason someone decides to turn pages logically offline.
>
> Memory ballooning uses and virtio-mem are two users, there are more.
>
>>
>> I read a little bit more on memory ballooning and virtio-mem and understand
>> that memory ballooning still keeps the inflated page but guest cannot allocate
>> and use it, whereas virtio-mem and memory hotunplug remove the page from
>> Linux completely (i.e., Linux no longer sees the memory).
>
> In virtio-mem terms, they are considered "fake offline" -- memory behaves as if it would never have been onlined, but there is a memmap for it. Like a (current) memory hole.
>
>>
>> It seems that I am mixing memory offlining and memory hotunplug. IIUC,
>> memory offlining means no one can allocate and use the offlined memory, but
>> Linux still sees it; memory hotunplug means Linux no longer sees it (no related
>> memmap and other metadata). Am I getting it right?
>
> The doc has this "Phases of Memory Hotplug" description, where it is roughly divided into that, yes.
>
>>
>>>
>>> Some PageOffline pages can be migrated using the non-folio migration: this is done for memory ballooning (memory comapction). As they get migrated, they are freed back to the buddy, PageOffline() is cleared -- they become PageBuddy() -- and the above applies.
>>
>> After a PageOffline page is migrated, the destination page becomes PageOffline, right?
>> OK, I see it in balloon_page_insert().
>
> Yes.
>
>>
>>>
>>> Other PageOffline pages can be skipped during memory offlining (virtio-mem use case, what we are doing her). We don't want them to ever go through the buddy, especially because if memory offlining fails they must definitely not be treated like free pages that can be allocated immediately.
>>
>> What do you mean by "skipped during memory offlining"? Are you implying when
>> virtio-mem is offlining some pages by marking it PageOffline and PG_offline_skippable,
>> someone else can do memory offlining in parallel?
>
> It could happen (e.g., manually offline a Linux memory block using sysfs), but that is not the primary use case.
>
> virtio-mem unplugs memory in the following sequence:
>
> 1) alloc_contig_range() small blocks (e.g., 2 MiB)
>
> 2) Report the blocks to the hypervisor
>
> 3) Mark them fake-offline: PageOffline (+ PageOfflineSkippable now)
>
> Once all small blocks that comprise a Linux memory block (e.g., 128 MiB) are fake-offline, offline the memory block and remove the memory using offline_and_remove_memory().
>
> In that operation -- offline_and_remove_memory() -- memory offlining code must be able to skip these PageOffline pages, otherwise offline_and_remove_memory() will just fail, saying that there are unmovable pages in there.
>
>>
>>>
>>> Next, the page is removed from its memory
>>>> block. When will PG_offline_skippable be used? The second phase when
>>>> the page is being removed from its memory block?
>>>
>>> PG_offline_skippable is used during memory offlining, while we look for any pages that are not PageBuddy (... or hwpoisoned ...), to migrate them off the memory so they get converted to PageBuddy.
>>>
>>> PageOffline + PageOfflineSkippable are checked on that phase, such that they don't require any migration.
>>
>> Hmm, if you just do not want to get PageOffline migrated, not setting it
>> __PageMovable would work right? PageOffline + __PageMovable is used by
>> ballooning, as these inflated pages can be migrated. PageOffline without
>> __PageMovable should be virtio-mem. Am I missing any other user?
>
> Sure. Just imagine !CONFIG_BALLOON_COMPACTION.
>
> In summary, we have
>
> 1) Migratable PageOffline pages (balloon compaction)
>
> 2) Unmigratable PageOffline pages (e.g., XEN balloon, hyper-v balloon,
>    memtrace, in the future likely some memory holes, ... )
>
> 3) Skippable PageOffline pages (virtio-mem)

Thank you for all the explanation. Now I understand how memory offline
and memory hotunplug work and shall begin to check the patches. :)


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 0/2] mm/memory_hotplug: introduce and use PG_offline_skippable
  2025-05-14 17:43         ` Zi Yan
@ 2025-05-14 17:46           ` David Hildenbrand
  0 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand @ 2025-05-14 17:46 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, virtualization, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
	Oscar Salvador, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Matthew Wilcox (Oracle)

>>>> Next, the page is removed from its memory
>>>>> block. When will PG_offline_skippable be used? The second phase when
>>>>> the page is being removed from its memory block?
>>>>
>>>> PG_offline_skippable is used during memory offlining, while we look for any pages that are not PageBuddy (... or hwpoisoned ...), to migrate them off the memory so they get converted to PageBuddy.
>>>>
>>>> PageOffline + PageOfflineSkippable are checked on that phase, such that they don't require any migration.
>>>
>>> Hmm, if you just do not want to get PageOffline migrated, not setting it
>>> __PageMovable would work right? PageOffline + __PageMovable is used by
>>> ballooning, as these inflated pages can be migrated. PageOffline without
>>> __PageMovable should be virtio-mem. Am I missing any other user?
>>
>> Sure. Just imagine !CONFIG_BALLOON_COMPACTION.
>>
>> In summary, we have
>>
>> 1) Migratable PageOffline pages (balloon compaction)
>>
>> 2) Unmigratable PageOffline pages (e.g., XEN balloon, hyper-v balloon,
>>     memtrace, in the future likely some memory holes, ... )
>>
>> 3) Skippable PageOffline pages (virtio-mem)
> 
> Thank you for all the explanation. Now I understand how memory offline
> and memory hotunplug work and shall begin to check the patches. :)

Sure, if you think the doc or some comments could be updated, I'm happy 
to review such changes.

It's always very helpful to receive feedback from someone that's new to 
this code.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 1/2] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages
  2025-05-14 11:15 ` [PATCH v1 1/2] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages David Hildenbrand
@ 2025-05-14 19:00   ` Zi Yan
  2025-05-14 19:51     ` David Hildenbrand
  0 siblings, 1 reply; 14+ messages in thread
From: Zi Yan @ 2025-05-14 19:00 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, virtualization, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
	Oscar Salvador, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Matthew Wilcox (Oracle)

On 14 May 2025, at 7:15, David Hildenbrand wrote:

> A long-term goal is supporting frozen PageOffline pages, and later
> PageOffline pages that don't have a refcount at all. Some more work for
> that is needed -- in particular around non-folio page migration and
> memory ballooning drivers -- but let's start by handling PageOffline pages
> that can be skipped during memory offlining differently.
>
> Let's introduce a PageOffline specific page flag (PG_offline_skippable)
> that for now reuses PG_owner_2. In the memdesc future, it will be one of
> a small number of per-memdesc flags stored alongside the type.
>
> By setting PG_offline_skippable, a driver indicates that it can
> restore the PageOffline state of these specific pages when re-onlining a
> memory block: it knows that these pages are supposed to be PageOffline()
> without the information in the vmemmap, so it can filter them out and
> not expose them to the buddy -> they stay PageOffline().
>
> While PG_offline_offlineable might be clearer, it is also super
> confusing. Alternatives (PG_offline_sticky?) also don't quite feel right.
> So let's use "skippable" for now.
>
> The flag is not supposed to be used for movable PageOffline pages as
> used for balloon compaction; movable PageOffline() pages can simply be
> migrated during the memory offlining stage.
>
> Let's convert the single user from our MEM_GOING_OFFLINE approach
> to the new PG_offline_skippable approach: virtio-mem. Fortunately,
> this simplifies the code quite a lot.
>
> What if someone decides to grab a reference on these pages although they
> really shouldn't? After all, we'll now keep the refcount at 1 (until we
> can properly stop using the refcount completely).
>
> Well, less worse things will happen than would currently: currently,
> if someone would grab a reference to these pages, in MEM_GOING_OFFLINE
> we would run into the
> 		if (WARN_ON(!page_ref_dec_and_test(page)))
> 			dump_page(page, "fake-offline page referenced");
>
> And once that unexpected reference would get dropped, we would end up
> freeing that page to the buddy: ouch.
>
> Now, we'll allow for offlining that memory, and when that unexpected
> reference would get dropped, we would not end up freeing that page to
> the buddy. Once we have frozen PageOffline() pages, it will all get a
> lot cleaner.
>
> Note that we didn't see the existing WARN_ON so far, because nobody
> should ever be referencing such pages.
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  drivers/virtio/virtio_mem.c | 111 +-----------------------------------
>  include/linux/page-flags.h  |  29 +++++++---
>  mm/memory_hotplug.c         |  12 ++--
>  mm/page_alloc.c             |   8 +--
>  mm/page_isolation.c         |  21 +++----
>  5 files changed, 42 insertions(+), 139 deletions(-)
>
> diff --git a/drivers/virtio/virtio_mem.c b/drivers/virtio/virtio_mem.c
> index 56d0dbe621637..77b72843c4b53 100644
> --- a/drivers/virtio/virtio_mem.c
> +++ b/drivers/virtio/virtio_mem.c
> @@ -278,10 +278,6 @@ static DEFINE_MUTEX(virtio_mem_mutex);
>  static LIST_HEAD(virtio_mem_devices);
>
>  static void virtio_mem_online_page_cb(struct page *page, unsigned int order);
> -static void virtio_mem_fake_offline_going_offline(unsigned long pfn,
> -						  unsigned long nr_pages);
> -static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn,
> -						   unsigned long nr_pages);
>  static void virtio_mem_retry(struct virtio_mem *vm);
>  static int virtio_mem_create_resource(struct virtio_mem *vm);
>  static void virtio_mem_delete_resource(struct virtio_mem *vm);
> @@ -924,64 +920,6 @@ static void virtio_mem_sbm_notify_online(struct virtio_mem *vm,
>  	virtio_mem_sbm_set_mb_state(vm, mb_id, new_state);
>  }
>
> -static void virtio_mem_sbm_notify_going_offline(struct virtio_mem *vm,
> -						unsigned long mb_id)
> -{
> -	const unsigned long nr_pages = PFN_DOWN(vm->sbm.sb_size);
> -	unsigned long pfn;
> -	int sb_id;
> -
> -	for (sb_id = 0; sb_id < vm->sbm.sbs_per_mb; sb_id++) {
> -		if (virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id, 1))
> -			continue;
> -		pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
> -			       sb_id * vm->sbm.sb_size);
> -		virtio_mem_fake_offline_going_offline(pfn, nr_pages);
> -	}
> -}
> -
> -static void virtio_mem_sbm_notify_cancel_offline(struct virtio_mem *vm,
> -						 unsigned long mb_id)
> -{
> -	const unsigned long nr_pages = PFN_DOWN(vm->sbm.sb_size);
> -	unsigned long pfn;
> -	int sb_id;
> -
> -	for (sb_id = 0; sb_id < vm->sbm.sbs_per_mb; sb_id++) {
> -		if (virtio_mem_sbm_test_sb_plugged(vm, mb_id, sb_id, 1))
> -			continue;
> -		pfn = PFN_DOWN(virtio_mem_mb_id_to_phys(mb_id) +
> -			       sb_id * vm->sbm.sb_size);
> -		virtio_mem_fake_offline_cancel_offline(pfn, nr_pages);
> -	}
> -}
> -
> -static void virtio_mem_bbm_notify_going_offline(struct virtio_mem *vm,
> -						unsigned long bb_id,
> -						unsigned long pfn,
> -						unsigned long nr_pages)
> -{
> -	/*
> -	 * When marked as "fake-offline", all online memory of this device block
> -	 * is allocated by us. Otherwise, we don't have any memory allocated.
> -	 */
> -	if (virtio_mem_bbm_get_bb_state(vm, bb_id) !=
> -	    VIRTIO_MEM_BBM_BB_FAKE_OFFLINE)
> -		return;
> -	virtio_mem_fake_offline_going_offline(pfn, nr_pages);
> -}
> -
> -static void virtio_mem_bbm_notify_cancel_offline(struct virtio_mem *vm,
> -						 unsigned long bb_id,
> -						 unsigned long pfn,
> -						 unsigned long nr_pages)
> -{
> -	if (virtio_mem_bbm_get_bb_state(vm, bb_id) !=
> -	    VIRTIO_MEM_BBM_BB_FAKE_OFFLINE)
> -		return;
> -	virtio_mem_fake_offline_cancel_offline(pfn, nr_pages);
> -}
> -
>  /*
>   * This callback will either be called synchronously from add_memory() or
>   * asynchronously (e.g., triggered via user space). We have to be careful
> @@ -1040,12 +978,6 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
>  			break;
>  		}
>  		vm->hotplug_active = true;
> -		if (vm->in_sbm)
> -			virtio_mem_sbm_notify_going_offline(vm, id);
> -		else
> -			virtio_mem_bbm_notify_going_offline(vm, id,
> -							    mhp->start_pfn,
> -							    mhp->nr_pages);
>  		break;
>  	case MEM_GOING_ONLINE:
>  		mutex_lock(&vm->hotplug_mutex);
> @@ -1094,12 +1026,6 @@ static int virtio_mem_memory_notifier_cb(struct notifier_block *nb,
>  	case MEM_CANCEL_OFFLINE:
>  		if (!vm->hotplug_active)
>  			break;
> -		if (vm->in_sbm)
> -			virtio_mem_sbm_notify_cancel_offline(vm, id);
> -		else
> -			virtio_mem_bbm_notify_cancel_offline(vm, id,
> -							     mhp->start_pfn,
> -							     mhp->nr_pages);
>  		vm->hotplug_active = false;
>  		mutex_unlock(&vm->hotplug_mutex);
>  		break;
> @@ -1157,6 +1083,7 @@ static void virtio_mem_set_fake_offline(unsigned long pfn,
>  			SetPageDirty(page);
>  		else
>  			__SetPageOffline(page);
> +		__SetPageOfflineSkippable(page);
>  		VM_WARN_ON_ONCE(!PageOffline(page));
>  	}
>  	page_offline_end();
> @@ -1172,6 +1099,7 @@ static void virtio_mem_clear_fake_offline(unsigned long pfn,
>  	for (; nr_pages--; pfn++) {
>  		struct page *page = pfn_to_page(pfn);
>
> +		__ClearPageOfflineSkippable(page);
>  		if (!onlined)
>  			/* generic_online_page() will clear PageOffline(). */
>  			ClearPageDirty(page);
> @@ -1261,41 +1189,6 @@ static int virtio_mem_fake_offline(struct virtio_mem *vm, unsigned long pfn,
>  	return -EBUSY;
>  }
>
> -/*
> - * Handle fake-offline pages when memory is going offline - such that the
> - * pages can be skipped by mm-core when offlining.
> - */
> -static void virtio_mem_fake_offline_going_offline(unsigned long pfn,
> -						  unsigned long nr_pages)
> -{
> -	struct page *page;
> -	unsigned long i;
> -
> -	/* Drop our reference to the pages so the memory can get offlined. */
> -	for (i = 0; i < nr_pages; i++) {
> -		page = pfn_to_page(pfn + i);
> -		if (WARN_ON(!page_ref_dec_and_test(page)))
> -			dump_page(page, "fake-offline page referenced");
> -	}
> -}
> -
> -/*
> - * Handle fake-offline pages when memory offlining is canceled - to undo
> - * what we did in virtio_mem_fake_offline_going_offline().
> - */
> -static void virtio_mem_fake_offline_cancel_offline(unsigned long pfn,
> -						   unsigned long nr_pages)
> -{
> -	unsigned long i;
> -
> -	/*
> -	 * Get the reference again that we dropped via page_ref_dec_and_test()
> -	 * when going offline.
> -	 */
> -	for (i = 0; i < nr_pages; i++)
> -		page_ref_inc(pfn_to_page(pfn + i));
> -}
> -
>  static void virtio_mem_online_page(struct virtio_mem *vm,
>  				   struct page *page, unsigned int order)
>  {
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 1c1d49554c71a..581510ae8e83a 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -178,6 +178,9 @@ enum pageflags {
>  	PG_vmemmap_self_hosted = PG_owner_priv_1,
>  #endif
>
> +	/* The driver allows for skipping these pages during memory offlining */
> +	PG_offline_skippable = PG_owner_2,
> +
>  	/*
>  	 * Flags only valid for compound pages.  Stored in first tail page's
>  	 * flags word.  Cannot use the first 8 flags or any flag marked as
> @@ -1029,14 +1032,23 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
>   * refcount of 1 and PageOffline(). generic_online_page() will
>   * take care of clearing PageOffline().
>   *
> - * If a driver wants to allow to offline unmovable PageOffline() pages without
> - * putting them back to the buddy, it can do so via the memory notifier by
> - * decrementing the reference count in MEM_GOING_OFFLINE and incrementing the
> - * reference count in MEM_CANCEL_OFFLINE. When offlining, the PageOffline()
> - * pages (now with a reference count of zero) are treated like free (unmanaged)
> - * pages, allowing the containing memory block to get offlined. A driver that
> - * relies on this feature is aware that re-onlining the memory block will
> - * require not giving them to the buddy via generic_online_page().
> + * If a driver wants to allow for offlining unmovable PageOffline() pages
> + * when offlining a memory block without exposing these pages as "free" to
> + * the buddy, it can mark them as PG_offline_skippable.
> + *
> + * By marking these PageOffline pages PG_offline_skippable, the driver
> + * acknowledges that it
> + * (a) knows which pages are PageOffline even without the memmap.
> + * (b) implements the online_page_callback_t callback to exclude these pages
> + *     from getting exposed to the buddy, so they will stay PageOffline()
> + *     when onlining a memory block.
> + * (c) synchronizes against concurrent memory onlining/offlining whenever
> + *     adjusting the PG_offline_skippable flag.
> + *
> + * Note that the refcount of these pages will for now be assumed to always
> + * be 1: nobody except the owner should be referencing these pages.
> + * Offlining code will complain if the refcount is not 1. In the future,
> + * these pages will always be frozen and not have a refcount.
>   *
>   * Memory offlining code will not adjust the managed page count for any
>   * PageOffline() pages, treating them like they were never exposed to the
> @@ -1048,6 +1060,7 @@ PAGE_TYPE_OPS(Buddy, buddy, buddy)
>   * page_offline_freeze()/page_offline_thaw().
>   */
>  PAGE_TYPE_OPS(Offline, offline, offline)
> +__PAGEFLAG(OfflineSkippable, offline_skippable, PF_NO_COMPOUND)
>
>  extern void page_offline_freeze(void);
>  extern void page_offline_thaw(void);
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index b1caedbade5b1..0cc5537f234bb 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -1767,12 +1767,10 @@ static int scan_movable_pages(unsigned long start, unsigned long end,
>  			goto found;
>
>  		/*
> -		 * PageOffline() pages that are not marked __PageMovable() and
> -		 * have a reference count > 0 (after MEM_GOING_OFFLINE) are
> -		 * definitely unmovable. If their reference count would be 0,
> -		 * they could at least be skipped when offlining memory.
> +		 * PageOffline() pages that are neither "movable" nor
> +		 * "skippable" prevent memory offlining.
>  		 */
> -		if (PageOffline(page) && page_count(page))
> +		if (PageOffline(page) && !PageOfflineSkippable(page))
>  			return -EBUSY;
>
>  		if (!PageHuge(page))
> @@ -1807,6 +1805,10 @@ static void do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>  		struct page *page;
>
>  		page = pfn_to_page(pfn);
> +
> +		if (PageOffline(page) && PageOfflineSkippable(page))
> +			continue;
> +

Some comment like "Skippable PageOffline() pages are not migratable but are
skipped during memory offline" might help understand the change.

Some thoughts after reading the related code:

offline_pages() is a little convoluted, since it has two steps to make
sure a range of memory can be offlined:
1. start_isolate_page_range() checks unmovable (maybe not migratable
is more precise) pages using has_unmovable_pages(), but leaves unmovable
PageOffline() page handling to the driver;
2. scan_movable_pages() and do_migrate_range() migrate pages and handle
different types of PageOffline() pages.

It might make the logic cleaner if start_isolate_page_range() can
have a callback to allow driver to handle PageOffline() pages.

Hmm, Skippable PageOffline() pages are virtio-mem specific, I wonder
why offline_pages() takes care of them. Shouldn't virtio-mem driver
handle them? I also realize that the two steps in offline_pages()
are very similar to alloc_contig_range() and virtio-mem is using
alloc_contig_range(). I wonder if the two steps in offline_pages()
can be abstracted out and shared with virtio-mem. Yes, offline_pages()
operates at memory section granularity, different from the granularity,
pageblock size, of alloc_contig_range() used in virtio-mem, but
both are trying to check unmovable pages and migrate movable pages.


>  		folio = page_folio(page);
>
>  		if (!folio_try_get(folio))
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a6fe1e9b95941..325b51c905076 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -7023,12 +7023,12 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
>  			continue;
>  		}
>  		/*
> -		 * At this point all remaining PageOffline() pages have a
> -		 * reference count of 0 and can simply be skipped.
> +		 * At this point all remaining PageOffline() pages must be
> +		 * "skippable" and have exactly one reference.
>  		 */
>  		if (PageOffline(page)) {
> -			BUG_ON(page_count(page));
> -			BUG_ON(PageBuddy(page));
> +			WARN_ON_ONCE(!PageOfflineSkippable(page));
> +			WARN_ON_ONCE(page_count(page) != 1);
>  			already_offline++;
>  			pfn++;
>  			continue;
> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
> index b2fc5266e3d26..debd898b794ea 100644
> --- a/mm/page_isolation.c
> +++ b/mm/page_isolation.c
> @@ -121,16 +121,11 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
>  			continue;
>
>  		/*
> -		 * We treat all PageOffline() pages as movable when offlining
> -		 * to give drivers a chance to decrement their reference count
> -		 * in MEM_GOING_OFFLINE in order to indicate that these pages
> -		 * can be offlined as there are no direct references anymore.
> -		 * For actually unmovable PageOffline() where the driver does
> -		 * not support this, we will fail later when trying to actually
> -		 * move these pages that still have a reference count > 0.
> -		 * (false negatives in this function only)
> +		 * PageOffline() pages that are marked as "skippable"
> +		 * are treated like movable pages for memory offlining.
>  		 */
> -		if ((flags & MEMORY_OFFLINE) && PageOffline(page))
> +		if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
> +		    PageOfflineSkippable(page))
>  			continue;

With this change, we are no longer give non-virtio-mem driver a chance
to decrease PageOffline(page) refcount? Or virtio-mem is the only driver
doing this?

>
>  		if (__PageMovable(page) || PageLRU(page))
> @@ -577,11 +572,11 @@ __test_page_isolated_in_pageblock(unsigned long pfn, unsigned long end_pfn,
>  			/* A HWPoisoned page cannot be also PageBuddy */
>  			pfn++;
>  		else if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
> -			 !page_count(page))
> +			 PageOfflineSkippable(page))

The same question as above.

>  			/*
> -			 * The responsible driver agreed to skip PageOffline()
> -			 * pages when offlining memory by dropping its
> -			 * reference in MEM_GOING_OFFLINE.
> +			 * If the page is a skippable PageOffline() page,
> +			 * we can offline the memory block, as the driver will
> +			 * re-discover them when re-onlining the memory.
>  			 */
>  			pfn++;
>  		else
> -- 
> 2.49.0

Otherwise, LGTM. Acked-by: Zi Yan <ziy@nvidia.com>

--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 1/2] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages
  2025-05-14 19:00   ` Zi Yan
@ 2025-05-14 19:51     ` David Hildenbrand
  2025-05-14 20:30       ` Zi Yan
  0 siblings, 1 reply; 14+ messages in thread
From: David Hildenbrand @ 2025-05-14 19:51 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, virtualization, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
	Oscar Salvador, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Matthew Wilcox (Oracle)

Thanks a bucnh for the review!

>> +
>> +		if (PageOffline(page) && PageOfflineSkippable(page))
>> +			continue;
>> +
> 
> Some comment like "Skippable PageOffline() pages are not migratable but are
> skipped during memory offline" might help understand the change.

I can add a comment like for the other cases.

> 
> Some thoughts after reading the related code:
> 
> offline_pages() is a little convoluted, since it has two steps to make
> sure a range of memory can be offlined:
> 1. start_isolate_page_range() checks unmovable (maybe not migratable
> is more precise) pages using has_unmovable_pages(), but leaves unmovable
> PageOffline() page handling to the driver;

Right, it's best-effort. For ZONE_MOVABLE we're skipping the checks 
completely.

I was wondering for a longer time that -- with the isolate flag being a 
separate bit soon :) -- we could simply isolate the whole range, and 
then fail if we stumble over an unmovable page during migration. That 
is, get rid of has_unmovable_pages() entirely. Un-doing the isolation 
would then properly preserve the migratetype -- no harm done?

Certainly worth a look. What do you think about that?

> 2. scan_movable_pages() and do_migrate_range() migrate pages and handle
> different types of PageOffline() pages.

Right, migrate what we can, skip these special once, and bail out on any 
others (at least in this patch, patch #2 restores the pre-virtio-mem 
behavior).

> 
> It might make the logic cleaner if start_isolate_page_range() can
> have a callback to allow driver to handle PageOffline() pages.

We have to identify them repeadetly, so start_isolate_page_range() would 
not be sufficient. Also, callbacks are rather tricky for the case where 
we cannot really stabilize the page.

> 
> Hmm, Skippable PageOffline() pages are virtio-mem specific, I wonder
> why offline_pages() takes care of them. Shouldn't virtio-mem driver
> handle them?
 > I also realize that the two steps in offline_pages()> are very 
similar to alloc_contig_range() and virtio-mem is using
> alloc_contig_range(). I wonder if the two steps in offline_pages()
> can be abstracted out and shared with virtio-mem.Yes, offline_pages()> operates at memory section granularity, different from the granularity,
> pageblock size, of alloc_contig_range() used in virtio-mem, but
> both are trying to check unmovable pages and migrate movable pages.

Not sure I get completely what you mean. virtio-mem really wants to use 
alloc_contig_range() to allocate ranges it wants to unplug, because it 
will fail fast and allows for smaller ranges.

offline_pages() is primarily also about offlining the memory section, 
which virtio-mem must do in order to remove the Linux memory block.

> 
> 
>>   		folio = page_folio(page);
>>
>>   		if (!folio_try_get(folio))
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index a6fe1e9b95941..325b51c905076 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -7023,12 +7023,12 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
>>   			continue;
>>   		}
>>   		/*
>> -		 * At this point all remaining PageOffline() pages have a
>> -		 * reference count of 0 and can simply be skipped.
>> +		 * At this point all remaining PageOffline() pages must be
>> +		 * "skippable" and have exactly one reference.
>>   		 */
>>   		if (PageOffline(page)) {
>> -			BUG_ON(page_count(page));
>> -			BUG_ON(PageBuddy(page));
>> +			WARN_ON_ONCE(!PageOfflineSkippable(page));
>> +			WARN_ON_ONCE(page_count(page) != 1);
>>   			already_offline++;
>>   			pfn++;
>>   			continue;
>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>> index b2fc5266e3d26..debd898b794ea 100644
>> --- a/mm/page_isolation.c
>> +++ b/mm/page_isolation.c
>> @@ -121,16 +121,11 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
>>   			continue;
>>
>>   		/*
>> -		 * We treat all PageOffline() pages as movable when offlining
>> -		 * to give drivers a chance to decrement their reference count
>> -		 * in MEM_GOING_OFFLINE in order to indicate that these pages
>> -		 * can be offlined as there are no direct references anymore.
>> -		 * For actually unmovable PageOffline() where the driver does
>> -		 * not support this, we will fail later when trying to actually
>> -		 * move these pages that still have a reference count > 0.
>> -		 * (false negatives in this function only)
>> +		 * PageOffline() pages that are marked as "skippable"
>> +		 * are treated like movable pages for memory offlining.
>>   		 */
>> -		if ((flags & MEMORY_OFFLINE) && PageOffline(page))
>> +		if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
>> +		    PageOfflineSkippable(page))
>>   			continue;
> 
> With this change, we are no longer give non-virtio-mem driver a chance
> to decrease PageOffline(page) refcount? Or virtio-mem is the only driver
> doing this?

The only in-tree driver making use of this so far, yes.


There is one tweak I'll have to perform in the virtio-mem driver to 
cover one corner case: when force-unloading the virtio-mem driver, we 
have to make sure that memory blocks with fake-offline pages cannot get 
offlined (re-onlining would be bad). I'll fix that up.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 2/2] mm/memory_hotplug: remove -EBUSY handling from scan_movable_pages()
  2025-05-14 11:15 ` [PATCH v1 2/2] mm/memory_hotplug: remove -EBUSY handling from scan_movable_pages() David Hildenbrand
@ 2025-05-14 19:57   ` David Hildenbrand
  0 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand @ 2025-05-14 19:57 UTC (permalink / raw)
  To: linux-kernel
  Cc: linux-mm, virtualization, Michael S. Tsirkin, Jason Wang,
	Xuan Zhuo, Eugenio Pérez, Andrew Morton, Oscar Salvador,
	Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Zi Yan, Matthew Wilcox (Oracle)

On 14.05.25 13:15, David Hildenbrand wrote:
> Now that we can reliably identify PageOffline() pages that allow for
> memory offlining in has_unmovable_pages(), start_isolate_page_range()
> will fail on PageOffline() pages that would prevent memory offlining, and
> we no longer have to detect them in scan_movable_pages() anymore.
> 
> Note that the previous mechanism relied on MEM_GOING_OFFLINE, whereby we
> were not able to distinguish the types of PageOffline() before
> MEM_GOING_OFFLINE.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---

Hmm, I'm having second thoughts about this patch.

With hyper-v, I think we can hotplug partial memory blocks (e.g., 64 
MiB) and online it to ZONE_MOVABLE. These blocks cannot get offlined so 
far -- in contrast to virtio-mem -- and has_unmovable_pages() would not 
identify that.

So probably best to keep that in for now.

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 1/2] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages
  2025-05-14 19:51     ` David Hildenbrand
@ 2025-05-14 20:30       ` Zi Yan
  2025-05-19 14:39         ` David Hildenbrand
  0 siblings, 1 reply; 14+ messages in thread
From: Zi Yan @ 2025-05-14 20:30 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, virtualization, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
	Oscar Salvador, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Matthew Wilcox (Oracle)

On 14 May 2025, at 15:51, David Hildenbrand wrote:

> Thanks a bucnh for the review!
>
>>> +
>>> +		if (PageOffline(page) && PageOfflineSkippable(page))
>>> +			continue;
>>> +
>>
>> Some comment like "Skippable PageOffline() pages are not migratable but are
>> skipped during memory offline" might help understand the change.
>
> I can add a comment like for the other cases.

Thanks.

>>
>> Some thoughts after reading the related code:
>>
>> offline_pages() is a little convoluted, since it has two steps to make
>> sure a range of memory can be offlined:
>> 1. start_isolate_page_range() checks unmovable (maybe not migratable
>> is more precise) pages using has_unmovable_pages(), but leaves unmovable
>> PageOffline() page handling to the driver;
>
> Right, it's best-effort. For ZONE_MOVABLE we're skipping the checks completely.
>
> I was wondering for a longer time that -- with the isolate flag being a separate bit soon :) -- we could simply isolate the whole range, and then fail if we stumble over

Talking about that, I will need your input on my change to move_pfn_range_to_zone()
in online_pages()[1]. Making MIGRATE_ISOLATE a separate bit means if you isolate
a pageblock without a migratetype, unisolating it will give an unpredictable
migratetype.

[1] https://lore.kernel.org/linux-mm/20250509200111.3372279-3-ziy@nvidia.com/

> an unmovable page during migration. That is, get rid of has_unmovable_pages() entirely. Un-doing the isolation would then properly preserve the migratetype -- no harm done?
>
> Certainly worth a look. What do you think about that?

In principle, the method should work and simplifies the code. But it suffers more
penalty (pages are migrated) when an unmovable page is encountered after the
isolation, since before nothing will be migrated. To mitigate this,
we probably would want some retry mechanism. For example, register a callback
to each unmovable page and once the unmovable page is deallocated,
alloc_contig_range() can move forward progress.

>
>> 2. scan_movable_pages() and do_migrate_range() migrate pages and handle
>> different types of PageOffline() pages.
>
> Right, migrate what we can, skip these special once, and bail out on any others (at least in this patch, patch #2 restores the pre-virtio-mem behavior).
>
>>
>> It might make the logic cleaner if start_isolate_page_range() can
>> have a callback to allow driver to handle PageOffline() pages.
>
> We have to identify them repeadetly, so start_isolate_page_range() would not be sufficient. Also, callbacks are rather tricky for the case where we cannot really stabilize the page.

During start_isolate_page_range(), all pageblocks are isolated, so
free pages cannot be used by anyone else, meaning no new PageOffline()
pages or any other unmovable pages, right?

>
>>
>> Hmm, Skippable PageOffline() pages are virtio-mem specific, I wonder
>> why offline_pages() takes care of them. Shouldn't virtio-mem driver
>> handle them?
>> I also realize that the two steps in offline_pages()> are very similar to alloc_contig_range() and virtio-mem is using
>> alloc_contig_range(). I wonder if the two steps in offline_pages()
>> can be abstracted out and shared with virtio-mem.Yes, offline_pages()> operates at memory section granularity, different from the granularity,
>> pageblock size, of alloc_contig_range() used in virtio-mem, but
>> both are trying to check unmovable pages and migrate movable pages.
>
> Not sure I get completely what you mean. virtio-mem really wants to use alloc_contig_range() to allocate ranges it wants to unplug, because it will fail fast and allows for smaller ranges.
>
> offline_pages() is primarily also about offlining the memory section, which virtio-mem must do in order to remove the Linux memory block.

To clarify, I mean the steps of start_isolate_page_range(), scan_movable_pages(),
do_migrate_range(), dissolve_free_hugetlb_folios() and test_pages_isolated() in
offline_pages() is very similar to the steps of start_isolate_page_range(),
__alloc_contig_migrate_range(), replace_free_hugepage_folios(),
and test_pages_isolate() in alloc_contig_range(). So I wonder if a common
function routine can be shared.

>
>>
>>
>>>   		folio = page_folio(page);
>>>
>>>   		if (!folio_try_get(folio))
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index a6fe1e9b95941..325b51c905076 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -7023,12 +7023,12 @@ unsigned long __offline_isolated_pages(unsigned long start_pfn,
>>>   			continue;
>>>   		}
>>>   		/*
>>> -		 * At this point all remaining PageOffline() pages have a
>>> -		 * reference count of 0 and can simply be skipped.
>>> +		 * At this point all remaining PageOffline() pages must be
>>> +		 * "skippable" and have exactly one reference.
>>>   		 */
>>>   		if (PageOffline(page)) {
>>> -			BUG_ON(page_count(page));
>>> -			BUG_ON(PageBuddy(page));
>>> +			WARN_ON_ONCE(!PageOfflineSkippable(page));
>>> +			WARN_ON_ONCE(page_count(page) != 1);
>>>   			already_offline++;
>>>   			pfn++;
>>>   			continue;
>>> diff --git a/mm/page_isolation.c b/mm/page_isolation.c
>>> index b2fc5266e3d26..debd898b794ea 100644
>>> --- a/mm/page_isolation.c
>>> +++ b/mm/page_isolation.c
>>> @@ -121,16 +121,11 @@ static struct page *has_unmovable_pages(unsigned long start_pfn, unsigned long e
>>>   			continue;
>>>
>>>   		/*
>>> -		 * We treat all PageOffline() pages as movable when offlining
>>> -		 * to give drivers a chance to decrement their reference count
>>> -		 * in MEM_GOING_OFFLINE in order to indicate that these pages
>>> -		 * can be offlined as there are no direct references anymore.
>>> -		 * For actually unmovable PageOffline() where the driver does
>>> -		 * not support this, we will fail later when trying to actually
>>> -		 * move these pages that still have a reference count > 0.
>>> -		 * (false negatives in this function only)
>>> +		 * PageOffline() pages that are marked as "skippable"
>>> +		 * are treated like movable pages for memory offlining.
>>>   		 */
>>> -		if ((flags & MEMORY_OFFLINE) && PageOffline(page))
>>> +		if ((flags & MEMORY_OFFLINE) && PageOffline(page) &&
>>> +		    PageOfflineSkippable(page))
>>>   			continue;
>>
>> With this change, we are no longer give non-virtio-mem driver a chance
>> to decrease PageOffline(page) refcount? Or virtio-mem is the only driver
>> doing this?
>
> The only in-tree driver making use of this so far, yes.

Got it. Thanks.
>
>
> There is one tweak I'll have to perform in the virtio-mem driver to cover one corner case: when force-unloading the virtio-mem driver, we have to make sure that memory blocks with fake-offline pages cannot get offlined (re-onlining would be bad). I'll fix that up.


--
Best Regards,
Yan, Zi


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH v1 1/2] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages
  2025-05-14 20:30       ` Zi Yan
@ 2025-05-19 14:39         ` David Hildenbrand
  0 siblings, 0 replies; 14+ messages in thread
From: David Hildenbrand @ 2025-05-19 14:39 UTC (permalink / raw)
  To: Zi Yan
  Cc: linux-kernel, linux-mm, virtualization, Michael S. Tsirkin,
	Jason Wang, Xuan Zhuo, Eugenio Pérez, Andrew Morton,
	Oscar Salvador, Vlastimil Babka, Suren Baghdasaryan, Michal Hocko,
	Brendan Jackman, Johannes Weiner, Matthew Wilcox (Oracle)


>>> Some thoughts after reading the related code:
>>>
>>> offline_pages() is a little convoluted, since it has two steps to make
>>> sure a range of memory can be offlined:
>>> 1. start_isolate_page_range() checks unmovable (maybe not migratable
>>> is more precise) pages using has_unmovable_pages(), but leaves unmovable
>>> PageOffline() page handling to the driver;
>>
>> Right, it's best-effort. For ZONE_MOVABLE we're skipping the checks completely.
>>
>> I was wondering for a longer time that -- with the isolate flag being a separate bit soon :) -- we could simply isolate the whole range, and then fail if we stumble over
> 
> Talking about that, I will need your input on my change to move_pfn_range_to_zone()
> in online_pages()[1]. Making MIGRATE_ISOLATE a separate bit means if you isolate
> a pageblock without a migratetype, unisolating it will give an unpredictable
> migratetype.

Sorry for my late reply on that. I think, the rule should be that you 
initialize the migratetype once, before any isolation+un-isolation.

So that should only require care when adding new pageblocks (memory 
hotplug).

> 
> [1] https://lore.kernel.org/linux-mm/20250509200111.3372279-3-ziy@nvidia.com/
> 
>> an unmovable page during migration. That is, get rid of has_unmovable_pages() entirely. Un-doing the isolation would then properly preserve the migratetype -- no harm done?
>>
>> Certainly worth a look. What do you think about that?
> 
> In principle, the method should work and simplifies the code. But it suffers more
> penalty (pages are migrated) when an unmovable page is encountered after the
> isolation, since before nothing will be migrated. To mitigate this,
> we probably would want some retry mechanism. For example, register a callback
> to each unmovable page and once the unmovable page is deallocated,
> alloc_contig_range() can move forward progress.

I was wondering if we could make the isolation code a  bit simpler. For 
example, set all involved pageblocks isolated. If any is already 
isolated, we can easily back off.

Once all are isolated, we could do the has_unmovable_pages() on the 
whole range, not a single pageblock.

Then we could start migrating.

I think the "issue" is, once we drop the zone lock, pages that are 
getting freed end up on the MIGRATE_ISOLATE list -- and I think we also 
must have moved the free pages already to the proper MIGRATE_ISOLATE 
list before we drop the zone lock.

So the possible cleanups might be limited.

> 
>>
>>> 2. scan_movable_pages() and do_migrate_range() migrate pages and handle
>>> different types of PageOffline() pages.
>>
>> Right, migrate what we can, skip these special once, and bail out on any others (at least in this patch, patch #2 restores the pre-virtio-mem behavior).
>>
>>>
>>> It might make the logic cleaner if start_isolate_page_range() can
>>> have a callback to allow driver to handle PageOffline() pages.
>>
>> We have to identify them repeadetly, so start_isolate_page_range() would not be sufficient. Also, callbacks are rather tricky for the case where we cannot really stabilize the page.
> 
> During start_isolate_page_range(), all pageblocks are isolated, so
> free pages cannot be used by anyone else, meaning no new PageOffline()
> pages or any other unmovable pages, right?

Yes, that is correct. Pages (incl. PageOffline) pages could get freed 
back to the buddy.

But we don't want to store per-page callbacks either way.

What would work is a new notifier chain (protected by RCU etc), that we 
can ask for each PageOffline page.

Not sure I prefer that of the approach here that is significantly 
simpler -- and we do have the spare bit for PageOffline pages in the new 
memdec world (for PageOffline, we'll probably need 2/3 flags in the long 
run).

> 
>>
>>>
>>> Hmm, Skippable PageOffline() pages are virtio-mem specific, I wonder
>>> why offline_pages() takes care of them. Shouldn't virtio-mem driver
>>> handle them?
>>> I also realize that the two steps in offline_pages()> are very similar to alloc_contig_range() and virtio-mem is using
>>> alloc_contig_range(). I wonder if the two steps in offline_pages()
>>> can be abstracted out and shared with virtio-mem.Yes, offline_pages()> operates at memory section granularity, different from the granularity,
>>> pageblock size, of alloc_contig_range() used in virtio-mem, but
>>> both are trying to check unmovable pages and migrate movable pages.
>>
>> Not sure I get completely what you mean. virtio-mem really wants to use alloc_contig_range() to allocate ranges it wants to unplug, because it will fail fast and allows for smaller ranges.
>>
>> offline_pages() is primarily also about offlining the memory section, which virtio-mem must do in order to remove the Linux memory block.
> 
> To clarify, I mean the steps of start_isolate_page_range(), scan_movable_pages(),
> do_migrate_range(), dissolve_free_hugetlb_folios() and test_pages_isolated() in
> offline_pages() is very similar to the steps of start_isolate_page_range(),
> __alloc_contig_migrate_range(), replace_free_hugepage_folios(),
> and test_pages_isolate() in alloc_contig_range(). So I wonder if a common
> function routine can be shared.

Ah, yes, I had the same idea a couple of years back, but never got to it.

There are subtle differences (e.g., memory offlining keeps retrying and 
is allowed to dissolve free hugetlb folios), but these could likely be 
modified using a parameter (similar to how we already special-case 
offlining).


Thanks!

-- 
Cheers,

David / dhildenb



^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-05-19 14:39 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-14 11:15 [PATCH v1 0/2] mm/memory_hotplug: introduce and use PG_offline_skippable David Hildenbrand
2025-05-14 11:15 ` [PATCH v1 1/2] mm/memory_hotplug: PG_offline_skippable for offlining memory blocks with PageOffline pages David Hildenbrand
2025-05-14 19:00   ` Zi Yan
2025-05-14 19:51     ` David Hildenbrand
2025-05-14 20:30       ` Zi Yan
2025-05-19 14:39         ` David Hildenbrand
2025-05-14 11:15 ` [PATCH v1 2/2] mm/memory_hotplug: remove -EBUSY handling from scan_movable_pages() David Hildenbrand
2025-05-14 19:57   ` David Hildenbrand
2025-05-14 13:45 ` [PATCH v1 0/2] mm/memory_hotplug: introduce and use PG_offline_skippable Zi Yan
2025-05-14 14:12   ` David Hildenbrand
2025-05-14 15:49     ` Zi Yan
2025-05-14 17:28       ` David Hildenbrand
2025-05-14 17:43         ` Zi Yan
2025-05-14 17:46           ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).