From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from shelob.surriel.com (shelob.surriel.com [96.67.55.147])
	(using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id AF6343A4F58
	for <linux-kernel@vger.kernel.org>; Thu, 30 Apr 2026 20:22:56 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=96.67.55.147
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1777580584; cv=none; b=V5a02y5D2pzMHZi4B6hDk1cW9QQdgZDXThJiE0nNMLYtANIp5i+XpDguokC0E99cD+nmWb9FzHEwRGdwoMcwzLteBdxFGuXWLU/gcYop9/tjcj/qokYaaNLG1HrM+pEJvb9McchxSPJSpdEMGX80V/FEsqC+wPmbZAhwE8TrPUo=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1777580584; c=relaxed/simple;
	bh=i7sknhmeeev3C6hj0FWigcRbaU6Ydq3qwvdcFRgxYHM=;
	h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References:
	 MIME-Version:Content-Type; b=GyJ0r28COQ8c0RKZHiIbCdHotKBoWBCsA7mLKux4rrPQGxl7N1b4qmbbjjoHbJnYu9krfl4Ywo24jFufELWZGSjPR+bDjYn820AMq9YpxXJIj2U5gigk0uCW4/NXRqPEC9x2Ofb6dhsMV9oFbJi8yez1Pz2nKwPQ71L/egW8z7Q=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com; spf=pass smtp.mailfrom=surriel.com; dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b=S3Q84uTs; arc=none smtp.client-ip=96.67.55.147
Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=surriel.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=surriel.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=surriel.com header.i=@surriel.com header.b="S3Q84uTs"
DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=surriel.com
	; s=mail; h=Content-Transfer-Encoding:Content-Type:MIME-Version:References:
	In-Reply-To:Message-ID:Date:Subject:Cc:To:From:Sender:Reply-To:Content-ID:
	Content-Description:Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc
	:Resent-Message-ID:List-Id:List-Help:List-Unsubscribe:List-Subscribe:
	List-Post:List-Owner:List-Archive;
	bh=+PD7yz2P9n7w+TkYRu6/Qo4yUZbiVReNQNEOFlyrg3M=; b=S3Q84uTs7XZ7mGI1EOrb/FsrbB
	IkjZXnJto0jv9ufuaPGZWXV0lSZuSBNhk8K7T/6WeRWoVD8dJWhzpoTAlciOjHK/mIG51TmtFbh3c
	PNDZvczcKyiyIpkWZyf1Df5L3Y2DSuSP10ov4KLmwgG4Sa8JteLrkU23kKtn63bh+ws4CM1k3yqMO
	J2pl8tEMccU+D3xPOBs8cC2xMJdW6hpUg6Bmd92yF7LLdWQCeKVL0jzfEPgTRzBwuF7VqyP41rVBM
	9CYxDqgi8k1fyJk/xHnIT7dto8eZ+PDqgf8LTtig7KcONrXwsP5Yl5u1aU+uFlTnPLX2FV7Bulsti
	jmmPlgOA==;
Received: from fangorn.home.surriel.com ([10.0.13.7])
	by shelob.surriel.com with esmtpsa  (TLS1.2) tls TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384
	(Exim 4.97.1)
	(envelope-from <riel@surriel.com>)
	id 1wIXuC-000000001R0-3OQC;
	Thu, 30 Apr 2026 16:22:40 -0400
From: Rik van Riel <riel@surriel.com>
To: linux-kernel@vger.kernel.org
Cc: kernel-team@meta.com,
	linux-mm@kvack.org,
	david@kernel.org,
	willy@infradead.org,
	surenb@google.com,
	hannes@cmpxchg.org,
	ljs@kernel.org,
	ziy@nvidia.com,
	usama.arif@linux.dev,
	Rik van Riel <riel@fb.com>,
	Rik van Riel <riel@surriel.com>
Subject: [RFC PATCH 16/45] mm: page_alloc: add background superpageblock defragmentation worker
Date: Thu, 30 Apr 2026 16:20:45 -0400
Message-ID: <20260430202233.111010-17-riel@surriel.com>
X-Mailer: git-send-email 2.52.0
In-Reply-To: <20260430202233.111010-1-riel@surriel.com>
References: <20260430202233.111010-1-riel@surriel.com>
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

From: Rik van Riel <riel@fb.com>

Add an event-driven background worker that evacuates movable pages from
tainted superpageblocks when free space runs low. Each superpageblock has
its own work_struct, so defrag targets the specific superpageblock that
needs it rather than scanning the entire system.

Defrag is triggered from sb_update_list() when a tainted superpageblock
drops below threshold: 1 or fewer free pageblocks, or less than 2
pageblocks worth of free pages. The worker evacuates movable pageblocks
until free space recovers: at least 2 free pageblocks or 3 pageblocks worth
of free pages, or no movable pages remain.

Clean superpageblocks (only free + movable) are never defragged since they
don't need it. Superblocks with no movable pages are skipped since there is
nothing to evacuate.

[v19 fold] Drop the now-dead per-pageblock evacuate plumbing
(queue_pageblock_evacuate, evacuate_item, evacuate_pool, evacuate_freelist,
evacuate_item_alloc/free, evacuate_work_fn, evacuate_irq_work_fn, plus
pgdat->evacuate_pending and pgdat->evacuate_irq_work). The new
background superpageblock defragmentation worker introduced here calls
evacuate_pageblock() directly from within its own work_struct, so the
async per-pageblock work-item pool, the irq_work indirection, and their
per-pgdat init in pageblock_evacuate_init() are no longer used.

Signed-off-by: Rik van Riel <riel@surriel.com>
Assisted-by: Claude:claude-opus-4.7 syzkaller
---
 include/linux/mmzone.h |  19 ++-
 mm/internal.h          |   2 +
 mm/mm_init.c           |  87 +++++++----
 mm/page_alloc.c        | 317 ++++++++++++++++++++++++++++-------------
 4 files changed, 301 insertions(+), 124 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index f226dfdd1e99..61fe939e7c0f 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -937,6 +937,23 @@ struct superpageblock {
 	 */
 	struct free_area	free_area[NR_PAGE_ORDERS];
 
+#ifdef CONFIG_COMPACTION
+	/* Background defragmentation work for this superpageblock */
+	struct work_struct	defrag_work;
+	struct irq_work		defrag_irq_work;
+	bool			defrag_active;
+	/*
+	 * Back-off state after a no-op defrag pass: defer the next attempt
+	 * until either nr_free_pages has grown by at least pageblock_nr_pages
+	 * or a cooldown elapses, so allocator hot paths cannot re-arm
+	 * defrag_work many times per second on an SB that cannot make progress.
+	 * defrag_last_no_progress_jiffies == 0 means the previous pass made
+	 * progress (or no pass has run yet).
+	 */
+	unsigned long		defrag_last_no_progress_jiffies;
+	unsigned long		defrag_last_no_progress_pages;
+#endif
+
 	/* Identity */
 	unsigned long		start_pfn;
 	struct zone		*zone;
@@ -1532,8 +1549,6 @@ typedef struct pglist_data {
 	struct task_struct *kcompactd;
 	bool proactive_compact_trigger;
 	struct workqueue_struct *evacuate_wq;
-	struct llist_head evacuate_pending;
-	struct irq_work evacuate_irq_work;
 #endif
 	/*
 	 * This is a per-node reserve of pages that are not available
diff --git a/mm/internal.h b/mm/internal.h
index 7ee73f9bb76c..02f1c7d36b85 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1026,9 +1026,11 @@ void init_cma_reserved_pageblock(struct page *page);
 #endif /* CONFIG_COMPACTION || CONFIG_CMA */
 
 #ifdef CONFIG_COMPACTION
+void init_superpageblock_defrag(struct superpageblock *sb);
 void superpageblock_clear_has_movable(struct zone *zone, struct page *page);
 void superpageblock_set_has_movable(struct zone *zone, struct page *page);
 #else
+static inline void init_superpageblock_defrag(struct superpageblock *sb) {}
 static inline void superpageblock_clear_has_movable(struct zone *zone,
 						    struct page *page) {}
 static inline void superpageblock_set_has_movable(struct zone *zone,
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 80cfc7c4de98..1f55ff3126a2 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1668,6 +1668,7 @@ void __meminit resize_zone_superpageblocks(struct zone *zone)
 	size_t alloc_size;
 	unsigned long i;
 	int nid = zone_to_nid(zone);
+	unsigned long flags;
 
 	if (!zone->spanned_pages)
 		return;
@@ -1690,6 +1691,37 @@ void __meminit resize_zone_superpageblocks(struct zone *zone)
 		return;
 	}
 
+	/* Initialize new superpageblocks (not from old array) first, outside lock */
+	if (zone->superpageblocks) {
+		old_offset = (zone->superpageblock_base_pfn - new_sb_base) >>
+			     SUPERPAGEBLOCK_ORDER;
+	} else {
+		old_offset = 0;
+	}
+
+	for (i = 0; i < new_nr_sbs; i++) {
+		struct superpageblock *sb = &new_sbs[i];
+		bool is_old = false;
+
+		if (zone->superpageblocks &&
+		    i >= old_offset &&
+		    i < old_offset + zone->nr_superpageblocks)
+			is_old = true;
+
+		if (is_old)
+			continue;
+
+		init_one_superpageblock(sb, zone,
+					new_sb_base + (i << SUPERPAGEBLOCK_ORDER),
+					zone_start, zone_end);
+	}
+
+	/*
+	 * Take zone->lock for the copy+fixup+swap to prevent concurrent
+	 * allocations from traversing free lists while we relocate them.
+	 */
+	spin_lock_irqsave(&zone->lock, flags);
+
 	/*
 	 * Copy existing superpageblocks to their new position.
 	 * The old array covers [old_base, old_base + old_nr * SB_SIZE).
@@ -1703,39 +1735,42 @@ void __meminit resize_zone_superpageblocks(struct zone *zone)
 		       zone->nr_superpageblocks * sizeof(struct superpageblock));
 
 		/*
-		 * Fix up list_head pointers that were self-referencing
-		 * (empty lists) or pointing into the old array.
+		 * Fix up all list_head pointers: both the SPB category list
+		 * and every free_area[order].free_list[migratetype]. Pages on
+		 * buddy free lists have buddy_list.prev/next pointing at the
+		 * old array's list heads — those must be updated to point at
+		 * the new array.
 		 */
 		for (i = old_offset; i < old_offset + zone->nr_superpageblocks; i++) {
 			struct superpageblock *sb = &new_sbs[i];
+			struct superpageblock *old_sb =
+				&zone->superpageblocks[i - old_offset];
+			int order, mt;
 
-			if (list_empty(&sb->list))
+			/* Fix up sb->list (zone category/fullness list) */
+			if (list_empty(&old_sb->list))
 				INIT_LIST_HEAD(&sb->list);
 			else
-				list_replace(&zone->superpageblocks[i - old_offset].list,
-					     &sb->list);
-		}
-	}
-
-	/* Initialize new superpageblocks (slots not covered by old array) */
-	for (i = 0; i < new_nr_sbs; i++) {
-		struct superpageblock *sb = &new_sbs[i];
-		bool is_old = false;
+				list_replace(&old_sb->list, &sb->list);
+
+			/* Fix up all free_area list heads */
+			for (order = 0; order < NR_PAGE_ORDERS; order++) {
+				for (mt = 0; mt < MIGRATE_TYPES; mt++) {
+					struct list_head *old_list =
+						&old_sb->free_area[order].free_list[mt];
+					struct list_head *new_list =
+						&sb->free_area[order].free_list[mt];
+
+					if (list_empty(old_list))
+						INIT_LIST_HEAD(new_list);
+					else
+						list_replace(old_list, new_list);
+				}
+			}
 
-		if (zone->superpageblocks) {
-			old_offset = (zone->superpageblock_base_pfn - new_sb_base) >>
-				     SUPERPAGEBLOCK_ORDER;
-			if (i >= old_offset &&
-			    i < old_offset + zone->nr_superpageblocks)
-				is_old = true;
+			/* Reinitialize defrag work structs (contain stale pointers) */
+			init_superpageblock_defrag(sb);
 		}
-
-		if (is_old)
-			continue;
-
-		init_one_superpageblock(sb, zone,
-					new_sb_base + (i << SUPERPAGEBLOCK_ORDER),
-					zone_start, zone_end);
 	}
 
 	/*
@@ -1774,6 +1809,8 @@ void __meminit resize_zone_superpageblocks(struct zone *zone)
 	zone->superpageblock_base_pfn = new_sb_base;
 	zone->spb_kvmalloced = true;
 
+	spin_unlock_irqrestore(&zone->lock, flags);
+
 	/*
 	 * The boot-time array was allocated with memblock_alloc, which
 	 * is not individually freeable after boot.  Only kvfree arrays
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cbf5f48d377e..07d2926ffb3d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -63,10 +63,6 @@
 #include "shuffle.h"
 #include "page_reporting.h"
 
-#ifdef CONFIG_COMPACTION
-static void queue_pageblock_evacuate(struct zone *zone, unsigned long pfn);
-#endif
-
 /* Free Page Internal flags: for internal, non-pcp variants of free_pages(). */
 typedef int __bitwise fpi_t;
 
@@ -753,8 +749,15 @@ static inline enum sb_fullness sb_get_fullness(struct superpageblock *sb,
  *
  * Called after counters change. Removes from current list (if any)
  * and adds to the appropriate list based on current fullness and
- * taint status.
+ * taint status. Also triggers background defragmentation if the
+ * superpageblock is tainted and running low on free space.
  */
+#ifdef CONFIG_COMPACTION
+static void spb_maybe_start_defrag(struct superpageblock *sb);
+#else
+static inline void spb_maybe_start_defrag(struct superpageblock *sb) {}
+#endif
+
 static void spb_update_list(struct superpageblock *sb)
 {
 	struct zone *zone = sb->zone;
@@ -771,6 +774,8 @@ static void spb_update_list(struct superpageblock *sb)
 	cat = spb_get_category(sb);
 	full = sb_get_fullness(sb, cat);
 	list_add_tail(&sb->list, &zone->spb_lists[cat][full]);
+
+	spb_maybe_start_defrag(sb);
 }
 
 /**
@@ -3297,12 +3302,6 @@ try_to_claim_block(struct zone *zone, struct page *page,
 			sb = pfn_to_superpageblock(zone, start_pfn);
 			if (sb)
 				spb_update_list(sb);
-
-			if ((start_type == MIGRATE_UNMOVABLE ||
-			     start_type == MIGRATE_RECLAIMABLE) &&
-			    get_pfnblock_bit(start_page, start_pfn,
-					     PB_has_movable))
-				queue_pageblock_evacuate(zone, start_pfn);
 		}
 #endif
 		return __rmqueue_smallest(zone, order, start_type);
@@ -8100,42 +8099,14 @@ void __init page_alloc_sysctl_init(void)
 
 #ifdef CONFIG_COMPACTION
 /*
- * Pageblock evacuation: asynchronously migrate movable pages out of
- * pageblocks that were stolen for unmovable/reclaimable allocations.
- * This keeps unmovable/reclaimable allocations concentrated in fewer
- * pageblocks, reducing long-term fragmentation.
- *
- * Uses a global pool of 64 pre-allocated work items (~3.5KB total)
- * and a per-pgdat workqueue to keep migration node-local.
+ * Pageblock evacuation: synchronously migrate movable pages out of a
+ * pageblock to consolidate fragmentation. Driven by the background
+ * superpageblock defragmentation worker (see below); has no per-pageblock
+ * scheduling infrastructure of its own.
  */
 
-struct evacuate_item {
-	struct work_struct	work;
-	struct zone		*zone;
-	unsigned long		start_pfn;
-	struct llist_node	free_node;
-};
-
-#define NR_EVACUATE_ITEMS	64
-static struct evacuate_item evacuate_pool[NR_EVACUATE_ITEMS];
-static struct llist_head evacuate_freelist;
-
-static struct evacuate_item *evacuate_item_alloc(void)
-{
-	struct llist_node *node;
-
-	node = llist_del_first(&evacuate_freelist);
-	if (!node)
-		return NULL;
-	return container_of(node, struct evacuate_item, free_node);
-}
-
-static void evacuate_item_free(struct evacuate_item *item)
-{
-	llist_add(&item->free_node, &evacuate_freelist);
-}
-
-static void evacuate_pageblock(struct zone *zone, unsigned long start_pfn)
+static void evacuate_pageblock(struct zone *zone, unsigned long start_pfn,
+			       bool force)
 {
 	unsigned long end_pfn = start_pfn + pageblock_nr_pages;
 	unsigned long pfn = start_pfn;
@@ -8153,8 +8124,14 @@ static void evacuate_pageblock(struct zone *zone, unsigned long start_pfn)
 		.gfp_mask = GFP_HIGHUSER_MOVABLE,
 	};
 
-	/* Verify this pageblock is still worth evacuating */
-	if (get_pageblock_migratetype(pfn_to_page(start_pfn)) == MIGRATE_MOVABLE)
+	/*
+	 * Verify this pageblock is still worth evacuating.
+	 * Skip if it reverted to MOVABLE (steal was undone) — unless
+	 * force is set (background defrag wants to clear movable pages
+	 * out of tainted superpageblocks regardless of pageblock type).
+	 */
+	if (!force &&
+	    get_pageblock_migratetype(pfn_to_page(start_pfn)) == MIGRATE_MOVABLE)
 		return;
 
 	INIT_LIST_HEAD(&cc.migratepages);
@@ -8209,86 +8186,206 @@ static void evacuate_pageblock(struct zone *zone, unsigned long start_pfn)
 		putback_movable_pages(&cc.migratepages);
 }
 
-static void evacuate_work_fn(struct work_struct *work)
+/*
+ * Background superpageblock defragmentation.
+ *
+ * Evacuate movable pageblocks from tainted superpageblocks to consolidate
+ * contamination. Triggered on-demand when a tainted superpageblock runs
+ * low on free space, rather than running on a fixed timer.
+ *
+ * Goals for tainted superpageblocks:
+ * - At least 2 free pageblocks if movable pageblocks still exist
+ * - Or 3 pageblocks worth of free pages while movable pages remain
+ * - Skip superpageblocks with no movable pages (nothing to evacuate)
+ */
+
+/* Target free space: 3 pageblocks worth of free pages */
+#define SPB_DEFRAG_FREE_PAGES_TARGET	(3UL * pageblock_nr_pages)
+
+/**
+ * spb_needs_defrag - Check if a superpageblock needs defragmentation
+ * @sb: superpageblock to check (may be NULL)
+ *
+ * Returns false for NULL, non-tainted, or clean superpageblocks.
+ * A tainted superpageblock needs defrag if it has movable pages that can
+ * be evacuated AND free space is running low (1 or fewer free
+ * pageblocks, or less than 2 pageblocks worth of free pages).
+ */
+/*
+ * Cooldown between defrag attempts that made no progress, in seconds.
+ * Long enough to keep the allocator hot path quiet on saturated SBs;
+ * short enough that a freshly-freed pageblock isn't ignored for long.
+ */
+#define SPB_DEFRAG_NOOP_COOLDOWN_SECS	5
+
+static bool spb_needs_defrag(struct superpageblock *sb)
 {
-	struct evacuate_item *item = container_of(work, struct evacuate_item,
-						  work);
-	evacuate_pageblock(item->zone, item->start_pfn);
-	evacuate_item_free(item);
+	if (!sb)
+		return false;
+
+	if (spb_get_category(sb) != SB_TAINTED)
+		return false;
+
+	/*
+	 * Back off if the previous pass made no progress: do not retry until
+	 * either the cooldown elapses or free pages have grown by at least a
+	 * pageblock's worth (a hint that there might be new material to
+	 * consolidate or evacuate).
+	 */
+	if (sb->defrag_last_no_progress_jiffies &&
+	    time_before(jiffies, sb->defrag_last_no_progress_jiffies +
+				 SPB_DEFRAG_NOOP_COOLDOWN_SECS * HZ) &&
+	    sb->nr_free_pages < sb->defrag_last_no_progress_pages +
+				pageblock_nr_pages)
+		return false;
+
+	/*
+	 * Tainted superpageblocks: evacuate movable pages to concentrate
+	 * unmovable/reclaimable allocations.  Migration targets are
+	 * allocated system-wide, so no internal free space is needed.
+	 * Maintain the tainted reserve so unmovable claims always
+	 * find room in existing tainted superpageblocks.
+	 */
+	return sb->nr_movable > 0 &&
+	       sb->nr_free < SPB_TAINTED_RESERVE;
 }
 
 /**
- * evacuate_irq_work_fn - IRQ work callback to drain pending evacuations
- * @work: the irq_work embedded in pg_data_t
+ * spb_defrag_done - Check if defrag target has been reached
+ * @sb: superpageblock being defragmented
  *
- * queue_work() can deadlock when called from inside the page allocator
- * because it may try to allocate memory with locks already held.
- * Use irq_work to defer the queue_work() calls to a safe context.
+ * Stop defragmenting when the superpageblock has enough free space
+ * or there are no more movable pages to evacuate.
  */
-static void evacuate_irq_work_fn(struct irq_work *work)
+static bool spb_defrag_done(struct superpageblock *sb)
 {
-	pg_data_t *pgdat = container_of(work, pg_data_t,
-					evacuate_irq_work);
-	struct llist_node *pending;
-	struct evacuate_item *item, *next;
+	/*
+	 * Tainted superpageblocks: keep evacuating movable pages until
+	 * the reserve of free pageblocks is restored, or until there
+	 * are no more movable pages to evacuate.
+	 */
+	return !sb->nr_movable ||
+	       sb->nr_free >= SPB_TAINTED_RESERVE;
+}
 
-	if (!pgdat->evacuate_wq)
+/**
+ * spb_defrag_superpageblock - evacuate movable pages from a tainted superpageblock
+ * @sb: the tainted superpageblock to defragment
+ *
+ * Find any pageblock with movable pages (PB_has_movable) and evacuate
+ * them, leaving only unmovable, reclaimable, and free pages behind.
+ * Stop when the free space target is reached.
+ */
+static void spb_defrag_superpageblock(struct superpageblock *sb)
+{
+	unsigned long pfn, end_pfn;
+	struct zone *zone = sb->zone;
+
+	if (!sb->nr_movable)
 		return;
 
+	end_pfn = sb->start_pfn + SUPERPAGEBLOCK_NR_PAGES;
+
+	for (pfn = sb->start_pfn; pfn < end_pfn; pfn += pageblock_nr_pages) {
+		struct page *page;
+
+		if (spb_defrag_done(sb))
+			return;
+
+		if (!pfn_valid(pfn))
+			continue;
+
+		page = pfn_to_page(pfn);
+
+		/* Skip pageblocks without movable pages */
+		if (!get_pfnblock_bit(page, pfn, PB_has_movable))
+			continue;
+
+		/* Skip if fully free — nothing to evacuate */
+		if (get_pfnblock_bit(page, pfn, PB_all_free))
+			continue;
+
+		evacuate_pageblock(zone, pfn, true);
+	}
+}
+
+static void spb_defrag_work_fn(struct work_struct *work)
+{
+	struct superpageblock *sb = container_of(work, struct superpageblock,
+					     defrag_work);
+	u16 nr_free_before = sb->nr_free;
+
+	spb_defrag_superpageblock(sb);
+
 	/*
-	 * Collect all pending items first, then queue them.  Use _safe
-	 * because evacuate_work_fn() may run immediately on another
-	 * CPU and free the item before we follow the next pointer.
+	 * If this pass produced no new free pageblocks, arm the no-progress
+	 * cooldown so spb_needs_defrag() rejects re-arms until either time
+	 * passes or nr_free_pages grows enough to suggest new material to
+	 * work on.  Use jiffies | 1 so the field is never accidentally zero.
 	 */
-	pending = llist_del_all(&pgdat->evacuate_pending);
-	llist_for_each_entry_safe(item, next, pending, free_node) {
-		INIT_WORK(&item->work, evacuate_work_fn);
-		queue_work(pgdat->evacuate_wq, &item->work);
+	if (sb->nr_free == nr_free_before) {
+		sb->defrag_last_no_progress_jiffies = jiffies | 1;
+		sb->defrag_last_no_progress_pages = sb->nr_free_pages;
+	} else {
+		sb->defrag_last_no_progress_jiffies = 0;
 	}
+
+	/* Allow new defrag requests for this superpageblock */
+	sb->defrag_active = false;
 }
 
 /**
- * queue_pageblock_evacuate - schedule async evacuation of movable pages
- * @zone: the zone containing the pageblock
- * @pfn: start PFN of the pageblock (must be pageblock-aligned)
+ * spb_defrag_irq_work_fn - IRQ work callback to safely queue defrag work
+ * @work: the irq_work embedded in struct superpageblock
  *
- * Called from the page allocator when a movable pageblock is claimed
- * for unmovable or reclaimable allocations. Queues the pageblock for
- * background migration of its remaining movable pages. Uses irq_work
- * to defer the actual queue_work() call outside the allocator's lock
- * context.
+ * queue_work() can deadlock when called from inside the page allocator
+ * because it may try to allocate memory with locks already held.
+ * Use irq_work to defer the queue_work() call to a safe context.
  */
-static void queue_pageblock_evacuate(struct zone *zone, unsigned long pfn)
+static void spb_defrag_irq_work_fn(struct irq_work *work)
 {
-	struct evacuate_item *item;
-	pg_data_t *pgdat = zone->zone_pgdat;
+	struct superpageblock *sb = container_of(work, struct superpageblock,
+					     defrag_irq_work);
+	pg_data_t *pgdat = sb->zone->zone_pgdat;
+
+	if (pgdat->evacuate_wq)
+		queue_work(pgdat->evacuate_wq, &sb->defrag_work);
+}
 
-	if (!pgdat->evacuate_irq_work.func)
+/**
+ * spb_maybe_start_defrag - Trigger defrag if a superpageblock needs it
+ * @sb: superpageblock whose counters just changed
+ *
+ * Called from counter update paths (under zone->lock). If the
+ * superpageblock is tainted and running low on free space, schedule
+ * irq_work to queue defrag work outside the allocator's lock context.
+ * The irq_work handler is set up by pageblock_evacuate_init();
+ * before that runs, defrag_irq_work.func is NULL and we skip.
+ */
+static void spb_maybe_start_defrag(struct superpageblock *sb)
+{
+	if (!spb_needs_defrag(sb))
 		return;
 
-	item = evacuate_item_alloc();
-	if (!item)
+	/* Don't pile up work items; one defrag pass per superpageblock at a time */
+	if (sb->defrag_active)
 		return;
 
-	item->zone = zone;
-	item->start_pfn = pfn;
-	llist_add(&item->free_node, &pgdat->evacuate_pending);
-	irq_work_queue(&pgdat->evacuate_irq_work);
+	if (sb->defrag_irq_work.func) {
+		sb->defrag_active = true;
+		irq_work_queue(&sb->defrag_irq_work);
+	}
 }
 
 static int __init pageblock_evacuate_init(void)
 {
-	int nid, i;
-
-	/* Initialize the global freelist of work items */
-	init_llist_head(&evacuate_freelist);
-	for (i = 0; i < NR_EVACUATE_ITEMS; i++)
-		llist_add(&evacuate_pool[i].free_node, &evacuate_freelist);
+	int nid;
 
 	/* Create a per-pgdat workqueue */
 	for_each_online_node(nid) {
 		pg_data_t *pgdat = NODE_DATA(nid);
 		char name[32];
+		int z;
 
 		snprintf(name, sizeof(name), "kevacuate/%d", nid);
 		pgdat->evacuate_wq = alloc_workqueue(name, WQ_MEM_RECLAIM, 1);
@@ -8297,14 +8394,40 @@ static int __init pageblock_evacuate_init(void)
 			continue;
 		}
 
-		init_llist_head(&pgdat->evacuate_pending);
-		init_irq_work(&pgdat->evacuate_irq_work,
-			      evacuate_irq_work_fn);
+		/* Initialize per-superpageblock defrag work structs */
+		for (z = 0; z < MAX_NR_ZONES; z++) {
+			struct zone *zone = &pgdat->node_zones[z];
+			unsigned long j;
+
+			if (!zone->superpageblocks)
+				continue;
+
+			for (j = 0; j < zone->nr_superpageblocks; j++) {
+				INIT_WORK(&zone->superpageblocks[j].defrag_work,
+					  spb_defrag_work_fn);
+				init_irq_work(&zone->superpageblocks[j].defrag_irq_work,
+					      spb_defrag_irq_work_fn);
+			}
+		}
 	}
 
 	return 0;
 }
 late_initcall(pageblock_evacuate_init);
+
+/**
+ * init_superpageblock_defrag - initialize defrag work structs for a superpageblock
+ * @sb: superpageblock to initialize
+ *
+ * Called during boot from pageblock_evacuate_init() and during memory
+ * hotplug from resize_zone_superpageblocks().  Safe to call multiple times
+ * on the same superpageblock (reinitializes work structs).
+ */
+void init_superpageblock_defrag(struct superpageblock *sb)
+{
+	INIT_WORK(&sb->defrag_work, spb_defrag_work_fn);
+	init_irq_work(&sb->defrag_irq_work, spb_defrag_irq_work_fn);
+}
 #endif /* CONFIG_COMPACTION */
 
 #ifdef CONFIG_CONTIG_ALLOC
-- 
2.52.0