Linux cgroups development
 help / color / mirror / Atom feed
* [RFC PATCH 1/5] mm, swap: add virtual swap device infrastructure
From: Nhat Pham @ 2026-05-28 21:29 UTC (permalink / raw)
  To: kasong
  Cc: Liam.Howlett, akpm, apopple, axelrasmussen, baohua, baolin.wang,
	bhe, byungchul, cgroups, chengming.zhou, chrisl, corbet, david,
	dev.jain, gourry, hannes, hughd, jannh, joshua.hahnjy, lance.yang,
	lenb, linux-doc, linux-kernel, linux-mm, linux-pm,
	lorenzo.stoakes, matthew.brost, mhocko, muchun.song, npache,
	nphamcs, pavel, peterx, peterz, pfalcato, rafael, rakie.kim,
	roman.gushchin, rppt, ryan.roberts, shakeel.butt, shikemeng,
	surenb, tglx, vbabka, weixugc, ying.huang, yosry.ahmed, yuanchu,
	zhengqi.arch, ziy, kernel-team, riel, haowenchao22
In-Reply-To: <20260528212955.1912856-1-nphamcs@gmail.com>

Create a massive virtual swap device at boot, along with the
dynamic cluster infrastructure that the rest of the vswap layer
is built on:

  - swap_cluster_info_dynamic: per-cluster dynamic info kept in
    an xarray, allowing arbitrary-size devices without the static
    cluster_info[] array.
  - virtual_table: a per-slot side table for vswap backend metadata
    (tag-encoded in low bits). The field itself is added in the
    next patch; this commit only introduces the dynamic cluster
    container that will hold it.
  - The size of the vswap device is ALIGN_DOWN(UINT_MAX,
    SWAPFILE_CLUSTER) pages.

Gated by a new CONFIG_VSWAP (depends on SWAP && 64BIT). For now,
the vswap device cannot be swapon'd or swapoff'd — it is created
unconditionally at boot when CONFIG_VSWAP=y and lives for the
lifetime of the kernel. The SWP_VSWAP flag and swap_is_vswap()
helper let hot paths skip per-device bookkeeping that doesn't
apply (avail-list management, percpu_ref get/put, hibernation
target lookup, etc.).

This patch is pure scaffolding: it introduces the device, the
dynamic-cluster machinery, and the general shape of a vswap
allocator (with sanity checks), but does not hook the vswap device
into any allocation path. folio_alloc_swap will not produce vswap
entries until a subsequent patch wires it in. Backends (zswap,
zero, physical disk) and the vswap-aware swap-out / swap-in /
writeback paths arrive in subsequent patches.

Suggested-by: Kairui Song <kasong@tencent.com>
Co-developed-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Nhat Pham <nphamcs@gmail.com>
---
 MAINTAINERS          |   1 +
 include/linux/swap.h |   4 +
 mm/Kconfig           |  10 ++
 mm/page_io.c         |  18 ++-
 mm/swap.h            |  46 ++++++--
 mm/swap_state.c      |  43 ++++---
 mm/swap_table.h      |   2 +
 mm/swapfile.c        | 264 +++++++++++++++++++++++++++++++++++++++----
 mm/vswap.h           |  29 +++++
 mm/zswap.c           |  10 +-
 10 files changed, 375 insertions(+), 52 deletions(-)
 create mode 100644 mm/vswap.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 9be179722d42..e96bd0bf6307 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -17041,6 +17041,7 @@ F:	mm/swap.h
 F:	mm/swap_table.h
 F:	mm/swap_state.c
 F:	mm/swapfile.c
+F:	mm/vswap.h
 
 MEMORY MANAGEMENT - THP (TRANSPARENT HUGE PAGE)
 M:	Andrew Morton <akpm@linux-foundation.org>
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 6d72778e6cc3..ee9b1e76b058 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -214,6 +214,7 @@ enum {
 	SWP_STABLE_WRITES = (1 << 11),	/* no overwrite PG_writeback pages */
 	SWP_SYNCHRONOUS_IO = (1 << 12),	/* synchronous IO is efficient */
 	SWP_HIBERNATION = (1 << 13),	/* pinned for hibernation */
+	SWP_VSWAP	= (1 << 14),	/* virtual swap device */
 					/* add others here before... */
 };
 
@@ -282,6 +283,7 @@ struct swap_info_struct {
 	struct work_struct reclaim_work; /* reclaim worker */
 	struct list_head discard_clusters; /* discard clusters list */
 	struct plist_node avail_list;   /* entry in swap_avail_head */
+	struct xarray cluster_info_pool; /* Xarray for vswap dynamic cluster info */
 };
 
 static inline swp_entry_t page_swap_entry(struct page *page)
@@ -473,6 +475,8 @@ void swap_free_hibernation_slot(swp_entry_t entry);
 
 static inline void put_swap_device(struct swap_info_struct *si)
 {
+	if (si->flags & SWP_VSWAP)
+		return;
 	percpu_ref_put(&si->users);
 }
 
diff --git a/mm/Kconfig b/mm/Kconfig
index 776b67c66e82..fc395ae3dde8 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -19,6 +19,16 @@ menuconfig SWAP
 	  used to provide more virtual memory than the actual RAM present
 	  in your computer.  If unsure say Y.
 
+config VSWAP
+	bool "Virtual swap device"
+	depends on SWAP && 64BIT
+	help
+	  Adds a virtual swap layer that decouples swap entries in page
+	  tables from physical backing storage. Swap entries are allocated
+	  from a virtual swap device and can be backed by zswap, a physical
+	  swapfile, or kept in memory — with the backing changeable at
+	  runtime without invalidating page table entries.
+
 config ZSWAP
 	bool "Compressed cache for swap pages"
 	depends on SWAP
diff --git a/mm/page_io.c b/mm/page_io.c
index f2d8fe7fd057..8126be6e4cfb 100644
--- a/mm/page_io.c
+++ b/mm/page_io.c
@@ -295,8 +295,7 @@ int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug)
 	}
 	rcu_read_unlock();
 
-	__swap_writepage(folio, swap_plug);
-	return 0;
+	return __swap_writepage(folio, swap_plug);
 out_unlock:
 	folio_unlock(folio);
 	return ret;
@@ -458,11 +457,18 @@ static void swap_writepage_bdev_async(struct folio *folio,
 	submit_bio(bio);
 }
 
-void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug)
+int __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug)
 {
 	struct swap_info_struct *sis = __swap_entry_to_info(folio->swap);
 
 	VM_BUG_ON_FOLIO(!folio_test_swapcache(folio), folio);
+
+	if (sis->flags & SWP_VSWAP) {
+		/* Prevent the page from getting reclaimed. */
+		folio_set_dirty(folio);
+		return AOP_WRITEPAGE_ACTIVATE;
+	}
+
 	/*
 	 * ->flags can be updated non-atomically,
 	 * but that will never affect SWP_FS_OPS, so the data_race
@@ -479,6 +485,7 @@ void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug)
 		swap_writepage_bdev_sync(folio, sis);
 	else
 		swap_writepage_bdev_async(folio, sis);
+	return 0;
 }
 
 void swap_write_unplug(struct swap_iocb *sio)
@@ -684,6 +691,11 @@ void swap_read_folio(struct folio *folio, struct swap_iocb **plug)
 	if (zswap_load(folio) != -ENOENT)
 		goto finish;
 
+	if (unlikely(sis->flags & SWP_VSWAP)) {
+		folio_unlock(folio);
+		goto finish;
+	}
+
 	/* We have to read from slower devices. Increase zswap protection. */
 	zswap_folio_swapin(folio);
 
diff --git a/mm/swap.h b/mm/swap.h
index 81c06aae7ccd..479ee5871cb9 100644
--- a/mm/swap.h
+++ b/mm/swap.h
@@ -65,6 +65,13 @@ struct swap_cluster_info {
 	struct list_head list;
 };
 
+struct swap_cluster_info_dynamic {
+	struct swap_cluster_info ci;	/* Underlying cluster info */
+	unsigned int index;		/* for cluster_index() */
+	struct rcu_head rcu;		/* For kfree_rcu deferred free */
+	/* Backend pointers (virtual_table) added in a later patch. */
+};
+
 /* All on-list cluster must have a non-zero flag. */
 enum swap_cluster_flags {
 	CLUSTER_FLAG_NONE = 0, /* For temporary off-list cluster */
@@ -75,6 +82,7 @@ enum swap_cluster_flags {
 	CLUSTER_FLAG_USABLE = CLUSTER_FLAG_FRAG,
 	CLUSTER_FLAG_FULL,
 	CLUSTER_FLAG_DISCARD,
+	CLUSTER_FLAG_DEAD,	/* Vswap dynamic cluster pending kfree_rcu */
 	CLUSTER_FLAG_MAX,
 };
 
@@ -108,9 +116,19 @@ static inline struct swap_info_struct *__swap_entry_to_info(swp_entry_t entry)
 static inline struct swap_cluster_info *__swap_offset_to_cluster(
 		struct swap_info_struct *si, pgoff_t offset)
 {
+	unsigned int cluster_idx = offset / SWAPFILE_CLUSTER;
+
 	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
 	VM_WARN_ON_ONCE(offset >= roundup(si->max, SWAPFILE_CLUSTER));
-	return &si->cluster_info[offset / SWAPFILE_CLUSTER];
+
+	if (si->flags & SWP_VSWAP) {
+		struct swap_cluster_info_dynamic *ci_dyn;
+
+		ci_dyn = xa_load(&si->cluster_info_pool, cluster_idx);
+		return ci_dyn ? &ci_dyn->ci : NULL;
+	}
+
+	return &si->cluster_info[cluster_idx];
 }
 
 static inline struct swap_cluster_info *__swap_entry_to_cluster(swp_entry_t entry)
@@ -122,7 +140,7 @@ static inline struct swap_cluster_info *__swap_entry_to_cluster(swp_entry_t entr
 static __always_inline struct swap_cluster_info *__swap_cluster_lock(
 		struct swap_info_struct *si, unsigned long offset, bool irq)
 {
-	struct swap_cluster_info *ci = __swap_offset_to_cluster(si, offset);
+	struct swap_cluster_info *ci;
 
 	/*
 	 * Nothing modifies swap cache in an IRQ context. All access to
@@ -135,10 +153,24 @@ static __always_inline struct swap_cluster_info *__swap_cluster_lock(
 	 */
 	VM_WARN_ON_ONCE(!in_task());
 	VM_WARN_ON_ONCE(percpu_ref_is_zero(&si->users)); /* race with swapoff */
-	if (irq)
-		spin_lock_irq(&ci->lock);
-	else
-		spin_lock(&ci->lock);
+
+	rcu_read_lock();
+	ci = __swap_offset_to_cluster(si, offset);
+	if (ci) {
+		if (irq)
+			spin_lock_irq(&ci->lock);
+		else
+			spin_lock(&ci->lock);
+
+		if (ci->flags == CLUSTER_FLAG_DEAD) {
+			if (irq)
+				spin_unlock_irq(&ci->lock);
+			else
+				spin_unlock(&ci->lock);
+			ci = NULL;
+		}
+	}
+	rcu_read_unlock();
 	return ci;
 }
 
@@ -250,7 +282,7 @@ static inline void swap_read_unplug(struct swap_iocb *plug)
 }
 void swap_write_unplug(struct swap_iocb *sio);
 int swap_writeout(struct folio *folio, struct swap_iocb **swap_plug);
-void __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
+int __swap_writepage(struct folio *folio, struct swap_iocb **swap_plug);
 
 /* linux/mm/swap_state.c */
 extern struct address_space swap_space __read_mostly;
diff --git a/mm/swap_state.c b/mm/swap_state.c
index 04f5ce992401..b063c47138c5 100644
--- a/mm/swap_state.c
+++ b/mm/swap_state.c
@@ -90,8 +90,10 @@ struct folio *swap_cache_get_folio(swp_entry_t entry)
 	struct folio *folio;
 
 	for (;;) {
+		rcu_read_lock();
 		swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
 					swp_cluster_offset(entry));
+		rcu_read_unlock();
 		if (!swp_tb_is_folio(swp_tb))
 			return NULL;
 		folio = swp_tb_to_folio(swp_tb);
@@ -113,8 +115,10 @@ bool swap_cache_has_folio(swp_entry_t entry)
 {
 	unsigned long swp_tb;
 
+	rcu_read_lock();
 	swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
 				swp_cluster_offset(entry));
+	rcu_read_unlock();
 	return swp_tb_is_folio(swp_tb);
 }
 
@@ -130,8 +134,10 @@ void *swap_cache_get_shadow(swp_entry_t entry)
 {
 	unsigned long swp_tb;
 
+	rcu_read_lock();
 	swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
 				swp_cluster_offset(entry));
+	rcu_read_unlock();
 	if (swp_tb_is_shadow(swp_tb))
 		return swp_tb_to_shadow(swp_tb);
 	return NULL;
@@ -400,14 +406,16 @@ void __swap_cache_replace_folio(struct swap_cluster_info *ci,
  * -ENOENT / -EEXIST: Target swap entry is unavailable or cached, the caller
  *                    should abort or try to use the cached folio instead
  */
-static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
-					swp_entry_t targ_entry, gfp_t gfp,
+static struct folio *__swap_cache_alloc(swp_entry_t targ_entry, gfp_t gfp,
 					unsigned int order, struct vm_fault *vmf,
 					struct mempolicy *mpol, pgoff_t ilx)
 {
 	int err;
 	swp_entry_t entry;
 	struct folio *folio;
+	struct swap_cluster_info *ci;
+	struct swap_info_struct *si = __swap_entry_to_info(targ_entry);
+	unsigned long offset = swp_offset(targ_entry);
 	void *shadow = NULL;
 	unsigned short memcg_id;
 	unsigned long address, nr_pages = 1UL << order;
@@ -417,9 +425,12 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 	entry.val = round_down(targ_entry.val, nr_pages);
 
 	/* Check if the slot and range are available, skip allocation if not */
-	spin_lock(&ci->lock);
-	err = __swap_cache_add_check(ci, targ_entry, nr_pages, NULL, NULL);
-	spin_unlock(&ci->lock);
+	err = -ENOENT;
+	ci = swap_cluster_lock(si, offset);
+	if (ci) {
+		err = __swap_cache_add_check(ci, targ_entry, nr_pages, NULL, NULL);
+		swap_cluster_unlock(ci);
+	}
 	if (unlikely(err))
 		return ERR_PTR(err);
 
@@ -440,10 +451,13 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 		return ERR_PTR(-ENOMEM);
 
 	/* Double check the range is still not in conflict */
-	spin_lock(&ci->lock);
-	err = __swap_cache_add_check(ci, targ_entry, nr_pages, &shadow, &memcg_id);
+	err = -ENOENT;
+	ci = swap_cluster_lock(si, offset);
+	if (ci)
+		err = __swap_cache_add_check(ci, targ_entry, nr_pages, &shadow, &memcg_id);
 	if (unlikely(err)) {
-		spin_unlock(&ci->lock);
+		if (ci)
+			swap_cluster_unlock(ci);
 		folio_put(folio);
 		return ERR_PTR(err);
 	}
@@ -451,13 +465,14 @@ static struct folio *__swap_cache_alloc(struct swap_cluster_info *ci,
 	__folio_set_locked(folio);
 	__folio_set_swapbacked(folio);
 	__swap_cache_do_add_folio(ci, folio, entry);
-	spin_unlock(&ci->lock);
+	swap_cluster_unlock(ci);
 
 	if (mem_cgroup_swapin_charge_folio(folio, memcg_id,
 					   vmf ? vmf->vma->vm_mm : NULL, gfp)) {
-		spin_lock(&ci->lock);
+		/* The folio pins the cluster */
+		ci = swap_cluster_lock(si, offset);
 		__swap_cache_do_del_folio(ci, folio, entry, shadow);
-		spin_unlock(&ci->lock);
+		swap_cluster_unlock(ci);
 		folio_unlock(folio);
 		/* nr_pages refs from swap cache, 1 from allocation */
 		folio_put_refs(folio, nr_pages + 1);
@@ -501,9 +516,7 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
 {
 	int order, err;
 	struct folio *ret;
-	struct swap_cluster_info *ci;
 
-	ci = __swap_entry_to_cluster(targ_entry);
 	order = highest_order(orders);
 
 	/* orders must be non-zero, and must not exceed cluster size. */
@@ -511,12 +524,12 @@ struct folio *swap_cache_alloc_folio(swp_entry_t targ_entry, gfp_t gfp,
 		return ERR_PTR(-EINVAL);
 
 	do {
-		ret = __swap_cache_alloc(ci, targ_entry, gfp, order,
+		ret = __swap_cache_alloc(targ_entry, gfp, order,
 					 vmf, mpol, ilx);
 		if (!IS_ERR(ret))
 			break;
 		err = PTR_ERR(ret);
-		if (!order || (err && err != -EBUSY && err != -ENOMEM))
+		if (err && err != -EBUSY && err != -ENOMEM)
 			break;
 		count_mthp_stat(order, MTHP_STAT_SWPIN_FALLBACK);
 		order = next_order(&orders, order);
diff --git a/mm/swap_table.h b/mm/swap_table.h
index e6613e62f8d0..fd7f0fb9836a 100644
--- a/mm/swap_table.h
+++ b/mm/swap_table.h
@@ -255,6 +255,8 @@ static inline unsigned long swap_table_get(struct swap_cluster_info *ci,
 	unsigned long swp_tb;
 
 	VM_WARN_ON_ONCE(off >= SWAPFILE_CLUSTER);
+	if (!ci)
+		return SWP_TB_NULL;
 
 	rcu_read_lock();
 	table = rcu_dereference(ci->table);
diff --git a/mm/swapfile.c b/mm/swapfile.c
index a9a1e477fec9..f6d2529159ff 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -42,10 +42,12 @@
 #include <linux/suspend.h>
 #include <linux/zswap.h>
 #include <linux/plist.h>
+#include <linux/major.h>
 
 #include <asm/tlbflush.h>
 #include <linux/leafops.h>
 #include "swap_table.h"
+#include "vswap.h"
 #include "internal.h"
 #include "swap.h"
 
@@ -401,6 +403,8 @@ static inline bool cluster_is_usable(struct swap_cluster_info *ci, int order)
 static inline unsigned int cluster_index(struct swap_info_struct *si,
 					 struct swap_cluster_info *ci)
 {
+	if (si->flags & SWP_VSWAP)
+		return container_of(ci, struct swap_cluster_info_dynamic, ci)->index;
 	return ci - si->cluster_info;
 }
 
@@ -734,6 +738,22 @@ static void free_cluster(struct swap_info_struct *si, struct swap_cluster_info *
 		return;
 	}
 
+	if (si->flags & SWP_VSWAP) {
+		struct swap_cluster_info_dynamic *ci_dyn;
+
+		ci_dyn = container_of(ci, struct swap_cluster_info_dynamic, ci);
+		if (ci->flags != CLUSTER_FLAG_NONE) {
+			spin_lock(&si->lock);
+			list_del(&ci->list);
+			spin_unlock(&si->lock);
+		}
+		swap_cluster_free_table(ci);
+		xa_erase(&si->cluster_info_pool, ci_dyn->index);
+		ci->flags = CLUSTER_FLAG_DEAD;
+		kfree_rcu(ci_dyn, rcu);
+		return;
+	}
+
 	__free_cluster(si, ci);
 }
 
@@ -836,14 +856,21 @@ static int swap_cluster_setup_bad_slot(struct swap_info_struct *si,
  * stolen by a lower order). @usable will be set to false if that happens.
  */
 static bool cluster_reclaim_range(struct swap_info_struct *si,
-				  struct swap_cluster_info *ci,
+				  struct swap_cluster_info **pcip,
 				  unsigned long start, unsigned int order,
 				  bool *usable)
 {
+	struct swap_cluster_info *ci = *pcip;
 	unsigned int nr_pages = 1 << order;
 	unsigned long offset = start, end = start + nr_pages;
 	unsigned long swp_tb;
 
+	/*
+	 * Take RCU read lock before releasing the cluster lock to keep ci
+	 * alive — for vswap dynamic clusters, ci is freed via kfree_rcu
+	 * and the grace period could otherwise elapse in the window.
+	 */
+	rcu_read_lock();
 	spin_unlock(&ci->lock);
 	do {
 		swp_tb = swap_table_get(ci, offset % SWAPFILE_CLUSTER);
@@ -853,7 +880,15 @@ static bool cluster_reclaim_range(struct swap_info_struct *si,
 			if (__try_to_reclaim_swap(si, offset, TTRS_ANYWAY) < 0)
 				break;
 	} while (++offset < end);
-	spin_lock(&ci->lock);
+	rcu_read_unlock();
+
+	/* Re-lookup: dynamic cluster may have been freed while lock was dropped */
+	ci = swap_cluster_lock(si, start);
+	*pcip = ci;
+	if (!ci) {
+		*usable = false;
+		return false;
+	}
 
 	/*
 	 * We just dropped ci->lock so cluster could be used by another
@@ -984,7 +1019,8 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 		if (!cluster_scan_range(si, ci, offset, nr_pages, &need_reclaim))
 			continue;
 		if (need_reclaim) {
-			ret = cluster_reclaim_range(si, ci, offset, order, &usable);
+			ret = cluster_reclaim_range(si, &ci, offset, order,
+						    &usable);
 			if (!usable)
 				goto out;
 			if (cluster_is_empty(ci))
@@ -1002,8 +1038,10 @@ static unsigned int alloc_swap_scan_cluster(struct swap_info_struct *si,
 		break;
 	}
 out:
-	relocate_cluster(si, ci);
-	swap_cluster_unlock(ci);
+	if (ci) {
+		relocate_cluster(si, ci);
+		swap_cluster_unlock(ci);
+	}
 	if (si->flags & SWP_SOLIDSTATE) {
 		this_cpu_write(percpu_swap_cluster.offset[order], next);
 		this_cpu_write(percpu_swap_cluster.si[order], si);
@@ -1035,6 +1073,41 @@ static unsigned int alloc_swap_scan_list(struct swap_info_struct *si,
 	return found;
 }
 
+static unsigned int alloc_swap_scan_dynamic(struct swap_info_struct *si,
+					    struct folio *folio)
+{
+	struct swap_cluster_info_dynamic *ci_dyn;
+	struct swap_cluster_info *ci;
+	unsigned long offset;
+
+	WARN_ON(!(si->flags & SWP_VSWAP));
+
+	ci_dyn = kzalloc(sizeof(*ci_dyn), GFP_ATOMIC);
+	if (!ci_dyn)
+		return SWAP_ENTRY_INVALID;
+
+	spin_lock_init(&ci_dyn->ci.lock);
+	INIT_LIST_HEAD(&ci_dyn->ci.list);
+
+	if (swap_cluster_alloc_table(&ci_dyn->ci, GFP_ATOMIC)) {
+		kfree(ci_dyn);
+		return SWAP_ENTRY_INVALID;
+	}
+
+	if (xa_alloc(&si->cluster_info_pool, &ci_dyn->index, ci_dyn,
+		     XA_LIMIT(1, DIV_ROUND_UP(si->max, SWAPFILE_CLUSTER) - 1),
+		     GFP_ATOMIC)) {
+		swap_cluster_free_table(&ci_dyn->ci);
+		kfree(ci_dyn);
+		return SWAP_ENTRY_INVALID;
+	}
+
+	ci = &ci_dyn->ci;
+	spin_lock(&ci->lock);
+	offset = cluster_offset(si, ci);
+	return alloc_swap_scan_cluster(si, ci, folio, offset);
+}
+
 static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 {
 	long to_scan = 1;
@@ -1057,7 +1130,9 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 				spin_unlock(&ci->lock);
 				nr_reclaim = __try_to_reclaim_swap(si, offset,
 								   TTRS_ANYWAY);
-				spin_lock(&ci->lock);
+				ci = swap_cluster_lock(si, offset);
+				if (!ci)
+					goto next;
 				if (nr_reclaim) {
 					offset += abs(nr_reclaim);
 					continue;
@@ -1071,6 +1146,7 @@ static void swap_reclaim_full_clusters(struct swap_info_struct *si, bool force)
 			relocate_cluster(si, ci);
 
 		swap_cluster_unlock(ci);
+next:
 		if (to_scan <= 0)
 			break;
 		cond_resched();
@@ -1141,6 +1217,12 @@ static unsigned long cluster_alloc_swap_entry(struct swap_info_struct *si,
 			goto done;
 	}
 
+	if (si->flags & SWP_VSWAP) {
+		found = alloc_swap_scan_dynamic(si, folio);
+		if (found)
+			goto done;
+	}
+
 	if (!(si->flags & SWP_PAGE_DISCARD)) {
 		found = alloc_swap_scan_list(si, &si->free_clusters, folio, false);
 		if (found)
@@ -1259,6 +1341,13 @@ static void add_to_avail_list(struct swap_info_struct *si, bool swapon)
 			goto skip;
 	}
 
+	/*
+	 * Keep vswap off the avail list — it is not allocated from by
+	 * the physical swap allocator (swap_alloc_fast/slow).
+	 */
+	if (swap_is_vswap(si))
+		goto skip;
+
 	plist_add(&si->avail_list, &swap_avail_head);
 
 skip:
@@ -1341,6 +1430,10 @@ static void swap_range_free(struct swap_info_struct *si, unsigned long offset,
 
 static bool get_swap_device_info(struct swap_info_struct *si)
 {
+	/* vswap device is always alive — no ref counting needed */
+	if (swap_is_vswap(si))
+		return true;
+
 	if (!percpu_ref_tryget_live(&si->users))
 		return false;
 	/*
@@ -1376,11 +1469,11 @@ static bool swap_alloc_fast(struct folio *folio)
 		return false;
 
 	ci = swap_cluster_lock(si, offset);
-	if (cluster_is_usable(ci, order)) {
+	if (ci && cluster_is_usable(ci, order)) {
 		if (cluster_is_empty(ci))
 			offset = cluster_offset(si, ci);
 		alloc_swap_scan_cluster(si, ci, folio, offset);
-	} else {
+	} else if (ci) {
 		swap_cluster_unlock(ci);
 	}
 
@@ -1484,6 +1577,7 @@ int swap_retry_table_alloc(swp_entry_t entry, gfp_t gfp)
 	if (!si)
 		return 0;
 
+	/* Entry is in use (being faulted in), so its cluster is alive. */
 	ci = __swap_offset_to_cluster(si, offset);
 	ret = swap_extend_table_alloc(si, ci, gfp);
 
@@ -1711,6 +1805,7 @@ int folio_alloc_swap(struct folio *folio)
 	unsigned int order = folio_order(folio);
 	unsigned int size = 1 << order;
 
+	VM_WARN_ON_FOLIO(folio_test_swapcache(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
 	VM_BUG_ON_FOLIO(!folio_test_uptodate(folio), folio);
 
@@ -1873,7 +1968,8 @@ struct swap_info_struct *get_swap_device(swp_entry_t entry)
 	return NULL;
 put_out:
 	pr_err("%s: %s%08lx\n", __func__, Bad_offset, entry.val);
-	percpu_ref_put(&si->users);
+	if (!swap_is_vswap(si))
+		percpu_ref_put(&si->users);
 	return NULL;
 }
 
@@ -2005,6 +2101,7 @@ static bool folio_maybe_swapped(struct folio *folio)
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_locked(folio), folio);
 	VM_WARN_ON_ONCE_FOLIO(!folio_test_swapcache(folio), folio);
 
+	/* Folio is locked and in swap cache, so ci->count > 0: cluster is alive. */
 	ci = __swap_entry_to_cluster(entry);
 	ci_off = swp_cluster_offset(entry);
 	ci_end = ci_off + folio_nr_pages(folio);
@@ -2142,9 +2239,9 @@ swp_entry_t swap_alloc_hibernation_slot(int type)
 	pcp_offset = this_cpu_read(percpu_swap_cluster.offset[0]);
 	if (pcp_si == si && pcp_offset) {
 		ci = swap_cluster_lock(si, pcp_offset);
-		if (cluster_is_usable(ci, 0))
+		if (ci && cluster_is_usable(ci, 0))
 			offset = alloc_swap_scan_cluster(si, ci, NULL, pcp_offset);
-		else
+		else if (ci)
 			swap_cluster_unlock(ci);
 	}
 	if (!offset)
@@ -2192,6 +2289,9 @@ static int __find_hibernation_swap_type(dev_t device, sector_t offset)
 
 		if (!(sis->flags & SWP_WRITEOK))
 			continue;
+		/* vswap has no bdev — never a hibernation target */
+		if (swap_is_vswap(sis))
+			continue;
 
 		if (device == sis->bdev->bd_dev) {
 			struct swap_extent *se = first_se(sis);
@@ -2379,6 +2479,9 @@ int find_first_swap(dev_t *device)
 
 		if (!(sis->flags & SWP_WRITEOK))
 			continue;
+		/* vswap has no bdev — never a hibernation target */
+		if (swap_is_vswap(sis))
+			continue;
 		*device = sis->bdev->bd_dev;
 		spin_unlock(&swap_lock);
 		return type;
@@ -2590,8 +2693,10 @@ static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
 						&vmf);
 		}
 		if (!folio) {
+			rcu_read_lock();
 			swp_tb = swap_table_get(__swap_entry_to_cluster(entry),
 						swp_cluster_offset(entry));
+			rcu_read_unlock();
 			if (swp_tb_get_count(swp_tb) <= 0)
 				continue;
 			return -ENOMEM;
@@ -2737,8 +2842,10 @@ static unsigned int find_next_to_unuse(struct swap_info_struct *si,
 	 * allocations from this area (while holding swap_lock).
 	 */
 	for (i = prev + 1; i < si->max; i++) {
+		rcu_read_lock();
 		swp_tb = swap_table_get(__swap_offset_to_cluster(si, i),
 					i % SWAPFILE_CLUSTER);
+		rcu_read_unlock();
 		if (!swp_tb_is_null(swp_tb) && !swp_tb_is_bad(swp_tb))
 			break;
 		if ((i % LATENCY_LIMIT) == 0)
@@ -2977,6 +3084,11 @@ static int setup_swap_extents(struct swap_info_struct *sis,
 	struct inode *inode = mapping->host;
 	int ret;
 
+	if (sis->flags & SWP_VSWAP) {
+		*span = 0;
+		return 0;
+	}
+
 	if (S_ISBLK(inode->i_mode)) {
 		ret = add_swap_extent(sis, 0, sis->max, 0);
 		*span = sis->pages;
@@ -3001,15 +3113,22 @@ static int setup_swap_extents(struct swap_info_struct *sis,
 
 static void _enable_swap_info(struct swap_info_struct *si)
 {
-	atomic_long_add(si->pages, &nr_swap_pages);
-	total_swap_pages += si->pages;
+	if (!swap_is_vswap(si)) {
+		atomic_long_add(si->pages, &nr_swap_pages);
+		total_swap_pages += si->pages;
+	}
 
 	assert_spin_locked(&swap_lock);
 
-	plist_add(&si->list, &swap_active_head);
-
-	/* Add back to available list */
-	add_to_avail_list(si, true);
+	/*
+	 * Vswap has no backing file and no swapoff support — keep it
+	 * off swap_active_head (used by swapoff filename lookup and
+	 * swap_sync_discard) and swap_avail_head (physical allocator).
+	 */
+	if (!swap_is_vswap(si)) {
+		plist_add(&si->list, &swap_active_head);
+		add_to_avail_list(si, true);
+	}
 }
 
 /*
@@ -3046,6 +3165,8 @@ static void wait_for_allocation(struct swap_info_struct *si)
 	struct swap_cluster_info *ci;
 
 	BUG_ON(si->flags & SWP_WRITEOK);
+	if (si->flags & SWP_VSWAP)
+		return;
 
 	for (offset = 0; offset < end; offset += SWAPFILE_CLUSTER) {
 		ci = swap_cluster_lock(si, offset);
@@ -3184,7 +3305,8 @@ SYSCALL_DEFINE1(swapoff, const char __user *, specialfile)
 
 	destroy_swap_extents(p, p->swap_file);
 
-	if (!(p->flags & SWP_SOLIDSTATE))
+	if (!(p->flags & SWP_VSWAP) &&
+	    !(p->flags & SWP_SOLIDSTATE))
 		atomic_dec(&nr_rotate_swap);
 
 	mutex_lock(&swapon_mutex);
@@ -3294,6 +3416,19 @@ static void swap_stop(struct seq_file *swap, void *v)
 	mutex_unlock(&swapon_mutex);
 }
 
+static const char *swap_type_str(struct swap_info_struct *si)
+{
+	struct file *file = si->swap_file;
+
+	if (si->flags & SWP_VSWAP)
+		return "vswap\t";
+
+	if (S_ISBLK(file_inode(file)->i_mode))
+		return "partition";
+
+	return "file\t";
+}
+
 static int swap_show(struct seq_file *swap, void *v)
 {
 	struct swap_info_struct *si = v;
@@ -3313,8 +3448,7 @@ static int swap_show(struct seq_file *swap, void *v)
 	len = seq_file_path(swap, file, " \t\n\\");
 	seq_printf(swap, "%*s%s\t%lu\t%s%lu\t%s%d\n",
 			len < 40 ? 40 - len : 1, " ",
-			S_ISBLK(file_inode(file)->i_mode) ?
-				"partition" : "file\t",
+			swap_type_str(si),
 			bytes, bytes < 10000000 ? "\t" : "",
 			inuse, inuse < 10000000 ? "\t" : "",
 			si->prio);
@@ -3446,7 +3580,6 @@ static int claim_swapfile(struct swap_info_struct *si, struct inode *inode)
 	return 0;
 }
 
-
 /*
  * Find out how many pages are allowed for a single swap device. There
  * are two limiting factors:
@@ -3552,10 +3685,43 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
 				    unsigned long maxpages)
 {
 	unsigned long nr_clusters = DIV_ROUND_UP(maxpages, SWAPFILE_CLUSTER);
-	struct swap_cluster_info *cluster_info;
+	struct swap_cluster_info *cluster_info = NULL;
+	struct swap_cluster_info_dynamic *ci_dyn;
 	int err = -ENOMEM;
 	unsigned long i;
 
+	/* For SWP_VSWAP files, initialize Xarray pool instead of static array */
+	if (si->flags & SWP_VSWAP) {
+		/*
+		 * Pre-allocate cluster 0 and mark slot 0 (header page)
+		 * as bad so the allocator never hands out page offset 0.
+		 */
+		ci_dyn = kzalloc(sizeof(*ci_dyn), GFP_KERNEL);
+		if (!ci_dyn)
+			goto err;
+		spin_lock_init(&ci_dyn->ci.lock);
+		INIT_LIST_HEAD(&ci_dyn->ci.list);
+
+		nr_clusters = 0;
+		xa_init_flags(&si->cluster_info_pool, XA_FLAGS_ALLOC);
+		err = xa_insert(&si->cluster_info_pool, 0, ci_dyn, GFP_KERNEL);
+		if (err) {
+			kfree(ci_dyn);
+			goto err;
+		}
+
+		err = swap_cluster_setup_bad_slot(si, &ci_dyn->ci, 0, false);
+		if (err) {
+			xa_erase(&si->cluster_info_pool, 0);
+			swap_cluster_free_table(&ci_dyn->ci);
+			kfree(ci_dyn);
+			xa_destroy(&si->cluster_info_pool);
+			goto err;
+		}
+
+		goto setup_cluster_info;
+	}
+
 	cluster_info = kvzalloc_objs(*cluster_info, nr_clusters);
 	if (!cluster_info)
 		goto err;
@@ -3580,6 +3746,10 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
 	err = swap_cluster_setup_bad_slot(si, cluster_info, 0, false);
 	if (err)
 		goto err;
+
+	if (!swap_header)
+		goto setup_cluster_info;
+
 	for (i = 0; i < swap_header->info.nr_badpages; i++) {
 		unsigned int page_nr = swap_header->info.badpages[i];
 
@@ -3599,6 +3769,7 @@ static int setup_swap_clusters_info(struct swap_info_struct *si,
 			goto err;
 	}
 
+setup_cluster_info:
 	INIT_LIST_HEAD(&si->free_clusters);
 	INIT_LIST_HEAD(&si->full_clusters);
 	INIT_LIST_HEAD(&si->discard_clusters);
@@ -3635,7 +3806,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 	struct dentry *dentry;
 	int prio;
 	int error;
-	union swap_header *swap_header;
+	union swap_header *swap_header = NULL;
 	int nr_extents;
 	sector_t span;
 	unsigned long maxpages;
@@ -3709,7 +3880,6 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 		goto bad_swap_unlock_inode;
 	}
 	swap_header = kmap_local_folio(folio, 0);
-
 	maxpages = read_swap_header(si, swap_header, inode);
 	if (unlikely(!maxpages)) {
 		error = -EINVAL;
@@ -3744,7 +3914,7 @@ SYSCALL_DEFINE2(swapon, const char __user *, specialfile, int, swap_flags)
 
 	if (si->bdev && !bdev_rot(si->bdev)) {
 		si->flags |= SWP_SOLIDSTATE;
-	} else {
+	} else if (!(si->flags & SWP_SOLIDSTATE)) {
 		atomic_inc(&nr_rotate_swap);
 		inced_nr_rotate_swap = true;
 	}
@@ -3966,3 +4136,47 @@ static int __init swapfile_init(void)
 	return 0;
 }
 subsys_initcall(swapfile_init);
+
+#ifdef CONFIG_VSWAP
+struct swap_info_struct *vswap_si;
+
+static int __init vswap_init(void)
+{
+	struct swap_info_struct *si;
+	unsigned long maxpages;
+	int err;
+
+	si = alloc_swap_info();
+	if (IS_ERR(si))
+		return PTR_ERR(si);
+
+	maxpages = min(swapfile_maximum_size,
+		       ALIGN_DOWN((unsigned long)UINT_MAX, SWAPFILE_CLUSTER));
+	si->flags |= SWP_VSWAP | SWP_SOLIDSTATE | SWP_WRITEOK;
+	si->bdev = NULL;
+	si->max = maxpages;
+	si->pages = maxpages - 1;
+	si->prio = SHRT_MAX;
+	si->list.prio = -si->prio;
+	si->avail_list.prio = -si->prio;
+
+	err = setup_swap_clusters_info(si, NULL, maxpages);
+	if (err)
+		goto fail;
+
+	mutex_lock(&swapon_mutex);
+	enable_swap_info(si);
+	mutex_unlock(&swapon_mutex);
+
+	vswap_si = si;
+	pr_info("vswap: created virtual swap device (%lu pages)\n", maxpages);
+	return 0;
+
+fail:
+	spin_lock(&swap_lock);
+	si->flags = 0;
+	spin_unlock(&swap_lock);
+	return err;
+}
+late_initcall(vswap_init);
+#endif
diff --git a/mm/vswap.h b/mm/vswap.h
new file mode 100644
index 000000000000..094ff16cb5a4
--- /dev/null
+++ b/mm/vswap.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Virtual swap space
+ *
+ * Copyright (C) 2026 Nhat Pham
+ */
+#ifndef _MM_VSWAP_H
+#define _MM_VSWAP_H
+
+#include <linux/swap.h>
+
+#ifdef CONFIG_VSWAP
+
+extern struct swap_info_struct *vswap_si;
+
+static inline bool swap_is_vswap(struct swap_info_struct *si)
+{
+	return si->flags & SWP_VSWAP;
+}
+
+#else
+
+static inline bool swap_is_vswap(struct swap_info_struct *si)
+{
+	return false;
+}
+
+#endif /* CONFIG_VSWAP */
+#endif /* _MM_VSWAP_H */
diff --git a/mm/zswap.c b/mm/zswap.c
index 761cd699e0a3..993406074d58 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -994,11 +994,16 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	struct swap_info_struct *si;
 	int ret = 0;
 
-	/* try to allocate swap cache folio */
 	si = get_swap_device(swpentry);
 	if (!si)
 		return -EEXIST;
 
+	if (si->flags & SWP_VSWAP) {
+		put_swap_device(si);
+		return -EINVAL;
+	}
+
+	/* try to allocate swap cache folio */
 	mpol = get_task_policy(current);
 	folio = swap_cache_alloc_folio(swpentry, GFP_KERNEL, BIT(0), NULL, mpol,
 				       NO_INTERLEAVE_INDEX);
@@ -1049,7 +1054,8 @@ static int zswap_writeback_entry(struct zswap_entry *entry,
 	folio_set_reclaim(folio);
 
 	/* start writeback */
-	__swap_writepage(folio, NULL);
+	ret = __swap_writepage(folio, NULL);
+	WARN_ON_ONCE(ret);
 
 out:
 	if (ret) {
-- 
2.53.0-Meta


^ permalink raw reply related

* [RFC PATCH 0/5] mm, swap: Virtual Swap Space (Swap Table Edition)
From: Nhat Pham @ 2026-05-28 21:29 UTC (permalink / raw)
  To: kasong
  Cc: Liam.Howlett, akpm, apopple, axelrasmussen, baohua, baolin.wang,
	bhe, byungchul, cgroups, chengming.zhou, chrisl, corbet, david,
	dev.jain, gourry, hannes, hughd, jannh, joshua.hahnjy, lance.yang,
	lenb, linux-doc, linux-kernel, linux-mm, linux-pm,
	lorenzo.stoakes, matthew.brost, mhocko, muchun.song, npache,
	nphamcs, pavel, peterx, peterz, pfalcato, rafael, rakie.kim,
	roman.gushchin, rppt, ryan.roberts, shakeel.butt, shikemeng,
	surenb, tglx, vbabka, weixugc, ying.huang, yosry.ahmed, yuanchu,
	zhengqi.arch, ziy, kernel-team, riel, haowenchao22

Based on: mm-unstable @ 444fc9435e57 + swap-table phase IV v5 [2].

I manually adapted Kairui's ghost device implementation (from [4])
for my vswap device. I've credited him as Co-developed-by on Patch I
since a substantial portion of the dynamic-cluster infrastructure is
his (I did propose the idea of using xarray/radix tree for dynamic
swap clusters allocation and management though :P).

From here on out, for simplicity, I will refer to swap table phase IV
as "P4", and the older v6 virtual swap space implementation as "v6".


I. Context and Motivation

Virtual swap decouples PTE swap entries from physical swap backing,
allowing pages to be compressed by zswap without pre-allocating a
physical swap slot. See [1] for a more involved discussion on the
motivation of swap virtualization, but in short, a swap virtualization
scheme needs to satisfy 3 requirements, which are all driven by real
pressing use cases of many parties using swap:

1. No backend coupling. For instance, a zswap entry should not
   require a physical swap slot to be allocated. This prevents
   wastage of coupled backend resources, and allows zswap to be
   used in systems that do not have enough storage capacity for
   physical swap (without having to resort to silly hacks). The same
   should hold for zero-filled swap pages, and swap cached folios too.

2. Dynamic swap space. The virtualization scheme should not require
   static provisioning, to accommodate dynamic and unpredictable swap
   usage. This massively simplifies operational provisioning, and
   allow the in-memory compression backend to be maximally utilized.
   It also makes sure we do not induce unbounded overhead on unused
   swap capacity.

3. Efficient backend transfer. The virtualization scheme should not
   introduces PTE/rmap walking overhead for backend transfer. This
   is crucial for systems that want to support multiple swap backends
   in a tiering fashion (for e.g zswap -> disk swap).

There are a lot of other future use cases as well - see [1] for more
details.

This series reimplements the virtual swap space concept (see [1])
on top of Kairui Song's swap table infrastructure, on top of [2]
and in accordance with his proposal in [3]. The proposal's idea
is interesting, so I decided to give it a shot myself. I'm still not
100% sure that this is bug-proof, but hey, it compiles, and has
not crashed in my simple stress testing :)

The prototype here is feature-complete relative to the swap-table P4
baseline — swapout, swapin, freeing, swapoff, zswap writeback, zswap
shrinker, memcg charging, and THP swapin all work for
both vswap and direct-physical entries — and satisfies all three
requirements above: no backend coupling (zswap/zero entries hold no
physical slot), dynamic swap space (clusters allocated on demand via
xarray, no static provisioning), and efficient backend transfer
(in-place vtable updates, no PTE/rmap walking).

II. Design

With vswap, pages are assigned virtual swap entries on a ghost device
with no backing storage. These entries are backed by zswap, zero pages,
or (lazily) physical swap slots. Physical backing is allocated only
when needed — on zswap writeback or reclaim writeout, after the rmap
step.

Compared to the standalone v6 implementation [1], which introduces a
24-byte per-entry swap descriptor and its own cluster allocator, this
edition uses swap_table infrastructure, and share a lot of the allocator
logic. Per-slot metadata is stored in a tag-encoded virtual_table
(atomic_long_t, 8 bytes per slot), and physical clusters store
Pointer-tagged rmap entries in the swap_table for reverse lookup back to
the virtual cluster.

Here are some data layout diagrams:

  Case 1: vswap entry (virtualized)

  PTE                  swap_cluster_info_dynamic
  vswap_entry          +-------------------------+
  (swp_entry_t) ------>| swap_cluster_info (ci)  |
                       | +--------------------+  |
                       | | swap_table         |  |
                       | |   PFN / Shadow     |  |
                       | | memcg_table        |  |
                       | | count,flags,order  |  |
                       | | lock, list         |  |
                       | +--------------------+  |
                       |                         |
                       | virtual_table           |
                       | +--------------------+  |
                       | | NONE               |  |
                       | | PHYS               |  |
                       | | ZERO               |  |
                       | | ZSWAP(entry*)      |  |
                       | | FOLIO(folio*)      |  |
                       | +--------------------+  |
                       +-------------------------+
                              |
                              | PHYS resolves to
                              v
                       PHYSICAL CLUSTER (swap_cluster_info)
                       +--------------------------+
                       | swap_table per-slot:     |
                       |   NULL   - free          |
                       |   PFN    - cached folio  |
                       |   Shadow - swapped out   |
                       |   Pointer- vswap rmap    |
                       |   Bad    - unusable      |
                       |                          |
                       | Vswap-backing slot:      |
                       |   Pointer(C|swp_entry_t) |
                       |     rmap back to vswap   |
                       +--------------------------+

  Case 2: direct-mapped physical entry (no vswap)

  PTE                  PHYSICAL CLUSTER (swap_cluster_info)
  phys_entry           +--------------------------+
  (swp_entry_t) ------>| swap_table per-slot:     |
                       |   NULL   - free          |
                       |   PFN    - cached folio  |
                       |   Shadow - swapped out   |
                       |   Bad    - unusable      |
                       +--------------------------+

struct swap_cluster_info_dynamic {
    struct swap_cluster_info ci;       /* swap_table, lock, etc. */
    unsigned int index;                /* position in xarray */
    struct rcu_head rcu;               /* kfree_rcu deferred free */
    atomic_long_t *virtual_table;      /* backend info, 8 B/slot */
};

Each vswap cluster (swap_cluster_info_dynamic) extends the classic
swap_cluster_info struct with a virtual_table array that stores the
backend information for each virtual swap entry in the cluster. Each
entry is tag-encoded in the low 3 bits to indicate backend types:

  NONE:   |----- 0000 ------|000|  free / unbacked
  PHYS:   |-- (type:5,off:N)|001|  on a physical swapfile (shifted)
  ZERO:   |----- 0000 ------|010|  zero-filled page
  ZSWAP:  |--- zswap_entry* |011|  compressed in zswap
  FOLIO:  |--- folio* ------|100|  in-memory folio

We still have room for 3 more future backend types, for e.g. CRAM, i.e
compressed-CXL-as-swap, which is laid out in [10] and [11]. Worst
case scenario, we can add more fields to this extended struct.

Other design points:
- Both vswap entries (Case 1) and directly-mapped physical entries
  (Case 2) coexist as first-class citizens. All the common swap
  code paths — swapout, swapin, swap freeing, swapoff, zswap
  writeback, THP swapin, etc. work for both. When CONFIG_VSWAP=n,
  the vswap branches compile out and behavior should be identical to
  today's swap-table P4 (at least that is my intention).
- Pointer-tagged swap_table on physical clusters for rmap (physical
  -> virtual) lookup.
- Virtual swap slots not backed by physical swap are not charged to
  memcg swap counters — only physical backing is charged (I made the
  case for this in [7]).
- Careful separation of vswap and physical swap allocation paths and
  structures adds a lot of complexity, but is crucial to make sure
  both paths are efficient and do not conflict with each other (for
  correctness and performance). I do re-use a lot of the allocation
  logic wherever possible though.

  An example of this is the per-cpu cluster caching. I have found that
  caching virtual and physical clusters in the same structure is a
  recipe for bugs and performance regressions :) For instance, zswap
  shrinker will invalidate the cached virtual cluster, and cache its
  physical cluster instead, which will be reverted by the next vswap
  allocation.


And a lot more of these random tidbits off the top of my head. See the
patches for a proof-of-concept implementation.


III. Follow-ups:

In no particular order (and most of which can be done as follow-up
patch series rather than shoving everything in the initial landing):

- More thorough stress testing is very much needed.

- Performance benchmarks to make sure I don't accidentally regress
  the vswap-less case, and that the vswap's case performance is
  good. I suspect I will have to port a lot of the
  optimizations I implemented in v6 over here - some of the
  inefficiencies are inherent in any swap virtualization, and
  would require the same fix (for e.g the MRU cluster caching
  for faster cluster lookup - see [8] and [9]).

- Runtime enable/disable of the vswap device. To be honest, I don't
  know if there is a value in this. My preference is vswap can be
  optimized to the point that any overhead is negligible. Failing that,
  maybe we can come up with some simple heuristics that automatically
  decides for users?

  In this RFC, CONFIG_VSWAP=y means the vswap device is always created at
  boot, and CONFIG_VSWAP=n means the vswap device is never created. This
  *might* be enough just on its own.

  Is a runtime knob (sysfs or sysctl) worth the complexity beyond
  these heuristics? I'm not sure yet. Maintaining both cases
  at runtime also has overhead for checking as well, and some of the
  checks are not cheap :)

  Besides, what does swapon/swapoff buy us here? We do not want
  multiple vswap devices - they're identical performance-wise, so we
  will just fragment clusters unnecessarily. We do not care about
  sizing, since the metadata layer is completely dynamic. If we want
  to opt-out of vswap at runtime per-cgroup, maybe swap.tier by
  Youngjun (see [12]) is a better interface than swapon/swapoff?

- Defer per-cluster memcg_table and zeromap allocation on physical
  clusters. A physical swap cluster backing vswap entries only do
  not really need their memcg_table, but the current design forces
  us to allocate it anyway. This is a waste of memory, and is an
  overhead regression compared to my older design on the zswap-only
  case, which Johannes has pointed out multiple times (see [6]),
  and is one of the biggest reasons why I have not been satisfied
  with this approach thus far. It honestly is a bit of a
  deal-breaker...

  That said, I think I might be able to allocate them on demand, i.e
  only when the first direct-mapped slot is allocated on that cluster.
  That will give us the best of BOTH worlds, for both the vswap and
  directly-mapped physical swap cases. No promises, but I will try
  (if this approach is good enough for all parties).

- Widen swap_info_struct->max to unsigned long. The vswap device's
  max is currently clamped to ALIGN_DOWN(UINT_MAX, SWAPFILE_CLUSTER)
  (~16 TiB) to fit in unsigned int. 16 TiB is small for vswap,
  especially when we're getting increasingly big machines memory-wise.

- Supporting 32-bit architectures. I need to do the math carefully.
  But do we want to optimize for these architectures anyway? I think
  the only argument is if somehow virtual swap is so good that we
  can just get rid of the direct-mapped physical swap case entirely,
  so we need to support 32-bit architectures. I'm willing to have my
  mind changed though.

- Add some fat design doc (assuming this approach is acceptable to
  folks).

- Samefilled page handling is still doable BTW, if folks think this
  has value :)


This is an early RFC — I have only done basic functional testing so
far, and still need to run more thorough stress tests and benchmarks.
That said, I figure I should send this out early to get folks's
feedback, before I get myself too deep in this rabbit hole - the
complexity is already mounting...


[1]: https://lore.kernel.org/all/20260505153854.1612033-1-nphamcs@gmail.com/
[2]: https://lore.kernel.org/all/20260517-swap-table-p4-v5-0-88ae43e064c7@tencent.com/
[3]: https://lwn.net/Articles/1072657/
[4]: https://lore.kernel.org/all/20260220-swap-table-p4-v1-15-104795d19815@tencent.com/
[5]: https://lore.kernel.org/all/aerrps94j70MkgdW@gourry-fedora-PF4VCD3F/
[6]: https://lore.kernel.org/all/aZyFxKGXc8J6PIij@cmpxchg.org/
[7]: https://lore.kernel.org/linux-mm/CAKEwX=P4syV38jAVCWq198r2OHXXc=xA-fx1dk6+qYef6yzxWQ@mail.gmail.com/
[8]: https://lore.kernel.org/all/CAKEwX=NrUhUrAFx+8BYJEfaVKpCm-H9JhBzYSrqOQb-NW7QRug@mail.gmail.com/
[9]: https://lore.kernel.org/all/20260505153854.1612033-23-nphamcs@gmail.com/
[10]: https://lore.kernel.org/all/aerrps94j70MkgdW@gourry-fedora-PF4VCD3F/
[11]: https://lore.kernel.org/all/afIKxG5mJZE6QgpR@gourry-fedora-PF4VCD3F/
[12]: https://lore.kernel.org/all/20260527062247.3440692-1-youngjun.park@lge.com/

Nhat Pham (5):
  mm, swap: add virtual swap device infrastructure
  mm, swap: support zswap and zeroswap as vswap backends
  mm, swap: support physical swap as a vswap backend
  mm, swap: only charge physical swap entries
  mm, swap: add debugfs counters for vswap

 MAINTAINERS           |    1 +
 include/linux/swap.h  |   71 +++
 include/linux/zswap.h |    3 +
 mm/Kconfig            |   10 +
 mm/internal.h         |   20 +-
 mm/madvise.c          |    2 +-
 mm/memcontrol.c       |  132 ++++-
 mm/memory.c           |   34 +-
 mm/page_io.c          |  195 ++++++--
 mm/swap.h             |   59 ++-
 mm/swap_state.c       |   51 +-
 mm/swap_table.h       |   56 +++
 mm/swapfile.c         | 1096 +++++++++++++++++++++++++++++++++++++----
 mm/vmscan.c           |    5 +-
 mm/vswap.h            |  445 +++++++++++++++++
 mm/zswap.c            |  167 +++++--
 16 files changed, 2108 insertions(+), 239 deletions(-)
 create mode 100644 mm/vswap.h


base-commit: 401c55d4eacd97ffd24a89829655baa43b2b308e
-- 
2.53.0-Meta


^ permalink raw reply

* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask()
From: Andrew Morton @ 2026-05-28 19:41 UTC (permalink / raw)
  To: Yury Norov
  Cc: David Hildenbrand, Zi Yan, Matthew Brost, Joshua Hahn, Rakie Kim,
	Byungchul Park, Gregory Price, Ying Huang, Alistair Popple,
	linux-mm, linux-kernel, Farhad Alemi, Waiman Long,
	Rasmus Villemoes, cgroups
In-Reply-To: <20260528190337.878027-1-ynorov@nvidia.com>

On Thu, 28 May 2026 15:03:37 -0400 Yury Norov <ynorov@nvidia.com> wrote:

> Reassigning nodes relative an empty user-provided nodemask is useless,
> and triggers divide-by-zero in the function.
> 
> Reported-by: Farhad Alemi <farhad.alemi@berkeley.edu>
> Link: https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/

Thanks both.

It looks like this is very old code, so we'll be wanting a cc:stable in
this.

> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -370,8 +370,13 @@ static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
>  static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
>  				   const nodemask_t *rel)
>  {
> +	unsigned int w = nodes_weight(*rel);
>  	nodemask_t tmp;
> -	nodes_fold(tmp, *orig, nodes_weight(*rel));
> +
> +	if (w == 0)
> +		return -EINVAL;
> +
> +	nodes_fold(tmp, *orig, w);
>  	nodes_onto(*ret, tmp, *rel);
>  }

I suspect we should address this at the mpol level - it should never
have got that far.  Hopefully the mempolicy maintainers can have a
think.



^ permalink raw reply

* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask()
From: Yury Norov @ 2026-05-28 19:40 UTC (permalink / raw)
  To: Waiman Long
  Cc: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, linux-mm, linux-kernel, Farhad Alemi,
	Rasmus Villemoes, cgroups
In-Reply-To: <305848e7-f987-494c-8244-bcf8eed6fb7d@redhat.com>

On Thu, May 28, 2026 at 03:37:04PM -0400, Waiman Long wrote:
> On 5/28/26 3:03 PM, Yury Norov wrote:
> > Reassigning nodes relative an empty user-provided nodemask is useless,
> > and triggers divide-by-zero in the function.
> > 
> > Reported-by: Farhad Alemi <farhad.alemi@berkeley.edu>
> > Link: https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/
> > Signed-off-by: Yury Norov <ynorov@nvidia.com>
> > ---
> >   mm/mempolicy.c | 7 ++++++-
> >   1 file changed, 6 insertions(+), 1 deletion(-)
> > 
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 4e4421b22b59..cd961fa1eb33 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -370,8 +370,13 @@ static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
> >   static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
> >   				   const nodemask_t *rel)
> >   {
> > +	unsigned int w = nodes_weight(*rel);
> >   	nodemask_t tmp;
> > -	nodes_fold(tmp, *orig, nodes_weight(*rel));
> > +
> > +	if (w == 0)
> > +		return -EINVAL;
> > +
> > +	nodes_fold(tmp, *orig, w);
> >   	nodes_onto(*ret, tmp, *rel);
> >   }
> 
> mpol_relative_nodemask() is a void function, so this code should fail
> compilation. Right?

Apologize, submitted the wrong file. Will resend shortly.

^ permalink raw reply

* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask()
From: Matthew Wilcox @ 2026-05-28 19:37 UTC (permalink / raw)
  To: Yury Norov
  Cc: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, linux-mm, linux-kernel, Farhad Alemi,
	Waiman Long, Rasmus Villemoes, cgroups
In-Reply-To: <20260528190337.878027-1-ynorov@nvidia.com>

On Thu, May 28, 2026 at 03:03:37PM -0400, Yury Norov wrote:
>  static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
          ^^^^

>  				   const nodemask_t *rel)
>  {
> +	unsigned int w = nodes_weight(*rel);
>  	nodemask_t tmp;
> -	nodes_fold(tmp, *orig, nodes_weight(*rel));
> +
> +	if (w == 0)
> +		return -EINVAL;

... this doesn't even compile.

^ permalink raw reply

* Re: [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask()
From: Waiman Long @ 2026-05-28 19:37 UTC (permalink / raw)
  To: Yury Norov, Andrew Morton, David Hildenbrand, Zi Yan,
	Matthew Brost, Joshua Hahn, Rakie Kim, Byungchul Park,
	Gregory Price, Ying Huang, Alistair Popple, linux-mm,
	linux-kernel
  Cc: Farhad Alemi, Rasmus Villemoes, cgroups
In-Reply-To: <20260528190337.878027-1-ynorov@nvidia.com>

On 5/28/26 3:03 PM, Yury Norov wrote:
> Reassigning nodes relative an empty user-provided nodemask is useless,
> and triggers divide-by-zero in the function.
>
> Reported-by: Farhad Alemi <farhad.alemi@berkeley.edu>
> Link: https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/
> Signed-off-by: Yury Norov <ynorov@nvidia.com>
> ---
>   mm/mempolicy.c | 7 ++++++-
>   1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 4e4421b22b59..cd961fa1eb33 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -370,8 +370,13 @@ static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
>   static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
>   				   const nodemask_t *rel)
>   {
> +	unsigned int w = nodes_weight(*rel);
>   	nodemask_t tmp;
> -	nodes_fold(tmp, *orig, nodes_weight(*rel));
> +
> +	if (w == 0)
> +		return -EINVAL;
> +
> +	nodes_fold(tmp, *orig, w);
>   	nodes_onto(*ret, tmp, *rel);
>   }
>   

mpol_relative_nodemask() is a void function, so this code should fail 
compilation. Right?

Cheers,
Longman


^ permalink raw reply

* Re: [BUG] lib/bitmap: divide error in bitmap_fold() when sz argument is 0
From: Yury Norov @ 2026-05-28 19:07 UTC (permalink / raw)
  To: Farhad Alemi
  Cc: Andrew Morton, Yury Norov, Waiman Long, David Hildenbrand,
	Rasmus Villemoes, cgroups, linux-mm, linux-kernel
In-Reply-To: <CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com>

Hi Farhad,

Thanks for the report. Submitted the fix and added you in CC.

Thanks,
Yury

On Thu, May 28, 2026 at 11:25:36AM -0700, Farhad Alemi wrote:
> Hello,
> 
> I am reporting a divide-by-zero crash in bitmap_fold() found by syzkaller.
> 
> Summary:
> bitmap_fold() at lib/bitmap.c divides by its `sz` parameter without
> guarding sz != 0:
> 
>   void bitmap_fold(unsigned long *dst, const unsigned long *orig,
>                    unsigned int sz, unsigned int nbits)
>   {
>           ...
>           for_each_set_bit(oldbit, orig, nbits)
>                   set_bit(oldbit % sz, dst);
>   }
> 
> The call chain in the observed crash is:
> 
>   mpol_relative_nodemask()   mm/mempolicy.c
>     nodes_fold(tmp, *orig, nodes_weight(*rel))
>   __nodes_fold()              include/linux/nodemask.h
>     bitmap_fold(dstp->bits, origp->bits, sz, nbits)
>   bitmap_fold()               lib/bitmap.c
> 
> When `nodes_weight(*rel)` is 0 (i.e. the relative-nodes mask is empty),
> the `sz` argument passed to bitmap_fold() is 0, and the
> `oldbit % sz` expression executes a divl by zero.
> 
> Observed on:
> - Linux v6.18.32-dirty (where the bug was originally found), x86_64,
>   QEMU Q35
> - KASAN enabled; panic_on_warn set
> - The only local dirty file in my tree is drivers/tty/serial/serial_core.c,
>   containing a local ttyS0 console guard for the fuzzing harness. It is
>   unrelated to lib/bitmap, mm/mempolicy, or kernel/cgroup/cpuset.
> - The crash fires in a cpu-hotplug kernel thread (Comm: cpuhp/1, PID 21)
>   reached via sched_cpu_deactivate -> cpuset_handle_hotplug ->
>   cpuset_update_tasks_nodemask -> mpol_rebind_mm -> mpol_rebind_policy
>   -> mpol_rebind_nodemask -> mpol_relative_nodemask -> __nodes_fold ->
>   bitmap_fold.
> - Source inspection of linus/master at commit e8c2f9fdadee
>   (v7.1-rc4-754-ge8c2f9fdadee) shows the buggy structure is unchanged:
>   bitmap_fold() at lib/bitmap.c:718 still computes `oldbit % sz` with
>   no sz != 0 guard; __nodes_fold() at include/linux/nodemask.h:365
>   still forwards its sz argument; mpol_relative_nodemask() at
>   mm/mempolicy.c:370 still calls nodes_fold(tmp, *orig,
>   nodes_weight(*rel)). I have not re-run a reproducer against
>   e8c2f9fdadee as no standalone reproducer is available yet.
> 
> Impact:
> A divide-by-zero in a cpu-hotplug kernel thread context kills the
> kernel:
> 
>   Oops: divide error: 0000 [#1] SMP KASAN NOPTI
>   CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Not tainted 6.18.32-dirty #1 PREEMPT(full)
>   RIP: 0010:bitmap_fold+0x5e/0xb0 lib/bitmap.c:713
> 
> The crash report's code disassembly pins the trapping instruction to
> `divl 0x4(%rsp)` (bytes `f7 74 24 04`) with %edx pre-zeroed by the
> preceding `xor %edx,%edx` -- i.e. a 32-bit unsigned divide by the
> on-stack `sz` value.
> 
> Relevant stack:
> 
>   bitmap_fold+0x5e/0xb0 lib/bitmap.c:713
>   __nodes_fold include/linux/nodemask.h:369 [inline]
>   mpol_relative_nodemask mm/mempolicy.c:372 [inline]
>   mpol_rebind_nodemask+0x1e9/0x2d0 mm/mempolicy.c:508
>   mpol_rebind_policy mm/mempolicy.c:542 [inline]
>   mpol_rebind_mm+0x3ab/0x680 mm/mempolicy.c:569
>   cpuset_update_tasks_nodemask+0x22e/0x340 kernel/cgroup/cpuset.c:2777
>   hotplug_update_tasks kernel/cgroup/cpuset.c:3882 [inline]
>   cpuset_hotplug_update_tasks kernel/cgroup/cpuset.c:3985 [inline]
>   cpuset_handle_hotplug+0xe52/0x1200 kernel/cgroup/cpuset.c:4089
>   cpuset_cpu_inactive kernel/sched/core.c:8377 [inline]
>   sched_cpu_deactivate+0x497/0x600 kernel/sched/core.c:8493
>   cpuhp_invoke_callback+0x44a/0x860 kernel/cpu.c:195
>   cpuhp_thread_fun+0x40f/0x870 kernel/cpu.c:1105
>   smpboot_thread_fn+0x546/0xa50 kernel/smpboot.c:160
>   kthread+0x73e/0x8c0 kernel/kthread.c:432
> 
> Expected behavior:
> Either bitmap_fold() should guard against sz == 0 (return early or
> WARN+return), or the callers in the nodes_fold / mpol_relative_nodemask
> chain should not pass a zero `sz` (e.g. short-circuit the rebind when
> the relative nodemask is empty).
> 
> Reproducer:
> A standalone .syz or C reproducer was not produced for this seed; the
> crash fired during broader cpu/cgroup/mempolicy fuzzing. The console
> report is attached as crash-report.txt.
> 
> Novelty check:
> I searched the syzbot dashboard's upstream open, fixed, stable, and
> invalid (per-subsystem mempolicy/mm/cgroups) namespaces, the Android
> dashboard, and the marc.info linux-mm and linux-kernel archives, for
> "bitmap_fold", "mpol_rebind_nodemask" + "divide error", "__nodes_fold"
> + "BUG"/"Oops", and "cpuset_handle_hotplug" + "BUG". I did not find an
> exact match. The recent Jinjiang Tu series (mainline commit
> 3d702678f57e, "mm/mempolicy: fix mpol_rebind_nodemask() for
> MPOL_F_NUMA_BALANCING") is a sibling fix in the same function but
> addresses wrong-rebind logic under NUMA balancing, not the
> divide-by-zero in bitmap_fold().
> 
> I appreciate your time and consideration, and I'm grateful for your
> work on this subsystem. I'd be glad to test any candidate patches.
> 
> Regards,

> Oops: divide error: 0000 [#1] SMP KASAN NOPTI
> CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Not tainted 6.18.32-dirty #1 PREEMPT(full) 
> Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
> RIP: 0010:bitmap_fold+0x5e/0xb0 lib/bitmap.c:713
> Code: 31 f6 e8 a5 4e 20 fe 41 89 dc 44 89 ea 4c 89 f7 4c 89 e6 e8 84 f2 01 00 49 89 c5 44 39 eb 76 2d e8 f7 fc b9 fd 44 89 e8 31 d2 <f7> 74 24 04 89 d5 89 d0 c1 e8 06 49 8d 3c c7 be 08 00 00 00 e8 39
> RSP: 0018:ffffc9000016f520 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 0000000000000040 RCX: ffff8881026a0000
> RDX: 0000000000000000 RSI: 0000000000000040 RDI: ffff888126f6f218
> RBP: ffffc9000016f630 R08: ffffc9000016f5a7 R09: 0000000000000000
> R10: ffffc9000016f5a0 R11: fffff5200002deb5 R12: 0000000000000040
> R13: 0000000000000000 R14: ffff888126f6f218 R15: ffffc9000016f5a0
> FS:  0000000000000000(0000) GS:ffff8882abcc4000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fcd8c9c6fe8 CR3: 0000000192758000 CR4: 0000000000750ef0
> PKRU: 55555554
> Call Trace:
>  <TASK>
>  __nodes_fold include/linux/nodemask.h:369 [inline]
>  mpol_relative_nodemask mm/mempolicy.c:372 [inline]
>  mpol_rebind_nodemask+0x1e9/0x2d0 mm/mempolicy.c:508
>  mpol_rebind_policy mm/mempolicy.c:542 [inline]
>  mpol_rebind_mm+0x3ab/0x680 mm/mempolicy.c:569
>  cpuset_update_tasks_nodemask+0x22e/0x340 kernel/cgroup/cpuset.c:2777
>  hotplug_update_tasks kernel/cgroup/cpuset.c:3882 [inline]
>  cpuset_hotplug_update_tasks kernel/cgroup/cpuset.c:3985 [inline]
>  cpuset_handle_hotplug+0xe52/0x1200 kernel/cgroup/cpuset.c:4089
>  cpuset_cpu_inactive kernel/sched/core.c:8377 [inline]
>  sched_cpu_deactivate+0x497/0x600 kernel/sched/core.c:8493
>  cpuhp_invoke_callback+0x44a/0x860 kernel/cpu.c:195
>  cpuhp_thread_fun+0x40f/0x870 kernel/cpu.c:1105
>  smpboot_thread_fn+0x546/0xa50 kernel/smpboot.c:160
>  kthread+0x73e/0x8c0 kernel/kthread.c:432
>  ret_from_fork+0x4b4/0xa30 arch/x86/kernel/process.c:158
>  ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
>  </TASK>
> Modules linked in:
> ---[ end trace 0000000000000000 ]---
> RIP: 0010:bitmap_fold+0x5e/0xb0 lib/bitmap.c:713
> Code: 31 f6 e8 a5 4e 20 fe 41 89 dc 44 89 ea 4c 89 f7 4c 89 e6 e8 84 f2 01 00 49 89 c5 44 39 eb 76 2d e8 f7 fc b9 fd 44 89 e8 31 d2 <f7> 74 24 04 89 d5 89 d0 c1 e8 06 49 8d 3c c7 be 08 00 00 00 e8 39
> RSP: 0018:ffffc9000016f520 EFLAGS: 00010246
> RAX: 0000000000000000 RBX: 0000000000000040 RCX: ffff8881026a0000
> RDX: 0000000000000000 RSI: 0000000000000040 RDI: ffff888126f6f218
> RBP: ffffc9000016f630 R08: ffffc9000016f5a7 R09: 0000000000000000
> R10: ffffc9000016f5a0 R11: fffff5200002deb5 R12: 0000000000000040
> R13: 0000000000000000 R14: ffff888126f6f218 R15: ffffc9000016f5a0
> FS:  0000000000000000(0000) GS:ffff8882abcc4000(0000) knlGS:0000000000000000
> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007fcd8c9c6fe8 CR3: 0000000192758000 CR4: 0000000000750ef0
> PKRU: 55555554
> ----------------
> Code disassembly (best guess):
>    0:	31 f6                	xor    %esi,%esi
>    2:	e8 a5 4e 20 fe       	call   0xfe204eac
>    7:	41 89 dc             	mov    %ebx,%r12d
>    a:	44 89 ea             	mov    %r13d,%edx
>    d:	4c 89 f7             	mov    %r14,%rdi
>   10:	4c 89 e6             	mov    %r12,%rsi
>   13:	e8 84 f2 01 00       	call   0x1f29c
>   18:	49 89 c5             	mov    %rax,%r13
>   1b:	44 39 eb             	cmp    %r13d,%ebx
>   1e:	76 2d                	jbe    0x4d
>   20:	e8 f7 fc b9 fd       	call   0xfdb9fd1c
>   25:	44 89 e8             	mov    %r13d,%eax
>   28:	31 d2                	xor    %edx,%edx
> * 2a:	f7 74 24 04          	divl   0x4(%rsp) <-- trapping instruction
>   2e:	89 d5                	mov    %edx,%ebp
>   30:	89 d0                	mov    %edx,%eax
>   32:	c1 e8 06             	shr    $0x6,%eax
>   35:	49 8d 3c c7          	lea    (%r15,%rax,8),%rdi
>   39:	be 08 00 00 00       	mov    $0x8,%esi
>   3e:	e8                   	.byte 0xe8
>   3f:	39                   	.byte 0x39
> 
> <<<<<<<<<<<<<<< tail report >>>>>>>>>>>>>>>
> 


^ permalink raw reply

* [PATCH] mm: don't allow empty relative nodemask in mpol_relative_nodemask()
From: Yury Norov @ 2026-05-28 19:03 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Zi Yan, Matthew Brost,
	Joshua Hahn, Rakie Kim, Byungchul Park, Gregory Price, Ying Huang,
	Alistair Popple, linux-mm, linux-kernel
  Cc: Yury Norov, Farhad Alemi, Waiman Long, Rasmus Villemoes, cgroups

Reassigning nodes relative an empty user-provided nodemask is useless,
and triggers divide-by-zero in the function.

Reported-by: Farhad Alemi <farhad.alemi@berkeley.edu>
Link: https://lore.kernel.org/all/CA+0ovCgxbZkXa+OU8w3s84R3KNPNxxRfmsNR-udh+afQBbGNmw@mail.gmail.com/
Signed-off-by: Yury Norov <ynorov@nvidia.com>
---
 mm/mempolicy.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 4e4421b22b59..cd961fa1eb33 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -370,8 +370,13 @@ static inline int mpol_store_user_nodemask(const struct mempolicy *pol)
 static void mpol_relative_nodemask(nodemask_t *ret, const nodemask_t *orig,
 				   const nodemask_t *rel)
 {
+	unsigned int w = nodes_weight(*rel);
 	nodemask_t tmp;
-	nodes_fold(tmp, *orig, nodes_weight(*rel));
+
+	if (w == 0)
+		return -EINVAL;
+
+	nodes_fold(tmp, *orig, w);
 	nodes_onto(*ret, tmp, *rel);
 }
 
-- 
2.51.0


^ permalink raw reply related

* [BUG] lib/bitmap: divide error in bitmap_fold() when sz argument is 0
From: Farhad Alemi @ 2026-05-28 18:25 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Yury Norov, Waiman Long, David Hildenbrand, Rasmus Villemoes,
	cgroups, linux-mm, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 4647 bytes --]

Hello,

I am reporting a divide-by-zero crash in bitmap_fold() found by syzkaller.

Summary:
bitmap_fold() at lib/bitmap.c divides by its `sz` parameter without
guarding sz != 0:

  void bitmap_fold(unsigned long *dst, const unsigned long *orig,
                   unsigned int sz, unsigned int nbits)
  {
          ...
          for_each_set_bit(oldbit, orig, nbits)
                  set_bit(oldbit % sz, dst);
  }

The call chain in the observed crash is:

  mpol_relative_nodemask()   mm/mempolicy.c
    nodes_fold(tmp, *orig, nodes_weight(*rel))
  __nodes_fold()              include/linux/nodemask.h
    bitmap_fold(dstp->bits, origp->bits, sz, nbits)
  bitmap_fold()               lib/bitmap.c

When `nodes_weight(*rel)` is 0 (i.e. the relative-nodes mask is empty),
the `sz` argument passed to bitmap_fold() is 0, and the
`oldbit % sz` expression executes a divl by zero.

Observed on:
- Linux v6.18.32-dirty (where the bug was originally found), x86_64,
  QEMU Q35
- KASAN enabled; panic_on_warn set
- The only local dirty file in my tree is drivers/tty/serial/serial_core.c,
  containing a local ttyS0 console guard for the fuzzing harness. It is
  unrelated to lib/bitmap, mm/mempolicy, or kernel/cgroup/cpuset.
- The crash fires in a cpu-hotplug kernel thread (Comm: cpuhp/1, PID 21)
  reached via sched_cpu_deactivate -> cpuset_handle_hotplug ->
  cpuset_update_tasks_nodemask -> mpol_rebind_mm -> mpol_rebind_policy
  -> mpol_rebind_nodemask -> mpol_relative_nodemask -> __nodes_fold ->
  bitmap_fold.
- Source inspection of linus/master at commit e8c2f9fdadee
  (v7.1-rc4-754-ge8c2f9fdadee) shows the buggy structure is unchanged:
  bitmap_fold() at lib/bitmap.c:718 still computes `oldbit % sz` with
  no sz != 0 guard; __nodes_fold() at include/linux/nodemask.h:365
  still forwards its sz argument; mpol_relative_nodemask() at
  mm/mempolicy.c:370 still calls nodes_fold(tmp, *orig,
  nodes_weight(*rel)). I have not re-run a reproducer against
  e8c2f9fdadee as no standalone reproducer is available yet.

Impact:
A divide-by-zero in a cpu-hotplug kernel thread context kills the
kernel:

  Oops: divide error: 0000 [#1] SMP KASAN NOPTI
  CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Not tainted 6.18.32-dirty #1 PREEMPT(full)
  RIP: 0010:bitmap_fold+0x5e/0xb0 lib/bitmap.c:713

The crash report's code disassembly pins the trapping instruction to
`divl 0x4(%rsp)` (bytes `f7 74 24 04`) with %edx pre-zeroed by the
preceding `xor %edx,%edx` -- i.e. a 32-bit unsigned divide by the
on-stack `sz` value.

Relevant stack:

  bitmap_fold+0x5e/0xb0 lib/bitmap.c:713
  __nodes_fold include/linux/nodemask.h:369 [inline]
  mpol_relative_nodemask mm/mempolicy.c:372 [inline]
  mpol_rebind_nodemask+0x1e9/0x2d0 mm/mempolicy.c:508
  mpol_rebind_policy mm/mempolicy.c:542 [inline]
  mpol_rebind_mm+0x3ab/0x680 mm/mempolicy.c:569
  cpuset_update_tasks_nodemask+0x22e/0x340 kernel/cgroup/cpuset.c:2777
  hotplug_update_tasks kernel/cgroup/cpuset.c:3882 [inline]
  cpuset_hotplug_update_tasks kernel/cgroup/cpuset.c:3985 [inline]
  cpuset_handle_hotplug+0xe52/0x1200 kernel/cgroup/cpuset.c:4089
  cpuset_cpu_inactive kernel/sched/core.c:8377 [inline]
  sched_cpu_deactivate+0x497/0x600 kernel/sched/core.c:8493
  cpuhp_invoke_callback+0x44a/0x860 kernel/cpu.c:195
  cpuhp_thread_fun+0x40f/0x870 kernel/cpu.c:1105
  smpboot_thread_fn+0x546/0xa50 kernel/smpboot.c:160
  kthread+0x73e/0x8c0 kernel/kthread.c:432

Expected behavior:
Either bitmap_fold() should guard against sz == 0 (return early or
WARN+return), or the callers in the nodes_fold / mpol_relative_nodemask
chain should not pass a zero `sz` (e.g. short-circuit the rebind when
the relative nodemask is empty).

Reproducer:
A standalone .syz or C reproducer was not produced for this seed; the
crash fired during broader cpu/cgroup/mempolicy fuzzing. The console
report is attached as crash-report.txt.

Novelty check:
I searched the syzbot dashboard's upstream open, fixed, stable, and
invalid (per-subsystem mempolicy/mm/cgroups) namespaces, the Android
dashboard, and the marc.info linux-mm and linux-kernel archives, for
"bitmap_fold", "mpol_rebind_nodemask" + "divide error", "__nodes_fold"
+ "BUG"/"Oops", and "cpuset_handle_hotplug" + "BUG". I did not find an
exact match. The recent Jinjiang Tu series (mainline commit
3d702678f57e, "mm/mempolicy: fix mpol_rebind_nodemask() for
MPOL_F_NUMA_BALANCING") is a sibling fix in the same function but
addresses wrong-rebind logic under NUMA balancing, not the
divide-by-zero in bitmap_fold().

I appreciate your time and consideration, and I'm grateful for your
work on this subsystem. I'd be glad to test any candidate patches.

Regards,

[-- Attachment #2: crash-report.txt --]
[-- Type: text/plain, Size: 3959 bytes --]

Oops: divide error: 0000 [#1] SMP KASAN NOPTI
CPU: 1 UID: 0 PID: 21 Comm: cpuhp/1 Not tainted 6.18.32-dirty #1 PREEMPT(full) 
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014
RIP: 0010:bitmap_fold+0x5e/0xb0 lib/bitmap.c:713
Code: 31 f6 e8 a5 4e 20 fe 41 89 dc 44 89 ea 4c 89 f7 4c 89 e6 e8 84 f2 01 00 49 89 c5 44 39 eb 76 2d e8 f7 fc b9 fd 44 89 e8 31 d2 <f7> 74 24 04 89 d5 89 d0 c1 e8 06 49 8d 3c c7 be 08 00 00 00 e8 39
RSP: 0018:ffffc9000016f520 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000040 RCX: ffff8881026a0000
RDX: 0000000000000000 RSI: 0000000000000040 RDI: ffff888126f6f218
RBP: ffffc9000016f630 R08: ffffc9000016f5a7 R09: 0000000000000000
R10: ffffc9000016f5a0 R11: fffff5200002deb5 R12: 0000000000000040
R13: 0000000000000000 R14: ffff888126f6f218 R15: ffffc9000016f5a0
FS:  0000000000000000(0000) GS:ffff8882abcc4000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcd8c9c6fe8 CR3: 0000000192758000 CR4: 0000000000750ef0
PKRU: 55555554
Call Trace:
 <TASK>
 __nodes_fold include/linux/nodemask.h:369 [inline]
 mpol_relative_nodemask mm/mempolicy.c:372 [inline]
 mpol_rebind_nodemask+0x1e9/0x2d0 mm/mempolicy.c:508
 mpol_rebind_policy mm/mempolicy.c:542 [inline]
 mpol_rebind_mm+0x3ab/0x680 mm/mempolicy.c:569
 cpuset_update_tasks_nodemask+0x22e/0x340 kernel/cgroup/cpuset.c:2777
 hotplug_update_tasks kernel/cgroup/cpuset.c:3882 [inline]
 cpuset_hotplug_update_tasks kernel/cgroup/cpuset.c:3985 [inline]
 cpuset_handle_hotplug+0xe52/0x1200 kernel/cgroup/cpuset.c:4089
 cpuset_cpu_inactive kernel/sched/core.c:8377 [inline]
 sched_cpu_deactivate+0x497/0x600 kernel/sched/core.c:8493
 cpuhp_invoke_callback+0x44a/0x860 kernel/cpu.c:195
 cpuhp_thread_fun+0x40f/0x870 kernel/cpu.c:1105
 smpboot_thread_fn+0x546/0xa50 kernel/smpboot.c:160
 kthread+0x73e/0x8c0 kernel/kthread.c:432
 ret_from_fork+0x4b4/0xa30 arch/x86/kernel/process.c:158
 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
 </TASK>
Modules linked in:
---[ end trace 0000000000000000 ]---
RIP: 0010:bitmap_fold+0x5e/0xb0 lib/bitmap.c:713
Code: 31 f6 e8 a5 4e 20 fe 41 89 dc 44 89 ea 4c 89 f7 4c 89 e6 e8 84 f2 01 00 49 89 c5 44 39 eb 76 2d e8 f7 fc b9 fd 44 89 e8 31 d2 <f7> 74 24 04 89 d5 89 d0 c1 e8 06 49 8d 3c c7 be 08 00 00 00 e8 39
RSP: 0018:ffffc9000016f520 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000040 RCX: ffff8881026a0000
RDX: 0000000000000000 RSI: 0000000000000040 RDI: ffff888126f6f218
RBP: ffffc9000016f630 R08: ffffc9000016f5a7 R09: 0000000000000000
R10: ffffc9000016f5a0 R11: fffff5200002deb5 R12: 0000000000000040
R13: 0000000000000000 R14: ffff888126f6f218 R15: ffffc9000016f5a0
FS:  0000000000000000(0000) GS:ffff8882abcc4000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fcd8c9c6fe8 CR3: 0000000192758000 CR4: 0000000000750ef0
PKRU: 55555554
----------------
Code disassembly (best guess):
   0:	31 f6                	xor    %esi,%esi
   2:	e8 a5 4e 20 fe       	call   0xfe204eac
   7:	41 89 dc             	mov    %ebx,%r12d
   a:	44 89 ea             	mov    %r13d,%edx
   d:	4c 89 f7             	mov    %r14,%rdi
  10:	4c 89 e6             	mov    %r12,%rsi
  13:	e8 84 f2 01 00       	call   0x1f29c
  18:	49 89 c5             	mov    %rax,%r13
  1b:	44 39 eb             	cmp    %r13d,%ebx
  1e:	76 2d                	jbe    0x4d
  20:	e8 f7 fc b9 fd       	call   0xfdb9fd1c
  25:	44 89 e8             	mov    %r13d,%eax
  28:	31 d2                	xor    %edx,%edx
* 2a:	f7 74 24 04          	divl   0x4(%rsp) <-- trapping instruction
  2e:	89 d5                	mov    %edx,%ebp
  30:	89 d0                	mov    %edx,%eax
  32:	c1 e8 06             	shr    $0x6,%eax
  35:	49 8d 3c c7          	lea    (%r15,%rax,8),%rdi
  39:	be 08 00 00 00       	mov    $0x8,%esi
  3e:	e8                   	.byte 0xe8
  3f:	39                   	.byte 0x39

<<<<<<<<<<<<<<< tail report >>>>>>>>>>>>>>>


^ permalink raw reply

* Re: [PATCH] cgroup: pair max limit READ_ONCE() with WRITE_ONCE()
From: Tejun Heo @ 2026-05-28 15:55 UTC (permalink / raw)
  To: Ren Tamura, hannes, mkoutny; +Cc: cgroups, linux-kernel
In-Reply-To: <20260528042839.28472-1-ren.tamura.oss@gmail.com>

Hello,

Applied to cgroup/for-7.2.

Thanks.

--
tejun

^ permalink raw reply

* Re: [PATCH v5 9/9] mm: switch deferred split shrinker to list_lru
From: Usama Arif @ 2026-05-28 15:31 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Shakeel Butt,
	Michal Hocko, Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng,
	Yosry Ahmed, Zi Yan, Liam R . Howlett, Kiryl Shutsemau,
	Vlastimil Babka, Kairui Song, Mikhail Zaslonko, Vasily Gorbik,
	Baolin Wang, Barry Song, Dev Jain, Lance Yang, Nico Pache,
	Ryan Roberts, cgroups, linux-mm, linux-kernel
In-Reply-To: <ahhK3AKksNbJ4zbY@cmpxchg.org>



On 28/05/2026 15:02, Johannes Weiner wrote:
> On Thu, May 28, 2026 at 02:32:06PM +0100, Usama Arif wrote:
>>
>>
>> On 27/05/2026 21:45, Johannes Weiner wrote:
>>> The deferred split queue handles cgroups in a suboptimal fashion. The
>>> queue is per-NUMA node or per-cgroup, not the intersection. That means
>>> on a cgrouped system, a node-restricted allocation entering reclaim
>>> can end up splitting large pages on other nodes:
>>>
>>>         alloc/unmap
>>>           deferred_split_folio()
>>>             list_add_tail(memcg->split_queue)
>>>             set_shrinker_bit(memcg, node, deferred_shrinker_id)
>>>
>>>         for_each_zone_zonelist_nodemask(restricted_nodes)
>>>           mem_cgroup_iter()
>>>             shrink_slab(node, memcg)
>>>               shrink_slab_memcg(node, memcg)
>>>                 if test_shrinker_bit(memcg, node, deferred_shrinker_id)
>>>                   deferred_split_scan()
>>>                     walks memcg->split_queue
>>>
>>> The shrinker bit adds an imperfect guard rail. As soon as the cgroup
>>> has a single large page on the node of interest, all large pages owned
>>> by that memcg, including those on other nodes, will be split.
>>>
>>> list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
>>> streamlines a lot of the list operations and reclaim walks. It's used
>>> widely by other major shrinkers already. Convert the deferred split
>>> queue as well.
>>>
>>> The list_lru per-memcg heads are instantiated on demand when the first
>>> object of interest is allocated for a cgroup, by calling
>>> folio_memcg_alloc_deferred(). Add calls to where splittable pages are
>>> created: anon faults, swapin faults, khugepaged collapse.
>>>
>>> These calls create all possible node heads for the cgroup at once, so
>>> the migration code (between nodes) doesn't need any special care.
>>>
>>> Reported-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
>>> Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
>>> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
>>> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
>>> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
>>> ---
>>>  include/linux/huge_mm.h    |   7 +-
>>>  include/linux/memcontrol.h |   4 -
>>>  include/linux/mmzone.h     |  12 --
>>>  mm/huge_memory.c           | 364 +++++++++++++------------------------
>>>  mm/internal.h              |   2 +-
>>>  mm/khugepaged.c            |   5 +
>>>  mm/memcontrol.c            |  12 +-
>>>  mm/memory.c                |   4 +
>>>  mm/mm_init.c               |  15 --
>>>  mm/swap_state.c            |  10 +
>>>  10 files changed, 150 insertions(+), 285 deletions(-)
>>>
>>
>> [...]
>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 135f5c0f57bd..f22e61d8c8de 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -5222,6 +5222,10 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>>>  			folio_put(folio);
>>>  			goto next;
>>>  		}
>>> +		if (order > 1 && folio_memcg_alloc_deferred(folio)) {
>>> +			folio_put(folio);
>>
>> Ah sorry, should have caught this in the previous version, do we need
>>
>> count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
>>
>> here?
> 
> This isn't an allocation we expect to fail with any sort of routine
> that we'd need to capture it in the event counter. It would warn in
> dmesg if it did. But in practice it can't happen at all, since it's a
> sub-costly-order slab allocation and the allocator would loop and OOM
> kill stuff until it succeeds.
> 
>> or maybe we just goto next instead of goto fallback and trty next
>> viable order?
> 
> Again I don't think it matters, but fallback seems a bit more correct
> because the size of the list_lru allocation doesn't change with lower
> orders (until we hit 0).
> 
> So I think we can just leave it as is.

Ack!

Acked-by: Usama Arif <usama.arif@linux.dev>


^ permalink raw reply

* Re: [PATCH v3] cgroup/dmem: implement dmem.high soft limit via prioritized eviction
From: Maarten Lankhorst @ 2026-05-28 14:09 UTC (permalink / raw)
  To: Qiliang Yuan, Christian Koenig, Huang Rui, Matthew Auld,
	Matthew Brost, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Tejun Heo, Johannes Weiner, Michal Koutný,
	Natalie Vock
  Cc: dri-devel, linux-kernel, cgroups
In-Reply-To: <20260528-feature-dmem-high-v3-1-c642b34bcb2f@gmail.com>

Hello,

Den 2026-05-28 kl. 14:03, skrev Qiliang Yuan:
> The dmem cgroup v2 controller currently only provides a hard "max"
> limit, which causes immediate allocation failures when a cgroup's
> device memory usage reaches its quota.  GPU-bound AI workloads need
> smoother over-subscription support: a soft limit that temporarily
> allows excess usage while applying backpressure through reclaim
> rather than outright failure.
> 
> Add dmem.high, a soft limit that penalizes over-limit cgroups by
> evicting their buffer objects first when eviction is triggered (e.g.
> due to a "max" limit hit).  Unlike the rejected v1 approach which
> used sleep-on-allocation throttling, this version provides a
> meaningful recovery action through prioritized reclaim.
> 
> Expose "high" as a new cgroupfs control file per region via
> set_resource_high() and get_resource_high(), and initialize it to
> PAGE_COUNTER_MAX in reset_all_resource_limits().  Like get_resource_max(),
> get_resource_high() returns PAGE_COUNTER_MAX when the pool is NULL.
> 
> Extend dmem_cgroup_state_evict_valuable() with a "try_high"
> parameter.  When set, the function walks the page_counter parent
> chain to check whether any ancestor exceeds its high limit, then
> verifies that the pool is above its effective minimum to respect
> dmem.min protection.  Only pools meeting both criteria are evicted.
> 
> Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy.
> Pass 1 uses trylock and targets only BOs whose cgroup exceeds
> dmem.high.  Pass 2 falls back to the standard above-elow eviction.
> Pass 3 begins with a properly-locked high-priority pass in case
> Pass 1 failed due to trylock contention, then proceeds with the
> standard repeat-while-making-progress loop with low-watermark
> fallback.
> 
> Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
> ---
> Introduce a "high" soft limit for the dmem cgroup v2 controller.
> When a "max" limit is hit and eviction is triggered, buffer objects
> belonging to cgroups that exceed their dmem.high limit are targeted
> first, providing a meaningful recovery action through reclaim.
> 
> The dmem cgroup currently only supports hard "max" limits, which
> cause immediate allocation failures for GPU-bound workloads. A soft
> limit enables smoother over-subscription by penalizing over-limit
> cgroups via prioritized eviction rather than outright rejection.
> 
> The implementation adds a "high" cgroupfs control file per region,
> a try_high parameter to dmem_cgroup_state_evict_valuable() for
> tier-1 eviction, and a 3-pass strategy in ttm_bo_evict_alloc().
> ---
> V2 -> V3:
> - Walk the page_counter parent chain in the try_high pass to prevent
>   child cgroups from evading the penalty when a parent cgroup exceeds
>   its dmem.high limit.
> - Check dmem.min protection in the try_high pass to avoid evicting
>   BOs below the effective minimum.
> - Add a properly-locked high-priority retry at the beginning of Pass 3
>   so that actively-used over-limit BOs (which failed trylock in Pass 1)
>   are not skipped while innocent cgroups are evicted.
> - Fix get_resource_high(NULL) returning 0 instead of PAGE_COUNTER_MAX
>   to match the behavior of get_resource_max().
> 
> V1 -> V2:
> - Replace sleep-on-allocation throttling with prioritized eviction.
>   When a "max" limit is hit, BOs from cgroups exceeding dmem.high are
>   evicted first in a dedicated pass. No throttling or sleeping is
>   performed in the charge path.
> - Remove task throttling (schedule_timeout_killable, TIF_NOTIFY_RESUME,
>   resume_user_mode_work() integration) entirely.
> - Add dmem.high cgroupfs control file per region.
> - Extend dmem_cgroup_state_evict_valuable() with try_high parameter
>   to target over-limit cgroups as tier-1 eviction.
> - Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy:
>   (1) trylock: evict only BOs exceeding dmem.high
>   (2) trylock: above-elow
>   (3) proper-lock: repeat with low fallback.
> - Initialize high to PAGE_COUNTER_MAX in reset_all_resource_limits().
> 
> v1: https://lore.kernel.org/all/20260520-feature-dmem-high-v1-1-97ca0cb7f95a@gmail.com
> v2: https://lore.kernel.org/all/20260522-feature-dmem-high-v2-1-d805deddecbb@gmail.com
> ---
>  drivers/gpu/drm/ttm/ttm_bo.c | 35 ++++++++++++++++++++----
>  include/linux/cgroup_dmem.h  |  4 +--
>  kernel/cgroup/dmem.c         | 65 ++++++++++++++++++++++++++++++++++++++++++--
>  3 files changed, 94 insertions(+), 10 deletions(-)
> 
> diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
> index bcd76f6bb7f02..2f2b428f1d30a 100644
> --- a/drivers/gpu/drm/ttm/ttm_bo.c
> +++ b/drivers/gpu/drm/ttm/ttm_bo.c
> @@ -505,6 +505,8 @@ struct ttm_bo_evict_walk {
>  
>  	/** @limit_pool: Which pool limit we should test against */
>  	struct dmem_cgroup_pool_state *limit_pool;
> +	/** @try_high: Whether to only evict BO's above the high watermark (first pass) */
> +	bool try_high;
>  	/** @try_low: Whether we should attempt to evict BO's with low watermark threshold */
>  	bool try_low;
>  	/** @hit_low: If we cannot evict a bo when @try_low is false (first pass) */
> @@ -518,7 +520,8 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *
>  	s64 lret;
>  
>  	if (!dmem_cgroup_state_evict_valuable(evict_walk->limit_pool, bo->resource->css,
> -					      evict_walk->try_low, &evict_walk->hit_low))
> +					      evict_walk->try_high, evict_walk->try_low,
> +					      &evict_walk->hit_low))
>  		return 0;
>  
>  	if (bo->pin_count || !bo->bdev->funcs->eviction_valuable(bo, evict_walk->place))
> @@ -577,31 +580,51 @@ static int ttm_bo_evict_alloc(struct ttm_device *bdev,
>  	};
>  	s64 lret;
>  
> +	/*
> +	 * Pass 1 (trylock): Only evict BOs whose cgroup is above its
> +	 * dmem.high soft limit. This penalizes over-limit cgroups first.
> +	 */
>  	evict_walk.walk.arg.trylock_only = true;
> +	evict_walk.try_high = true;
>  	lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
> +	evict_walk.try_high = false;
> +	if (lret)
> +		goto out;

I believe the first pass for 'high' should not be trylock only. High needs to be
preferentially evicted, even if the objects are locked elsewhere.


> -	/* One more attempt if we hit low limit? */
> +	/*
> +	 * Pass 2 (trylock): Evict BOs above the effective low watermark.
> +	 * Falls back to low-priority eviction if needed.
> +	 */
> +	lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
>  	if (!lret && evict_walk.hit_low) {
>  		evict_walk.try_low = true;
>  		lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
>  	}
> +
>  	if (lret || !ticket)
>  		goto out;
>  
> -	/* Reset low limit */
> +	/*
> +	 * Pass 3+ (properly locked): Evict while making progress.
> +	 * First retry the high-priority pass with proper locking in case
> +	 * Pass 1 failed due to trylock contention on over-limit BOs.
> +	 * If that still fails, fall back to the standard low-priority eviction.
> +	 */
>  	evict_walk.try_low = evict_walk.hit_low = false;
> -	/* If ticket-locking, repeat while making progress. */
>  	evict_walk.walk.arg.trylock_only = false;
> +	evict_walk.try_high = true;
> +	lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
> +	evict_walk.try_high = false;
> +	if (lret)
> +		goto out;
>  
>  retry:
>  	do {
> -		/* The walk may clear the evict_walk.walk.ticket field */
>  		evict_walk.walk.arg.ticket = ticket;
>  		evict_walk.evicted = 0;
>  		lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
>  	} while (!lret && evict_walk.evicted);
>  
> -	/* We hit the low limit? Try once more */
>  	if (!lret && evict_walk.hit_low && !evict_walk.try_low) {
>  		evict_walk.try_low = true;
>  		goto retry;
> diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
> index dd4869f1d736e..06115d35509b1 100644
> --- a/include/linux/cgroup_dmem.h
> +++ b/include/linux/cgroup_dmem.h
> @@ -23,7 +23,7 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
>  void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64 size);
>  bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
>  				      struct dmem_cgroup_pool_state *test_pool,
> -				      bool ignore_low, bool *ret_hit_low);
> +				      bool try_high, bool ignore_low, bool *ret_hit_low);
>  
>  void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool);
>  #else
> @@ -54,7 +54,7 @@ static inline void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64
>  static inline
>  bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
>  				      struct dmem_cgroup_pool_state *test_pool,
> -				      bool ignore_low, bool *ret_hit_low)
> +				      bool try_high, bool ignore_low, bool *ret_hit_low)
>  {
>  	return true;
>  }
> diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
> index 4753a67d0f0f2..c80444c0da177 100644
> --- a/kernel/cgroup/dmem.c
> +++ b/kernel/cgroup/dmem.c
> @@ -156,6 +156,12 @@ set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
>  	page_counter_set_low(&pool->cnt, val);
>  }
>  
> +static void
> +set_resource_high(struct dmem_cgroup_pool_state *pool, u64 val)
> +{
> +	page_counter_set_high(&pool->cnt, val);
> +}
> +
>  static void
>  set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
>  {
> @@ -167,6 +173,11 @@ static u64 get_resource_low(struct dmem_cgroup_pool_state *pool)
>  	return pool ? READ_ONCE(pool->cnt.low) : 0;
>  }
>  
> +static u64 get_resource_high(struct dmem_cgroup_pool_state *pool)
> +{
> +	return pool ? READ_ONCE(pool->cnt.high) : PAGE_COUNTER_MAX;
> +}
> +
>  static u64 get_resource_min(struct dmem_cgroup_pool_state *pool)
>  {
>  	return pool ? READ_ONCE(pool->cnt.min) : 0;
> @@ -186,6 +197,7 @@ static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
>  {
>  	set_resource_min(rpool, 0);
>  	set_resource_low(rpool, 0);
> +	set_resource_high(rpool, PAGE_COUNTER_MAX);
>  	set_resource_max(rpool, PAGE_COUNTER_MAX);
>  }
>  
> @@ -289,10 +301,13 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
>   * dmem_cgroup_state_evict_valuable() - Check if we should evict from test_pool
>   * @limit_pool: The pool for which we hit limits
>   * @test_pool: The pool for which to test
> + * @try_high: Only evict BOs whose usage exceeds the high limit (first pass)
>   * @ignore_low: Whether we have to respect low watermarks.
>   * @ret_hit_low: Pointer to whether it makes sense to consider low watermark.
>   *
>   * This function returns true if we can evict from @test_pool, false if not.
> + * When @try_high is set, only pools with usage above their high limit are
> + * evictable, enabling prioritized eviction of over-limit cgroups.
>   * When returning false and @ignore_low is false, @ret_hit_low may
>   * be set to true to indicate this function can be retried with @ignore_low
>   * set to true.
> @@ -301,7 +316,7 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
>   */
>  bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
>  				      struct dmem_cgroup_pool_state *test_pool,
> -				      bool ignore_low, bool *ret_hit_low)
> +				      bool try_high, bool ignore_low, bool *ret_hit_low)
>  {
>  	struct dmem_cgroup_pool_state *pool = test_pool;
>  	struct page_counter *ctest;
> @@ -331,9 +346,38 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
>  
>  	ctest = &test_pool->cnt;
>  
> +	used = page_counter_read(ctest);
> +
> +	if (try_high) {
> +		struct page_counter *c;
> +
> +		/*
> +		 * Walk the page_counter parent chain to check whether any
> +		 * ancestor cgroup exceeds its dmem.high limit.  This prevents
> +		 * child cgroups from evading the penalty when a parent cgroup
> +		 * is over its high limit.
> +		 */
> +		if (used <= READ_ONCE(ctest->high)) {
> +			for (c = ctest->parent; c; c = c->parent) {
> +				if (page_counter_read(c) > READ_ONCE(c->high))
> +					break;
> +			}
> +			if (!c)
> +				return false;
> +		}
> +
> +		/*
> +		 * Respect dmem.min protection: do not evict BOs below the
> +		 * effective minimum even during the high-priority pass.
> +		 */
> +		dmem_cgroup_calculate_protection(limit_pool, test_pool);
> +		min = READ_ONCE(ctest->emin);
> +
> +		return used > min;
> +	}
> +
>  	dmem_cgroup_calculate_protection(limit_pool, test_pool);
>  
> -	used = page_counter_read(ctest);
>  	min = READ_ONCE(ctest->emin);
>  
>  	if (used <= min)
> @@ -835,6 +879,17 @@ static ssize_t dmem_cgroup_region_low_write(struct kernfs_open_file *of,
>  	return dmemcg_limit_write(of, buf, nbytes, off, set_resource_low);
>  }
>  
> +static int dmem_cgroup_region_high_show(struct seq_file *sf, void *v)
> +{
> +	return dmemcg_limit_show(sf, v, get_resource_high);
> +}
> +
> +static ssize_t dmem_cgroup_region_high_write(struct kernfs_open_file *of,
> +					  char *buf, size_t nbytes, loff_t off)
> +{
> +	return dmemcg_limit_write(of, buf, nbytes, off, set_resource_high);
> +}
> +
>  static int dmem_cgroup_region_max_show(struct seq_file *sf, void *v)
>  {
>  	return dmemcg_limit_show(sf, v, get_resource_max);
> @@ -868,6 +923,12 @@ static struct cftype files[] = {
>  		.seq_show = dmem_cgroup_region_low_show,
>  		.flags = CFTYPE_NOT_ON_ROOT,
>  	},
> +	{
> +		.name = "high",
> +		.write = dmem_cgroup_region_high_write,
> +		.seq_show = dmem_cgroup_region_high_show,
> +		.flags = CFTYPE_NOT_ON_ROOT,
> +	},
>  	{
>  		.name = "max",
>  		.write = dmem_cgroup_region_max_write,

The rest of the patch looks good.


^ permalink raw reply

* Re: [PATCH v5 9/9] mm: switch deferred split shrinker to list_lru
From: Johannes Weiner @ 2026-05-28 14:03 UTC (permalink / raw)
  To: SeongJae Park
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Shakeel Butt,
	Michal Hocko, Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng,
	Yosry Ahmed, Zi Yan, Liam R . Howlett, Usama Arif,
	Kiryl Shutsemau, Vlastimil Babka, Kairui Song, Mikhail Zaslonko,
	Vasily Gorbik, Baolin Wang, Barry Song, Dev Jain, Lance Yang,
	Nico Pache, Ryan Roberts, cgroups, linux-mm, linux-kernel
In-Reply-To: <20260528070807.144064-1-sj@kernel.org>

On Thu, May 28, 2026 at 12:08:05AM -0700, SeongJae Park wrote:
> From 23b5800dd49085707baee5774b74782c3e424f24 Mon Sep 17 00:00:00 2001
> From: SeongJae Park <sj@kernel.org>
> Date: Wed, 27 May 2026 23:58:07 -0700
> Subject: [PATCH] mm/huge_mm: define memcg_alloc_deferred() for
>  !CONFIG_TRANSPARENT_HUGEPPAGE
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> Without this, UM mode kunit fails like below.
> 
>     $ ./tools/testing/kunit/kunit.py run --kunitconfig mm/damon/tests/
>     [00:00:02] Configuring KUnit Kernel ...
>     [00:00:02] Building KUnit Kernel ...
>     Populating config with:
>     $ make ARCH=um O=.kunit olddefconfig
>     Building with:
>     $ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=8
>     ERROR:root:../mm/swap_state.c: In function ‘__swap_cache_alloc’:
>     ../mm/swap_state.c:468:26: error: implicit declaration of function ‘folio_memcg_alloc_deferred’ [-Wimplicit-function-declaration]
>       468 |         if (order > 1 && folio_memcg_alloc_deferred(folio)) {
>           |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
>     make[4]: *** [../scripts/Makefile.build:289: mm/swap_state.o] Error 1
>     make[4]: *** Waiting for unfinished jobs....
>     make[3]: *** [../scripts/Makefile.build:548: mm] Error 2
>     make[3]: *** Waiting for unfinished jobs....
>     make[2]: *** [/home/lkhack/linux/Makefile:2143: .] Error 2
>     make[1]: *** [/home/lkhack/linux/Makefile:248: __sub-make] Error 2
>     make: *** [Makefile:248: __sub-make] Error 2
> 
> Fix by implementing the function for CONFIG_TRANSPARENT_HUGEPPAGE unset
> case.
> 
> Fixes: https://lore.kernel.org/20260527204757.2544958-10-hannes@cmpxchg.org
> Signed-off-by: SeongJae Park <sj@kernel.org>

Whoops, thanks for the fix, SJ. I'll incorporate UM builds into my
final compile test before sending.

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

^ permalink raw reply

* Re: [PATCH v5 9/9] mm: switch deferred split shrinker to list_lru
From: Johannes Weiner @ 2026-05-28 14:02 UTC (permalink / raw)
  To: Usama Arif
  Cc: Andrew Morton, David Hildenbrand, Lorenzo Stoakes, Shakeel Butt,
	Michal Hocko, Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng,
	Yosry Ahmed, Zi Yan, Liam R . Howlett, Kiryl Shutsemau,
	Vlastimil Babka, Kairui Song, Mikhail Zaslonko, Vasily Gorbik,
	Baolin Wang, Barry Song, Dev Jain, Lance Yang, Nico Pache,
	Ryan Roberts, cgroups, linux-mm, linux-kernel
In-Reply-To: <6f9c78b2-3846-4f75-bcc2-41bf91230513@linux.dev>

On Thu, May 28, 2026 at 02:32:06PM +0100, Usama Arif wrote:
> 
> 
> On 27/05/2026 21:45, Johannes Weiner wrote:
> > The deferred split queue handles cgroups in a suboptimal fashion. The
> > queue is per-NUMA node or per-cgroup, not the intersection. That means
> > on a cgrouped system, a node-restricted allocation entering reclaim
> > can end up splitting large pages on other nodes:
> > 
> >         alloc/unmap
> >           deferred_split_folio()
> >             list_add_tail(memcg->split_queue)
> >             set_shrinker_bit(memcg, node, deferred_shrinker_id)
> > 
> >         for_each_zone_zonelist_nodemask(restricted_nodes)
> >           mem_cgroup_iter()
> >             shrink_slab(node, memcg)
> >               shrink_slab_memcg(node, memcg)
> >                 if test_shrinker_bit(memcg, node, deferred_shrinker_id)
> >                   deferred_split_scan()
> >                     walks memcg->split_queue
> > 
> > The shrinker bit adds an imperfect guard rail. As soon as the cgroup
> > has a single large page on the node of interest, all large pages owned
> > by that memcg, including those on other nodes, will be split.
> > 
> > list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
> > streamlines a lot of the list operations and reclaim walks. It's used
> > widely by other major shrinkers already. Convert the deferred split
> > queue as well.
> > 
> > The list_lru per-memcg heads are instantiated on demand when the first
> > object of interest is allocated for a cgroup, by calling
> > folio_memcg_alloc_deferred(). Add calls to where splittable pages are
> > created: anon faults, swapin faults, khugepaged collapse.
> > 
> > These calls create all possible node heads for the cgroup at once, so
> > the migration code (between nodes) doesn't need any special care.
> > 
> > Reported-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
> > Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
> > Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> > Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> > Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> > ---
> >  include/linux/huge_mm.h    |   7 +-
> >  include/linux/memcontrol.h |   4 -
> >  include/linux/mmzone.h     |  12 --
> >  mm/huge_memory.c           | 364 +++++++++++++------------------------
> >  mm/internal.h              |   2 +-
> >  mm/khugepaged.c            |   5 +
> >  mm/memcontrol.c            |  12 +-
> >  mm/memory.c                |   4 +
> >  mm/mm_init.c               |  15 --
> >  mm/swap_state.c            |  10 +
> >  10 files changed, 150 insertions(+), 285 deletions(-)
> > 
> 
> [...]
> 
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 135f5c0f57bd..f22e61d8c8de 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -5222,6 +5222,10 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
> >  			folio_put(folio);
> >  			goto next;
> >  		}
> > +		if (order > 1 && folio_memcg_alloc_deferred(folio)) {
> > +			folio_put(folio);
> 
> Ah sorry, should have caught this in the previous version, do we need
> 
> count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);
> 
> here?

This isn't an allocation we expect to fail with any sort of routine
that we'd need to capture it in the event counter. It would warn in
dmesg if it did. But in practice it can't happen at all, since it's a
sub-costly-order slab allocation and the allocator would loop and OOM
kill stuff until it succeeds.

> or maybe we just goto next instead of goto fallback and trty next
> viable order?

Again I don't think it matters, but fallback seems a bit more correct
because the size of the list_lru allocation doesn't change with lower
orders (until we hit 0).

So I think we can just leave it as is.

^ permalink raw reply

* Re: [PATCH v5 9/9] mm: switch deferred split shrinker to list_lru
From: Usama Arif @ 2026-05-28 13:32 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Shakeel Butt, Michal Hocko,
	Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng, Yosry Ahmed,
	Zi Yan, Liam R . Howlett, Kiryl Shutsemau, Vlastimil Babka,
	Kairui Song, Mikhail Zaslonko, Vasily Gorbik, Baolin Wang,
	Barry Song, Dev Jain, Lance Yang, Nico Pache, Ryan Roberts,
	cgroups, linux-mm, linux-kernel
In-Reply-To: <20260527204757.2544958-10-hannes@cmpxchg.org>



On 27/05/2026 21:45, Johannes Weiner wrote:
> The deferred split queue handles cgroups in a suboptimal fashion. The
> queue is per-NUMA node or per-cgroup, not the intersection. That means
> on a cgrouped system, a node-restricted allocation entering reclaim
> can end up splitting large pages on other nodes:
> 
>         alloc/unmap
>           deferred_split_folio()
>             list_add_tail(memcg->split_queue)
>             set_shrinker_bit(memcg, node, deferred_shrinker_id)
> 
>         for_each_zone_zonelist_nodemask(restricted_nodes)
>           mem_cgroup_iter()
>             shrink_slab(node, memcg)
>               shrink_slab_memcg(node, memcg)
>                 if test_shrinker_bit(memcg, node, deferred_shrinker_id)
>                   deferred_split_scan()
>                     walks memcg->split_queue
> 
> The shrinker bit adds an imperfect guard rail. As soon as the cgroup
> has a single large page on the node of interest, all large pages owned
> by that memcg, including those on other nodes, will be split.
> 
> list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
> streamlines a lot of the list operations and reclaim walks. It's used
> widely by other major shrinkers already. Convert the deferred split
> queue as well.
> 
> The list_lru per-memcg heads are instantiated on demand when the first
> object of interest is allocated for a cgroup, by calling
> folio_memcg_alloc_deferred(). Add calls to where splittable pages are
> created: anon faults, swapin faults, khugepaged collapse.
> 
> These calls create all possible node heads for the cgroup at once, so
> the migration code (between nodes) doesn't need any special care.
> 
> Reported-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
> Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/huge_mm.h    |   7 +-
>  include/linux/memcontrol.h |   4 -
>  include/linux/mmzone.h     |  12 --
>  mm/huge_memory.c           | 364 +++++++++++++------------------------
>  mm/internal.h              |   2 +-
>  mm/khugepaged.c            |   5 +
>  mm/memcontrol.c            |  12 +-
>  mm/memory.c                |   4 +
>  mm/mm_init.c               |  15 --
>  mm/swap_state.c            |  10 +
>  10 files changed, 150 insertions(+), 285 deletions(-)
> 

[...]

> diff --git a/mm/memory.c b/mm/memory.c
> index 135f5c0f57bd..f22e61d8c8de 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -5222,6 +5222,10 @@ static struct folio *alloc_anon_folio(struct vm_fault *vmf)
>  			folio_put(folio);
>  			goto next;
>  		}
> +		if (order > 1 && folio_memcg_alloc_deferred(folio)) {
> +			folio_put(folio);

Ah sorry, should have caught this in the previous version, do we need

count_mthp_stat(order, MTHP_STAT_ANON_FAULT_FALLBACK);

here?

or maybe we just goto next instead of goto fallback and trty next
viable order?


> +			goto fallback;
> +		}
>  		folio_throttle_swaprate(folio, gfp);
>  		/*
>  		 * When a folio is not zeroed during allocation

^ permalink raw reply

* Re: [PATCH v5 1/9] mm: list_lru: fix set_shrinker_bit() call during race with cgroup deletion
From: Usama Arif @ 2026-05-28 13:25 UTC (permalink / raw)
  To: Johannes Weiner, Andrew Morton
  Cc: David Hildenbrand, Lorenzo Stoakes, Shakeel Butt, Michal Hocko,
	Dave Chinner, Roman Gushchin, Muchun Song, Qi Zheng, Yosry Ahmed,
	Zi Yan, Liam R . Howlett, Kiryl Shutsemau, Vlastimil Babka,
	Kairui Song, Mikhail Zaslonko, Vasily Gorbik, Baolin Wang,
	Barry Song, Dev Jain, Lance Yang, Nico Pache, Ryan Roberts,
	cgroups, linux-mm, linux-kernel
In-Reply-To: <20260527204757.2544958-2-hannes@cmpxchg.org>



On 27/05/2026 21:45, Johannes Weiner wrote:
> When list_lru_add() races with cgroup deletion, the shrinker bit is set
> on the wrong group and lost. This can cause a shrinker run to miss the
> cgroup that actually has the object.
> 
> When the passed in memcg is dead, the function finds the first non-dead
> parent from the passed in memcg and adds the object there; but the
> shrinker bit is set on the memcg that was passed in.
> 
> This bug is as old as the shrinker bitmap itself.
> 
> Fix it by returning the "effective" memcg from the locking function, and
> have the caller use that.
> 
> Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
> Reported-by: Usama Arif <usama.arif@linux.dev>
> Reported-by: Sashiko
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  mm/list_lru.c | 26 +++++++++++++++-----------
>  1 file changed, 15 insertions(+), 11 deletions(-)
> 

Acked-by: Usama Arif <usama.arif@linux.dev>


^ permalink raw reply

* Re: [RFC PATCH rdma-next 0/5] cgroup/rdma: add per-type resource accounting for QP, MR and MR memory
From: Jason Gunthorpe @ 2026-05-28 13:06 UTC (permalink / raw)
  To: Tao Cui; +Cc: cgroups, hannes, leon, linux-rdma, mkoutny, tj
In-Reply-To: <20260528075537.2170697-1-cuitao@kylinos.cn>

On Thu, May 28, 2026 at 03:55:37PM +0800, Tao Cui wrote:
> Hi,Jason
> 
> > memory pin accounting should ideally be limited by the cgroup directly
> > but we argued about that for a while and could never get an agreement
> > of an acceptable implementation. There are many nasty corner cases
> > around cgroups and fork and other cases IIRC
> >
> > So I'm not sure if making it rdma specific can easially solve these
> > problems
> 
> Thanks for the detailed context.  I understand the concern — generic
> pinned-page accounting at the memcg level has difficult ownership
> semantics around fork(), cgroup migration, shared mappings, and page
> lifetime tracking.
> 
> The intent of mr_mem is narrower and RDMA-scoped.  It is not page-level
> ownership tracking — it is object-based accounting tied to the MR
> lifetime:
> 
>   - charged at MR registration time
>   - uncharged at MR destruction time
>   - the charge lives with the MR's creating cgroup for the entire
>     lifetime of the MR object

Okay, that's an interesting framing. Perhaps it can work, you should
include this in the commit message and be sure to CC the cgroup
people.

Jason

^ permalink raw reply

* [PATCH v3] cgroup/dmem: implement dmem.high soft limit via prioritized eviction
From: Qiliang Yuan @ 2026-05-28 12:03 UTC (permalink / raw)
  To: Christian Koenig, Huang Rui, Matthew Auld, Matthew Brost,
	Maarten Lankhorst, Maxime Ripard, Thomas Zimmermann, David Airlie,
	Simona Vetter, Tejun Heo, Johannes Weiner, Michal Koutný,
	Natalie Vock
  Cc: dri-devel, linux-kernel, cgroups, Qiliang Yuan

The dmem cgroup v2 controller currently only provides a hard "max"
limit, which causes immediate allocation failures when a cgroup's
device memory usage reaches its quota.  GPU-bound AI workloads need
smoother over-subscription support: a soft limit that temporarily
allows excess usage while applying backpressure through reclaim
rather than outright failure.

Add dmem.high, a soft limit that penalizes over-limit cgroups by
evicting their buffer objects first when eviction is triggered (e.g.
due to a "max" limit hit).  Unlike the rejected v1 approach which
used sleep-on-allocation throttling, this version provides a
meaningful recovery action through prioritized reclaim.

Expose "high" as a new cgroupfs control file per region via
set_resource_high() and get_resource_high(), and initialize it to
PAGE_COUNTER_MAX in reset_all_resource_limits().  Like get_resource_max(),
get_resource_high() returns PAGE_COUNTER_MAX when the pool is NULL.

Extend dmem_cgroup_state_evict_valuable() with a "try_high"
parameter.  When set, the function walks the page_counter parent
chain to check whether any ancestor exceeds its high limit, then
verifies that the pool is above its effective minimum to respect
dmem.min protection.  Only pools meeting both criteria are evicted.

Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy.
Pass 1 uses trylock and targets only BOs whose cgroup exceeds
dmem.high.  Pass 2 falls back to the standard above-elow eviction.
Pass 3 begins with a properly-locked high-priority pass in case
Pass 1 failed due to trylock contention, then proceeds with the
standard repeat-while-making-progress loop with low-watermark
fallback.

Signed-off-by: Qiliang Yuan <realwujing@gmail.com>
---
Introduce a "high" soft limit for the dmem cgroup v2 controller.
When a "max" limit is hit and eviction is triggered, buffer objects
belonging to cgroups that exceed their dmem.high limit are targeted
first, providing a meaningful recovery action through reclaim.

The dmem cgroup currently only supports hard "max" limits, which
cause immediate allocation failures for GPU-bound workloads. A soft
limit enables smoother over-subscription by penalizing over-limit
cgroups via prioritized eviction rather than outright rejection.

The implementation adds a "high" cgroupfs control file per region,
a try_high parameter to dmem_cgroup_state_evict_valuable() for
tier-1 eviction, and a 3-pass strategy in ttm_bo_evict_alloc().
---
V2 -> V3:
- Walk the page_counter parent chain in the try_high pass to prevent
  child cgroups from evading the penalty when a parent cgroup exceeds
  its dmem.high limit.
- Check dmem.min protection in the try_high pass to avoid evicting
  BOs below the effective minimum.
- Add a properly-locked high-priority retry at the beginning of Pass 3
  so that actively-used over-limit BOs (which failed trylock in Pass 1)
  are not skipped while innocent cgroups are evicted.
- Fix get_resource_high(NULL) returning 0 instead of PAGE_COUNTER_MAX
  to match the behavior of get_resource_max().

V1 -> V2:
- Replace sleep-on-allocation throttling with prioritized eviction.
  When a "max" limit is hit, BOs from cgroups exceeding dmem.high are
  evicted first in a dedicated pass. No throttling or sleeping is
  performed in the charge path.
- Remove task throttling (schedule_timeout_killable, TIF_NOTIFY_RESUME,
  resume_user_mode_work() integration) entirely.
- Add dmem.high cgroupfs control file per region.
- Extend dmem_cgroup_state_evict_valuable() with try_high parameter
  to target over-limit cgroups as tier-1 eviction.
- Refactor ttm_bo_evict_alloc() into a 3-pass eviction strategy:
  (1) trylock: evict only BOs exceeding dmem.high
  (2) trylock: above-elow
  (3) proper-lock: repeat with low fallback.
- Initialize high to PAGE_COUNTER_MAX in reset_all_resource_limits().

v1: https://lore.kernel.org/all/20260520-feature-dmem-high-v1-1-97ca0cb7f95a@gmail.com
v2: https://lore.kernel.org/all/20260522-feature-dmem-high-v2-1-d805deddecbb@gmail.com
---
 drivers/gpu/drm/ttm/ttm_bo.c | 35 ++++++++++++++++++++----
 include/linux/cgroup_dmem.h  |  4 +--
 kernel/cgroup/dmem.c         | 65 ++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 94 insertions(+), 10 deletions(-)

diff --git a/drivers/gpu/drm/ttm/ttm_bo.c b/drivers/gpu/drm/ttm/ttm_bo.c
index bcd76f6bb7f02..2f2b428f1d30a 100644
--- a/drivers/gpu/drm/ttm/ttm_bo.c
+++ b/drivers/gpu/drm/ttm/ttm_bo.c
@@ -505,6 +505,8 @@ struct ttm_bo_evict_walk {
 
 	/** @limit_pool: Which pool limit we should test against */
 	struct dmem_cgroup_pool_state *limit_pool;
+	/** @try_high: Whether to only evict BO's above the high watermark (first pass) */
+	bool try_high;
 	/** @try_low: Whether we should attempt to evict BO's with low watermark threshold */
 	bool try_low;
 	/** @hit_low: If we cannot evict a bo when @try_low is false (first pass) */
@@ -518,7 +520,8 @@ static s64 ttm_bo_evict_cb(struct ttm_lru_walk *walk, struct ttm_buffer_object *
 	s64 lret;
 
 	if (!dmem_cgroup_state_evict_valuable(evict_walk->limit_pool, bo->resource->css,
-					      evict_walk->try_low, &evict_walk->hit_low))
+					      evict_walk->try_high, evict_walk->try_low,
+					      &evict_walk->hit_low))
 		return 0;
 
 	if (bo->pin_count || !bo->bdev->funcs->eviction_valuable(bo, evict_walk->place))
@@ -577,31 +580,51 @@ static int ttm_bo_evict_alloc(struct ttm_device *bdev,
 	};
 	s64 lret;
 
+	/*
+	 * Pass 1 (trylock): Only evict BOs whose cgroup is above its
+	 * dmem.high soft limit. This penalizes over-limit cgroups first.
+	 */
 	evict_walk.walk.arg.trylock_only = true;
+	evict_walk.try_high = true;
 	lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
+	evict_walk.try_high = false;
+	if (lret)
+		goto out;
 
-	/* One more attempt if we hit low limit? */
+	/*
+	 * Pass 2 (trylock): Evict BOs above the effective low watermark.
+	 * Falls back to low-priority eviction if needed.
+	 */
+	lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
 	if (!lret && evict_walk.hit_low) {
 		evict_walk.try_low = true;
 		lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
 	}
+
 	if (lret || !ticket)
 		goto out;
 
-	/* Reset low limit */
+	/*
+	 * Pass 3+ (properly locked): Evict while making progress.
+	 * First retry the high-priority pass with proper locking in case
+	 * Pass 1 failed due to trylock contention on over-limit BOs.
+	 * If that still fails, fall back to the standard low-priority eviction.
+	 */
 	evict_walk.try_low = evict_walk.hit_low = false;
-	/* If ticket-locking, repeat while making progress. */
 	evict_walk.walk.arg.trylock_only = false;
+	evict_walk.try_high = true;
+	lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
+	evict_walk.try_high = false;
+	if (lret)
+		goto out;
 
 retry:
 	do {
-		/* The walk may clear the evict_walk.walk.ticket field */
 		evict_walk.walk.arg.ticket = ticket;
 		evict_walk.evicted = 0;
 		lret = ttm_lru_walk_for_evict(&evict_walk.walk, bdev, man, 1);
 	} while (!lret && evict_walk.evicted);
 
-	/* We hit the low limit? Try once more */
 	if (!lret && evict_walk.hit_low && !evict_walk.try_low) {
 		evict_walk.try_low = true;
 		goto retry;
diff --git a/include/linux/cgroup_dmem.h b/include/linux/cgroup_dmem.h
index dd4869f1d736e..06115d35509b1 100644
--- a/include/linux/cgroup_dmem.h
+++ b/include/linux/cgroup_dmem.h
@@ -23,7 +23,7 @@ int dmem_cgroup_try_charge(struct dmem_cgroup_region *region, u64 size,
 void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64 size);
 bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 				      struct dmem_cgroup_pool_state *test_pool,
-				      bool ignore_low, bool *ret_hit_low);
+				      bool try_high, bool ignore_low, bool *ret_hit_low);
 
 void dmem_cgroup_pool_state_put(struct dmem_cgroup_pool_state *pool);
 #else
@@ -54,7 +54,7 @@ static inline void dmem_cgroup_uncharge(struct dmem_cgroup_pool_state *pool, u64
 static inline
 bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 				      struct dmem_cgroup_pool_state *test_pool,
-				      bool ignore_low, bool *ret_hit_low)
+				      bool try_high, bool ignore_low, bool *ret_hit_low)
 {
 	return true;
 }
diff --git a/kernel/cgroup/dmem.c b/kernel/cgroup/dmem.c
index 4753a67d0f0f2..c80444c0da177 100644
--- a/kernel/cgroup/dmem.c
+++ b/kernel/cgroup/dmem.c
@@ -156,6 +156,12 @@ set_resource_low(struct dmem_cgroup_pool_state *pool, u64 val)
 	page_counter_set_low(&pool->cnt, val);
 }
 
+static void
+set_resource_high(struct dmem_cgroup_pool_state *pool, u64 val)
+{
+	page_counter_set_high(&pool->cnt, val);
+}
+
 static void
 set_resource_max(struct dmem_cgroup_pool_state *pool, u64 val)
 {
@@ -167,6 +173,11 @@ static u64 get_resource_low(struct dmem_cgroup_pool_state *pool)
 	return pool ? READ_ONCE(pool->cnt.low) : 0;
 }
 
+static u64 get_resource_high(struct dmem_cgroup_pool_state *pool)
+{
+	return pool ? READ_ONCE(pool->cnt.high) : PAGE_COUNTER_MAX;
+}
+
 static u64 get_resource_min(struct dmem_cgroup_pool_state *pool)
 {
 	return pool ? READ_ONCE(pool->cnt.min) : 0;
@@ -186,6 +197,7 @@ static void reset_all_resource_limits(struct dmem_cgroup_pool_state *rpool)
 {
 	set_resource_min(rpool, 0);
 	set_resource_low(rpool, 0);
+	set_resource_high(rpool, PAGE_COUNTER_MAX);
 	set_resource_max(rpool, PAGE_COUNTER_MAX);
 }
 
@@ -289,10 +301,13 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
  * dmem_cgroup_state_evict_valuable() - Check if we should evict from test_pool
  * @limit_pool: The pool for which we hit limits
  * @test_pool: The pool for which to test
+ * @try_high: Only evict BOs whose usage exceeds the high limit (first pass)
  * @ignore_low: Whether we have to respect low watermarks.
  * @ret_hit_low: Pointer to whether it makes sense to consider low watermark.
  *
  * This function returns true if we can evict from @test_pool, false if not.
+ * When @try_high is set, only pools with usage above their high limit are
+ * evictable, enabling prioritized eviction of over-limit cgroups.
  * When returning false and @ignore_low is false, @ret_hit_low may
  * be set to true to indicate this function can be retried with @ignore_low
  * set to true.
@@ -301,7 +316,7 @@ dmem_cgroup_calculate_protection(struct dmem_cgroup_pool_state *limit_pool,
  */
 bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 				      struct dmem_cgroup_pool_state *test_pool,
-				      bool ignore_low, bool *ret_hit_low)
+				      bool try_high, bool ignore_low, bool *ret_hit_low)
 {
 	struct dmem_cgroup_pool_state *pool = test_pool;
 	struct page_counter *ctest;
@@ -331,9 +346,38 @@ bool dmem_cgroup_state_evict_valuable(struct dmem_cgroup_pool_state *limit_pool,
 
 	ctest = &test_pool->cnt;
 
+	used = page_counter_read(ctest);
+
+	if (try_high) {
+		struct page_counter *c;
+
+		/*
+		 * Walk the page_counter parent chain to check whether any
+		 * ancestor cgroup exceeds its dmem.high limit.  This prevents
+		 * child cgroups from evading the penalty when a parent cgroup
+		 * is over its high limit.
+		 */
+		if (used <= READ_ONCE(ctest->high)) {
+			for (c = ctest->parent; c; c = c->parent) {
+				if (page_counter_read(c) > READ_ONCE(c->high))
+					break;
+			}
+			if (!c)
+				return false;
+		}
+
+		/*
+		 * Respect dmem.min protection: do not evict BOs below the
+		 * effective minimum even during the high-priority pass.
+		 */
+		dmem_cgroup_calculate_protection(limit_pool, test_pool);
+		min = READ_ONCE(ctest->emin);
+
+		return used > min;
+	}
+
 	dmem_cgroup_calculate_protection(limit_pool, test_pool);
 
-	used = page_counter_read(ctest);
 	min = READ_ONCE(ctest->emin);
 
 	if (used <= min)
@@ -835,6 +879,17 @@ static ssize_t dmem_cgroup_region_low_write(struct kernfs_open_file *of,
 	return dmemcg_limit_write(of, buf, nbytes, off, set_resource_low);
 }
 
+static int dmem_cgroup_region_high_show(struct seq_file *sf, void *v)
+{
+	return dmemcg_limit_show(sf, v, get_resource_high);
+}
+
+static ssize_t dmem_cgroup_region_high_write(struct kernfs_open_file *of,
+					  char *buf, size_t nbytes, loff_t off)
+{
+	return dmemcg_limit_write(of, buf, nbytes, off, set_resource_high);
+}
+
 static int dmem_cgroup_region_max_show(struct seq_file *sf, void *v)
 {
 	return dmemcg_limit_show(sf, v, get_resource_max);
@@ -868,6 +923,12 @@ static struct cftype files[] = {
 		.seq_show = dmem_cgroup_region_low_show,
 		.flags = CFTYPE_NOT_ON_ROOT,
 	},
+	{
+		.name = "high",
+		.write = dmem_cgroup_region_high_write,
+		.seq_show = dmem_cgroup_region_high_show,
+		.flags = CFTYPE_NOT_ON_ROOT,
+	},
 	{
 		.name = "max",
 		.write = dmem_cgroup_region_max_write,

---
base-commit: ab5fce87a778cb780a05984a2ca448f2b41aafbf
change-id: 20260519-feature-dmem-high-16997148dc38

Best regards,
-- 
Qiliang Yuan <realwujing@gmail.com>


^ permalink raw reply related

* [PATCH] cgroup/cpuset: Free sched domains on rebuild guard failure
From: Guopeng Zhang @ 2026-05-28  9:37 UTC (permalink / raw)
  To: Waiman Long, Tejun Heo, Johannes Weiner, Michal Koutný
  Cc: Chen Ridong, cgroups, linux-kernel, Guopeng Zhang

From: Guopeng Zhang <zhangguopeng@kylinos.cn>

generate_sched_domains() returns sched-domain masks and optional
attributes that are normally handed to partition_sched_domains(), which
takes ownership of them.

rebuild_sched_domains_locked() has a WARN guard after
generate_sched_domains() and before partition_sched_domains() to avoid
passing offline CPUs into the scheduler domain rebuild path. If that
guard fires, the function currently returns directly without freeing
the generated doms and attr.

Free the generated sched-domain masks and attributes before returning
from the guard failure path.

Signed-off-by: Guopeng Zhang <zhangguopeng@kylinos.cn>
---
 kernel/cgroup/cpuset.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 51327333980a..c5fdebc205d8 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1004,8 +1004,11 @@ void rebuild_sched_domains_locked(void)
 	* prevent the panic.
 	*/
 	for (i = 0; doms && i < ndoms; i++) {
-		if (WARN_ON_ONCE(!cpumask_subset(doms[i], cpu_active_mask)))
+		if (WARN_ON_ONCE(!cpumask_subset(doms[i], cpu_active_mask))) {
+			free_sched_domains(doms, ndoms);
+			kfree(attr);
 			return;
+		}
 	}
 
 	/* Have scheduler rebuild the domains */
-- 
2.43.0


^ permalink raw reply related

* Re: [RFC PATCH bpf-next v7 00/11] mm: BPF struct_ops for dynamic memory protection and async reclaim
From: teawater @ 2026-05-28  8:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Alexei Starovoitov, Daniel Borkmann, John Fastabend,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Kumar Kartikeya Dwivedi, Song Liu, Yonghong Song, Jiri Olsa,
	Johannes Weiner, Roman Gushchin, Shakeel Butt, Muchun Song,
	JP Kobryn, Andrew Morton, Shuah Khan, davem, Jakub Kicinski,
	Jesper Dangaard Brouer, Stanislav Fomichev, KP Singh, Tao Chen,
	Mykyta Yatsenko, Leon Hwang, Anton Protopopov, Amery Hung,
	Tobias Klauser, Eyal Birger, Rong Tao, Hao Luo, Peter Zijlstra,
	Miguel Ojeda, Nathan Chancellor, Kees Cook, Tejun Heo, Jeff Xu,
	mkoutny, Jan Hendrik Farr, Christian Brauner, Randy Dunlap,
	Brian Gerst, Masahiro Yamada, Willem de Bruijn, Jason Xing,
	Paul Chaignon, Chen Ridong, Lance Yang, Jiayuan Chen,
	linux-kernel, bpf, cgroups, linux-mm, netdev, linux-kselftest,
	geliang, baohua, Hui Zhu
In-Reply-To: <ahavmbcdXDX5gNup@tiehlicka>

> 
> On Tue 26-05-26 10:20:00, Hui Zhu wrote:
> 

Hi Michal,

> > 
> > From: Hui Zhu <zhuhui@kylinos.cn>
> >  
> >  Overview:
> >  This series introduces BPF struct_ops support for the memory controller,
> >  enabling userspace BPF programs to implement custom, dynamic memory
> >  management policies per cgroup. The feature allows BPF programs to hook
> >  into the core reclaim and charge paths without requiring kernel
> >  modifications, providing a flexible alternative to static knobs such as
> >  memory.low and memory.min.
> >  
> >  The series enables two complementary use cases.
> >  
> >  Dynamic memory protection: static memory protection thresholds
> >  (memory.low, memory.min) are poor fits for workloads whose actual memory
> >  activity varies over time. A high-priority cgroup holding a large working
> >  set but temporarily idle will still suppress reclaim on its siblings,
> >  wasting available memory. A BPF-driven approach can observe real workload
> >  activity -- page faults, charge/uncharge events -- and activate or
> >  withdraw protection dynamically.
> > 
> Why the same cannot be achieved by dynamically changing protection?

Dynamically adjusting memory.low or memory.min is indeed an
option, but it has a practical drawback: in many production
environments these values are managed and pushed down by a
cluster-level orchestrator (e.g. a container runtime or resource
manager). Modifying them from a separate BPF-based agent risks
conflicts with the orchestrator's own control loop and makes the
system harder to reason about.

Beyond that, the intended use case requires rapid, short-lived
adjustments -- reacting to bursts of page faults or PSI spikes
and reverting just as quickly once the pressure subsides. Mutating
the static knobs for that purpose feels like the wrong abstraction:
the knobs express policy intent, while what we need is a transient
override that sits on top of that policy.

The hooks are therefore not meant to replace the existing limits,
but to complement them: the orchestrator continues to own
memory.low / memory.min, while a BPF program makes small, brief
corrections in response to observed runtime behavior.

> 
> > 
> > The test results at the end of this
> >  letter quantify the difference: in a scenario where the high-priority
> >  cgroup is idle, the BPF-controlled low-priority cgroup achieves roughly
> >  37x higher throughput than with static memory.low.
> >  
> >  Asynchronous proactive reclaim: the memcg_charged and memcg_uncharged
> >  hooks, combined with the BPF workqueue mechanism and the new
> >  bpf_try_to_free_mem_cgroup_pages() kfunc, enable BPF programs to perform
> >  proactive background reclaim without blocking the charge path. The
> >  pattern works as follows: the memcg_charged callback tracks accumulated
> >  memory usage; when usage crosses a configurable threshold, it enqueues an
> >  asynchronous work item via bpf_wq_start() and returns immediately without
> >  throttling the charging task. The workqueue callback then invokes
> >  bpf_try_to_free_mem_cgroup_pages() to reclaim pages from the target
> >  cgroup; if usage remains elevated after reclaim, the callback re-enqueues
> >  itself to continue. This allows a BPF program to keep a cgroup's
> >  footprint below its hard limit (memory.max) entirely in the background,
> >  avoiding the OOM killer or direct-reclaim stalls that would otherwise
> >  occur.
> > 
> How do you account the overall work done to the specific memcg as the
> large part of the reclaim is done from WQ context?

One approach to attribute the reclaim work accurately to the target
memcg would be to expose a kfunc that creates a kthread_worker and
attaches it to a specific cgroup. Reclaim work enqueued to that
worker would then run in a context already associated with the
target memcg, so the accounting would naturally fall to the right
cgroup without any extra bookkeeping.

The tradeoff is additional complexity: creating a per-cgroup worker
introduces resource overhead and lifecycle management concerns
(e.g. when should the worker be torn down). Whether that cost is
justified depends on how strictly the caller needs the reclaim to
be attributed.

That said, I am not certain this is the right direction yet and
would welcome your thoughts on whether this is worth pursuing, or
whether there is a simpler mechanism I am overlooking.


> Also when introducing a BPF hook please focus on describing why existing
> interfaces fail to achieve what you need. For the async reclaim why it
> is not practical or feasible to use userspace driven memory reclaim.


Noted, and thank you for both points. In the next revision I will
add a dedicated section to each hook's description covering:

Why existing interfaces are insufficient. For the async reclaim
case specifically, I will explain why userspace-driven reclaim
(e.g. memory.reclaim, cgroup-aware madvise, or a dedicated
reclaim daemon) is not practical: userspace cannot react at the
granularity or latency required, and the round-trip through a
syscall or procfs write introduces overhead that defeats the
purpose of proactive reclaim.
What gap the new hook fills that cannot be closed by tuning
existing knobs.

Best,
Hui


> -- 
> Michal Hocko
> SUSE Labs
>

^ permalink raw reply

* Re: [RFC PATCH rdma-next 0/5] cgroup/rdma: add per-type resource accounting for QP, MR and MR memory
From: Tao Cui @ 2026-05-28  7:55 UTC (permalink / raw)
  To: jgg; +Cc: cgroups, cuitao, hannes, leon, linux-rdma, mkoutny, tj
In-Reply-To: <20260527133400.GM2487554@ziepe.ca>

Hi,Jason

> memory pin accounting should ideally be limited by the cgroup directly
> but we argued about that for a while and could never get an agreement
> of an acceptable implementation. There are many nasty corner cases
> around cgroups and fork and other cases IIRC
>
> So I'm not sure if making it rdma specific can easially solve these
> problems

Thanks for the detailed context.  I understand the concern — generic
pinned-page accounting at the memcg level has difficult ownership
semantics around fork(), cgroup migration, shared mappings, and page
lifetime tracking.

The intent of mr_mem is narrower and RDMA-scoped.  It is not page-level
ownership tracking — it is object-based accounting tied to the MR
lifetime:

  - charged at MR registration time
  - uncharged at MR destruction time
  - the charge lives with the MR's creating cgroup for the entire
    lifetime of the MR object

This model intentionally defines accounting semantics around MR
object lifetime rather than page ownership:

1. fork(): The accounting model is based on MR object ownership
   rather than ownership of the underlying pages after fork().
   fork() does not duplicate MR objects.  Even though the child
   inherits the uverbs fd and can access the parent's ucontext,
   the MR remains a single kernel object — fork itself creates no
   additional MR registrations or associated RDMA resource accounting.
   The charge is tied to the MR object, not to the number of processes
   that can reach it, so no splitting or re-accounting is needed.

2. Cgroup migration: mr_mem follows the same semantics as the existing
   hca_object — charge at creation time against the invoking task's
   cgroup, uncharge at destruction time.  The RDMA cgroup does not
   implement can_attach/attach callbacks today, so charges do not
   migrate with the task.  This is a known limitation that applies
   equally to hca_handle and hca_object.  mr_mem does not introduce
   any new complication here.

3. Overlap with memory cgroup: mr_mem does not count process memory
   usage — it represents a per-device DMA registration budget: how
   much memory can this cgroup register through a given HCA.  This is
   a different dimension from what memory cgroup tracks.  An
   administrator might set mr_mem limits differently per device, which
   memory cgroup cannot express.

   In particular, mr_mem tracks the registered memory range associated
   with the MR rather than exact dynamically pinned pages (e.g. for
   ODP MRs).  This is a stable, policy-oriented approximation of
   registration footprint — not an attempt at precise physical page
   accounting.

If you think this RDMA-scoped approach still has unresolved problems,
I'd appreciate guidance on which corner cases remain problematic.

Thanks,
Tao

^ permalink raw reply

* Re: [PATCH v5 9/9] mm: switch deferred split shrinker to list_lru
From: SeongJae Park @ 2026-05-28  7:08 UTC (permalink / raw)
  To: Johannes Weiner
  Cc: SeongJae Park, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Shakeel Butt, Michal Hocko, Dave Chinner, Roman Gushchin,
	Muchun Song, Qi Zheng, Yosry Ahmed, Zi Yan, Liam R . Howlett,
	Usama Arif, Kiryl Shutsemau, Vlastimil Babka, Kairui Song,
	Mikhail Zaslonko, Vasily Gorbik, Baolin Wang, Barry Song,
	Dev Jain, Lance Yang, Nico Pache, Ryan Roberts, cgroups, linux-mm,
	linux-kernel
In-Reply-To: <20260527204757.2544958-10-hannes@cmpxchg.org>

Hi Johannes,

On Wed, 27 May 2026 16:45:16 -0400 Johannes Weiner <hannes@cmpxchg.org> wrote:

> The deferred split queue handles cgroups in a suboptimal fashion. The
> queue is per-NUMA node or per-cgroup, not the intersection. That means
> on a cgrouped system, a node-restricted allocation entering reclaim
> can end up splitting large pages on other nodes:
> 
>         alloc/unmap
>           deferred_split_folio()
>             list_add_tail(memcg->split_queue)
>             set_shrinker_bit(memcg, node, deferred_shrinker_id)
> 
>         for_each_zone_zonelist_nodemask(restricted_nodes)
>           mem_cgroup_iter()
>             shrink_slab(node, memcg)
>               shrink_slab_memcg(node, memcg)
>                 if test_shrinker_bit(memcg, node, deferred_shrinker_id)
>                   deferred_split_scan()
>                     walks memcg->split_queue
> 
> The shrinker bit adds an imperfect guard rail. As soon as the cgroup
> has a single large page on the node of interest, all large pages owned
> by that memcg, including those on other nodes, will be split.
> 
> list_lru properly sets up per-node, per-cgroup lists. As a bonus, it
> streamlines a lot of the list operations and reclaim walks. It's used
> widely by other major shrinkers already. Convert the deferred split
> queue as well.
> 
> The list_lru per-memcg heads are instantiated on demand when the first
> object of interest is allocated for a cgroup, by calling
> folio_memcg_alloc_deferred(). Add calls to where splittable pages are
> created: anon faults, swapin faults, khugepaged collapse.
> 
> These calls create all possible node heads for the cgroup at once, so
> the migration code (between nodes) doesn't need any special care.
> 
> Reported-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
> Tested-by: Mikhail Zaslonko <zaslonko@linux.ibm.com>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Reviewed-by: Lorenzo Stoakes (Oracle) <ljs@kernel.org>
> Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
> ---
>  include/linux/huge_mm.h    |   7 +-
>  include/linux/memcontrol.h |   4 -
>  include/linux/mmzone.h     |  12 --
>  mm/huge_memory.c           | 364 +++++++++++++------------------------
>  mm/internal.h              |   2 +-
>  mm/khugepaged.c            |   5 +
>  mm/memcontrol.c            |  12 +-
>  mm/memory.c                |   4 +
>  mm/mm_init.c               |  15 --
>  mm/swap_state.c            |  10 +
>  10 files changed, 150 insertions(+), 285 deletions(-)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index edece3e26985..f6c2531a27a3 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -423,10 +423,10 @@ static inline int split_huge_page(struct page *page)
>  {
>  	return split_huge_page_to_list_to_order(page, NULL, 0);
>  }
> +
> +int folio_memcg_alloc_deferred(struct folio *folio);
> +
>  void deferred_split_folio(struct folio *folio, bool partially_mapped);
> -#ifdef CONFIG_MEMCG
> -void reparent_deferred_split_queue(struct mem_cgroup *memcg);
> -#endif
>  
>  void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>  		unsigned long address, bool freeze);
> @@ -664,7 +664,6 @@ static inline int folio_split(struct folio *folio, unsigned int new_order,
>  }
>  
>  static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
> -static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
>  #define split_huge_pmd(__vma, __pmd, __address)	\
>  	do { } while (0)

I found this patch is now in mm-new and it makes UM mode kunit fails like
below.

    $ ./tools/testing/kunit/kunit.py run --kunitconfig mm/damon/tests/
    [00:00:02] Configuring KUnit Kernel ...
    [00:00:02] Building KUnit Kernel ...
    Populating config with:
    $ make ARCH=um O=.kunit olddefconfig
    Building with:
    $ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=8
    ERROR:root:../mm/swap_state.c: In function ‘__swap_cache_alloc’:
    ../mm/swap_state.c:468:26: error: implicit declaration of function ‘folio_memcg_alloc_deferred’ [-Wimplicit-function-declaration]
      468 |         if (order > 1 && folio_memcg_alloc_deferred(folio)) {
          |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
    make[4]: *** [../scripts/Makefile.build:289: mm/swap_state.o] Error 1
    make[4]: *** Waiting for unfinished jobs....
    make[3]: *** [../scripts/Makefile.build:548: mm] Error 2
    make[3]: *** Waiting for unfinished jobs....
    make[2]: *** [/home/lkhack/linux/Makefile:2143: .] Error 2
    make[1]: *** [/home/lkhack/linux/Makefile:248: __sub-make] Error 2
    make: *** [Makefile:248: __sub-make] Error 2

Maybe we can define the function for CONFIG_TRANSPARENT_HUGEPAGE unset case?  I
confirmed the below attaching temporal fix works for at least kunit.


Thanks,
SJ

[...]
=== >8 ===
From 23b5800dd49085707baee5774b74782c3e424f24 Mon Sep 17 00:00:00 2001
From: SeongJae Park <sj@kernel.org>
Date: Wed, 27 May 2026 23:58:07 -0700
Subject: [PATCH] mm/huge_mm: define memcg_alloc_deferred() for
 !CONFIG_TRANSPARENT_HUGEPPAGE
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Without this, UM mode kunit fails like below.

    $ ./tools/testing/kunit/kunit.py run --kunitconfig mm/damon/tests/
    [00:00:02] Configuring KUnit Kernel ...
    [00:00:02] Building KUnit Kernel ...
    Populating config with:
    $ make ARCH=um O=.kunit olddefconfig
    Building with:
    $ make all compile_commands.json scripts_gdb ARCH=um O=.kunit --jobs=8
    ERROR:root:../mm/swap_state.c: In function ‘__swap_cache_alloc’:
    ../mm/swap_state.c:468:26: error: implicit declaration of function ‘folio_memcg_alloc_deferred’ [-Wimplicit-function-declaration]
      468 |         if (order > 1 && folio_memcg_alloc_deferred(folio)) {
          |                          ^~~~~~~~~~~~~~~~~~~~~~~~~~
    make[4]: *** [../scripts/Makefile.build:289: mm/swap_state.o] Error 1
    make[4]: *** Waiting for unfinished jobs....
    make[3]: *** [../scripts/Makefile.build:548: mm] Error 2
    make[3]: *** Waiting for unfinished jobs....
    make[2]: *** [/home/lkhack/linux/Makefile:2143: .] Error 2
    make[1]: *** [/home/lkhack/linux/Makefile:248: __sub-make] Error 2
    make: *** [Makefile:248: __sub-make] Error 2

Fix by implementing the function for CONFIG_TRANSPARENT_HUGEPPAGE unset
case.

Fixes: https://lore.kernel.org/20260527204757.2544958-10-hannes@cmpxchg.org
Signed-off-by: SeongJae Park <sj@kernel.org>
---
 include/linux/huge_mm.h | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f6c2531a27a35..055de7b8ed487 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -663,6 +663,11 @@ static inline int folio_split(struct folio *folio, unsigned int new_order,
 	return -EINVAL;
 }
 
+static inline int folio_memcg_alloc_deferred(struct folio *folio)
+{
+	return 0;
+}
+
 static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
-- 
2.47.3



^ permalink raw reply related

* Re: [RFC PATCH bpf-next v7 04/11] libbpf: introduce bpf_map__attach_struct_ops_opts()
From: Leon Hwang @ 2026-05-28  5:53 UTC (permalink / raw)
  To: Yonghong Song, Hui Zhu, Alexei Starovoitov, Daniel Borkmann,
	John Fastabend, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Kumar Kartikeya Dwivedi, Song Liu, Jiri Olsa,
	Johannes Weiner, Michal Hocko, Roman Gushchin, Shakeel Butt,
	Muchun Song, JP Kobryn, Andrew Morton, Shuah Khan, davem,
	Jakub Kicinski, Jesper Dangaard Brouer, Stanislav Fomichev,
	KP Singh, Tao Chen, Mykyta Yatsenko, Leon Hwang, Anton Protopopov,
	Amery Hung, Tobias Klauser, Eyal Birger, Rong Tao, Hao Luo,
	Peter Zijlstra, Miguel Ojeda, Nathan Chancellor, Kees Cook,
	Tejun Heo, Jeff Xu, mkoutny, Jan Hendrik Farr, Christian Brauner,
	Randy Dunlap, Brian Gerst, Masahiro Yamada, Willem de Bruijn,
	Jason Xing, Paul Chaignon, Chen Ridong, Lance Yang, Jiayuan Chen,
	linux-kernel, bpf, cgroups, linux-mm, netdev, linux-kselftest
  Cc: geliang, baohua
In-Reply-To: <2fd62ec0-c594-4ac2-a95d-29eafbcb74d6@linux.dev>

On 27/5/26 23:43, Yonghong Song wrote:
> 
> 
> On 5/25/26 7:20 PM, Hui Zhu wrote:
>> From: Roman Gushchin <roman.gushchin@linux.dev>
[...]
>> diff --git a/tools/lib/bpf/libbpf.map b/tools/lib/bpf/libbpf.map
>> index dfed8d60af05..6105619b5ecf 100644
>> --- a/tools/lib/bpf/libbpf.map
>> +++ b/tools/lib/bpf/libbpf.map
>> @@ -454,6 +454,7 @@ LIBBPF_1.7.0 {
>>           bpf_prog_assoc_struct_ops;
>>           bpf_program__assoc_struct_ops;
>>           btf__permute;
>> +        bpf_map__attach_struct_ops_opts;
> 
> Function bpf_map__attach_struct_ops_opts should be in
> LIBBPF_1.8.0.
> 

Pls also keep it in alphabet order.

Thanks,
Leon

>>   } LIBBPF_1.6.0;
>>     LIBBPF_1.8.0 {
> 
> 
> 


^ permalink raw reply

* [PATCH] cgroup: pair max limit READ_ONCE() with WRITE_ONCE()
From: Ren Tamura @ 2026-05-28  4:28 UTC (permalink / raw)
  To: tj, hannes, mkoutny; +Cc: cgroups, linux-kernel, Ren Tamura

cgroup.max.descendants and cgroup.max.depth are shown through seq_file.
Their show callbacks read cgrp->max_descendants and cgrp->max_depth with
READ_ONCE(), respectively.

The corresponding write callbacks update the same scalar fields while
holding the cgroup lock, but the seq_file show path does not serialize
against those stores. This leaves the lockless show-side loads annotated
with READ_ONCE(), while the corresponding stores remain plain stores.

Use WRITE_ONCE() for the updates so the intended lockless access is marked
consistently on both sides. This does not change locking, ordering, or
user-visible semantics.

Assisted-by: OpenAI-Codex:gpt-5.5
Signed-off-by: Ren Tamura <ren.tamura.oss@gmail.com>
---
 kernel/cgroup/cgroup.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 6152add0c..daddfc2b9 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -3726,7 +3726,7 @@ static ssize_t cgroup_max_descendants_write(struct kernfs_open_file *of,
 	if (!cgrp)
 		return -ENOENT;
 
-	cgrp->max_descendants = descendants;
+	WRITE_ONCE(cgrp->max_descendants, descendants);
 
 	cgroup_kn_unlock(of->kn);
 
@@ -3769,7 +3769,7 @@ static ssize_t cgroup_max_depth_write(struct kernfs_open_file *of,
 	if (!cgrp)
 		return -ENOENT;
 
-	cgrp->max_depth = depth;
+	WRITE_ONCE(cgrp->max_depth, depth);
 
 	cgroup_kn_unlock(of->kn);
 

base-commit: eb3f4b7426cfd2b79d65b7d37155480b32259a11
-- 
2.53.0


^ permalink raw reply related

* Re: [PATCH v3] security: Expand task_setscheduler LSM hook to include CPU affinity mask
From: Aaron Tomlin @ 2026-05-28  1:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: tsbogend, paul, jmorris, serge, mingo, juri.lelli,
	vincent.guittot, stephen.smalley.work, casey, longman, tj, hannes,
	mkoutny, chenridong, dietmar.eggemann, rostedt, bsegall, mgorman,
	vschneid, kprateek.nayak, omosnace, kees, neelx, sean, chjohnst,
	steve, mproche, nick.lange, cgroups, linux-mips, linux-fsdevel,
	linux-security-module, selinux, linux-kernel
In-Reply-To: <20260527195858.GC3493090@noisy.programming.kicks-ass.net>

[-- Attachment #1: Type: text/plain, Size: 3314 bytes --]

On Wed, May 27, 2026 at 09:58:58PM +0200, Peter Zijlstra wrote:
> On Wed, May 27, 2026 at 01:41:52PM -0400, Aaron Tomlin wrote:
> 
> > > > The actual use case here is multi-tenant workload isolation and visibility.
> > > > Passing the evaluated cpumask to the BPF LSM allows operators to write a
> > > > simple eBPF program to detect spatial boundary overlaps (e.g., logging an
> > > > event if a requested mask intersects with platform-reserved cores).
> 
> Why isn't cgroups good enough to enforce this? If you create a cgroup
> hierarchy per tenant, and constrain them using the cpuset controller,
> they should not be able to escape, rendering this event impossible.

Hi Peter,

You raise a very fair point. The cpuset cgroup controller is indeed the
kernel's primary vehicle for spatial enforcement, and under normal
circumstances, it successfully prevents a tenant from escaping their
designated cores.

The cpuset controller does govern resource limits, but does not audit
intent. When __sched_setaffinity() is invoked, the kernel compares the
requested in_mask against the task's allowed cpuset. If there is only a
partial intersection, the kernel silently truncates the requested mask to
fit the cpuset, without raising any alarm.

The BPF LSM hook, conversely, receives the raw, untruncated in_mask,
affording operators the visibility to detect, audit, and even reject these
violations of intent before the kernel silently sanitises the input.

This patch does not seek to replace the cpuset controller, but rather to
complement it by providing auditing capabilities.

> > We are not creating a bespoke BPF hook here; rather, we are rectifying a
> > historical blind spot within the API. The existing LSM hook is invoked
> > during sched_setaffinity(), yet it presently receives only the task_struct
> > pointer. Consequently, the security module is essentially asked, "Should
> > Process A be permitted to alter Process B's affinity?" without being
> > informed of the proposed affinity itself. Providing in_mask simply
> > furnishes the existing hook with the requisite payload to make an informed
> > decision.
> 
> It occurs to me that this same argument would require to also pass in
> the new sched_attr, no? That way the LSM can inspect the new policy
> before it becomes effective.

I agree, the underlying logic does indeed extend perfectly to sched_attr.

Presently, the LSM is equally oblivious as to whether a process is
requesting a benign transition to SCHED_BATCH, or attempting to escalate
its privileges by requesting a real-time policy such as SCHED_FIFO with
maximum priority. Just as with the CPU mask, providing the sched_attr
payload would rectify this parallel blind spot, allowing BPF policies to
inspect and mediate scheduling attributes before they become effective.

If you are amenable, I should be more than happy to expand the scope of the
forthcoming patch to include this. Alternatively, we could address the
sched_attr expansion in a separate, subsequent patch. Personally, I would
favour the latter approach, but please do let me know your preference.

I very much look forward to hearing Paul's thoughts on whether this aligns
with the broader LSM vision.

Thank you.


Kind regards,
-- 
Aaron Tomlin

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox