cgroups.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3 0/4] reparent the THP split queue
@ 2025-09-28 11:16 Qi Zheng
  2025-09-28 11:16 ` [PATCH v3 1/4] mm: thp: replace folio_memcg() with folio_memcg_charged() Qi Zheng
                   ` (3 more replies)
  0 siblings, 4 replies; 11+ messages in thread
From: Qi Zheng @ 2025-09-28 11:16 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm
  Cc: linux-mm, linux-kernel, cgroups, Qi Zheng

Changes in v3:
 - use css_is_dying() in folio_split_queue_lock*() to check if memcg is dying
   (David Hildenbrand, Shakeel Butt and Zi Yan)
 - modify the commit message in [PATCH v2 4/4]
   (Roman Gushchin)
 - fix the build error in [PATCH v2 4/4]
 - collect Acked-bys and Reviewed-bys
 - rebase onto the next-20250926

Changes in v2:
 - fix build errors in [PATCH 2/4] and [PATCH 4/4]
 - some cleanups for [PATCH 3/4] (suggested by David Hildenbrand)
 - collect Acked-bys and Reviewed-bys
 - rebase onto the next-20250922

Hi all,

In the future, we will reparent LRU folios during memcg offline to eliminate
dying memory cgroups, which requires reparenting the THP split queue to its
parent memcg.

Similar to list_lru, the split queue is relatively independent and does not need
to be reparented along with objcg and LRU folios (holding objcg lock and lru
lock). Therefore, we can apply the same mechanism as list_lru to reparent the
split queue first when memcg is offine.

The first three patches in this series are separated from the series
"Eliminate Dying Memory Cgroup" [1], mainly to do some cleanup and preparatory
work.

The last patch reparents the THP split queue to its parent memcg during memcg
offline.

Comments and suggestions are welcome!

Thanks,
Qi

[1]. https://lore.kernel.org/all/20250415024532.26632-1-songmuchun@bytedance.com/

Muchun Song (3):
  mm: thp: replace folio_memcg() with folio_memcg_charged()
  mm: thp: introduce folio_split_queue_lock and its variants
  mm: thp: use folio_batch to handle THP splitting in
    deferred_split_scan()

Qi Zheng (1):
  mm: thp: reparent the split queue during memcg offline

 include/linux/huge_mm.h    |   4 +
 include/linux/memcontrol.h |  10 ++
 mm/huge_memory.c           | 236 ++++++++++++++++++++++++++-----------
 mm/memcontrol.c            |   1 +
 4 files changed, 179 insertions(+), 72 deletions(-)

-- 
2.20.1


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH v3 1/4] mm: thp: replace folio_memcg() with folio_memcg_charged()
  2025-09-28 11:16 [PATCH v3 0/4] reparent the THP split queue Qi Zheng
@ 2025-09-28 11:16 ` Qi Zheng
  2025-10-02  1:23   ` Harry Yoo
  2025-09-28 11:17 ` [PATCH v3 2/4] mm: thp: introduce folio_split_queue_lock and its variants Qi Zheng
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 11+ messages in thread
From: Qi Zheng @ 2025-09-28 11:16 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

folio_memcg_charged() is intended for use when the user is unconcerned
about the returned memcg pointer. It is more efficient than folio_memcg().
Therefore, replace folio_memcg() with folio_memcg_charged().

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
---
 mm/huge_memory.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 1b81680b4225f..6db24b3a57005 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4014,7 +4014,7 @@ bool __folio_unqueue_deferred_split(struct folio *folio)
 	bool unqueued = false;
 
 	WARN_ON_ONCE(folio_ref_count(folio));
-	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
+	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg_charged(folio));
 
 	ds_queue = get_deferred_split_queue(folio);
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 2/4] mm: thp: introduce folio_split_queue_lock and its variants
  2025-09-28 11:16 [PATCH v3 0/4] reparent the THP split queue Qi Zheng
  2025-09-28 11:16 ` [PATCH v3 1/4] mm: thp: replace folio_memcg() with folio_memcg_charged() Qi Zheng
@ 2025-09-28 11:17 ` Qi Zheng
  2025-10-02  1:44   ` Harry Yoo
  2025-09-28 11:17 ` [PATCH v3 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() Qi Zheng
  2025-09-28 11:45 ` [PATCH v3 4/4] mm: thp: reparent the split queue during memcg offline Qi Zheng
  3 siblings, 1 reply; 11+ messages in thread
From: Qi Zheng @ 2025-09-28 11:17 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

In future memcg removal, the binding between a folio and a memcg may
change, making the split lock within the memcg unstable when held.

A new approach is required to reparent the split queue to its parent. This
patch starts introducing a unified way to acquire the split lock for
future work.

It's a code-only refactoring with no functional changes.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
Acked-by: David Hildenbrand <david@redhat.com>
---
 include/linux/memcontrol.h |  10 ++++
 mm/huge_memory.c           | 104 ++++++++++++++++++++++++++++---------
 2 files changed, 89 insertions(+), 25 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 16fe0306e50ea..99876af13c315 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -1662,6 +1662,11 @@ int alloc_shrinker_info(struct mem_cgroup *memcg);
 void free_shrinker_info(struct mem_cgroup *memcg);
 void set_shrinker_bit(struct mem_cgroup *memcg, int nid, int shrinker_id);
 void reparent_shrinker_deferred(struct mem_cgroup *memcg);
+
+static inline int shrinker_id(struct shrinker *shrinker)
+{
+	return shrinker->id;
+}
 #else
 #define mem_cgroup_sockets_enabled 0
 
@@ -1693,6 +1698,11 @@ static inline void set_shrinker_bit(struct mem_cgroup *memcg,
 				    int nid, int shrinker_id)
 {
 }
+
+static inline int shrinker_id(struct shrinker *shrinker)
+{
+	return -1;
+}
 #endif
 
 #ifdef CONFIG_MEMCG
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 6db24b3a57005..0ac3b97177b7f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1078,26 +1078,83 @@ pmd_t maybe_pmd_mkwrite(pmd_t pmd, struct vm_area_struct *vma)
 
 #ifdef CONFIG_MEMCG
 static inline
-struct deferred_split *get_deferred_split_queue(struct folio *folio)
+struct mem_cgroup *folio_split_queue_memcg(struct folio *folio,
+					   struct deferred_split *queue)
 {
-	struct mem_cgroup *memcg = folio_memcg(folio);
-	struct pglist_data *pgdat = NODE_DATA(folio_nid(folio));
+	if (mem_cgroup_disabled())
+		return NULL;
+	if (&NODE_DATA(folio_nid(folio))->deferred_split_queue == queue)
+		return NULL;
+	return container_of(queue, struct mem_cgroup, deferred_split_queue);
+}
 
-	if (memcg)
-		return &memcg->deferred_split_queue;
-	else
-		return &pgdat->deferred_split_queue;
+static struct deferred_split *folio_split_queue_lock(struct folio *folio)
+{
+	struct mem_cgroup *memcg;
+	struct deferred_split *queue;
+
+	memcg = folio_memcg(folio);
+	queue = memcg ? &memcg->deferred_split_queue :
+			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
+	spin_lock(&queue->split_queue_lock);
+
+	return queue;
+}
+
+static struct deferred_split *
+folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
+{
+	struct mem_cgroup *memcg;
+	struct deferred_split *queue;
+
+	memcg = folio_memcg(folio);
+	queue = memcg ? &memcg->deferred_split_queue :
+			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
+	spin_lock_irqsave(&queue->split_queue_lock, *flags);
+
+	return queue;
 }
 #else
 static inline
-struct deferred_split *get_deferred_split_queue(struct folio *folio)
+struct mem_cgroup *folio_split_queue_memcg(struct folio *folio,
+					   struct deferred_split *queue)
+{
+	return NULL;
+}
+
+static struct deferred_split *folio_split_queue_lock(struct folio *folio)
 {
 	struct pglist_data *pgdat = NODE_DATA(folio_nid(folio));
+	struct deferred_split *queue = &pgdat->deferred_split_queue;
+
+	spin_lock(&queue->split_queue_lock);
 
-	return &pgdat->deferred_split_queue;
+	return queue;
+}
+
+static struct deferred_split *
+folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
+{
+	struct pglist_data *pgdat = NODE_DATA(folio_nid(folio));
+	struct deferred_split *queue = &pgdat->deferred_split_queue;
+
+	spin_lock_irqsave(&queue->split_queue_lock, *flags);
+
+	return queue;
 }
 #endif
 
+static inline void split_queue_unlock(struct deferred_split *queue)
+{
+	spin_unlock(&queue->split_queue_lock);
+}
+
+static inline void split_queue_unlock_irqrestore(struct deferred_split *queue,
+						 unsigned long flags)
+{
+	spin_unlock_irqrestore(&queue->split_queue_lock, flags);
+}
+
 static inline bool is_transparent_hugepage(const struct folio *folio)
 {
 	if (!folio_test_large(folio))
@@ -3579,7 +3636,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		struct page *split_at, struct page *lock_at,
 		struct list_head *list, bool uniform_split)
 {
-	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
+	struct deferred_split *ds_queue;
 	XA_STATE(xas, &folio->mapping->i_pages, folio->index);
 	struct folio *end_folio = folio_next(folio);
 	bool is_anon = folio_test_anon(folio);
@@ -3718,7 +3775,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 	}
 
 	/* Prevent deferred_split_scan() touching ->_refcount */
-	spin_lock(&ds_queue->split_queue_lock);
+	ds_queue = folio_split_queue_lock(folio);
 	if (folio_ref_freeze(folio, 1 + extra_pins)) {
 		struct swap_cluster_info *ci = NULL;
 		struct lruvec *lruvec;
@@ -3740,7 +3797,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 			 */
 			list_del_init(&folio->_deferred_list);
 		}
-		spin_unlock(&ds_queue->split_queue_lock);
+		split_queue_unlock(ds_queue);
 		if (mapping) {
 			int nr = folio_nr_pages(folio);
 
@@ -3835,7 +3892,7 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		if (ci)
 			swap_cluster_unlock(ci);
 	} else {
-		spin_unlock(&ds_queue->split_queue_lock);
+		split_queue_unlock(ds_queue);
 		ret = -EAGAIN;
 	}
 fail:
@@ -4016,8 +4073,7 @@ bool __folio_unqueue_deferred_split(struct folio *folio)
 	WARN_ON_ONCE(folio_ref_count(folio));
 	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg_charged(folio));
 
-	ds_queue = get_deferred_split_queue(folio);
-	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+	ds_queue = folio_split_queue_lock_irqsave(folio, &flags);
 	if (!list_empty(&folio->_deferred_list)) {
 		ds_queue->split_queue_len--;
 		if (folio_test_partially_mapped(folio)) {
@@ -4028,7 +4084,7 @@ bool __folio_unqueue_deferred_split(struct folio *folio)
 		list_del_init(&folio->_deferred_list);
 		unqueued = true;
 	}
-	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+	split_queue_unlock_irqrestore(ds_queue, flags);
 
 	return unqueued;	/* useful for debug warnings */
 }
@@ -4036,10 +4092,7 @@ bool __folio_unqueue_deferred_split(struct folio *folio)
 /* partially_mapped=false won't clear PG_partially_mapped folio flag */
 void deferred_split_folio(struct folio *folio, bool partially_mapped)
 {
-	struct deferred_split *ds_queue = get_deferred_split_queue(folio);
-#ifdef CONFIG_MEMCG
-	struct mem_cgroup *memcg = folio_memcg(folio);
-#endif
+	struct deferred_split *ds_queue;
 	unsigned long flags;
 
 	/*
@@ -4062,7 +4115,7 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
 	if (folio_test_swapcache(folio))
 		return;
 
-	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+	ds_queue = folio_split_queue_lock_irqsave(folio, &flags);
 	if (partially_mapped) {
 		if (!folio_test_partially_mapped(folio)) {
 			folio_set_partially_mapped(folio);
@@ -4077,15 +4130,16 @@ void deferred_split_folio(struct folio *folio, bool partially_mapped)
 		VM_WARN_ON_FOLIO(folio_test_partially_mapped(folio), folio);
 	}
 	if (list_empty(&folio->_deferred_list)) {
+		struct mem_cgroup *memcg;
+
+		memcg = folio_split_queue_memcg(folio, ds_queue);
 		list_add_tail(&folio->_deferred_list, &ds_queue->split_queue);
 		ds_queue->split_queue_len++;
-#ifdef CONFIG_MEMCG
 		if (memcg)
 			set_shrinker_bit(memcg, folio_nid(folio),
-					 deferred_split_shrinker->id);
-#endif
+					 shrinker_id(deferred_split_shrinker));
 	}
-	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+	split_queue_unlock_irqrestore(ds_queue, flags);
 }
 
 static unsigned long deferred_split_count(struct shrinker *shrink,
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan()
  2025-09-28 11:16 [PATCH v3 0/4] reparent the THP split queue Qi Zheng
  2025-09-28 11:16 ` [PATCH v3 1/4] mm: thp: replace folio_memcg() with folio_memcg_charged() Qi Zheng
  2025-09-28 11:17 ` [PATCH v3 2/4] mm: thp: introduce folio_split_queue_lock and its variants Qi Zheng
@ 2025-09-28 11:17 ` Qi Zheng
  2025-09-28 11:45 ` [PATCH v3 4/4] mm: thp: reparent the split queue during memcg offline Qi Zheng
  3 siblings, 0 replies; 11+ messages in thread
From: Qi Zheng @ 2025-09-28 11:17 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm
  Cc: linux-mm, linux-kernel, cgroups, Muchun Song, Qi Zheng

From: Muchun Song <songmuchun@bytedance.com>

The maintenance of the folio->_deferred_list is intricate because it's
reused in a local list.

Here are some peculiarities:

   1) When a folio is removed from its split queue and added to a local
      on-stack list in deferred_split_scan(), the ->split_queue_len isn't
      updated, leading to an inconsistency between it and the actual
      number of folios in the split queue.

   2) When the folio is split via split_folio() later, it's removed from
      the local list while holding the split queue lock. At this time,
      this lock protects the local list, not the split queue.

   3) To handle the race condition with a third-party freeing or migrating
      the preceding folio, we must ensure there's always one safe (with
      raised refcount) folio before by delaying its folio_put(). More
      details can be found in commit e66f3185fa04 ("mm/thp: fix deferred
      split queue not partially_mapped"). It's rather tricky.

We can use the folio_batch infrastructure to handle this clearly. In this
case, ->split_queue_len will be consistent with the real number of folios
in the split queue. If list_empty(&folio->_deferred_list) returns false,
it's clear the folio must be in its split queue (not in a local list
anymore).

In the future, we will reparent LRU folios during memcg offline to
eliminate dying memory cgroups, which requires reparenting the split queue
to its parent first. So this patch prepares for using
folio_split_queue_lock_irqsave() as the memcg may change then.

Signed-off-by: Muchun Song <songmuchun@bytedance.com>
Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
Reviewed-by: Zi Yan <ziy@nvidia.com>
Acked-by: David Hildenbrand <david@redhat.com>
---
 mm/huge_memory.c | 84 ++++++++++++++++++++++--------------------------
 1 file changed, 38 insertions(+), 46 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 0ac3b97177b7f..bb32091e3133e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -3781,21 +3781,22 @@ static int __folio_split(struct folio *folio, unsigned int new_order,
 		struct lruvec *lruvec;
 		int expected_refs;
 
-		if (folio_order(folio) > 1 &&
-		    !list_empty(&folio->_deferred_list)) {
-			ds_queue->split_queue_len--;
+		if (folio_order(folio) > 1) {
+			if (!list_empty(&folio->_deferred_list)) {
+				ds_queue->split_queue_len--;
+				/*
+				 * Reinitialize page_deferred_list after removing the
+				 * page from the split_queue, otherwise a subsequent
+				 * split will see list corruption when checking the
+				 * page_deferred_list.
+				 */
+				list_del_init(&folio->_deferred_list);
+			}
 			if (folio_test_partially_mapped(folio)) {
 				folio_clear_partially_mapped(folio);
 				mod_mthp_stat(folio_order(folio),
 					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
 			}
-			/*
-			 * Reinitialize page_deferred_list after removing the
-			 * page from the split_queue, otherwise a subsequent
-			 * split will see list corruption when checking the
-			 * page_deferred_list.
-			 */
-			list_del_init(&folio->_deferred_list);
 		}
 		split_queue_unlock(ds_queue);
 		if (mapping) {
@@ -4185,40 +4186,44 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	struct pglist_data *pgdata = NODE_DATA(sc->nid);
 	struct deferred_split *ds_queue = &pgdata->deferred_split_queue;
 	unsigned long flags;
-	LIST_HEAD(list);
-	struct folio *folio, *next, *prev = NULL;
-	int split = 0, removed = 0;
+	struct folio *folio, *next;
+	int split = 0, i;
+	struct folio_batch fbatch;
 
 #ifdef CONFIG_MEMCG
 	if (sc->memcg)
 		ds_queue = &sc->memcg->deferred_split_queue;
 #endif
 
+	folio_batch_init(&fbatch);
+retry:
 	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
 	/* Take pin on all head pages to avoid freeing them under us */
 	list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
 							_deferred_list) {
 		if (folio_try_get(folio)) {
-			list_move(&folio->_deferred_list, &list);
-		} else {
+			folio_batch_add(&fbatch, folio);
+		} else if (folio_test_partially_mapped(folio)) {
 			/* We lost race with folio_put() */
-			if (folio_test_partially_mapped(folio)) {
-				folio_clear_partially_mapped(folio);
-				mod_mthp_stat(folio_order(folio),
-					      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
-			}
-			list_del_init(&folio->_deferred_list);
-			ds_queue->split_queue_len--;
+			folio_clear_partially_mapped(folio);
+			mod_mthp_stat(folio_order(folio),
+				      MTHP_STAT_NR_ANON_PARTIALLY_MAPPED, -1);
 		}
+		list_del_init(&folio->_deferred_list);
+		ds_queue->split_queue_len--;
 		if (!--sc->nr_to_scan)
 			break;
+		if (!folio_batch_space(&fbatch))
+			break;
 	}
 	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
 
-	list_for_each_entry_safe(folio, next, &list, _deferred_list) {
+	for (i = 0; i < folio_batch_count(&fbatch); i++) {
 		bool did_split = false;
 		bool underused = false;
+		struct deferred_split *fqueue;
 
+		folio = fbatch.folios[i];
 		if (!folio_test_partially_mapped(folio)) {
 			/*
 			 * See try_to_map_unused_to_zeropage(): we cannot
@@ -4241,38 +4246,25 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 		}
 		folio_unlock(folio);
 next:
+		if (did_split || !folio_test_partially_mapped(folio))
+			continue;
 		/*
-		 * split_folio() removes folio from list on success.
 		 * Only add back to the queue if folio is partially mapped.
 		 * If thp_underused returns false, or if split_folio fails
 		 * in the case it was underused, then consider it used and
 		 * don't add it back to split_queue.
 		 */
-		if (did_split) {
-			; /* folio already removed from list */
-		} else if (!folio_test_partially_mapped(folio)) {
-			list_del_init(&folio->_deferred_list);
-			removed++;
-		} else {
-			/*
-			 * That unlocked list_del_init() above would be unsafe,
-			 * unless its folio is separated from any earlier folios
-			 * left on the list (which may be concurrently unqueued)
-			 * by one safe folio with refcount still raised.
-			 */
-			swap(folio, prev);
+		fqueue = folio_split_queue_lock_irqsave(folio, &flags);
+		if (list_empty(&folio->_deferred_list)) {
+			list_add_tail(&folio->_deferred_list, &fqueue->split_queue);
+			fqueue->split_queue_len++;
 		}
-		if (folio)
-			folio_put(folio);
+		split_queue_unlock_irqrestore(fqueue, flags);
 	}
+	folios_put(&fbatch);
 
-	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
-	list_splice_tail(&list, &ds_queue->split_queue);
-	ds_queue->split_queue_len -= removed;
-	spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
-
-	if (prev)
-		folio_put(prev);
+	if (sc->nr_to_scan)
+		goto retry;
 
 	/*
 	 * Stop shrinker if we didn't split any page, but the queue is empty.
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v3 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-28 11:16 [PATCH v3 0/4] reparent the THP split queue Qi Zheng
                   ` (2 preceding siblings ...)
  2025-09-28 11:17 ` [PATCH v3 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() Qi Zheng
@ 2025-09-28 11:45 ` Qi Zheng
  2025-09-29  6:20   ` Muchun Song
  3 siblings, 1 reply; 11+ messages in thread
From: Qi Zheng @ 2025-09-28 11:45 UTC (permalink / raw)
  To: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm
  Cc: linux-mm, linux-kernel, cgroups, Qi Zheng

From: Qi Zheng <zhengqi.arch@bytedance.com>

Similar to list_lru, the split queue is relatively independent and does
not need to be reparented along with objcg and LRU folios (holding
objcg lock and lru lock). So let's apply the same mechanism as list_lru
to reparent the split queue separately when memcg is offine.

This is also a preparation for reparenting LRU folios.

Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
---
 include/linux/huge_mm.h |  4 ++++
 mm/huge_memory.c        | 46 +++++++++++++++++++++++++++++++++++++++++
 mm/memcontrol.c         |  1 +
 3 files changed, 51 insertions(+)

diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f327d62fc9852..0c211dcbb0ec1 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -417,6 +417,9 @@ static inline int split_huge_page(struct page *page)
 	return split_huge_page_to_list_to_order(page, NULL, ret);
 }
 void deferred_split_folio(struct folio *folio, bool partially_mapped);
+#ifdef CONFIG_MEMCG
+void reparent_deferred_split_queue(struct mem_cgroup *memcg);
+#endif
 
 void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
 		unsigned long address, bool freeze);
@@ -611,6 +614,7 @@ static inline int try_folio_split(struct folio *folio, struct page *page,
 }
 
 static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
+static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
 #define split_huge_pmd(__vma, __pmd, __address)	\
 	do { } while (0)
 
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bb32091e3133e..5fc0caca71de0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1094,9 +1094,22 @@ static struct deferred_split *folio_split_queue_lock(struct folio *folio)
 	struct deferred_split *queue;
 
 	memcg = folio_memcg(folio);
+retry:
 	queue = memcg ? &memcg->deferred_split_queue :
 			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
 	spin_lock(&queue->split_queue_lock);
+	/*
+	 * Notice:
+	 * 1. The memcg could be NULL if cgroup_disable=memory is set.
+	 * 2. There is a period between setting CSS_DYING and reparenting
+	 *    deferred split queue, and during this period the THPs in the
+	 *    deferred split queue will be hidden from the shrinker side.
+	 */
+	if (unlikely(memcg && css_is_dying(&memcg->css))) {
+		spin_unlock(&queue->split_queue_lock);
+		memcg = parent_mem_cgroup(memcg);
+		goto retry;
+	}
 
 	return queue;
 }
@@ -1108,9 +1121,15 @@ folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
 	struct deferred_split *queue;
 
 	memcg = folio_memcg(folio);
+retry:
 	queue = memcg ? &memcg->deferred_split_queue :
 			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
 	spin_lock_irqsave(&queue->split_queue_lock, *flags);
+	if (unlikely(memcg && css_is_dying(&memcg->css))) {
+		spin_unlock_irqrestore(&queue->split_queue_lock, *flags);
+		memcg = parent_mem_cgroup(memcg);
+		goto retry;
+	}
 
 	return queue;
 }
@@ -4275,6 +4294,33 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
 	return split;
 }
 
+#ifdef CONFIG_MEMCG
+void reparent_deferred_split_queue(struct mem_cgroup *memcg)
+{
+	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
+	struct deferred_split *ds_queue = &memcg->deferred_split_queue;
+	struct deferred_split *parent_ds_queue = &parent->deferred_split_queue;
+	int nid;
+
+	spin_lock_irq(&ds_queue->split_queue_lock);
+	spin_lock_nested(&parent_ds_queue->split_queue_lock, SINGLE_DEPTH_NESTING);
+
+	if (!ds_queue->split_queue_len)
+		goto unlock;
+
+	list_splice_tail_init(&ds_queue->split_queue, &parent_ds_queue->split_queue);
+	parent_ds_queue->split_queue_len += ds_queue->split_queue_len;
+	ds_queue->split_queue_len = 0;
+
+	for_each_node(nid)
+		set_shrinker_bit(parent, nid, shrinker_id(deferred_split_shrinker));
+
+unlock:
+	spin_unlock(&parent_ds_queue->split_queue_lock);
+	spin_unlock_irq(&ds_queue->split_queue_lock);
+}
+#endif
+
 #ifdef CONFIG_DEBUG_FS
 static void split_huge_pages_all(void)
 {
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e090f29eb03bd..d03da72e7585d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -3887,6 +3887,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
 	zswap_memcg_offline_cleanup(memcg);
 
 	memcg_offline_kmem(memcg);
+	reparent_deferred_split_queue(memcg);
 	reparent_shrinker_deferred(memcg);
 	wb_memcg_offline(memcg);
 	lru_gen_offline_memcg(memcg);
-- 
2.20.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-28 11:45 ` [PATCH v3 4/4] mm: thp: reparent the split queue during memcg offline Qi Zheng
@ 2025-09-29  6:20   ` Muchun Song
  2025-09-29  7:22     ` Qi Zheng
  0 siblings, 1 reply; 11+ messages in thread
From: Muchun Song @ 2025-09-29  6:20 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, david,
	lorenzo.stoakes, ziy, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm,
	linux-mm, linux-kernel, cgroups, Qi Zheng



> On Sep 28, 2025, at 19:45, Qi Zheng <qi.zheng@linux.dev> wrote:
> 
> From: Qi Zheng <zhengqi.arch@bytedance.com>
> 
> Similar to list_lru, the split queue is relatively independent and does
> not need to be reparented along with objcg and LRU folios (holding
> objcg lock and lru lock). So let's apply the same mechanism as list_lru
> to reparent the split queue separately when memcg is offine.
> 
> This is also a preparation for reparenting LRU folios.
> 
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> ---
> include/linux/huge_mm.h |  4 ++++
> mm/huge_memory.c        | 46 +++++++++++++++++++++++++++++++++++++++++
> mm/memcontrol.c         |  1 +
> 3 files changed, 51 insertions(+)
> 
> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
> index f327d62fc9852..0c211dcbb0ec1 100644
> --- a/include/linux/huge_mm.h
> +++ b/include/linux/huge_mm.h
> @@ -417,6 +417,9 @@ static inline int split_huge_page(struct page *page)
> 	return split_huge_page_to_list_to_order(page, NULL, ret);
> }
> void deferred_split_folio(struct folio *folio, bool partially_mapped);
> +#ifdef CONFIG_MEMCG
> +void reparent_deferred_split_queue(struct mem_cgroup *memcg);
> +#endif
> 
> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> 		unsigned long address, bool freeze);
> @@ -611,6 +614,7 @@ static inline int try_folio_split(struct folio *folio, struct page *page,
> }
> 
> static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
> +static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
> #define split_huge_pmd(__vma, __pmd, __address) \
> 	do { } while (0)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bb32091e3133e..5fc0caca71de0 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1094,9 +1094,22 @@ static struct deferred_split *folio_split_queue_lock(struct folio *folio)
> 	struct deferred_split *queue;
> 
> 	memcg = folio_memcg(folio);
> +retry:
> 	queue = memcg ? &memcg->deferred_split_queue :
> 			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
> 	spin_lock(&queue->split_queue_lock);
> + 	/*
> +	 * Notice:
> +	 * 1. The memcg could be NULL if cgroup_disable=memory is set.
> +	 * 2. There is a period between setting CSS_DYING and reparenting
> +	 *    deferred split queue, and during this period the THPs in the
> +	 *    deferred split queue will be hidden from the shrinker side.
> +	 */
> + 	if (unlikely(memcg && css_is_dying(&memcg->css))) {
> + 		spin_unlock(&queue->split_queue_lock);
> + 		memcg = parent_mem_cgroup(memcg);
> + 		goto retry;
> + 	}
> 
> 	return queue;
> }
> @@ -1108,9 +1121,15 @@ folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
> 	struct deferred_split *queue;
> 
> 	memcg = folio_memcg(folio);
> +retry:
> 	queue = memcg ? &memcg->deferred_split_queue :
> 			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
> 	spin_lock_irqsave(&queue->split_queue_lock, *flags);
> + 		if (unlikely(memcg && css_is_dying(&memcg->css))) {
> + 		spin_unlock_irqrestore(&queue->split_queue_lock, *flags);
> + 		memcg = parent_mem_cgroup(memcg);
> + 		goto retry;
> + 	}
> 
> 	return queue;
> }
> @@ -4275,6 +4294,33 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
> 	return split;
> }
> 
> +#ifdef CONFIG_MEMCG
> +void reparent_deferred_split_queue(struct mem_cgroup *memcg)
> +{
> + 	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
> + 	struct deferred_split *ds_queue = &memcg->deferred_split_queue;
> + 	struct deferred_split *parent_ds_queue = &parent->deferred_split_queue;
> + 	int nid;
> +
> + 	spin_lock_irq(&ds_queue->split_queue_lock);
> + 	spin_lock_nested(&parent_ds_queue->split_queue_lock, SINGLE_DEPTH_NESTING);
> +
> + 	if (!ds_queue->split_queue_len)
> + 		goto unlock;
> +
> + 	list_splice_tail_init(&ds_queue->split_queue, &parent_ds_queue->split_queue);
> + 	parent_ds_queue->split_queue_len += ds_queue->split_queue_len;
> + 	ds_queue->split_queue_len = 0;
> +
> + 	for_each_node(nid)
> + 		set_shrinker_bit(parent, nid, shrinker_id(deferred_split_shrinker));
> +
> +unlock:
> + 	spin_unlock(&parent_ds_queue->split_queue_lock);
> + 	spin_unlock_irq(&ds_queue->split_queue_lock);
> +}
> +#endif
> +
> #ifdef CONFIG_DEBUG_FS
> static void split_huge_pages_all(void)
> {
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index e090f29eb03bd..d03da72e7585d 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -3887,6 +3887,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
> 	zswap_memcg_offline_cleanup(memcg);
> 
> 	memcg_offline_kmem(memcg);
> + 	reparent_deferred_split_queue(memcg);

Since the dying flag of a memcg is not set under split_queue_lock,
two threads holding different split_queue_locks (e.g., one for the
parent memcg and one for the child) can concurrently manipulate the
same split-queue list of a folio. I think we should take the same
solution like list_lru does to fix this.

Muchun,
Thanks.


> 	reparent_shrinker_deferred(memcg);
> 	wb_memcg_offline(memcg);
> 	lru_gen_offline_memcg(memcg);
> -- 
> 2.20.1
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-29  6:20   ` Muchun Song
@ 2025-09-29  7:22     ` Qi Zheng
  2025-09-29  7:38       ` Muchun Song
  0 siblings, 1 reply; 11+ messages in thread
From: Qi Zheng @ 2025-09-29  7:22 UTC (permalink / raw)
  To: Muchun Song
  Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, david,
	lorenzo.stoakes, ziy, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm,
	linux-mm, linux-kernel, cgroups, Qi Zheng



On 9/29/25 2:20 PM, Muchun Song wrote:
> 
> 
>> On Sep 28, 2025, at 19:45, Qi Zheng <qi.zheng@linux.dev> wrote:
>>
>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>
>> Similar to list_lru, the split queue is relatively independent and does
>> not need to be reparented along with objcg and LRU folios (holding
>> objcg lock and lru lock). So let's apply the same mechanism as list_lru
>> to reparent the split queue separately when memcg is offine.
>>
>> This is also a preparation for reparenting LRU folios.
>>
>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>> ---
>> include/linux/huge_mm.h |  4 ++++
>> mm/huge_memory.c        | 46 +++++++++++++++++++++++++++++++++++++++++
>> mm/memcontrol.c         |  1 +
>> 3 files changed, 51 insertions(+)
>>
>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>> index f327d62fc9852..0c211dcbb0ec1 100644
>> --- a/include/linux/huge_mm.h
>> +++ b/include/linux/huge_mm.h
>> @@ -417,6 +417,9 @@ static inline int split_huge_page(struct page *page)
>> 	return split_huge_page_to_list_to_order(page, NULL, ret);
>> }
>> void deferred_split_folio(struct folio *folio, bool partially_mapped);
>> +#ifdef CONFIG_MEMCG
>> +void reparent_deferred_split_queue(struct mem_cgroup *memcg);
>> +#endif
>>
>> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> 		unsigned long address, bool freeze);
>> @@ -611,6 +614,7 @@ static inline int try_folio_split(struct folio *folio, struct page *page,
>> }
>>
>> static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>> +static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
>> #define split_huge_pmd(__vma, __pmd, __address) \
>> 	do { } while (0)
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index bb32091e3133e..5fc0caca71de0 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1094,9 +1094,22 @@ static struct deferred_split *folio_split_queue_lock(struct folio *folio)
>> 	struct deferred_split *queue;
>>
>> 	memcg = folio_memcg(folio);
>> +retry:
>> 	queue = memcg ? &memcg->deferred_split_queue :
>> 			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>> 	spin_lock(&queue->split_queue_lock);
>> + 	/*
>> +	 * Notice:
>> +	 * 1. The memcg could be NULL if cgroup_disable=memory is set.
>> +	 * 2. There is a period between setting CSS_DYING and reparenting
>> +	 *    deferred split queue, and during this period the THPs in the
>> +	 *    deferred split queue will be hidden from the shrinker side.

The shrinker side can find this deferred split queue by traversing
memcgs, so we should check CSS_DYING after we acquire child
split_queue_lock in :

deferred_split_scan
--> spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
     if (css_is_dying(&memcg->css))
     --> retry to get parent split_queue_lock

So during this period, we use parent split_queue_lock to protect
child deferred split queue. It's a little weird, but it's safe.

>> +	 */
>> + 	if (unlikely(memcg && css_is_dying(&memcg->css))) {
>> + 		spin_unlock(&queue->split_queue_lock);
>> + 		memcg = parent_mem_cgroup(memcg);
>> + 		goto retry;
>> + 	}
>>
>> 	return queue;
>> }
>> @@ -1108,9 +1121,15 @@ folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
>> 	struct deferred_split *queue;
>>
>> 	memcg = folio_memcg(folio);
>> +retry:
>> 	queue = memcg ? &memcg->deferred_split_queue :
>> 			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>> 	spin_lock_irqsave(&queue->split_queue_lock, *flags);
>> + 		if (unlikely(memcg && css_is_dying(&memcg->css))) {
>> + 		spin_unlock_irqrestore(&queue->split_queue_lock, *flags);
>> + 		memcg = parent_mem_cgroup(memcg);
>> + 		goto retry;
>> + 	}
>>
>> 	return queue;
>> }
>> @@ -4275,6 +4294,33 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>> 	return split;
>> }
>>
>> +#ifdef CONFIG_MEMCG
>> +void reparent_deferred_split_queue(struct mem_cgroup *memcg)
>> +{
>> + 	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
>> + 	struct deferred_split *ds_queue = &memcg->deferred_split_queue;
>> + 	struct deferred_split *parent_ds_queue = &parent->deferred_split_queue;
>> + 	int nid;
>> +
>> + 	spin_lock_irq(&ds_queue->split_queue_lock);
>> + 	spin_lock_nested(&parent_ds_queue->split_queue_lock, SINGLE_DEPTH_NESTING);
>> +
>> + 	if (!ds_queue->split_queue_len)
>> + 		goto unlock;
>> +
>> + 	list_splice_tail_init(&ds_queue->split_queue, &parent_ds_queue->split_queue);
>> + 	parent_ds_queue->split_queue_len += ds_queue->split_queue_len;
>> + 	ds_queue->split_queue_len = 0;
>> +
>> + 	for_each_node(nid)
>> + 		set_shrinker_bit(parent, nid, shrinker_id(deferred_split_shrinker));
>> +
>> +unlock:
>> + 	spin_unlock(&parent_ds_queue->split_queue_lock);
>> + 	spin_unlock_irq(&ds_queue->split_queue_lock);
>> +}
>> +#endif
>> +
>> #ifdef CONFIG_DEBUG_FS
>> static void split_huge_pages_all(void)
>> {
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index e090f29eb03bd..d03da72e7585d 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -3887,6 +3887,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>> 	zswap_memcg_offline_cleanup(memcg);
>>
>> 	memcg_offline_kmem(memcg);
>> + 	reparent_deferred_split_queue(memcg);
> 
> Since the dying flag of a memcg is not set under split_queue_lock,
> two threads holding different split_queue_locks (e.g., one for the
> parent memcg and one for the child) can concurrently manipulate the
> same split-queue list of a folio. I think we should take the same

If we ensure that we will check CSS_DYING every time we take the
split_queue_lock, then the lock protecting deferred split queue
must be the same lock.

To be more clear, consider the following case:

CPU0              CPU1              CPU2

                   folio_split_queue_lock
                   --> get child queue and lock

set CSS_DYING

                                     deferred_split_scan
                   unlock child queue lock
                                     --> acquire child queue lock
                                         ***WE SHOULD CHECK CSS_DYING 
HERE***
	

reparent spilt queue

The deferred_split_scan() is problematic now, I will fix it as follow:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5fc0caca71de0..9f1f61e7e0c8e 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -4208,6 +4208,7 @@ static unsigned long deferred_split_scan(struct 
shrinker *shrink,
         struct folio *folio, *next;
         int split = 0, i;
         struct folio_batch fbatch;
+       struct mem_cgroup *memcg;

  #ifdef CONFIG_MEMCG
         if (sc->memcg)
@@ -4217,6 +4218,11 @@ static unsigned long deferred_split_scan(struct 
shrinker *shrink,
         folio_batch_init(&fbatch);
  retry:
         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
+       if (sc->memcg && css_is_dying(&sc->memcg->css)) {
+               spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
+               memcg = parent_mem_cgroup(sc->memcg);
+ 
spin_lock_irqsave(&memcg->deferred_split_queue.split_queue_lock, flags);
+       }
         /* Take pin on all head pages to avoid freeing them under us */
         list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
                                                         _deferred_list) {

Of course I'll add helper functions and do some cleanup.

Thanks,
Qi


> solution like list_lru does to fix this.
> 
> Muchun,
> Thanks.
> 
> 
>> 	reparent_shrinker_deferred(memcg);
>> 	wb_memcg_offline(memcg);
>> 	lru_gen_offline_memcg(memcg);
>> -- 
>> 2.20.1
>>
> 


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-29  7:22     ` Qi Zheng
@ 2025-09-29  7:38       ` Muchun Song
  2025-09-29  7:54         ` Qi Zheng
  0 siblings, 1 reply; 11+ messages in thread
From: Muchun Song @ 2025-09-29  7:38 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, david,
	lorenzo.stoakes, ziy, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm,
	linux-mm, linux-kernel, cgroups, Qi Zheng



> On Sep 29, 2025, at 15:22, Qi Zheng <qi.zheng@linux.dev> wrote:
> 
> 
> 
> On 9/29/25 2:20 PM, Muchun Song wrote:
>>> On Sep 28, 2025, at 19:45, Qi Zheng <qi.zheng@linux.dev> wrote:
>>> 
>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>> 
>>> Similar to list_lru, the split queue is relatively independent and does
>>> not need to be reparented along with objcg and LRU folios (holding
>>> objcg lock and lru lock). So let's apply the same mechanism as list_lru
>>> to reparent the split queue separately when memcg is offine.
>>> 
>>> This is also a preparation for reparenting LRU folios.
>>> 
>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>> ---
>>> include/linux/huge_mm.h |  4 ++++
>>> mm/huge_memory.c        | 46 +++++++++++++++++++++++++++++++++++++++++
>>> mm/memcontrol.c         |  1 +
>>> 3 files changed, 51 insertions(+)
>>> 
>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>> index f327d62fc9852..0c211dcbb0ec1 100644
>>> --- a/include/linux/huge_mm.h
>>> +++ b/include/linux/huge_mm.h
>>> @@ -417,6 +417,9 @@ static inline int split_huge_page(struct page *page)
>>> 	return split_huge_page_to_list_to_order(page, NULL, ret);
>>> }
>>> void deferred_split_folio(struct folio *folio, bool partially_mapped);
>>> +#ifdef CONFIG_MEMCG
>>> +void reparent_deferred_split_queue(struct mem_cgroup *memcg);
>>> +#endif
>>> 
>>> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>> 		unsigned long address, bool freeze);
>>> @@ -611,6 +614,7 @@ static inline int try_folio_split(struct folio *folio, struct page *page,
>>> }
>>> 
>>> static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>>> +static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
>>> #define split_huge_pmd(__vma, __pmd, __address) \
>>> 	do { } while (0)
>>> 
>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>> index bb32091e3133e..5fc0caca71de0 100644
>>> --- a/mm/huge_memory.c
>>> +++ b/mm/huge_memory.c
>>> @@ -1094,9 +1094,22 @@ static struct deferred_split *folio_split_queue_lock(struct folio *folio)
>>> struct deferred_split *queue;
>>> 
>>> 	memcg = folio_memcg(folio);
>>> +retry:
>>> 	queue = memcg ? &memcg->deferred_split_queue :
>>> 			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>>> 	spin_lock(&queue->split_queue_lock);
>>> +  /*
>>> +  * Notice:
>>> +  * 1. The memcg could be NULL if cgroup_disable=memory is set.
>>> +  * 2. There is a period between setting CSS_DYING and reparenting
>>> +  *    deferred split queue, and during this period the THPs in the
>>> +  *    deferred split queue will be hidden from the shrinker side.
> 
> The shrinker side can find this deferred split queue by traversing
> memcgs, so we should check CSS_DYING after we acquire child
> split_queue_lock in :
> 
> deferred_split_scan
> --> spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>    if (css_is_dying(&memcg->css))
>    --> retry to get parent split_queue_lock
> 
> So during this period, we use parent split_queue_lock to protect
> child deferred split queue. It's a little weird, but it's safe.
> 
>>> + 	 */
>>> +  	if (unlikely(memcg && css_is_dying(&memcg->css))) {
>>> +  		spin_unlock(&queue->split_queue_lock);
>>> +  		memcg = parent_mem_cgroup(memcg);
>>> +  		goto retry;
>>> +  	}
>>> 
>>> return queue;
>>> }
>>> @@ -1108,9 +1121,15 @@ folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
>>> struct deferred_split *queue;
>>> 
>>> 	memcg = folio_memcg(folio);
>>> +retry:
>>> 	queue = memcg ? &memcg->deferred_split_queue :
>>> 			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>>> 	spin_lock_irqsave(&queue->split_queue_lock, *flags);
>>> +  	if (unlikely(memcg && css_is_dying(&memcg->css))) {
>>> +  		spin_unlock_irqrestore(&queue->split_queue_lock, *flags);
>>> +  		memcg = parent_mem_cgroup(memcg);
>>> +  		goto retry;
>>> +  	}
>>> 
>>> return queue;
>>> }
>>> @@ -4275,6 +4294,33 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>> return split;
>>> }
>>> 
>>> +#ifdef CONFIG_MEMCG
>>> +void reparent_deferred_split_queue(struct mem_cgroup *memcg)
>>> +{
>>> +  	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
>>> +  	struct deferred_split *ds_queue = &memcg->deferred_split_queue;
>>> +  	struct deferred_split *parent_ds_queue = &parent->deferred_split_queue;
>>> +  	int nid;
>>> +
>>> + 	spin_lock_irq(&ds_queue->split_queue_lock);
>>> +  	spin_lock_nested(&parent_ds_queue->split_queue_lock, SINGLE_DEPTH_NESTING);
>>> +
>>> +  	if (!ds_queue->split_queue_len)
>>> +  		goto unlock;
>>> +
>>> +  	list_splice_tail_init(&ds_queue->split_queue, &parent_ds_queue->split_queue);
>>> +  	parent_ds_queue->split_queue_len += ds_queue->split_queue_len;
>>> +  	ds_queue->split_queue_len = 0;
>>> +
>>> +  	for_each_node(nid)
>>> +  		set_shrinker_bit(parent, nid, shrinker_id(deferred_split_shrinker));
>>> +
>>> +unlock:
>>> +  	spin_unlock(&parent_ds_queue->split_queue_lock);
>>> +  	spin_unlock_irq(&ds_queue->split_queue_lock);
>>> +}
>>> +#endif
>>> +
>>> #ifdef CONFIG_DEBUG_FS
>>> static void split_huge_pages_all(void)
>>> {
>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>> index e090f29eb03bd..d03da72e7585d 100644
>>> --- a/mm/memcontrol.c
>>> +++ b/mm/memcontrol.c
>>> @@ -3887,6 +3887,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>>> zswap_memcg_offline_cleanup(memcg);
>>> 
>>> 	memcg_offline_kmem(memcg);
>>> +  	reparent_deferred_split_queue(memcg);
>> Since the dying flag of a memcg is not set under split_queue_lock,
>> two threads holding different split_queue_locks (e.g., one for the
>> parent memcg and one for the child) can concurrently manipulate the
>> same split-queue list of a folio. I think we should take the same
> 
> If we ensure that we will check CSS_DYING every time we take the
> split_queue_lock, then the lock protecting deferred split queue
> must be the same lock.
> 
> To be more clear, consider the following case:
> 
> CPU0              CPU1              CPU2
> 
>                  folio_split_queue_lock
>                  --> get child queue and lock
> 
> set CSS_DYING
> 
>                                    deferred_split_scan
>                  unlock child queue lock
>                                    --> acquire child queue lock
>                                        ***WE SHOULD CHECK CSS_DYING HERE***
> 
> 
> reparent spilt queue
> 
> The deferred_split_scan() is problematic now, I will fix it as follow:
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5fc0caca71de0..9f1f61e7e0c8e 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4208,6 +4208,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>        struct folio *folio, *next;
>        int split = 0, i;
>        struct folio_batch fbatch;
> +      struct mem_cgroup *memcg;
> 
> #ifdef CONFIG_MEMCG
>        if (sc->memcg)
> @@ -4217,6 +4218,11 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>        folio_batch_init(&fbatch);
> retry:
>        spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> +      if (sc->memcg && css_is_dying(&sc->memcg->css)) {

There are more than one place where we check whether a memcg is dying,
it is better to introduce a helper like mem_cgroup_is_dying to do this
in memcontrol.h.

> +               spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);

Yes, we could fix this like this way. But I suggest we introduce another
helper like folio_split_queue_lock to do the similar retry logic. Every users
of split_queue_lock are supposed to use this new helper or folio_split_queue_lock
to get the lock.

> +               memcg = parent_mem_cgroup(sc->memcg);
> + 		spin_lock_irqsave(&memcg->deferred_split_queue.split_queue_lock, flags);
> +       }
>        /* Take pin on all head pages to avoid freeing them under us */
>        list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>                                                        _deferred_list) {
> 
> Of course I'll add helper functions and do some cleanup.

Yes.

> 
> Thanks,
> Qi
> 
> 
>> solution like list_lru does to fix this.
>> Muchun,
>> Thanks.
>>> reparent_shrinker_deferred(memcg);
>>> wb_memcg_offline(memcg);
>>> lru_gen_offline_memcg(memcg);
>>> -- 
>>> 2.20.1



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 4/4] mm: thp: reparent the split queue during memcg offline
  2025-09-29  7:38       ` Muchun Song
@ 2025-09-29  7:54         ` Qi Zheng
  0 siblings, 0 replies; 11+ messages in thread
From: Qi Zheng @ 2025-09-29  7:54 UTC (permalink / raw)
  To: Muchun Song
  Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, david,
	lorenzo.stoakes, ziy, harry.yoo, baolin.wang, Liam.Howlett,
	npache, ryan.roberts, dev.jain, baohua, lance.yang, akpm,
	linux-mm, linux-kernel, cgroups



On 9/29/25 3:38 PM, Muchun Song wrote:
> 
> 
>> On Sep 29, 2025, at 15:22, Qi Zheng <qi.zheng@linux.dev> wrote:
>>
>>
>>
>> On 9/29/25 2:20 PM, Muchun Song wrote:
>>>> On Sep 28, 2025, at 19:45, Qi Zheng <qi.zheng@linux.dev> wrote:
>>>>
>>>> From: Qi Zheng <zhengqi.arch@bytedance.com>
>>>>
>>>> Similar to list_lru, the split queue is relatively independent and does
>>>> not need to be reparented along with objcg and LRU folios (holding
>>>> objcg lock and lru lock). So let's apply the same mechanism as list_lru
>>>> to reparent the split queue separately when memcg is offine.
>>>>
>>>> This is also a preparation for reparenting LRU folios.
>>>>
>>>> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
>>>> ---
>>>> include/linux/huge_mm.h |  4 ++++
>>>> mm/huge_memory.c        | 46 +++++++++++++++++++++++++++++++++++++++++
>>>> mm/memcontrol.c         |  1 +
>>>> 3 files changed, 51 insertions(+)
>>>>
>>>> diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
>>>> index f327d62fc9852..0c211dcbb0ec1 100644
>>>> --- a/include/linux/huge_mm.h
>>>> +++ b/include/linux/huge_mm.h
>>>> @@ -417,6 +417,9 @@ static inline int split_huge_page(struct page *page)
>>>> 	return split_huge_page_to_list_to_order(page, NULL, ret);
>>>> }
>>>> void deferred_split_folio(struct folio *folio, bool partially_mapped);
>>>> +#ifdef CONFIG_MEMCG
>>>> +void reparent_deferred_split_queue(struct mem_cgroup *memcg);
>>>> +#endif
>>>>
>>>> void __split_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>>>> 		unsigned long address, bool freeze);
>>>> @@ -611,6 +614,7 @@ static inline int try_folio_split(struct folio *folio, struct page *page,
>>>> }
>>>>
>>>> static inline void deferred_split_folio(struct folio *folio, bool partially_mapped) {}
>>>> +static inline void reparent_deferred_split_queue(struct mem_cgroup *memcg) {}
>>>> #define split_huge_pmd(__vma, __pmd, __address) \
>>>> 	do { } while (0)
>>>>
>>>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>>>> index bb32091e3133e..5fc0caca71de0 100644
>>>> --- a/mm/huge_memory.c
>>>> +++ b/mm/huge_memory.c
>>>> @@ -1094,9 +1094,22 @@ static struct deferred_split *folio_split_queue_lock(struct folio *folio)
>>>> struct deferred_split *queue;
>>>>
>>>> 	memcg = folio_memcg(folio);
>>>> +retry:
>>>> 	queue = memcg ? &memcg->deferred_split_queue :
>>>> 			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>>>> 	spin_lock(&queue->split_queue_lock);
>>>> +  /*
>>>> +  * Notice:
>>>> +  * 1. The memcg could be NULL if cgroup_disable=memory is set.
>>>> +  * 2. There is a period between setting CSS_DYING and reparenting
>>>> +  *    deferred split queue, and during this period the THPs in the
>>>> +  *    deferred split queue will be hidden from the shrinker side.
>>
>> The shrinker side can find this deferred split queue by traversing
>> memcgs, so we should check CSS_DYING after we acquire child
>> split_queue_lock in :
>>
>> deferred_split_scan
>> --> spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>>     if (css_is_dying(&memcg->css))
>>     --> retry to get parent split_queue_lock
>>
>> So during this period, we use parent split_queue_lock to protect
>> child deferred split queue. It's a little weird, but it's safe.
>>
>>>> + 	 */
>>>> +  	if (unlikely(memcg && css_is_dying(&memcg->css))) {
>>>> +  		spin_unlock(&queue->split_queue_lock);
>>>> +  		memcg = parent_mem_cgroup(memcg);
>>>> +  		goto retry;
>>>> +  	}
>>>>
>>>> return queue;
>>>> }
>>>> @@ -1108,9 +1121,15 @@ folio_split_queue_lock_irqsave(struct folio *folio, unsigned long *flags)
>>>> struct deferred_split *queue;
>>>>
>>>> 	memcg = folio_memcg(folio);
>>>> +retry:
>>>> 	queue = memcg ? &memcg->deferred_split_queue :
>>>> 			&NODE_DATA(folio_nid(folio))->deferred_split_queue;
>>>> 	spin_lock_irqsave(&queue->split_queue_lock, *flags);
>>>> +  	if (unlikely(memcg && css_is_dying(&memcg->css))) {
>>>> +  		spin_unlock_irqrestore(&queue->split_queue_lock, *flags);
>>>> +  		memcg = parent_mem_cgroup(memcg);
>>>> +  		goto retry;
>>>> +  	}
>>>>
>>>> return queue;
>>>> }
>>>> @@ -4275,6 +4294,33 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>>> return split;
>>>> }
>>>>
>>>> +#ifdef CONFIG_MEMCG
>>>> +void reparent_deferred_split_queue(struct mem_cgroup *memcg)
>>>> +{
>>>> +  	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
>>>> +  	struct deferred_split *ds_queue = &memcg->deferred_split_queue;
>>>> +  	struct deferred_split *parent_ds_queue = &parent->deferred_split_queue;
>>>> +  	int nid;
>>>> +
>>>> + 	spin_lock_irq(&ds_queue->split_queue_lock);
>>>> +  	spin_lock_nested(&parent_ds_queue->split_queue_lock, SINGLE_DEPTH_NESTING);
>>>> +
>>>> +  	if (!ds_queue->split_queue_len)
>>>> +  		goto unlock;
>>>> +
>>>> +  	list_splice_tail_init(&ds_queue->split_queue, &parent_ds_queue->split_queue);
>>>> +  	parent_ds_queue->split_queue_len += ds_queue->split_queue_len;
>>>> +  	ds_queue->split_queue_len = 0;
>>>> +
>>>> +  	for_each_node(nid)
>>>> +  		set_shrinker_bit(parent, nid, shrinker_id(deferred_split_shrinker));
>>>> +
>>>> +unlock:
>>>> +  	spin_unlock(&parent_ds_queue->split_queue_lock);
>>>> +  	spin_unlock_irq(&ds_queue->split_queue_lock);
>>>> +}
>>>> +#endif
>>>> +
>>>> #ifdef CONFIG_DEBUG_FS
>>>> static void split_huge_pages_all(void)
>>>> {
>>>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>>>> index e090f29eb03bd..d03da72e7585d 100644
>>>> --- a/mm/memcontrol.c
>>>> +++ b/mm/memcontrol.c
>>>> @@ -3887,6 +3887,7 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css)
>>>> zswap_memcg_offline_cleanup(memcg);
>>>>
>>>> 	memcg_offline_kmem(memcg);
>>>> +  	reparent_deferred_split_queue(memcg);
>>> Since the dying flag of a memcg is not set under split_queue_lock,
>>> two threads holding different split_queue_locks (e.g., one for the
>>> parent memcg and one for the child) can concurrently manipulate the
>>> same split-queue list of a folio. I think we should take the same
>>
>> If we ensure that we will check CSS_DYING every time we take the
>> split_queue_lock, then the lock protecting deferred split queue
>> must be the same lock.
>>
>> To be more clear, consider the following case:
>>
>> CPU0              CPU1              CPU2
>>
>>                   folio_split_queue_lock
>>                   --> get child queue and lock
>>
>> set CSS_DYING
>>
>>                                     deferred_split_scan
>>                   unlock child queue lock
>>                                     --> acquire child queue lock
>>                                         ***WE SHOULD CHECK CSS_DYING HERE***
>>
>>
>> reparent spilt queue
>>
>> The deferred_split_scan() is problematic now, I will fix it as follow:
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index 5fc0caca71de0..9f1f61e7e0c8e 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -4208,6 +4208,7 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>         struct folio *folio, *next;
>>         int split = 0, i;
>>         struct folio_batch fbatch;
>> +      struct mem_cgroup *memcg;
>>
>> #ifdef CONFIG_MEMCG
>>         if (sc->memcg)
>> @@ -4217,6 +4218,11 @@ static unsigned long deferred_split_scan(struct shrinker *shrink,
>>         folio_batch_init(&fbatch);
>> retry:
>>         spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
>> +      if (sc->memcg && css_is_dying(&sc->memcg->css)) {
> 
> There are more than one place where we check whether a memcg is dying,
> it is better to introduce a helper like mem_cgroup_is_dying to do this
> in memcontrol.h.

OK. I will try to add a cleanup patch to do this.

> 
>> +               spin_unlock_irqrestore(&ds_queue->split_queue_lock, flags);
> 
> Yes, we could fix this like this way. But I suggest we introduce another
> helper like folio_split_queue_lock to do the similar retry logic. Every users
> of split_queue_lock are supposed to use this new helper or folio_split_queue_lock
> to get the lock.

Yes, will do.

> 
>> +               memcg = parent_mem_cgroup(sc->memcg);
>> + 		spin_lock_irqsave(&memcg->deferred_split_queue.split_queue_lock, flags);
>> +       }
>>         /* Take pin on all head pages to avoid freeing them under us */
>>         list_for_each_entry_safe(folio, next, &ds_queue->split_queue,
>>                                                         _deferred_list) {
>>
>> Of course I'll add helper functions and do some cleanup.
> 
> Yes.
> 
>>
>> Thanks,
>> Qi
>>
>>
>>> solution like list_lru does to fix this.
>>> Muchun,
>>> Thanks.
>>>> reparent_shrinker_deferred(memcg);
>>>> wb_memcg_offline(memcg);
>>>> lru_gen_offline_memcg(memcg);
>>>> -- 
>>>> 2.20.1
> 
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 1/4] mm: thp: replace folio_memcg() with folio_memcg_charged()
  2025-09-28 11:16 ` [PATCH v3 1/4] mm: thp: replace folio_memcg() with folio_memcg_charged() Qi Zheng
@ 2025-10-02  1:23   ` Harry Yoo
  0 siblings, 0 replies; 11+ messages in thread
From: Harry Yoo @ 2025-10-02  1:23 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, lance.yang, akpm, linux-mm,
	linux-kernel, cgroups, Muchun Song

On Sun, Sep 28, 2025 at 07:16:59PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> folio_memcg_charged() is intended for use when the user is unconcerned
> about the returned memcg pointer. It is more efficient than folio_memcg().
> Therefore, replace folio_memcg() with folio_memcg_charged().
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: David Hildenbrand <david@redhat.com>
> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev>
> ---

Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon

>  mm/huge_memory.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 1b81680b4225f..6db24b3a57005 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -4014,7 +4014,7 @@ bool __folio_unqueue_deferred_split(struct folio *folio)
>  	bool unqueued = false;
>  
>  	WARN_ON_ONCE(folio_ref_count(folio));
> -	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg(folio));
> +	WARN_ON_ONCE(!mem_cgroup_disabled() && !folio_memcg_charged(folio));
>  
>  	ds_queue = get_deferred_split_queue(folio);
>  	spin_lock_irqsave(&ds_queue->split_queue_lock, flags);
> -- 
> 2.20.1

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v3 2/4] mm: thp: introduce folio_split_queue_lock and its variants
  2025-09-28 11:17 ` [PATCH v3 2/4] mm: thp: introduce folio_split_queue_lock and its variants Qi Zheng
@ 2025-10-02  1:44   ` Harry Yoo
  0 siblings, 0 replies; 11+ messages in thread
From: Harry Yoo @ 2025-10-02  1:44 UTC (permalink / raw)
  To: Qi Zheng
  Cc: hannes, hughd, mhocko, roman.gushchin, shakeel.butt, muchun.song,
	david, lorenzo.stoakes, ziy, baolin.wang, Liam.Howlett, npache,
	ryan.roberts, dev.jain, baohua, lance.yang, akpm, linux-mm,
	linux-kernel, cgroups, Muchun Song

On Sun, Sep 28, 2025 at 07:17:00PM +0800, Qi Zheng wrote:
> From: Muchun Song <songmuchun@bytedance.com>
> 
> In future memcg removal, the binding between a folio and a memcg may
> change, making the split lock within the memcg unstable when held.
> 
> A new approach is required to reparent the split queue to its parent. This
> patch starts introducing a unified way to acquire the split lock for
> future work.
> 
> It's a code-only refactoring with no functional changes.
> 
> Signed-off-by: Muchun Song <songmuchun@bytedance.com>
> Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com>
> Acked-by: Johannes Weiner <hannes@cmpxchg.org>
> Reviewed-by: Zi Yan <ziy@nvidia.com>
> Acked-by: Shakeel Butt <shakeel.butt@linux.dev>
> Acked-by: David Hildenbrand <david@redhat.com>
> ---

Looks good to me,
Reviewed-by: Harry Yoo <harry.yoo@oracle.com>

-- 
Cheers,
Harry / Hyeonggon

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-10-02  1:45 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-28 11:16 [PATCH v3 0/4] reparent the THP split queue Qi Zheng
2025-09-28 11:16 ` [PATCH v3 1/4] mm: thp: replace folio_memcg() with folio_memcg_charged() Qi Zheng
2025-10-02  1:23   ` Harry Yoo
2025-09-28 11:17 ` [PATCH v3 2/4] mm: thp: introduce folio_split_queue_lock and its variants Qi Zheng
2025-10-02  1:44   ` Harry Yoo
2025-09-28 11:17 ` [PATCH v3 3/4] mm: thp: use folio_batch to handle THP splitting in deferred_split_scan() Qi Zheng
2025-09-28 11:45 ` [PATCH v3 4/4] mm: thp: reparent the split queue during memcg offline Qi Zheng
2025-09-29  6:20   ` Muchun Song
2025-09-29  7:22     ` Qi Zheng
2025-09-29  7:38       ` Muchun Song
2025-09-29  7:54         ` Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).