[PATCH 0/7] mm: better GUP pin lru_add_drain

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/7] mm: better GUP pin lru_add_drain_all()
@ 2025-08-31  8:57 Hugh Dickins
  2025-08-31  9:01 ` [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2 Hugh Dickins
                   ` (6 more replies)
  0 siblings, 7 replies; 18+ messages in thread
From: Hugh Dickins @ 2025-08-31  8:57 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Shivank Garg, Matthew Wilcox,
	Christoph Hellwig, Keir Fraser, Jason Gunthorpe, John Hubbard,
	Frederick Mayle, Peter Xu, Aneesh Kumar K.V, Johannes Weiner,
	Vlastimil Babka, Alexander Krabler, Ge Yang, Li Zhe, Chris Li,
	Yu Zhao, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Konstantin Khlebnikov, linux-kernel, linux-mm

Series of lru_add_drain_all()-related patches, arising from recent
mm/gup migration report from Will Deacon.  Based on 6.17-rc3 but apply
to latest mm.git.  I suggest all but 7/7 be hotfixes going to 6.17 and
stable, but you might not be persuaded by "counter the increase" ones.

1/7 mm: fix folio_expected_ref_count() when PG_private_2
2/7 mm/gup: check ref_count instead of lru before migration
3/7 mm/gup: local lru_add_drain() to avoid lru_add_drain_all()
4/7 mm: Revert "mm/gup: clear the LRU flag of a page before
5/7 mm: Revert "mm: vmscan.c: fix OOM on swap stress test"
6/7 mm: folio_may_be_cached() unless folio_test_large()
7/7 mm: lru_add_drain_all() do local lru_add_drain() first

 include/linux/mm.h   |  4 ++--
 include/linux/swap.h | 10 ++++++++++
 mm/gup.c             |  6 +++++-
 mm/mlock.c           |  6 +++---
 mm/swap.c            | 53 ++++++++++++++++++++++++++++------------------------
 mm/vmscan.c          |  2 +-
 6 files changed, 50 insertions(+), 31 deletions(-)

Thanks,
Hugh


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2
  2025-08-31  8:57 [PATCH 0/7] mm: better GUP pin lru_add_drain_all() Hugh Dickins
@ 2025-08-31  9:01 ` Hugh Dickins
  2025-08-31 23:37   ` Matthew Wilcox
  2025-08-31  9:05 ` [PATCH 2/7] mm/gup: check ref_count instead of lru before migration Hugh Dickins
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Hugh Dickins @ 2025-08-31  9:01 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Hildenbrand, Shivank Garg, Matthew Wilcox,
	Christoph Hellwig, Keir Fraser, Jason Gunthorpe, John Hubbard,
	Frederick Mayle, Peter Xu, Aneesh Kumar K.V, Johannes Weiner,
	Vlastimil Babka, Alexander Krabler, Ge Yang, Li Zhe, Chris Li,
	Yu Zhao, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Konstantin Khlebnikov, linux-kernel, linux-mm

6.16's folio_expected_ref_count() is forgetting the PG_private_2 flag,
which (like PG_private, but not in addition to PG_private) counts for
1 more reference: it needs to be using folio_has_private() in place of
folio_test_private().

But this went wrong earlier: folio_expected_ref_count() was based on
(and replaced) mm/migrate.c's folio_expected_refs(), which has been
using folio_test_private() since 6.0 converted to folios(): before
that, expected_page_refs() was correctly using page_has_private().

Just a few filesystems are still using PG_private_2 a.k.a. PG_fscache.
Potentially, this fix re-enables page migration on their folios; but
it would not be surprising to learn that in practice those folios are
not migratable for other reasons.

Fixes: 86ebd50224c0 ("mm: add folio_expected_ref_count() for reference count calculation")
Fixes: 108ca8358139 ("mm/migrate: Convert expected_page_refs() to folio_expected_refs()")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/mm.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1ae97a0b8ec7..ee8e535eadac 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2180,8 +2180,8 @@ static inline int folio_expected_ref_count(const struct folio *folio)
 	} else {
 		/* One reference per page from the pagecache. */
 		ref_count += !!folio->mapping << order;
-		/* One reference from PG_private. */
-		ref_count += folio_test_private(folio);
+		/* One reference from PG_private or PG_private_2. */
+		ref_count += folio_has_private(folio);
 	}
 
 	/* One reference per page table mapping. */
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/7] mm/gup: check ref_count instead of lru before migration
  2025-08-31  8:57 [PATCH 0/7] mm: better GUP pin lru_add_drain_all() Hugh Dickins
  2025-08-31  9:01 ` [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2 Hugh Dickins
@ 2025-08-31  9:05 ` Hugh Dickins
  2025-09-01  8:00   ` David Hildenbrand
  2025-08-31  9:08 ` [PATCH 3/7] mm/gup: local lru_add_drain() to avoid lru_add_drain_all() Hugh Dickins
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Hugh Dickins @ 2025-08-31  9:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Will Deacon, David Hildenbrand, Shivank Garg, Matthew Wilcox,
	Christoph Hellwig, Keir Fraser, Jason Gunthorpe, John Hubbard,
	Frederick Mayle, Peter Xu, Aneesh Kumar K.V, Johannes Weiner,
	Vlastimil Babka, Alexander Krabler, Ge Yang, Li Zhe, Chris Li,
	Yu Zhao, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Konstantin Khlebnikov, linux-kernel, linux-mm

Will Deacon reports:-

When taking a longterm GUP pin via pin_user_pages(),
__gup_longterm_locked() tries to migrate target folios that should not
be longterm pinned, for example because they reside in a CMA region or
movable zone. This is done by first pinning all of the target folios
anyway, collecting all of the longterm-unpinnable target folios into a
list, dropping the pins that were just taken and finally handing the
list off to migrate_pages() for the actual migration.

It is critically important that no unexpected references are held on the
folios being migrated, otherwise the migration will fail and
pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
relatively easy to observe migration failures when running pKVM (which
uses pin_user_pages() on crosvm's virtual address space to resolve
stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
this results in the VM terminating prematurely.

In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
mapping of guest memory prior to the pinning. Subsequently, when
pin_user_pages() walks the page-table, the relevant 'pte' is not
present and so the faulting logic allocates a new folio, mlocks it
with mlock_folio() and maps it in the page-table.

Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
batch by pagevec"), mlock/munlock operations on a folio (formerly page),
are deferred. For example, mlock_folio() takes an additional reference
on the target folio before placing it into a per-cpu 'folio_batch' for
later processing by mlock_folio_batch(), which drops the refcount once
the operation is complete. Processing of the batches is coupled with
the LRU batch logic and can be forcefully drained with
lru_add_drain_all() but as long as a folio remains unprocessed on the
batch, its refcount will be elevated.

This deferred batching therefore interacts poorly with the pKVM pinning
scenario as we can find ourselves in a situation where the migration
code fails to migrate a folio due to the elevated refcount from the
pending mlock operation.

Hugh Dickins adds:-

!folio_test_lru() has never been a very reliable way to tell if an
lru_add_drain_all() is worth calling, to remove LRU cache references
to make the folio migratable: the LRU flag may be set even while the
folio is held with an extra reference in a per-CPU LRU cache.

5.18 commit 2fbb0c10d1e8 may have made it more unreliable.  Then 6.11
commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding
to LRU batch") tried to make it reliable, by moving LRU flag clearing;
but missed the mlock/munlock batches, so still unreliable as reported.

And it turns out to be difficult to extend 33dfe9204f29's LRU flag
clearing to the mlock/munlock batches: if they do benefit from batching,
mlock/munlock cannot be so effective when easily suppressed while !LRU.

Instead, switch to an expected ref_count check, which was more reliable
all along: some more false positives (unhelpful drains) than before, and
never a guarantee that the folio will prove migratable, but better.

Note for stable backports: requires 6.16 commit 86ebd50224c0 ("mm:
add folio_expected_ref_count() for reference count calculation") and
6.17 commit ("mm: fix folio_expected_ref_count() when PG_private_2").

Reported-by: Will Deacon <will@kernel.org>
Link: https://lore.kernel.org/linux-mm/20250815101858.24352-1-will@kernel.org/
Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/gup.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/mm/gup.c b/mm/gup.c
index adffe663594d..82aec6443c0a 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
 			continue;
 		}

-		if (!folio_test_lru(folio) && drain_allow) {
+		if (drain_allow && folio_ref_count(folio) !=
+				   folio_expected_ref_count(folio) + 1) {
 			lru_add_drain_all();
 			drain_allow = false;
 		}
-- 
2.51.0

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/7] mm/gup: local lru_add_drain() to avoid lru_add_drain_all()
  2025-08-31  8:57 [PATCH 0/7] mm: better GUP pin lru_add_drain_all() Hugh Dickins
  2025-08-31  9:01 ` [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2 Hugh Dickins
  2025-08-31  9:05 ` [PATCH 2/7] mm/gup: check ref_count instead of lru before migration Hugh Dickins
@ 2025-08-31  9:08 ` Hugh Dickins
  2025-09-01  8:05   ` David Hildenbrand
  2025-08-31  9:11 ` [PATCH 4/7] mm: Revert "mm/gup: clear the LRU flag of a page before adding to LRU batch" Hugh Dickins
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Hugh Dickins @ 2025-08-31  9:08 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Will Deacon, David Hildenbrand, Shivank Garg, Matthew Wilcox,
	Christoph Hellwig, Keir Fraser, Jason Gunthorpe, John Hubbard,
	Frederick Mayle, Peter Xu, Aneesh Kumar K.V, Johannes Weiner,
	Vlastimil Babka, Alexander Krabler, Ge Yang, Li Zhe, Chris Li,
	Yu Zhao, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Konstantin Khlebnikov, linux-kernel, linux-mm

In many cases, if collect_longterm_unpinnable_folios() does need to
drain the LRU cache to release a reference, the cache in question is
on this same CPU, and much more efficiently drained by a preliminary
local lru_add_drain(), than the later cross-CPU lru_add_drain_all().

Marked for stable, to counter the increase in lru_add_drain_all()s
from "mm/gup: check ref_count instead of lru before migration".
Note for clean backports: can take 6.16 commit a03db236aebf ("gup:
optimize longterm pin_user_pages() for large folio") first.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/gup.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/mm/gup.c b/mm/gup.c
index 82aec6443c0a..9f7c87f504a9 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2291,6 +2291,8 @@ static unsigned long collect_longterm_unpinnable_folios(
 	struct folio *folio;
 	long i = 0;
 
+	lru_add_drain();
+
 	for (folio = pofs_get_folio(pofs, i); folio;
 	     folio = pofs_next_folio(folio, pofs, &i)) {
 
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/7] mm: Revert "mm/gup: clear the LRU flag of a page before adding to LRU batch"
  2025-08-31  8:57 [PATCH 0/7] mm: better GUP pin lru_add_drain_all() Hugh Dickins
                   ` (2 preceding siblings ...)
  2025-08-31  9:08 ` [PATCH 3/7] mm/gup: local lru_add_drain() to avoid lru_add_drain_all() Hugh Dickins
@ 2025-08-31  9:11 ` Hugh Dickins
  2025-09-01  8:06   ` David Hildenbrand
  2025-08-31  9:13 ` [PATCH 5/7] mm: Revert "mm: vmscan.c: fix OOM on swap stress test" Hugh Dickins
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 18+ messages in thread
From: Hugh Dickins @ 2025-08-31  9:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Will Deacon, David Hildenbrand, Shivank Garg, Matthew Wilcox,
	Christoph Hellwig, Keir Fraser, Jason Gunthorpe, John Hubbard,
	Frederick Mayle, Peter Xu, Aneesh Kumar K.V, Johannes Weiner,
	Vlastimil Babka, Alexander Krabler, Ge Yang, Li Zhe, Chris Li,
	Yu Zhao, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Konstantin Khlebnikov, linux-kernel, linux-mm

This reverts commit 33dfe9204f29b415bbc0abb1a50642d1ba94f5e9:
now that collect_longterm_unpinnable_folios() is checking ref_count
instead of lru, and mlock/munlock do not participate in the revised
LRU flag clearing, those changes are misleading, and enlarge the
window during which mlock/munlock may miss an mlock_count update.

It is possible (I'd hesitate to claim probable) that the greater
likelihood of missed mlock_count updates would explain the "Realtime
threads delayed due to kcompactd0" observed on 6.12 in the Link below.
If that is the case, this reversion will help; but a complete solution
needs also a further patch, beyond the scope of this series.

Included some 80-column cleanup around folio_batch_add_and_move().

The role of folio_test_clear_lru() (before taking per-memcg lru_lock)
is questionable since 6.13 removed mem_cgroup_move_account() etc; but
perhaps there are still some races which need it - not examined here.

Link: https://lore.kernel.org/linux-mm/DU0PR01MB10385345F7153F334100981888259A@DU0PR01MB10385.eurprd01.prod.exchangelabs.com/
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/swap.c | 50 ++++++++++++++++++++++++++------------------------
 1 file changed, 26 insertions(+), 24 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index 3632dd061beb..6ae2d5680574 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -164,6 +164,10 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 	for (i = 0; i < folio_batch_count(fbatch); i++) {
 		struct folio *folio = fbatch->folios[i];
 
+		/* block memcg migration while the folio moves between lru */
+		if (move_fn != lru_add && !folio_test_clear_lru(folio))
+			continue;
+
 		folio_lruvec_relock_irqsave(folio, &lruvec, &flags);
 		move_fn(lruvec, folio);
 
@@ -176,14 +180,10 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
 }
 
 static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
-		struct folio *folio, move_fn_t move_fn,
-		bool on_lru, bool disable_irq)
+		struct folio *folio, move_fn_t move_fn, bool disable_irq)
 {
 	unsigned long flags;
 
-	if (on_lru && !folio_test_clear_lru(folio))
-		return;
-
 	folio_get(folio);
 
 	if (disable_irq)
@@ -191,8 +191,8 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
 	else
 		local_lock(&cpu_fbatches.lock);
 
-	if (!folio_batch_add(this_cpu_ptr(fbatch), folio) || folio_test_large(folio) ||
-	    lru_cache_disabled())
+	if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
+			folio_test_large(folio) || lru_cache_disabled())
 		folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
 
 	if (disable_irq)
@@ -201,13 +201,13 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
 		local_unlock(&cpu_fbatches.lock);
 }
 
-#define folio_batch_add_and_move(folio, op, on_lru)						\
-	__folio_batch_add_and_move(								\
-		&cpu_fbatches.op,								\
-		folio,										\
-		op,										\
-		on_lru,										\
-		offsetof(struct cpu_fbatches, op) >= offsetof(struct cpu_fbatches, lock_irq)	\
+#define folio_batch_add_and_move(folio, op)		\
+	__folio_batch_add_and_move(			\
+		&cpu_fbatches.op,			\
+		folio,					\
+		op,					\
+		offsetof(struct cpu_fbatches, op) >=	\
+		offsetof(struct cpu_fbatches, lock_irq)	\
 	)
 
 static void lru_move_tail(struct lruvec *lruvec, struct folio *folio)
@@ -231,10 +231,10 @@ static void lru_move_tail(struct lruvec *lruvec, struct folio *folio)
 void folio_rotate_reclaimable(struct folio *folio)
 {
 	if (folio_test_locked(folio) || folio_test_dirty(folio) ||
-	    folio_test_unevictable(folio))
+	    folio_test_unevictable(folio) || !folio_test_lru(folio))
 		return;
 
-	folio_batch_add_and_move(folio, lru_move_tail, true);
+	folio_batch_add_and_move(folio, lru_move_tail);
 }
 
 void lru_note_cost_unlock_irq(struct lruvec *lruvec, bool file,
@@ -328,10 +328,11 @@ static void folio_activate_drain(int cpu)
 
 void folio_activate(struct folio *folio)
 {
-	if (folio_test_active(folio) || folio_test_unevictable(folio))
+	if (folio_test_active(folio) || folio_test_unevictable(folio) ||
+	    !folio_test_lru(folio))
 		return;
 
-	folio_batch_add_and_move(folio, lru_activate, true);
+	folio_batch_add_and_move(folio, lru_activate);
 }
 
 #else
@@ -507,7 +508,7 @@ void folio_add_lru(struct folio *folio)
 	    lru_gen_in_fault() && !(current->flags & PF_MEMALLOC))
 		folio_set_active(folio);
 
-	folio_batch_add_and_move(folio, lru_add, false);
+	folio_batch_add_and_move(folio, lru_add);
 }
 EXPORT_SYMBOL(folio_add_lru);
 
@@ -685,13 +686,13 @@ void lru_add_drain_cpu(int cpu)
 void deactivate_file_folio(struct folio *folio)
 {
 	/* Deactivating an unevictable folio will not accelerate reclaim */
-	if (folio_test_unevictable(folio))
+	if (folio_test_unevictable(folio) || !folio_test_lru(folio))
 		return;
 
 	if (lru_gen_enabled() && lru_gen_clear_refs(folio))
 		return;
 
-	folio_batch_add_and_move(folio, lru_deactivate_file, true);
+	folio_batch_add_and_move(folio, lru_deactivate_file);
 }
 
 /*
@@ -704,13 +705,13 @@ void deactivate_file_folio(struct folio *folio)
  */
 void folio_deactivate(struct folio *folio)
 {
-	if (folio_test_unevictable(folio))
+	if (folio_test_unevictable(folio) || !folio_test_lru(folio))
 		return;
 
 	if (lru_gen_enabled() ? lru_gen_clear_refs(folio) : !folio_test_active(folio))
 		return;
 
-	folio_batch_add_and_move(folio, lru_deactivate, true);
+	folio_batch_add_and_move(folio, lru_deactivate);
 }
 
 /**
@@ -723,10 +724,11 @@ void folio_deactivate(struct folio *folio)
 void folio_mark_lazyfree(struct folio *folio)
 {
 	if (!folio_test_anon(folio) || !folio_test_swapbacked(folio) ||
+	    !folio_test_lru(folio) ||
 	    folio_test_swapcache(folio) || folio_test_unevictable(folio))
 		return;
 
-	folio_batch_add_and_move(folio, lru_lazyfree, true);
+	folio_batch_add_and_move(folio, lru_lazyfree);
 }
 
 void lru_add_drain(void)
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/7] mm: Revert "mm: vmscan.c: fix OOM on swap stress test"
  2025-08-31  8:57 [PATCH 0/7] mm: better GUP pin lru_add_drain_all() Hugh Dickins
                   ` (3 preceding siblings ...)
  2025-08-31  9:11 ` [PATCH 4/7] mm: Revert "mm/gup: clear the LRU flag of a page before adding to LRU batch" Hugh Dickins
@ 2025-08-31  9:13 ` Hugh Dickins
  2025-09-01  8:07   ` David Hildenbrand
  2025-08-31  9:16 ` [PATCH 6/7] mm: folio_may_be_cached() unless folio_test_large() Hugh Dickins
  2025-08-31  9:18 ` [PATCH 7/7] mm: lru_add_drain_all() do local lru_add_drain() first Hugh Dickins
  6 siblings, 1 reply; 18+ messages in thread
From: Hugh Dickins @ 2025-08-31  9:13 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Will Deacon, David Hildenbrand, Shivank Garg, Matthew Wilcox,
	Christoph Hellwig, Keir Fraser, Jason Gunthorpe, John Hubbard,
	Frederick Mayle, Peter Xu, Aneesh Kumar K.V, Johannes Weiner,
	Vlastimil Babka, Alexander Krabler, Ge Yang, Li Zhe, Chris Li,
	Yu Zhao, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Konstantin Khlebnikov, linux-kernel, linux-mm

This reverts commit 0885ef4705607936fc36a38fd74356e1c465b023: that
was a fix to the reverted 33dfe9204f29b415bbc0abb1a50642d1ba94f5e9.

Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 mm/vmscan.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a48aec8bfd92..674999999cd0 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4507,7 +4507,7 @@ static bool sort_folio(struct lruvec *lruvec, struct folio *folio, struct scan_c
 	}
 
 	/* ineligible */
-	if (!folio_test_lru(folio) || zone > sc->reclaim_idx) {
+	if (zone > sc->reclaim_idx) {
 		gen = folio_inc_gen(lruvec, folio, false);
 		list_move_tail(&folio->lru, &lrugen->folios[gen][type][zone]);
 		return true;
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/7] mm: folio_may_be_cached() unless folio_test_large()
  2025-08-31  8:57 [PATCH 0/7] mm: better GUP pin lru_add_drain_all() Hugh Dickins
                   ` (4 preceding siblings ...)
  2025-08-31  9:13 ` [PATCH 5/7] mm: Revert "mm: vmscan.c: fix OOM on swap stress test" Hugh Dickins
@ 2025-08-31  9:16 ` Hugh Dickins
  2025-09-01  8:13   ` David Hildenbrand
  2025-08-31  9:18 ` [PATCH 7/7] mm: lru_add_drain_all() do local lru_add_drain() first Hugh Dickins
  6 siblings, 1 reply; 18+ messages in thread
From: Hugh Dickins @ 2025-08-31  9:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Will Deacon, David Hildenbrand, Shivank Garg, Matthew Wilcox,
	Christoph Hellwig, Keir Fraser, Jason Gunthorpe, John Hubbard,
	Frederick Mayle, Peter Xu, Aneesh Kumar K.V, Johannes Weiner,
	Vlastimil Babka, Alexander Krabler, Ge Yang, Li Zhe, Chris Li,
	Yu Zhao, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Konstantin Khlebnikov, linux-kernel, linux-mm

mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as
a large folio is added: so collect_longterm_unpinnable_folios() just
wastes effort when calling lru_add_drain_all() on a large folio.

But although there is good reason not to batch up PMD-sized folios,
we might well benefit from batching a small number of low-order mTHPs
(though unclear how that "small number" limitation will be implemented).

So ask if folio_may_be_cached() rather than !folio_test_large(), to
insulate those particular checks from future change.  Name preferred
to "folio_is_batchable" because large folios can well be put on a batch:
it's just the per-CPU LRU caches, drained much later, which need care.

Marked for stable, to counter the increase in lru_add_drain_all()s
from "mm/gup: check ref_count instead of lru before migration".

Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Hugh Dickins <hughd@google.com>
Cc: <stable@vger.kernel.org>
---
 include/linux/swap.h | 10 ++++++++++
 mm/gup.c             |  5 +++--
 mm/mlock.c           |  6 +++---
 mm/swap.c            |  2 +-
 4 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/linux/swap.h b/include/linux/swap.h
index 2fe6ed2cc3fd..b49a61c32238 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -385,6 +385,16 @@ void folio_add_lru_vma(struct folio *, struct vm_area_struct *);
 void mark_page_accessed(struct page *);
 void folio_mark_accessed(struct folio *);
 
+static inline bool folio_may_be_cached(struct folio *folio)
+{
+	/*
+	 * Holding PMD-sized folios in per-CPU LRU cache unbalances accounting.
+	 * Holding small numbers of low-order mTHP folios in per-CPU LRU cache
+	 * will be sensible, but nobody has implemented and tested that yet.
+	 */
+	return !folio_test_large(folio);
+}
+
 extern atomic_t lru_disable_count;
 
 static inline bool lru_cache_disabled(void)
diff --git a/mm/gup.c b/mm/gup.c
index 9f7c87f504a9..e70544c0f958 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -2309,8 +2309,9 @@ static unsigned long collect_longterm_unpinnable_folios(
 			continue;
 		}
 
-		if (drain_allow && folio_ref_count(folio) !=
-				   folio_expected_ref_count(folio) + 1) {
+		if (drain_allow && folio_may_be_cached(folio) &&
+				folio_ref_count(folio) !=
+				folio_expected_ref_count(folio) + 1) {
 			lru_add_drain_all();
 			drain_allow = false;
 		}
diff --git a/mm/mlock.c b/mm/mlock.c
index a1d93ad33c6d..427339dea380 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -255,7 +255,7 @@ void mlock_folio(struct folio *folio)
 
 	folio_get(folio);
 	if (!folio_batch_add(fbatch, mlock_lru(folio)) ||
-	    folio_test_large(folio) || lru_cache_disabled())
+	    !folio_may_be_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
 	local_unlock(&mlock_fbatch.lock);
 }
@@ -278,7 +278,7 @@ void mlock_new_folio(struct folio *folio)
 
 	folio_get(folio);
 	if (!folio_batch_add(fbatch, mlock_new(folio)) ||
-	    folio_test_large(folio) || lru_cache_disabled())
+	    !folio_may_be_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
 	local_unlock(&mlock_fbatch.lock);
 }
@@ -299,7 +299,7 @@ void munlock_folio(struct folio *folio)
 	 */
 	folio_get(folio);
 	if (!folio_batch_add(fbatch, folio) ||
-	    folio_test_large(folio) || lru_cache_disabled())
+	    !folio_may_be_cached(folio) || lru_cache_disabled())
 		mlock_folio_batch(fbatch);
 	local_unlock(&mlock_fbatch.lock);
 }
diff --git a/mm/swap.c b/mm/swap.c
index 6ae2d5680574..17438fd1f51a 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -192,7 +192,7 @@ static void __folio_batch_add_and_move(struct folio_batch __percpu *fbatch,
 		local_lock(&cpu_fbatches.lock);
 
 	if (!folio_batch_add(this_cpu_ptr(fbatch), folio) ||
-			folio_test_large(folio) || lru_cache_disabled())
+			!folio_may_be_cached(folio) || lru_cache_disabled())
 		folio_batch_move_lru(this_cpu_ptr(fbatch), move_fn);
 
 	if (disable_irq)
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 7/7] mm: lru_add_drain_all() do local lru_add_drain() first
  2025-08-31  8:57 [PATCH 0/7] mm: better GUP pin lru_add_drain_all() Hugh Dickins
                   ` (5 preceding siblings ...)
  2025-08-31  9:16 ` [PATCH 6/7] mm: folio_may_be_cached() unless folio_test_large() Hugh Dickins
@ 2025-08-31  9:18 ` Hugh Dickins
  2025-09-01  8:14   ` David Hildenbrand
  6 siblings, 1 reply; 18+ messages in thread
From: Hugh Dickins @ 2025-08-31  9:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Will Deacon, David Hildenbrand, Shivank Garg, Matthew Wilcox,
	Christoph Hellwig, Keir Fraser, Jason Gunthorpe, John Hubbard,
	Frederick Mayle, Peter Xu, Aneesh Kumar K.V, Johannes Weiner,
	Vlastimil Babka, Alexander Krabler, Ge Yang, Li Zhe, Chris Li,
	Yu Zhao, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Konstantin Khlebnikov, linux-kernel, linux-mm

No numbers to back this up, but it seemed obvious to me, that if there
are competing lru_add_drain_all()ers, the work will be minimized if each
flushes its own local queues before locking and doing cross-CPU drains.

Signed-off-by: Hugh Dickins <hughd@google.com>
---
 mm/swap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/swap.c b/mm/swap.c
index 17438fd1f51a..4951f4d35e40 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -834,6 +834,9 @@ static inline void __lru_add_drain_all(bool force_all_cpus)
 	 */
 	this_gen = smp_load_acquire(&lru_drain_gen);
 
+	/* It helps everyone if we do our own local drain immediately. */
+	lru_add_drain();
+
 	mutex_lock(&lock);
 
 	/*
-- 
2.51.0



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2
  2025-08-31  9:01 ` [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2 Hugh Dickins
@ 2025-08-31 23:37   ` Matthew Wilcox
  2025-09-01  1:17     ` Hugh Dickins
  0 siblings, 1 reply; 18+ messages in thread
From: Matthew Wilcox @ 2025-08-31 23:37 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Andrew Morton, David Hildenbrand, Shivank Garg, Christoph Hellwig,
	Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle,
	Peter Xu, Aneesh Kumar K.V, Johannes Weiner, Vlastimil Babka,
	Alexander Krabler, Ge Yang, Li Zhe, Chris Li, Yu Zhao,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Konstantin Khlebnikov,
	linux-kernel, linux-mm

On Sun, Aug 31, 2025 at 02:01:16AM -0700, Hugh Dickins wrote:
> 6.16's folio_expected_ref_count() is forgetting the PG_private_2 flag,
> which (like PG_private, but not in addition to PG_private) counts for
> 1 more reference: it needs to be using folio_has_private() in place of
> folio_test_private().

No, it doesn't.  I know it used to, but no filesystem was actually doing
that.  So I changed mm to match how filesystems actually worked.
I'm not sure if there's still documentation lying around that gets
this wrong or if you're remembering how things used to be documented,
but it's never how any filesystem has ever worked.

We're achingly close to getting rid of PG_private_2.  I think it's just
ceph and nfs that still use it.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2
  2025-08-31 23:37   ` Matthew Wilcox
@ 2025-09-01  1:17     ` Hugh Dickins
  2025-09-01  7:52       ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: Hugh Dickins @ 2025-09-01  1:17 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Hugh Dickins, Andrew Morton, Will Deacon, David Hildenbrand,
	Shivank Garg, Christoph Hellwig, Keir Fraser, Jason Gunthorpe,
	John Hubbard, Frederick Mayle, Peter Xu, Aneesh Kumar K.V,
	Johannes Weiner, Vlastimil Babka, Alexander Krabler, Ge Yang,
	Li Zhe, Chris Li, Yu Zhao, Axel Rasmussen, Yuanchu Xie, Wei Xu,
	Konstantin Khlebnikov, linux-kernel, linux-mm

On Mon, 1 Sep 2025, Matthew Wilcox wrote:
> On Sun, Aug 31, 2025 at 02:01:16AM -0700, Hugh Dickins wrote:
> > 6.16's folio_expected_ref_count() is forgetting the PG_private_2 flag,
> > which (like PG_private, but not in addition to PG_private) counts for
> > 1 more reference: it needs to be using folio_has_private() in place of
> > folio_test_private().
> 
> No, it doesn't.  I know it used to, but no filesystem was actually doing
> that.  So I changed mm to match how filesystems actually worked.
> I'm not sure if there's still documentation lying around that gets
> this wrong or if you're remembering how things used to be documented,
> but it's never how any filesystem has ever worked.
> 
> We're achingly close to getting rid of PG_private_2.  I think it's just
> ceph and nfs that still use it.

I knew you were trying to get rid of it (hurrah! thank you), so when I
tried porting my lru_add_drainage to 6.12 I was careful to check whether
folio_expected_ref_count() would need to add it to the accounting there:
apparently yes; but then I was surprised to find that it's still present
in 6.17-rc, I'd assumed it gone long ago.

I didn't try to read the filesystems (which could easily have been
inconsistent about it) to understand: what convinced me amidst all
the confusion was this comment and code in mm/filemap.c:

/**
 * folio_end_private_2 - Clear PG_private_2 and wake any waiters.
 * @folio: The folio.
 *
 * Clear the PG_private_2 bit on a folio and wake up any sleepers waiting for
 * it.  The folio reference held for PG_private_2 being set is released.
 *
 * This is, for example, used when a netfs folio is being written to a local
 * disk cache, thereby allowing writes to the cache for the same folio to be
 * serialised.
 */
void folio_end_private_2(struct folio *folio)
{
	VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
	clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
	folio_wake_bit(folio, PG_private_2);
	folio_put(folio);
}
EXPORT_SYMBOL(folio_end_private_2);

That seems to be clear that PG_private_2 is matched by a folio reference,
but perhaps you can explain it away - worth changing the comment if so.

I was also anxious to work out whether PG_private with PG_private_2
would mean +1 or +2: I don't think I found any decisive statement,
but traditional use of page_has_private() implied +1; and I expect
there's no filesystem which actually could have both on the same folio.

Thanks,
Hugh


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2
  2025-09-01  1:17     ` Hugh Dickins
@ 2025-09-01  7:52       ` David Hildenbrand
  2025-09-01  8:04         ` David Hildenbrand
  0 siblings, 1 reply; 18+ messages in thread
From: David Hildenbrand @ 2025-09-01  7:52 UTC (permalink / raw)
  To: Hugh Dickins, Matthew Wilcox
  Cc: Andrew Morton, Will Deacon, Shivank Garg, Christoph Hellwig,
	Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle,
	Peter Xu, Aneesh Kumar K.V, Johannes Weiner, Vlastimil Babka,
	Alexander Krabler, Ge Yang, Li Zhe, Chris Li, Yu Zhao,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Konstantin Khlebnikov,
	linux-kernel, linux-mm

On 01.09.25 03:17, Hugh Dickins wrote:
> On Mon, 1 Sep 2025, Matthew Wilcox wrote:
>> On Sun, Aug 31, 2025 at 02:01:16AM -0700, Hugh Dickins wrote:
>>> 6.16's folio_expected_ref_count() is forgetting the PG_private_2 flag,
>>> which (like PG_private, but not in addition to PG_private) counts for
>>> 1 more reference: it needs to be using folio_has_private() in place of
>>> folio_test_private().
>>
>> No, it doesn't.  I know it used to, but no filesystem was actually doing
>> that.  So I changed mm to match how filesystems actually worked.
>> I'm not sure if there's still documentation lying around that gets
>> this wrong or if you're remembering how things used to be documented,
>> but it's never how any filesystem has ever worked.
>>
>> We're achingly close to getting rid of PG_private_2.  I think it's just
>> ceph and nfs that still use it.
> 
> I knew you were trying to get rid of it (hurrah! thank you), so when I
> tried porting my lru_add_drainage to 6.12 I was careful to check whether
> folio_expected_ref_count() would need to add it to the accounting there:
> apparently yes; but then I was surprised to find that it's still present
> in 6.17-rc, I'd assumed it gone long ago.
> 
> I didn't try to read the filesystems (which could easily have been
> inconsistent about it) to understand: what convinced me amidst all
> the confusion was this comment and code in mm/filemap.c:
> 
> /**
>   * folio_end_private_2 - Clear PG_private_2 and wake any waiters.
>   * @folio: The folio.
>   *
>   * Clear the PG_private_2 bit on a folio and wake up any sleepers waiting for
>   * it.  The folio reference held for PG_private_2 being set is released.
>   *
>   * This is, for example, used when a netfs folio is being written to a local
>   * disk cache, thereby allowing writes to the cache for the same folio to be
>   * serialised.
>   */
> void folio_end_private_2(struct folio *folio)
> {
> 	VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
> 	clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
> 	folio_wake_bit(folio, PG_private_2);
> 	folio_put(folio);
> }
> EXPORT_SYMBOL(folio_end_private_2);
> 
> That seems to be clear that PG_private_2 is matched by a folio reference,
> but perhaps you can explain it away - worth changing the comment if so.
> 
> I was also anxious to work out whether PG_private with PG_private_2
> would mean +1 or +2: I don't think I found any decisive statement,
> but traditional use of page_has_private() implied +1; and I expect
> there's no filesystem which actually could have both on the same folio.

I think it's "+1", like we used to have.

I was seriously confused when discovering (iow, concerned about false 
positives):

	PG_fscache = PG_private_2,

But in the end PG_fscache is only used in comments and e.g., 
__fscache_clear_page_bits() calls folio_end_private_2(). So both are 
really just aliases.

[Either PG_fscache should be dropped and referred to as PG_private_2, or 
PG_private_2 should be dropped and PG_fscache used instead. It's even 
inconsistently used in that fscache. file.

Or both should be dropped, of course, once we can actually get rid of it 
...]

So PG_private_2 should not be used for any other purpose.

folio_start_private_2() / folio_end_private_2() indeed pair the flag 
with a reference. There are no other callers that would set/clear the 
flag without involving a reference.

The usage of private_2 is declared deprecated all over the place. So the 
question is if we really still care.

The ceph usage is guarded by CONFIG_CEPH_FSCACHE, the NFS one by 
NFS_FSCACHE, nothing really seems to prevent it from getting configured 
in easily.

Now, one problem would be if migration / splitting / ... code where we 
use folio_expected_ref_count() cannot deal with that additional 
reference properly, in which case this patch would indeed cause harm.

If all folio_expected_ref_count() callers can deal with updating that 
reference, all good.

nfs_migrate_folio(), for example, has folio_test_private_2() handling in 
there (just wait until it is gone). ceph handles it during 
ceph_writepages_start(), but uses ordinary filemap_migrate_folio.

Long story short: this patch is problematic if one 
folio_expected_ref_count() users is not aware of how to handle that 
additional reference.

-- 
Cheers

David / dhildenb

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/7] mm/gup: check ref_count instead of lru before migration
  2025-08-31  9:05 ` [PATCH 2/7] mm/gup: check ref_count instead of lru before migration Hugh Dickins
@ 2025-09-01  8:00   ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-09-01  8:00 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Will Deacon, Shivank Garg, Matthew Wilcox, Christoph Hellwig,
	Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle,
	Peter Xu, Aneesh Kumar K.V, Johannes Weiner, Vlastimil Babka,
	Alexander Krabler, Ge Yang, Li Zhe, Chris Li, Yu Zhao,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Konstantin Khlebnikov,
	linux-kernel, linux-mm

On 31.08.25 11:05, Hugh Dickins wrote:
> Will Deacon reports:-
> 
> When taking a longterm GUP pin via pin_user_pages(),
> __gup_longterm_locked() tries to migrate target folios that should not
> be longterm pinned, for example because they reside in a CMA region or
> movable zone. This is done by first pinning all of the target folios
> anyway, collecting all of the longterm-unpinnable target folios into a
> list, dropping the pins that were just taken and finally handing the
> list off to migrate_pages() for the actual migration.
> 
> It is critically important that no unexpected references are held on the
> folios being migrated, otherwise the migration will fail and
> pin_user_pages() will return -ENOMEM to its caller. Unfortunately, it is
> relatively easy to observe migration failures when running pKVM (which
> uses pin_user_pages() on crosvm's virtual address space to resolve
> stage-2 page faults from the guest) on a 6.15-based Pixel 6 device and
> this results in the VM terminating prematurely.
> 
> In the failure case, 'crosvm' has called mlock(MLOCK_ONFAULT) on its
> mapping of guest memory prior to the pinning. Subsequently, when
> pin_user_pages() walks the page-table, the relevant 'pte' is not
> present and so the faulting logic allocates a new folio, mlocks it
> with mlock_folio() and maps it in the page-table.
> 
> Since commit 2fbb0c10d1e8 ("mm/munlock: mlock_page() munlock_page()
> batch by pagevec"), mlock/munlock operations on a folio (formerly page),
> are deferred. For example, mlock_folio() takes an additional reference
> on the target folio before placing it into a per-cpu 'folio_batch' for
> later processing by mlock_folio_batch(), which drops the refcount once
> the operation is complete. Processing of the batches is coupled with
> the LRU batch logic and can be forcefully drained with
> lru_add_drain_all() but as long as a folio remains unprocessed on the
> batch, its refcount will be elevated.
> 
> This deferred batching therefore interacts poorly with the pKVM pinning
> scenario as we can find ourselves in a situation where the migration
> code fails to migrate a folio due to the elevated refcount from the
> pending mlock operation.
> 
> Hugh Dickins adds:-
> 
> !folio_test_lru() has never been a very reliable way to tell if an
> lru_add_drain_all() is worth calling, to remove LRU cache references
> to make the folio migratable: the LRU flag may be set even while the
> folio is held with an extra reference in a per-CPU LRU cache.
> 
> 5.18 commit 2fbb0c10d1e8 may have made it more unreliable.  Then 6.11
> commit 33dfe9204f29 ("mm/gup: clear the LRU flag of a page before adding
> to LRU batch") tried to make it reliable, by moving LRU flag clearing;
> but missed the mlock/munlock batches, so still unreliable as reported.
> 
> And it turns out to be difficult to extend 33dfe9204f29's LRU flag
> clearing to the mlock/munlock batches: if they do benefit from batching,
> mlock/munlock cannot be so effective when easily suppressed while !LRU.
> 
> Instead, switch to an expected ref_count check, which was more reliable
> all along: some more false positives (unhelpful drains) than before, and
> never a guarantee that the folio will prove migratable, but better.
> 
> Note for stable backports: requires 6.16 commit 86ebd50224c0 ("mm:
> add folio_expected_ref_count() for reference count calculation") and
> 6.17 commit ("mm: fix folio_expected_ref_count() when PG_private_2").
> 
> Reported-by: Will Deacon <will@kernel.org>
> Link: https://lore.kernel.org/linux-mm/20250815101858.24352-1-will@kernel.org/
> Fixes: 9a4e9f3b2d73 ("mm: update get_user_pages_longterm to migrate pages allocated from CMA region")
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>   mm/gup.c | 3 ++-
>   1 file changed, 2 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index adffe663594d..82aec6443c0a 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2307,7 +2307,8 @@ static unsigned long collect_longterm_unpinnable_folios(
>   			continue;
>   		}
>   
> -		if (!folio_test_lru(folio) && drain_allow) {
> +		if (drain_allow && folio_ref_count(folio) !=
> +				   folio_expected_ref_count(folio) + 1) {
>   			lru_add_drain_all();
>   			drain_allow = false;
>   		}

In general, to the fix idea

	Acked-by: David Hildenbrand <david@redhat.com>

But as raised in reply to patch #1, we have to be a bit careful about 
including private_2 in folio_expected_ref_count() at this point.

If we cannot include it in folio_expected_ref_count(), it's all going to 
be a mess until PG_private_2 is removed for good.

So that part still needs to be figured out.

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2
  2025-09-01  7:52       ` David Hildenbrand
@ 2025-09-01  8:04         ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-09-01  8:04 UTC (permalink / raw)
  To: Hugh Dickins, Matthew Wilcox
  Cc: Andrew Morton, Will Deacon, Shivank Garg, Christoph Hellwig,
	Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle,
	Peter Xu, Aneesh Kumar K.V, Johannes Weiner, Vlastimil Babka,
	Alexander Krabler, Ge Yang, Li Zhe, Chris Li, Yu Zhao,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Konstantin Khlebnikov,
	linux-kernel, linux-mm

On 01.09.25 09:52, David Hildenbrand wrote:
> On 01.09.25 03:17, Hugh Dickins wrote:
>> On Mon, 1 Sep 2025, Matthew Wilcox wrote:
>>> On Sun, Aug 31, 2025 at 02:01:16AM -0700, Hugh Dickins wrote:
>>>> 6.16's folio_expected_ref_count() is forgetting the PG_private_2 flag,
>>>> which (like PG_private, but not in addition to PG_private) counts for
>>>> 1 more reference: it needs to be using folio_has_private() in place of
>>>> folio_test_private().
>>>
>>> No, it doesn't.  I know it used to, but no filesystem was actually doing
>>> that.  So I changed mm to match how filesystems actually worked.
>>> I'm not sure if there's still documentation lying around that gets
>>> this wrong or if you're remembering how things used to be documented,
>>> but it's never how any filesystem has ever worked.
>>>
>>> We're achingly close to getting rid of PG_private_2.  I think it's just
>>> ceph and nfs that still use it.
>>
>> I knew you were trying to get rid of it (hurrah! thank you), so when I
>> tried porting my lru_add_drainage to 6.12 I was careful to check whether
>> folio_expected_ref_count() would need to add it to the accounting there:
>> apparently yes; but then I was surprised to find that it's still present
>> in 6.17-rc, I'd assumed it gone long ago.
>>
>> I didn't try to read the filesystems (which could easily have been
>> inconsistent about it) to understand: what convinced me amidst all
>> the confusion was this comment and code in mm/filemap.c:
>>
>> /**
>>    * folio_end_private_2 - Clear PG_private_2 and wake any waiters.
>>    * @folio: The folio.
>>    *
>>    * Clear the PG_private_2 bit on a folio and wake up any sleepers waiting for
>>    * it.  The folio reference held for PG_private_2 being set is released.
>>    *
>>    * This is, for example, used when a netfs folio is being written to a local
>>    * disk cache, thereby allowing writes to the cache for the same folio to be
>>    * serialised.
>>    */
>> void folio_end_private_2(struct folio *folio)
>> {
>> 	VM_BUG_ON_FOLIO(!folio_test_private_2(folio), folio);
>> 	clear_bit_unlock(PG_private_2, folio_flags(folio, 0));
>> 	folio_wake_bit(folio, PG_private_2);
>> 	folio_put(folio);
>> }
>> EXPORT_SYMBOL(folio_end_private_2);
>>
>> That seems to be clear that PG_private_2 is matched by a folio reference,
>> but perhaps you can explain it away - worth changing the comment if so.
>>
>> I was also anxious to work out whether PG_private with PG_private_2
>> would mean +1 or +2: I don't think I found any decisive statement,
>> but traditional use of page_has_private() implied +1; and I expect
>> there's no filesystem which actually could have both on the same folio.
> 
> I think it's "+1", like we used to have.
> 
> I was seriously confused when discovering (iow, concerned about false
> positives):
> 
> 	PG_fscache = PG_private_2,
> 
> But in the end PG_fscache is only used in comments and e.g.,
> __fscache_clear_page_bits() calls folio_end_private_2(). So both are
> really just aliases.
> 
> [Either PG_fscache should be dropped and referred to as PG_private_2, or
> PG_private_2 should be dropped and PG_fscache used instead. It's even
> inconsistently used in that fscache. file.
> 
> Or both should be dropped, of course, once we can actually get rid of it
> ...]
> 
> So PG_private_2 should not be used for any other purpose.
> 
> folio_start_private_2() / folio_end_private_2() indeed pair the flag
> with a reference. There are no other callers that would set/clear the
> flag without involving a reference.
> 
> The usage of private_2 is declared deprecated all over the place. So the
> question is if we really still care.
> 
> The ceph usage is guarded by CONFIG_CEPH_FSCACHE, the NFS one by
> NFS_FSCACHE, nothing really seems to prevent it from getting configured
> in easily.
> 
> Now, one problem would be if migration / splitting / ... code where we
> use folio_expected_ref_count() cannot deal with that additional
> reference properly, in which case this patch would indeed cause harm.
> 
> If all folio_expected_ref_count() callers can deal with updating that
> reference, all good.
> 
> nfs_migrate_folio(), for example, has folio_test_private_2() handling in
> there (just wait until it is gone). ceph handles it during
> ceph_writepages_start(), but uses ordinary filemap_migrate_folio.
> 
> Long story short: this patch is problematic if one
> folio_expected_ref_count() users is not aware of how to handle that
> additional reference.
> 

Case in point, I just stumbled over

commit 682a71a1b6b363bff71440f4eca6498f827a839d
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Fri Sep 2 20:46:46 2022 +0100

     migrate: convert __unmap_and_move() to use folios

and

commit 8faa8ef5dd11abe119ad0c8ccd39f2064ca7ed0e
Author: Matthew Wilcox (Oracle) <willy@infradead.org>
Date:   Mon Jun 6 09:34:36 2022 -0400

     mm/migrate: Convert fallback_migrate_page() to fallback_migrate_folio()
     
     Use a folio throughout.  migrate_page() will be converted to
     migrate_folio() later.


where we converted from page_has_private() to folio_test_private(). Maybe
that's all sane, but it raises the question if migration (and maybe splitting)
as a whole is no incompatible with PG_private_2

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 3/7] mm/gup: local lru_add_drain() to avoid lru_add_drain_all()
  2025-08-31  9:08 ` [PATCH 3/7] mm/gup: local lru_add_drain() to avoid lru_add_drain_all() Hugh Dickins
@ 2025-09-01  8:05   ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-09-01  8:05 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Will Deacon, Shivank Garg, Matthew Wilcox, Christoph Hellwig,
	Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle,
	Peter Xu, Aneesh Kumar K.V, Johannes Weiner, Vlastimil Babka,
	Alexander Krabler, Ge Yang, Li Zhe, Chris Li, Yu Zhao,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Konstantin Khlebnikov,
	linux-kernel, linux-mm

On 31.08.25 11:08, Hugh Dickins wrote:
> In many cases, if collect_longterm_unpinnable_folios() does need to
> drain the LRU cache to release a reference, the cache in question is
> on this same CPU, and much more efficiently drained by a preliminary
> local lru_add_drain(), than the later cross-CPU lru_add_drain_all().
> 
> Marked for stable, to counter the increase in lru_add_drain_all()s
> from "mm/gup: check ref_count instead of lru before migration".
> Note for clean backports: can take 6.16 commit a03db236aebf ("gup:
> optimize longterm pin_user_pages() for large folio") first.
> 
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>   mm/gup.c | 2 ++
>   1 file changed, 2 insertions(+)
> 
> diff --git a/mm/gup.c b/mm/gup.c
> index 82aec6443c0a..9f7c87f504a9 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -2291,6 +2291,8 @@ static unsigned long collect_longterm_unpinnable_folios(
>   	struct folio *folio;
>   	long i = 0;
>   
> +	lru_add_drain();
> +
>   	for (folio = pofs_get_folio(pofs, i); folio;
>   	     folio = pofs_next_folio(folio, pofs, &i)) {
>   

Do we really want to drain all the time we enter 
collect_longterm_unpinnable_folios(), or only if we detect an actual 
problem? (unexpected reference?)

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 4/7] mm: Revert "mm/gup: clear the LRU flag of a page before adding to LRU batch"
  2025-08-31  9:11 ` [PATCH 4/7] mm: Revert "mm/gup: clear the LRU flag of a page before adding to LRU batch" Hugh Dickins
@ 2025-09-01  8:06   ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-09-01  8:06 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Will Deacon, Shivank Garg, Matthew Wilcox, Christoph Hellwig,
	Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle,
	Peter Xu, Aneesh Kumar K.V, Johannes Weiner, Vlastimil Babka,
	Alexander Krabler, Ge Yang, Li Zhe, Chris Li, Yu Zhao,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Konstantin Khlebnikov,
	linux-kernel, linux-mm

On 31.08.25 11:11, Hugh Dickins wrote:
> This reverts commit 33dfe9204f29b415bbc0abb1a50642d1ba94f5e9:
> now that collect_longterm_unpinnable_folios() is checking ref_count
> instead of lru, and mlock/munlock do not participate in the revised
> LRU flag clearing, those changes are misleading, and enlarge the
> window during which mlock/munlock may miss an mlock_count update.
> 
> It is possible (I'd hesitate to claim probable) that the greater
> likelihood of missed mlock_count updates would explain the "Realtime
> threads delayed due to kcompactd0" observed on 6.12 in the Link below.
> If that is the case, this reversion will help; but a complete solution
> needs also a further patch, beyond the scope of this series.
> 
> Included some 80-column cleanup around folio_batch_add_and_move().
> 
> The role of folio_test_clear_lru() (before taking per-memcg lru_lock)
> is questionable since 6.13 removed mem_cgroup_move_account() etc; but
> perhaps there are still some races which need it - not examined here.
> 
> Link: https://lore.kernel.org/linux-mm/DU0PR01MB10385345F7153F334100981888259A@DU0PR01MB10385.eurprd01.prod.exchangelabs.com/
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 5/7] mm: Revert "mm: vmscan.c: fix OOM on swap stress test"
  2025-08-31  9:13 ` [PATCH 5/7] mm: Revert "mm: vmscan.c: fix OOM on swap stress test" Hugh Dickins
@ 2025-09-01  8:07   ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-09-01  8:07 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Will Deacon, Shivank Garg, Matthew Wilcox, Christoph Hellwig,
	Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle,
	Peter Xu, Aneesh Kumar K.V, Johannes Weiner, Vlastimil Babka,
	Alexander Krabler, Ge Yang, Li Zhe, Chris Li, Yu Zhao,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Konstantin Khlebnikov,
	linux-kernel, linux-mm

On 31.08.25 11:13, Hugh Dickins wrote:
> This reverts commit 0885ef4705607936fc36a38fd74356e1c465b023: that
> was a fix to the reverted 33dfe9204f29b415bbc0abb1a50642d1ba94f5e9.
> 

Likely checkpatch will not like the long commit ids.

> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 6/7] mm: folio_may_be_cached() unless folio_test_large()
  2025-08-31  9:16 ` [PATCH 6/7] mm: folio_may_be_cached() unless folio_test_large() Hugh Dickins
@ 2025-09-01  8:13   ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-09-01  8:13 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Will Deacon, Shivank Garg, Matthew Wilcox, Christoph Hellwig,
	Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle,
	Peter Xu, Aneesh Kumar K.V, Johannes Weiner, Vlastimil Babka,
	Alexander Krabler, Ge Yang, Li Zhe, Chris Li, Yu Zhao,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Konstantin Khlebnikov,
	linux-kernel, linux-mm

On 31.08.25 11:16, Hugh Dickins wrote:
> mm/swap.c and mm/mlock.c agree to drain any per-CPU batch as soon as
> a large folio is added: so collect_longterm_unpinnable_folios() just
> wastes effort when calling lru_add_drain_all() on a large folio.
> 
> But although there is good reason not to batch up PMD-sized folios,
> we might well benefit from batching a small number of low-order mTHPs
> (though unclear how that "small number" limitation will be implemented).
> 
> So ask if folio_may_be_cached() rather than !folio_test_large(), to
> insulate those particular checks from future change.  Name preferred
> to "folio_is_batchable" because large folios can well be put on a batch:
> it's just the per-CPU LRU caches, drained much later, which need care.
> 
> Marked for stable, to counter the increase in lru_add_drain_all()s
> from "mm/gup: check ref_count instead of lru before migration".
> 
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Hugh Dickins <hughd@google.com>
> Cc: <stable@vger.kernel.org>
> ---
>   include/linux/swap.h | 10 ++++++++++
>   mm/gup.c             |  5 +++--
>   mm/mlock.c           |  6 +++---
>   mm/swap.c            |  2 +-
>   4 files changed, 17 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index 2fe6ed2cc3fd..b49a61c32238 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -385,6 +385,16 @@ void folio_add_lru_vma(struct folio *, struct vm_area_struct *);
>   void mark_page_accessed(struct page *);
>   void folio_mark_accessed(struct folio *);
>   

Two smaller things:

(1) We have other "folio_maybe_*" functions, so this one should likely
     better start with that as well.

(2) With things like fscache in mind, the function can be a bit
     misleading.

So I wonder if (a) we should just add kerneldoc to document it clearly 
(lru cache, mlock cache?) and (b) maybe call it 
folio_may_be_lru_cached(). Not sure if we can find a better abstraction 
for these two caches.

Thinking again, "maybe_cached" might be a bit misleading because it 
implements a very very very bad heuristic for small folios.

Maybe it's more like "supports being cached".

folio_lru_caching_supported()

Something like that, maybe? (again, unclear about lru/mlock cache 
abstraction)


-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 7/7] mm: lru_add_drain_all() do local lru_add_drain() first
  2025-08-31  9:18 ` [PATCH 7/7] mm: lru_add_drain_all() do local lru_add_drain() first Hugh Dickins
@ 2025-09-01  8:14   ` David Hildenbrand
  0 siblings, 0 replies; 18+ messages in thread
From: David Hildenbrand @ 2025-09-01  8:14 UTC (permalink / raw)
  To: Hugh Dickins, Andrew Morton
  Cc: Will Deacon, Shivank Garg, Matthew Wilcox, Christoph Hellwig,
	Keir Fraser, Jason Gunthorpe, John Hubbard, Frederick Mayle,
	Peter Xu, Aneesh Kumar K.V, Johannes Weiner, Vlastimil Babka,
	Alexander Krabler, Ge Yang, Li Zhe, Chris Li, Yu Zhao,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Konstantin Khlebnikov,
	linux-kernel, linux-mm

On 31.08.25 11:18, Hugh Dickins wrote:
> No numbers to back this up, but it seemed obvious to me, that if there
> are competing lru_add_drain_all()ers, the work will be minimized if each
> flushes its own local queues before locking and doing cross-CPU drains.
> 

Sounds reasonable

Acked-by: David Hildenbrand <david@redhat.com>

-- 
Cheers

David / dhildenb



^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2025-09-01  8:14 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-31  8:57 [PATCH 0/7] mm: better GUP pin lru_add_drain_all() Hugh Dickins
2025-08-31  9:01 ` [PATCH 1/7] mm: fix folio_expected_ref_count() when PG_private_2 Hugh Dickins
2025-08-31 23:37   ` Matthew Wilcox
2025-09-01  1:17     ` Hugh Dickins
2025-09-01  7:52       ` David Hildenbrand
2025-09-01  8:04         ` David Hildenbrand
2025-08-31  9:05 ` [PATCH 2/7] mm/gup: check ref_count instead of lru before migration Hugh Dickins
2025-09-01  8:00   ` David Hildenbrand
2025-08-31  9:08 ` [PATCH 3/7] mm/gup: local lru_add_drain() to avoid lru_add_drain_all() Hugh Dickins
2025-09-01  8:05   ` David Hildenbrand
2025-08-31  9:11 ` [PATCH 4/7] mm: Revert "mm/gup: clear the LRU flag of a page before adding to LRU batch" Hugh Dickins
2025-09-01  8:06   ` David Hildenbrand
2025-08-31  9:13 ` [PATCH 5/7] mm: Revert "mm: vmscan.c: fix OOM on swap stress test" Hugh Dickins
2025-09-01  8:07   ` David Hildenbrand
2025-08-31  9:16 ` [PATCH 6/7] mm: folio_may_be_cached() unless folio_test_large() Hugh Dickins
2025-09-01  8:13   ` David Hildenbrand
2025-08-31  9:18 ` [PATCH 7/7] mm: lru_add_drain_all() do local lru_add_drain() first Hugh Dickins
2025-09-01  8:14   ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).