[PATCH 0/2] mm: batch TLB flushing for dirty folios in vmscan

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

* [PATCH 0/2] mm: batch TLB flushing for dirty folios in vmscan
@ 2026-03-09  8:17 Zhang Peng via B4 Relay
  2026-03-09  8:17 ` [PATCH 1/2] mm/vmscan: refactor shrink_folio_list for readability and maintainability Zhang Peng via B4 Relay
  2026-03-09  8:17 ` [PATCH 2/2] mm, vmscan: flush TLB for every 31 folios evictions Zhang Peng via B4 Relay
  0 siblings, 2 replies; 6+ messages in thread
From: Zhang Peng via B4 Relay @ 2026-03-09  8:17 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Johannes Weiner, Qi Zheng,
	Shakeel Butt, Axel Rasmussen, Yuanchu Xie, Wei Xu, Michal Hocko
  Cc: linux-mm, linux-kernel, Kairui Song, Zhang Peng

This series introduces batch TLB flushing optimization for dirty folios
during memory reclaim, aiming to reduce IPI overhead on multi-core systems.

Background
----------
Currently, when performing pageout in memory reclaim, try_to_unmap_flush_dirty()
is called for each dirty folio individually. On multi-core systems, this causes
frequent IPIs which can significantly impact performance.

Approach
--------
This patch series accumulates dirty folios into batches and performs a single
TLB flush for the entire batch, rather than flushing for each individual folio.

Changes
-------
Patch 1: Refactor shrink_folio_list() to improve code readability and
         maintainability by extracting common logic into helper functions:
         - folio_active_bounce(): Handle folio activation logic
         - folio_free(): Handle folio freeing logic
         - pageout_one(): Handle single folio pageout logic

Patch 2: Implement batch TLB flushing logic. Dirty folios are accumulated
         in batches and a single TLB flush is performed for each batch
         before calling pageout.

Testing
-------
The benchmark script uses stress-ng to compare TLB shootdown behavior before and
after this patch. It constrains a stress-ng workload via memcg to force reclaim
through shrink_folio_list(), reporting TLB shootdowns and IPIs.

Core benchmark command: stress-ng --vm 16 --vm-bytes 2G --vm-keep --timeout 60

==========================================================================
                 batch_dirty_tlb_flush Benchmark Results
==========================================================================
  Kernel: 7.0.0-rc1+   CPUs: 16
  MemTotal: 31834M   SwapTotal: 8191M
  memcg limit: 512M   alloc: 2G   workers: 16   duration: 60s
--------------------------------------------------------------------------
Metric                 Before        After             Delta (abs / %)
--------------------------------------------------------------------------
bogo ops/s             28238.63      35833.97          +7595.34 (+26.9%)
TLB shootdowns         55428953      17621697          -37807256 (-68.2%)
Function call IPIs     34073695      14498768          -19574927 (-57.4%)
pgscan_anon (pages)    52856224      60252894          7396670 (+14.0%)
pgsteal_anon (pages)   29004962      34054753          5049791 (+17.4%)
--------------------------------------------------------------------------

Suggested-by: Kairui Song <kasong@tencent.com>
Signed-off-by: Zhang Peng <bruzzhang@tencent.com>
---
bruzzhang (2):
      mm/vmscan: refactor shrink_folio_list for readability and maintainability
      mm, vmscan: flush TLB for every 31 folios evictions

 include/linux/vmstat.h |   1 +
 mm/vmscan.c            | 387 +++++++++++++++++++++++++++++++------------------
 2 files changed, 245 insertions(+), 143 deletions(-)
---
base-commit: 49cb736d092aaa856283e33b78ec3afb3964d82f
change-id: 20260309-batch-tlb-flush-893f0e56b496

Best regards,
-- 
Zhang Peng <zippermonkey@icloud.com>




^ permalink raw reply	[flat|nested] 6+ messages in thread

* [PATCH 1/2] mm/vmscan: refactor shrink_folio_list for readability and maintainability
  2026-03-09  8:17 [PATCH 0/2] mm: batch TLB flushing for dirty folios in vmscan Zhang Peng via B4 Relay
@ 2026-03-09  8:17 ` Zhang Peng via B4 Relay
  2026-03-09  8:17 ` [PATCH 2/2] mm, vmscan: flush TLB for every 31 folios evictions Zhang Peng via B4 Relay
  1 sibling, 0 replies; 6+ messages in thread
From: Zhang Peng via B4 Relay @ 2026-03-09  8:17 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Johannes Weiner, Qi Zheng,
	Shakeel Butt, Axel Rasmussen, Yuanchu Xie, Wei Xu, Michal Hocko
  Cc: linux-mm, linux-kernel, Kairui Song, Zhang Peng

From: bruzzhang <bruzzhang@tencent.com>

Refactor shrink_folio_list() by extracting three helper functions to
improve code organization and readability:

- folio_active_bounce(): Handle folio activation logic when pages need to
  be bounced back to the head of the LRU list
- folio_free(): Handle folio freeing logic, including buffer release,
  mapping removal, and batch management
- pageout_one(): Handle single folio pageout logic with proper state
  transition handling

Change shrink_folio_list() return type from unsigned int to void and track
reclaimed pages through stat->nr_reclaimed instead of a local variable.
Add nr_reclaimed field to struct reclaim_stat to support this change.

This refactoring maintains the same functionality while making the code
more modular and easier to understand. The extracted functions encapsulate
specific logical operations, making the main function flow clearer and
reducing code duplication.

No functional change.

Suggested-by: Kairui Song <kasong@tencent.com>
Signed-off-by: bruzzhang <bruzzhang@tencent.com>
---
 include/linux/vmstat.h |   1 +
 mm/vmscan.c            | 323 ++++++++++++++++++++++++++++---------------------
 2 files changed, 186 insertions(+), 138 deletions(-)

diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h
index 3c9c266cf782..f088c5641d99 100644
--- a/include/linux/vmstat.h
+++ b/include/linux/vmstat.h
@@ -26,6 +26,7 @@ struct reclaim_stat {
 	unsigned nr_unmap_fail;
 	unsigned nr_lazyfree_fail;
 	unsigned nr_demoted;
+	unsigned nr_reclaimed;
 };
 
 /* Stat data for system wide items */
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3f64a09f415c..a336f7fc7dae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1076,10 +1076,174 @@ static bool may_enter_fs(struct folio *folio, gfp_t gfp_mask)
 	return !data_race(folio_swap_flags(folio) & SWP_FS_OPS);
 }
 
+/* Mark folio as active and prepare to bounce back to head of LRU */
+static void folio_active_bounce(struct folio *folio, struct reclaim_stat *stat,
+		unsigned int nr_pages)
+{
+	/* Not a candidate for swapping, so reclaim swap space. */
+	if (folio_test_swapcache(folio) &&
+		(mem_cgroup_swap_full(folio) || folio_test_mlocked(folio)))
+		folio_free_swap(folio);
+	VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
+	if (!folio_test_mlocked(folio)) {
+		int type = folio_is_file_lru(folio);
+
+		folio_set_active(folio);
+		stat->nr_activate[type] += nr_pages;
+		count_memcg_folio_events(folio, PGACTIVATE, nr_pages);
+	}
+}
+
+static bool folio_free(struct folio *folio, struct folio_batch *free_folios,
+		struct scan_control *sc, struct reclaim_stat *stat)
+{
+	unsigned int nr_pages = folio_nr_pages(folio);
+	struct address_space *mapping = folio_mapping(folio);
+
+	/*
+	 * If the folio has buffers, try to free the buffer
+	 * mappings associated with this folio. If we succeed
+	 * we try to free the folio as well.
+	 *
+	 * We do this even if the folio is dirty.
+	 * filemap_release_folio() does not perform I/O, but it
+	 * is possible for a folio to have the dirty flag set,
+	 * but it is actually clean (all its buffers are clean).
+	 * This happens if the buffers were written out directly,
+	 * with submit_bh(). ext3 will do this, as well as
+	 * the blockdev mapping.  filemap_release_folio() will
+	 * discover that cleanness and will drop the buffers
+	 * and mark the folio clean - it can be freed.
+	 *
+	 * Rarely, folios can have buffers and no ->mapping.
+	 * These are the folios which were not successfully
+	 * invalidated in truncate_cleanup_folio().  We try to
+	 * drop those buffers here and if that worked, and the
+	 * folio is no longer mapped into process address space
+	 * (refcount == 1) it can be freed.  Otherwise, leave
+	 * the folio on the LRU so it is swappable.
+	 */
+	if (folio_needs_release(folio)) {
+		if (!filemap_release_folio(folio, sc->gfp_mask)) {
+			folio_active_bounce(folio, stat, nr_pages);
+			return false;
+		}
+
+		if (!mapping && folio_ref_count(folio) == 1) {
+			folio_unlock(folio);
+			if (folio_put_testzero(folio))
+				goto free_it;
+			else {
+				/*
+				 * rare race with speculative reference.
+				 * the speculative reference will free
+				 * this folio shortly, so we may
+				 * increment nr_reclaimed here (and
+				 * leave it off the LRU).
+				 */
+				stat->nr_reclaimed += nr_pages;
+				return true;
+			}
+		}
+	}
+
+	if (folio_test_lazyfree(folio)) {
+		/* follow __remove_mapping for reference */
+		if (!folio_ref_freeze(folio, 1))
+			return false;
+		/*
+		 * The folio has only one reference left, which is
+		 * from the isolation. After the caller puts the
+		 * folio back on the lru and drops the reference, the
+		 * folio will be freed anyway. It doesn't matter
+		 * which lru it goes on. So we don't bother checking
+		 * the dirty flag here.
+		 */
+		count_vm_events(PGLAZYFREED, nr_pages);
+		count_memcg_folio_events(folio, PGLAZYFREED, nr_pages);
+	} else if (!mapping || !__remove_mapping(mapping, folio, true,
+							sc->target_mem_cgroup))
+		return false;
+
+	folio_unlock(folio);
+free_it:
+	/*
+	 * Folio may get swapped out as a whole, need to account
+	 * all pages in it.
+	 */
+	stat->nr_reclaimed += nr_pages;
+
+	folio_unqueue_deferred_split(folio);
+	if (folio_batch_add(free_folios, folio) == 0) {
+		mem_cgroup_uncharge_folios(free_folios);
+		try_to_unmap_flush();
+		free_unref_folios(free_folios);
+	}
+	return true;
+}
+
+static void pageout_one(struct folio *folio, struct list_head *ret_folios,
+			struct folio_batch *free_folios,
+			struct scan_control *sc, struct reclaim_stat *stat,
+			struct swap_iocb **plug, struct list_head *folio_list)
+{
+	struct address_space *mapping = folio_mapping(folio);
+	unsigned int nr_pages = folio_nr_pages(folio);
+
+	switch (pageout(folio, mapping, plug, folio_list)) {
+	case PAGE_ACTIVATE:
+		/*
+		 * If shmem folio is split when writeback to swap,
+		 * the tail pages will make their own pass through
+		 * this function and be accounted then.
+		 */
+		if (nr_pages > 1 && !folio_test_large(folio)) {
+			sc->nr_scanned -= (nr_pages - 1);
+			nr_pages = 1;
+		}
+		folio_active_bounce(folio, stat, nr_pages);
+		fallthrough;
+	case PAGE_KEEP:
+		goto locked_keepit;
+	case PAGE_SUCCESS:
+		if (nr_pages > 1 && !folio_test_large(folio)) {
+			sc->nr_scanned -= (nr_pages - 1);
+			nr_pages = 1;
+		}
+		stat->nr_pageout += nr_pages;
+
+		if (folio_test_writeback(folio))
+			goto keepit;
+		if (folio_test_dirty(folio))
+			goto keepit;
+
+		/*
+		 * A synchronous write - probably a ramdisk.  Go
+		 * ahead and try to reclaim the folio.
+		 */
+		if (!folio_trylock(folio))
+			goto keepit;
+		if (folio_test_dirty(folio) ||
+			folio_test_writeback(folio))
+			goto locked_keepit;
+		mapping = folio_mapping(folio);
+		fallthrough;
+	case PAGE_CLEAN:
+		; /* try to free the folio below */
+	}
+	if (folio_free(folio, free_folios, sc, stat))
+		return;
+locked_keepit:
+	folio_unlock(folio);
+keepit:
+	list_add(&folio->lru, ret_folios);
+	VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
+			folio_test_unevictable(folio), folio);
+}
 /*
- * shrink_folio_list() returns the number of reclaimed pages
+ * Reclaimed folios are counted in stat->nr_reclaimed.
  */
-static unsigned int shrink_folio_list(struct list_head *folio_list,
+static void shrink_folio_list(struct list_head *folio_list,
 		struct pglist_data *pgdat, struct scan_control *sc,
 		struct reclaim_stat *stat, bool ignore_references,
 		struct mem_cgroup *memcg)
@@ -1087,7 +1251,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 	struct folio_batch free_folios;
 	LIST_HEAD(ret_folios);
 	LIST_HEAD(demote_folios);
-	unsigned int nr_reclaimed = 0, nr_demoted = 0;
+	unsigned int nr_demoted = 0;
 	unsigned int pgactivate = 0;
 	bool do_demote_pass;
 	struct swap_iocb *plug = NULL;
@@ -1421,126 +1585,15 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			 * starts and then write it out here.
 			 */
 			try_to_unmap_flush_dirty();
-			switch (pageout(folio, mapping, &plug, folio_list)) {
-			case PAGE_KEEP:
-				goto keep_locked;
-			case PAGE_ACTIVATE:
-				/*
-				 * If shmem folio is split when writeback to swap,
-				 * the tail pages will make their own pass through
-				 * this function and be accounted then.
-				 */
-				if (nr_pages > 1 && !folio_test_large(folio)) {
-					sc->nr_scanned -= (nr_pages - 1);
-					nr_pages = 1;
-				}
-				goto activate_locked;
-			case PAGE_SUCCESS:
-				if (nr_pages > 1 && !folio_test_large(folio)) {
-					sc->nr_scanned -= (nr_pages - 1);
-					nr_pages = 1;
-				}
-				stat->nr_pageout += nr_pages;
-
-				if (folio_test_writeback(folio))
-					goto keep;
-				if (folio_test_dirty(folio))
-					goto keep;
-
-				/*
-				 * A synchronous write - probably a ramdisk.  Go
-				 * ahead and try to reclaim the folio.
-				 */
-				if (!folio_trylock(folio))
-					goto keep;
-				if (folio_test_dirty(folio) ||
-				    folio_test_writeback(folio))
-					goto keep_locked;
-				mapping = folio_mapping(folio);
-				fallthrough;
-			case PAGE_CLEAN:
-				; /* try to free the folio below */
-			}
-		}
-
-		/*
-		 * If the folio has buffers, try to free the buffer
-		 * mappings associated with this folio. If we succeed
-		 * we try to free the folio as well.
-		 *
-		 * We do this even if the folio is dirty.
-		 * filemap_release_folio() does not perform I/O, but it
-		 * is possible for a folio to have the dirty flag set,
-		 * but it is actually clean (all its buffers are clean).
-		 * This happens if the buffers were written out directly,
-		 * with submit_bh(). ext3 will do this, as well as
-		 * the blockdev mapping.  filemap_release_folio() will
-		 * discover that cleanness and will drop the buffers
-		 * and mark the folio clean - it can be freed.
-		 *
-		 * Rarely, folios can have buffers and no ->mapping.
-		 * These are the folios which were not successfully
-		 * invalidated in truncate_cleanup_folio().  We try to
-		 * drop those buffers here and if that worked, and the
-		 * folio is no longer mapped into process address space
-		 * (refcount == 1) it can be freed.  Otherwise, leave
-		 * the folio on the LRU so it is swappable.
-		 */
-		if (folio_needs_release(folio)) {
-			if (!filemap_release_folio(folio, sc->gfp_mask))
-				goto activate_locked;
-			if (!mapping && folio_ref_count(folio) == 1) {
-				folio_unlock(folio);
-				if (folio_put_testzero(folio))
-					goto free_it;
-				else {
-					/*
-					 * rare race with speculative reference.
-					 * the speculative reference will free
-					 * this folio shortly, so we may
-					 * increment nr_reclaimed here (and
-					 * leave it off the LRU).
-					 */
-					nr_reclaimed += nr_pages;
-					continue;
-				}
-			}
+			pageout_one(folio, &ret_folios, &free_folios, sc, stat,
+				&plug, folio_list);
+			goto next;
 		}
 
-		if (folio_test_lazyfree(folio)) {
-			/* follow __remove_mapping for reference */
-			if (!folio_ref_freeze(folio, 1))
-				goto keep_locked;
-			/*
-			 * The folio has only one reference left, which is
-			 * from the isolation. After the caller puts the
-			 * folio back on the lru and drops the reference, the
-			 * folio will be freed anyway. It doesn't matter
-			 * which lru it goes on. So we don't bother checking
-			 * the dirty flag here.
-			 */
-			count_vm_events(PGLAZYFREED, nr_pages);
-			count_memcg_folio_events(folio, PGLAZYFREED, nr_pages);
-		} else if (!mapping || !__remove_mapping(mapping, folio, true,
-							 sc->target_mem_cgroup))
+		if (!folio_free(folio, &free_folios, sc, stat))
 			goto keep_locked;
-
-		folio_unlock(folio);
-free_it:
-		/*
-		 * Folio may get swapped out as a whole, need to account
-		 * all pages in it.
-		 */
-		nr_reclaimed += nr_pages;
-
-		folio_unqueue_deferred_split(folio);
-		if (folio_batch_add(&free_folios, folio) == 0) {
-			mem_cgroup_uncharge_folios(&free_folios);
-			try_to_unmap_flush();
-			free_unref_folios(&free_folios);
-		}
-		continue;
-
+		else
+			continue;
 activate_locked_split:
 		/*
 		 * The tail pages that are failed to add into swap cache
@@ -1551,29 +1604,21 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 			nr_pages = 1;
 		}
 activate_locked:
-		/* Not a candidate for swapping, so reclaim swap space. */
-		if (folio_test_swapcache(folio) &&
-		    (mem_cgroup_swap_full(folio) || folio_test_mlocked(folio)))
-			folio_free_swap(folio);
-		VM_BUG_ON_FOLIO(folio_test_active(folio), folio);
-		if (!folio_test_mlocked(folio)) {
-			int type = folio_is_file_lru(folio);
-			folio_set_active(folio);
-			stat->nr_activate[type] += nr_pages;
-			count_memcg_folio_events(folio, PGACTIVATE, nr_pages);
-		}
+		folio_active_bounce(folio, stat, nr_pages);
 keep_locked:
 		folio_unlock(folio);
 keep:
 		list_add(&folio->lru, &ret_folios);
 		VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
 				folio_test_unevictable(folio), folio);
+next:
+		continue;
 	}
 	/* 'folio_list' is always empty here */
 
 	/* Migrate folios selected for demotion */
 	nr_demoted = demote_folio_list(&demote_folios, pgdat, memcg);
-	nr_reclaimed += nr_demoted;
+	stat->nr_reclaimed += nr_demoted;
 	stat->nr_demoted += nr_demoted;
 	/* Folios that could not be demoted are still in @demote_folios */
 	if (!list_empty(&demote_folios)) {
@@ -1613,7 +1658,6 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
 
 	if (plug)
 		swap_write_unplug(plug);
-	return nr_reclaimed;
 }
 
 unsigned int reclaim_clean_pages_from_list(struct zone *zone,
@@ -1647,8 +1691,9 @@ unsigned int reclaim_clean_pages_from_list(struct zone *zone,
 	 * change in the future.
 	 */
 	noreclaim_flag = memalloc_noreclaim_save();
-	nr_reclaimed = shrink_folio_list(&clean_folios, zone->zone_pgdat, &sc,
+	shrink_folio_list(&clean_folios, zone->zone_pgdat, &sc,
 					&stat, true, NULL);
+	nr_reclaimed = stat.nr_reclaimed;
 	memalloc_noreclaim_restore(noreclaim_flag);
 
 	list_splice(&clean_folios, folio_list);
@@ -2017,8 +2062,8 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 	if (nr_taken == 0)
 		return 0;
 
-	nr_reclaimed = shrink_folio_list(&folio_list, pgdat, sc, &stat, false,
-					 lruvec_memcg(lruvec));
+	shrink_folio_list(&folio_list, pgdat, sc, &stat, false, lruvec_memcg(lruvec));
+	nr_reclaimed = stat.nr_reclaimed;
 
 	spin_lock_irq(&lruvec->lru_lock);
 	move_folios_to_lru(lruvec, &folio_list);
@@ -2195,7 +2240,8 @@ static unsigned int reclaim_folio_list(struct list_head *folio_list,
 		.no_demotion = 1,
 	};
 
-	nr_reclaimed = shrink_folio_list(folio_list, pgdat, &sc, &stat, true, NULL);
+	shrink_folio_list(folio_list, pgdat, &sc, &stat, true, NULL);
+	nr_reclaimed = stat.nr_reclaimed;
 	while (!list_empty(folio_list)) {
 		folio = lru_to_folio(folio_list);
 		list_del(&folio->lru);
@@ -4703,7 +4749,8 @@ static int evict_folios(unsigned long nr_to_scan, struct lruvec *lruvec,
 	if (list_empty(&list))
 		return scanned;
 retry:
-	reclaimed = shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
+	shrink_folio_list(&list, pgdat, sc, &stat, false, memcg);
+	reclaimed = stat.nr_reclaimed;
 	sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
 	sc->nr_reclaimed += reclaimed;
 	trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,

-- 
2.43.7




^ permalink raw reply related	[flat|nested] 6+ messages in thread

* [PATCH 2/2] mm, vmscan: flush TLB for every 31 folios evictions
  2026-03-09  8:17 [PATCH 0/2] mm: batch TLB flushing for dirty folios in vmscan Zhang Peng via B4 Relay
  2026-03-09  8:17 ` [PATCH 1/2] mm/vmscan: refactor shrink_folio_list for readability and maintainability Zhang Peng via B4 Relay
@ 2026-03-09  8:17 ` Zhang Peng via B4 Relay
  2026-03-09 12:29   ` Usama Arif
  1 sibling, 1 reply; 6+ messages in thread
From: Zhang Peng via B4 Relay @ 2026-03-09  8:17 UTC (permalink / raw)
  To: Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Johannes Weiner, Qi Zheng,
	Shakeel Butt, Axel Rasmussen, Yuanchu Xie, Wei Xu, Michal Hocko
  Cc: linux-mm, linux-kernel, Kairui Song, Zhang Peng

From: bruzzhang <bruzzhang@tencent.com>

Currently we flush TLB for every dirty folio, which is a bottleneck for
systems with many cores as this causes heavy IPI usage.

So instead, batch the folios, and flush once for every 31 folios (one
folio_batch). These folios will be held in a folio_batch releasing their
lock, then when folio_batch is full, do following steps:

- For each folio: lock - check still evictable - unlock
  - If no longer evictable, return the folio to the caller.
- Flush TLB once for the batch
- Pageout the folios (refcount freeze happens in the pageout path)

Note we can't hold a frozen folio in folio_batch for long as it will
cause filemap/swapcache lookup to livelock. Fortunately pageout usually
won't take too long; sync IO is fast, and non-sync IO will be issued
with the folio marked writeback.

Suggested-by: Kairui Song <kasong@tencent.com>
Signed-off-by: bruzzhang <bruzzhang@tencent.com>
---
 mm/vmscan.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 61 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index a336f7fc7dae..69cdd3252ff8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1240,6 +1240,48 @@ static void pageout_one(struct folio *folio, struct list_head *ret_folios,
 	VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
 			folio_test_unevictable(folio), folio);
 }
+
+static void pageout_batch(struct folio_batch *fbatch,
+			  struct list_head *ret_folios,
+			  struct folio_batch *free_folios,
+			  struct scan_control *sc, struct reclaim_stat *stat,
+			  struct swap_iocb **plug, struct list_head *folio_list)
+{
+	int i = 0, count = folio_batch_count(fbatch);
+	struct folio *folio;
+
+	folio_batch_reinit(fbatch);
+	do {
+		folio = fbatch->folios[i];
+		if (!folio_trylock(folio)) {
+			list_add(&folio->lru, ret_folios);
+			continue;
+		}
+
+		if (folio_test_writeback(folio) || folio_test_lru(folio) ||
+		    folio_mapped(folio))
+			goto next;
+		folio_batch_add(fbatch, folio);
+		continue;
+next:
+		folio_unlock(folio);
+		list_add(&folio->lru, ret_folios);
+	} while (++i != count);
+
+	i = 0;
+	count = folio_batch_count(fbatch);
+	if (!count)
+		return;
+	/* One TLB flush for the batch */
+	try_to_unmap_flush_dirty();
+	do {
+		folio = fbatch->folios[i];
+		pageout_one(folio, ret_folios, free_folios, sc, stat, plug,
+			    folio_list);
+	} while (++i != count);
+	folio_batch_reinit(fbatch);
+}
+
 /*
  * Reclaimed folios are counted in stat->nr_reclaimed.
  */
@@ -1249,6 +1291,8 @@ static void shrink_folio_list(struct list_head *folio_list,
 		struct mem_cgroup *memcg)
 {
 	struct folio_batch free_folios;
+	struct folio_batch flush_folios;
+
 	LIST_HEAD(ret_folios);
 	LIST_HEAD(demote_folios);
 	unsigned int nr_demoted = 0;
@@ -1257,6 +1301,8 @@ static void shrink_folio_list(struct list_head *folio_list,
 	struct swap_iocb *plug = NULL;
 
 	folio_batch_init(&free_folios);
+	folio_batch_init(&flush_folios);
+
 	memset(stat, 0, sizeof(*stat));
 	cond_resched();
 	do_demote_pass = can_demote(pgdat->node_id, sc, memcg);
@@ -1578,15 +1624,19 @@ static void shrink_folio_list(struct list_head *folio_list,
 				goto keep_locked;
 			if (!sc->may_writepage)
 				goto keep_locked;
-
 			/*
-			 * Folio is dirty. Flush the TLB if a writable entry
-			 * potentially exists to avoid CPU writes after I/O
-			 * starts and then write it out here.
+			 * For anon, we should only see swap cache (anon) and
+			 * the list pinning the page. For file page, the filemap
+			 * and the list pins it. Combined with the page_ref_freeze
+			 * in pageout_batch ensure nothing else touches the page
+			 * during lock unlocked.
 			 */
-			try_to_unmap_flush_dirty();
-			pageout_one(folio, &ret_folios, &free_folios, sc, stat,
-				&plug, folio_list);
+			folio_unlock(folio);
+			if (!folio_batch_add(&flush_folios, folio))
+				pageout_batch(&flush_folios,
+							&ret_folios, &free_folios,
+							sc, stat, &plug,
+							folio_list);
 			goto next;
 		}
 
@@ -1614,6 +1664,10 @@ static void shrink_folio_list(struct list_head *folio_list,
 next:
 		continue;
 	}
+	if (folio_batch_count(&flush_folios)) {
+		pageout_batch(&flush_folios, &ret_folios, &free_folios, sc,
+			      stat, &plug, folio_list);
+	}
 	/* 'folio_list' is always empty here */
 
 	/* Migrate folios selected for demotion */

-- 
2.43.7




^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] mm, vmscan: flush TLB for every 31 folios evictions
  2026-03-09  8:17 ` [PATCH 2/2] mm, vmscan: flush TLB for every 31 folios evictions Zhang Peng via B4 Relay
@ 2026-03-09 12:29   ` Usama Arif
  2026-03-09 13:19     ` Kairui Song
  2026-03-09 14:56     ` Zhang Peng
  0 siblings, 2 replies; 6+ messages in thread
From: Usama Arif @ 2026-03-09 12:29 UTC (permalink / raw)
  To: Zhang Peng via B4 Relay
  Cc: Usama Arif, Andrew Morton, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Johannes Weiner, Qi Zheng,
	Shakeel Butt, Axel Rasmussen, Yuanchu Xie, Wei Xu, Michal Hocko,
	linux-mm, linux-kernel, Kairui Song, Zhang Peng

On Mon, 09 Mar 2026 16:17:42 +0800 Zhang Peng via B4 Relay <devnull+zippermonkey.icloud.com@kernel.org> wrote:

> From: bruzzhang <bruzzhang@tencent.com>
> 
> Currently we flush TLB for every dirty folio, which is a bottleneck for
> systems with many cores as this causes heavy IPI usage.
> 
> So instead, batch the folios, and flush once for every 31 folios (one
> folio_batch). These folios will be held in a folio_batch releasing their
> lock, then when folio_batch is full, do following steps:
> 
> - For each folio: lock - check still evictable - unlock
>   - If no longer evictable, return the folio to the caller.
> - Flush TLB once for the batch
> - Pageout the folios (refcount freeze happens in the pageout path)
> 
> Note we can't hold a frozen folio in folio_batch for long as it will
> cause filemap/swapcache lookup to livelock. Fortunately pageout usually
> won't take too long; sync IO is fast, and non-sync IO will be issued
> with the folio marked writeback.
> 
> Suggested-by: Kairui Song <kasong@tencent.com>
> Signed-off-by: bruzzhang <bruzzhang@tencent.com>
> ---
>  mm/vmscan.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
>  1 file changed, 61 insertions(+), 7 deletions(-)
> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index a336f7fc7dae..69cdd3252ff8 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1240,6 +1240,48 @@ static void pageout_one(struct folio *folio, struct list_head *ret_folios,
>  	VM_BUG_ON_FOLIO(folio_test_lru(folio) ||
>  			folio_test_unevictable(folio), folio);
>  }
> +
> +static void pageout_batch(struct folio_batch *fbatch,
> +			  struct list_head *ret_folios,
> +			  struct folio_batch *free_folios,
> +			  struct scan_control *sc, struct reclaim_stat *stat,
> +			  struct swap_iocb **plug, struct list_head *folio_list)
> +{
> +	int i = 0, count = folio_batch_count(fbatch);
> +	struct folio *folio;
> +
> +	folio_batch_reinit(fbatch);
> +	do {
> +		folio = fbatch->folios[i];
> +		if (!folio_trylock(folio)) {
> +			list_add(&folio->lru, ret_folios);
> +			continue;
> +		}
> +
> +		if (folio_test_writeback(folio) || folio_test_lru(folio) ||
> +		    folio_mapped(folio))
> +			goto next;
> +		folio_batch_add(fbatch, folio);
> +		continue;
> +next:
> +		folio_unlock(folio);
> +		list_add(&folio->lru, ret_folios);
> +	} while (++i != count);

Hello!

Instead of using do {} while(++i != count), its better to use for loop,
a standard for loop would be better for code readability.

> +
> +	i = 0;
> +	count = folio_batch_count(fbatch);
> +	if (!count)
> +		return;
> +	/* One TLB flush for the batch */
> +	try_to_unmap_flush_dirty();
> +	do {
> +		folio = fbatch->folios[i];
> +		pageout_one(folio, ret_folios, free_folios, sc, stat, plug,
> +			    folio_list);
> +	} while (++i != count);
> +	folio_batch_reinit(fbatch);
> +}
> +
>  /*
>   * Reclaimed folios are counted in stat->nr_reclaimed.
>   */
> @@ -1249,6 +1291,8 @@ static void shrink_folio_list(struct list_head *folio_list,
>  		struct mem_cgroup *memcg)
>  {
>  	struct folio_batch free_folios;
> +	struct folio_batch flush_folios;
> +
>  	LIST_HEAD(ret_folios);
>  	LIST_HEAD(demote_folios);
>  	unsigned int nr_demoted = 0;
> @@ -1257,6 +1301,8 @@ static void shrink_folio_list(struct list_head *folio_list,
>  	struct swap_iocb *plug = NULL;
>  
>  	folio_batch_init(&free_folios);
> +	folio_batch_init(&flush_folios);
> +
>  	memset(stat, 0, sizeof(*stat));
>  	cond_resched();
>  	do_demote_pass = can_demote(pgdat->node_id, sc, memcg);
> @@ -1578,15 +1624,19 @@ static void shrink_folio_list(struct list_head *folio_list,
>  				goto keep_locked;
>  			if (!sc->may_writepage)
>  				goto keep_locked;
> -
>  			/*
> -			 * Folio is dirty. Flush the TLB if a writable entry
> -			 * potentially exists to avoid CPU writes after I/O
> -			 * starts and then write it out here.
> +			 * For anon, we should only see swap cache (anon) and
> +			 * the list pinning the page. For file page, the filemap
> +			 * and the list pins it. Combined with the page_ref_freeze
> +			 * in pageout_batch ensure nothing else touches the page
> +			 * during lock unlocked.
>  			 */

page_ref_freeze happens inside pageout_one() -> pageout() -> __remove_mapping(),
which runs after the folio is re-locked and after the TLB flush.  During
the unlocked window, the refcount is not frozen. Right?

With this patch, the folio is unlocked before try_to_unmap_flush_dirty() runs
in pageout_batch(). During this window, TLB entries on other CPUs could allow
writes to the folio after it has been selected for pageout. My understanding
is that the original code intentionally flushed TLB while the folio was locked
to prevent this? Could there be data corruption can result if a write through
a stale TLB entry races with the pageout I/O?


> -			try_to_unmap_flush_dirty();
> -			pageout_one(folio, &ret_folios, &free_folios, sc, stat,
> -				&plug, folio_list);
> +			folio_unlock(folio);
> +			if (!folio_batch_add(&flush_folios, folio))
> +				pageout_batch(&flush_folios,
> +							&ret_folios, &free_folios,
> +							sc, stat, &plug,
> +							folio_list);
>  			goto next;
>  		}
>  
> @@ -1614,6 +1664,10 @@ static void shrink_folio_list(struct list_head *folio_list,
>  next:
>  		continue;
>  	}
> +	if (folio_batch_count(&flush_folios)) {
> +		pageout_batch(&flush_folios, &ret_folios, &free_folios, sc,
> +			      stat, &plug, folio_list);
> +	}
>  	/* 'folio_list' is always empty here */
>  
>  	/* Migrate folios selected for demotion */
> 
> -- 
> 2.43.7
> 
> 
> 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] mm, vmscan: flush TLB for every 31 folios evictions
  2026-03-09 12:29   ` Usama Arif
@ 2026-03-09 13:19     ` Kairui Song
  2026-03-09 14:56     ` Zhang Peng
  1 sibling, 0 replies; 6+ messages in thread
From: Kairui Song @ 2026-03-09 13:19 UTC (permalink / raw)
  To: Zhang Peng, Usama Arif
  Cc: Zhang Peng via B4 Relay, Andrew Morton, David Hildenbrand,
	Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko, Johannes Weiner, Qi Zheng,
	Shakeel Butt, Axel Rasmussen, Yuanchu Xie, Wei Xu, Michal Hocko,
	linux-mm, linux-kernel

On Mon, Mar 9, 2026 at 8:42 PM Usama Arif <usama.arif@linux.dev> wrote:
>
> On Mon, 09 Mar 2026 16:17:42 +0800 Zhang Peng via B4 Relay <devnull+zippermonkey.icloud.com@kernel.org> wrote:
>
> > From: bruzzhang <bruzzhang@tencent.com>
> >
> > Currently we flush TLB for every dirty folio, which is a bottleneck for
> > systems with many cores as this causes heavy IPI usage.
> >
> > So instead, batch the folios, and flush once for every 31 folios (one
> > folio_batch). These folios will be held in a folio_batch releasing their
> > lock, then when folio_batch is full, do following steps:
> >
> > - For each folio: lock - check still evictable - unlock
> >   - If no longer evictable, return the folio to the caller.
> > - Flush TLB once for the batch
> > - Pageout the folios (refcount freeze happens in the pageout path)
> >
> > Note we can't hold a frozen folio in folio_batch for long as it will
> > cause filemap/swapcache lookup to livelock. Fortunately pageout usually
> > won't take too long; sync IO is fast, and non-sync IO will be issued
> > with the folio marked writeback.
> >
> > Suggested-by: Kairui Song <kasong@tencent.com>
> > Signed-off-by: bruzzhang <bruzzhang@tencent.com>
> > ---
> >  mm/vmscan.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-------
> >  1 file changed, 61 insertions(+), 7 deletions(-)

...

> >       folio_batch_init(&free_folios);
> > +     folio_batch_init(&flush_folios);
> > +
> >       memset(stat, 0, sizeof(*stat));
> >       cond_resched();
> >       do_demote_pass = can_demote(pgdat->node_id, sc, memcg);
> > @@ -1578,15 +1624,19 @@ static void shrink_folio_list(struct list_head *folio_list,
> >                               goto keep_locked;
> >                       if (!sc->may_writepage)
> >                               goto keep_locked;
> > -
> >                       /*
> > -                      * Folio is dirty. Flush the TLB if a writable entry
> > -                      * potentially exists to avoid CPU writes after I/O
> > -                      * starts and then write it out here.
> > +                      * For anon, we should only see swap cache (anon) and
> > +                      * the list pinning the page. For file page, the filemap
> > +                      * and the list pins it. Combined with the page_ref_freeze
> > +                      * in pageout_batch ensure nothing else touches the page
> > +                      * during lock unlocked.
> >                        */
>
> page_ref_freeze happens inside pageout_one() -> pageout() -> __remove_mapping(),
> which runs after the folio is re-locked and after the TLB flush.  During
> the unlocked window, the refcount is not frozen. Right?
>
> With this patch, the folio is unlocked before try_to_unmap_flush_dirty() runs
> in pageout_batch(). During this window, TLB entries on other CPUs could allow
> writes to the folio after it has been selected for pageout. My understanding
> is that the original code intentionally flushed TLB while the folio was locked
> to prevent this? Could there be data corruption can result if a write through
> a stale TLB entry races with the pageout I/O?

Hi Usama,

Thanks for the review. Yeah the comment here seems wrong, I agree with you.

Hi, Peng, I think you might have used some stall comment, at least
page_ref_freeze doesn't exist here and that doesn't seem to be how
this patch works currently. Can you help double check and update?

These folios are kept in the batch unlocked and unfreeze. Also,
unmapped. They could get mapped again or touched, so the batch flush
should relocks the folios and redo some routines before that unmap
before, and if they are still in a ready to be freed status, then
flush and do the IO, then free.

BTW some checks seem missing in the batch check? eg. folio_maybe_dma_pinned.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [PATCH 2/2] mm, vmscan: flush TLB for every 31 folios evictions
  2026-03-09 12:29   ` Usama Arif
  2026-03-09 13:19     ` Kairui Song
@ 2026-03-09 14:56     ` Zhang Peng
  1 sibling, 0 replies; 6+ messages in thread
From: Zhang Peng @ 2026-03-09 14:56 UTC (permalink / raw)
  To: usama.arif
  Cc: Liam.Howlett, akpm, axelrasmussen, bruzzhang, david,
	devnull+zippermonkey.icloud.com, hannes, kasong, linux-kernel,
	linux-mm, ljs, mhocko, mhocko, rppt, shakeel.butt, surenb, vbabka,
	weixugc, yuanchu, zhengqi.arch

Hi Usama,

Thanks for the review!

You are right that the comment is wrong, page_ref_freeze does not exist in
pageout_batch(). I will fix the comment in v2.

Regarding the data corruption concern: try_to_unmap_flush_dirty() is called
before pageout_one(), so all stale writable TLB entries are invalidated 
before
IO starts. Any writes through stale TLB entries during the unlocked 
window will
have completed and landed in physical memory before the flush, and will be
correctly captured by the subsequent IO.

pageout_batch() relocks each folio and rechecks its state (writeback, lru,
mapped, dma_pinned) before proceeding. If any of these conditions have 
changed
during the unlocked window, the folio is not written out and is put back 
to the
LRU list for a future reclaim attempt. So there should be no data corruption
issue.

I will also add folio_maybe_dma_pinned() check in v2 as suggested by 
Kairui Song.

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2026-03-09 14:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-03-09  8:17 [PATCH 0/2] mm: batch TLB flushing for dirty folios in vmscan Zhang Peng via B4 Relay
2026-03-09  8:17 ` [PATCH 1/2] mm/vmscan: refactor shrink_folio_list for readability and maintainability Zhang Peng via B4 Relay
2026-03-09  8:17 ` [PATCH 2/2] mm, vmscan: flush TLB for every 31 folios evictions Zhang Peng via B4 Relay
2026-03-09 12:29   ` Usama Arif
2026-03-09 13:19     ` Kairui Song
2026-03-09 14:56     ` Zhang Peng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox