* [PATCH v2] mm/vmscan: batch TLB flush during memory reclaim
@ 2025-03-28 18:20 Rik van Riel
2025-04-03 22:00 ` Andrew Morton
0 siblings, 1 reply; 5+ messages in thread
From: Rik van Riel @ 2025-03-28 18:20 UTC (permalink / raw)
To: Andrew Morton
Cc: linux-mm, linux-kernel, kernel-team, Vinay Banakar, liuye,
Hugh Dickins, Mel Gorman, Yu Zhao, Shakeel Butt
From: Vinay Banakar <vny@google.com>
The current implementation in shrink_folio_list() performs a full TLB
flush for every individual folio reclaimed. This causes unnecessary
overhead during memory reclaim.
The current code:
1. Clears PTEs and unmaps each page individually
2. Performs a full TLB flush on every CPU the mm is running on
The new code:
1. Clears PTEs and unmaps each page individually
2. Adds each unmapped page to pageout_folios
3. Flushes the TLB once before procesing pageout_folios
This reduces the number of TLB flushes issued by the memory reclaim
code by 1/N, where N is the number of mapped folios encountered in
the batch processed by shrink_folio_list.
[riel: forward port to 6.14, adjust code and naming to match surrounding code]
Signed-off-by: Vinay Banakar <vny@google.com>
Signed-off-by: Rik van Riel <riel@surriel.com>
---
v2: remove folio_test_young that broke some 32 bit builds, since pages should be
unmapped when they get to this point anyway, and if somebody mapped them again
they are by definition (very) recently accessed
mm/vmscan.c | 112 +++++++++++++++++++++++++++++++---------------------
1 file changed, 68 insertions(+), 44 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c767d71c43d7..286ff627d337 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1086,6 +1086,7 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
struct folio_batch free_folios;
LIST_HEAD(ret_folios);
LIST_HEAD(demote_folios);
+ LIST_HEAD(pageout_folios);
unsigned int nr_reclaimed = 0, nr_demoted = 0;
unsigned int pgactivate = 0;
bool do_demote_pass;
@@ -1394,51 +1395,10 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
goto keep_locked;
/*
- * Folio is dirty. Flush the TLB if a writable entry
- * potentially exists to avoid CPU writes after I/O
- * starts and then write it out here.
+ * Add to pageout list for batched TLB flushing and IO submission.
*/
- try_to_unmap_flush_dirty();
- switch (pageout(folio, mapping, &plug, folio_list)) {
- case PAGE_KEEP:
- goto keep_locked;
- case PAGE_ACTIVATE:
- /*
- * If shmem folio is split when writeback to swap,
- * the tail pages will make their own pass through
- * this function and be accounted then.
- */
- if (nr_pages > 1 && !folio_test_large(folio)) {
- sc->nr_scanned -= (nr_pages - 1);
- nr_pages = 1;
- }
- goto activate_locked;
- case PAGE_SUCCESS:
- if (nr_pages > 1 && !folio_test_large(folio)) {
- sc->nr_scanned -= (nr_pages - 1);
- nr_pages = 1;
- }
- stat->nr_pageout += nr_pages;
-
- if (folio_test_writeback(folio))
- goto keep;
- if (folio_test_dirty(folio))
- goto keep;
-
- /*
- * A synchronous write - probably a ramdisk. Go
- * ahead and try to reclaim the folio.
- */
- if (!folio_trylock(folio))
- goto keep;
- if (folio_test_dirty(folio) ||
- folio_test_writeback(folio))
- goto keep_locked;
- mapping = folio_mapping(folio);
- fallthrough;
- case PAGE_CLEAN:
- ; /* try to free the folio below */
- }
+ list_add(&folio->lru, &pageout_folios);
+ continue;
}
/*
@@ -1549,6 +1509,70 @@ static unsigned int shrink_folio_list(struct list_head *folio_list,
}
/* 'folio_list' is always empty here */
+ if (!list_empty(&pageout_folios)) {
+ /*
+ * The loop above unmapped the folios from the page tables.
+ * One TLB flush takes care of the whole batch.
+ */
+ try_to_unmap_flush_dirty();
+
+ while (!list_empty(&pageout_folios)) {
+ struct folio *folio = lru_to_folio(&pageout_folios);
+ struct address_space *mapping;
+ list_del(&folio->lru);
+
+ /* Recheck if the page got reactivated */
+ if (folio_test_active(folio) || folio_mapped(folio))
+ goto skip_pageout_locked;
+
+ mapping = folio_mapping(folio);
+ switch (pageout(folio, mapping, &plug, &pageout_folios)) {
+ case PAGE_KEEP:
+ case PAGE_ACTIVATE:
+ goto skip_pageout_locked;
+ case PAGE_SUCCESS:
+ /*
+ * If shmem folio is split when writeback to swap,
+ * the tail pages will make their own pass through
+ * this loop and be accounted then.
+ */
+ stat->nr_pageout += folio_nr_pages(folio);
+
+ if (folio_test_writeback(folio))
+ goto skip_pageout;
+ if (folio_test_dirty(folio))
+ goto skip_pageout;
+
+ /*
+ * A synchronous write - probably a ramdisk. Go
+ * ahead and try to reclaim the folio.
+ */
+ if (!folio_trylock(folio))
+ goto skip_pageout;
+ if (folio_test_dirty(folio) ||
+ folio_test_writeback(folio))
+ goto skip_pageout_locked;
+ mapping = folio_mapping(folio);
+ /* try to free the folio below */
+ fallthrough;
+ case PAGE_CLEAN:
+ /* try to free the folio */
+ if (!mapping ||
+ !remove_mapping(mapping, folio))
+ goto skip_pageout_locked;
+
+ nr_reclaimed += folio_nr_pages(folio);
+ folio_unlock(folio);
+ continue;
+ }
+
+skip_pageout_locked:
+ folio_unlock(folio);
+skip_pageout:
+ list_add(&folio->lru, &ret_folios);
+ }
+ }
+
/* Migrate folios selected for demotion */
nr_demoted = demote_folio_list(&demote_folios, pgdat);
nr_reclaimed += nr_demoted;
--
2.47.1
^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [PATCH v2] mm/vmscan: batch TLB flush during memory reclaim
2025-03-28 18:20 [PATCH v2] mm/vmscan: batch TLB flush during memory reclaim Rik van Riel
@ 2025-04-03 22:00 ` Andrew Morton
2025-04-03 22:31 ` Shakeel Butt
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: Andrew Morton @ 2025-04-03 22:00 UTC (permalink / raw)
To: Rik van Riel
Cc: linux-mm, linux-kernel, kernel-team, Vinay Banakar, liuye,
Hugh Dickins, Mel Gorman, Yu Zhao, Shakeel Butt
On Fri, 28 Mar 2025 14:20:55 -0400 Rik van Riel <riel@surriel.com> wrote:
> The current implementation in shrink_folio_list() performs a full TLB
> flush for every individual folio reclaimed. This causes unnecessary
> overhead during memory reclaim.
>
> The current code:
> 1. Clears PTEs and unmaps each page individually
> 2. Performs a full TLB flush on every CPU the mm is running on
>
> The new code:
> 1. Clears PTEs and unmaps each page individually
> 2. Adds each unmapped page to pageout_folios
> 3. Flushes the TLB once before procesing pageout_folios
>
> This reduces the number of TLB flushes issued by the memory reclaim
> code by 1/N, where N is the number of mapped folios encountered in
> the batch processed by shrink_folio_list.
Were any runtime benefits observable?
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] mm/vmscan: batch TLB flush during memory reclaim
2025-04-03 22:00 ` Andrew Morton
@ 2025-04-03 22:31 ` Shakeel Butt
2025-04-04 13:30 ` Vinay Banakar
2025-04-04 13:37 ` Vinay Banakar
2 siblings, 0 replies; 5+ messages in thread
From: Shakeel Butt @ 2025-04-03 22:31 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, linux-mm, linux-kernel, kernel-team, Vinay Banakar,
liuye, Hugh Dickins, Mel Gorman, Yu Zhao
On Thu, Apr 03, 2025 at 03:00:55PM -0700, Andrew Morton wrote:
> On Fri, 28 Mar 2025 14:20:55 -0400 Rik van Riel <riel@surriel.com> wrote:
>
> > The current implementation in shrink_folio_list() performs a full TLB
> > flush for every individual folio reclaimed. This causes unnecessary
> > overhead during memory reclaim.
> >
> > The current code:
> > 1. Clears PTEs and unmaps each page individually
> > 2. Performs a full TLB flush on every CPU the mm is running on
> >
> > The new code:
> > 1. Clears PTEs and unmaps each page individually
> > 2. Adds each unmapped page to pageout_folios
> > 3. Flushes the TLB once before procesing pageout_folios
> >
> > This reduces the number of TLB flushes issued by the memory reclaim
> > code by 1/N, where N is the number of mapped folios encountered in
> > the batch processed by shrink_folio_list.
>
> Were any runtime benefits observable?
Andrew, can you hold off this patch for now? I provided some feedback
privately but let me put it here as well.
This patch is very very hard to review. shrink_folio_list() has become a
beast over the years. This patch is moving a code block within the same
function and skipping a lot of stuff happening between the old place and
the new place. I still couldn't figure out how the actual freeing of
folios are happening as the patch completely skips
mem_cgroup_uncharge_folios() & free_unref_folios(). Also the lazyfree
counters are skipped. In addition buffer head, swap free, mlocked
handling will be skipped as well.
I think there is a need to explain why this patch is correct even with
skipping all those functionalities.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] mm/vmscan: batch TLB flush during memory reclaim
2025-04-03 22:00 ` Andrew Morton
2025-04-03 22:31 ` Shakeel Butt
@ 2025-04-04 13:30 ` Vinay Banakar
2025-04-04 13:37 ` Vinay Banakar
2 siblings, 0 replies; 5+ messages in thread
From: Vinay Banakar @ 2025-04-04 13:30 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, linux-mm, linux-kernel, kernel-team, liuye,
Hugh Dickins, Mel Gorman, Yu Zhao, Shakeel Butt
[-- Attachment #1: Type: text/plain, Size: 1443 bytes --]
On Thu, Apr 3, 2025 at 5:00 PM Andrew Morton <akpm@linux-foundation.org>
wrote:
> Were any runtime benefits observable?
I had replied as follows on another chain related to this patch:
Yes, the patch reduces IPIs by a factor of 512 by sending one IPI (for TLB
flush) per PMD rather than per page. Since shrink_folio_list()
usually operates on one PMD at a time, I believe we can safely batch these
operations here, but I would appreciate your feedback on this.
Here's a concrete example:
When swapping out 20 GiB (5.2M pages):
- Current: Each page triggers an IPI to all cores
- With 6 cores: 31.4M total interrupts (6 cores × 5.2M pages)
- With patch: One IPI per PMD (512 pages)
- Only 10.2K IPIs required (5.2M/512)
- With 6 cores: 61.4K total interrupts
- Results in ~99% reduction in total interrupts
Application performance impact varies by workload, but here's a
representative test case:
- Thread 1: Continuously accesses a 2 GiB private anonymous map (64B
chunks at random offsets)
- Thread 2: Pinned to different core, uses MADV_PAGEOUT on 20 GiB
private anonymous map to swap it out to SSD
- The threads only access their respective maps.
Results:
- Without patch: Thread 1 sees ~53% throughput reduction during
swap. If there are multiple worker threads (like thread 1), the
cumulative throughput degradation will be much higher
- With patch: Thread 1 maintains normal throughput
[-- Attachment #2: Type: text/html, Size: 1780 bytes --]
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH v2] mm/vmscan: batch TLB flush during memory reclaim
2025-04-03 22:00 ` Andrew Morton
2025-04-03 22:31 ` Shakeel Butt
2025-04-04 13:30 ` Vinay Banakar
@ 2025-04-04 13:37 ` Vinay Banakar
2 siblings, 0 replies; 5+ messages in thread
From: Vinay Banakar @ 2025-04-04 13:37 UTC (permalink / raw)
To: Andrew Morton
Cc: Rik van Riel, linux-mm, linux-kernel, kernel-team, liuye,
Hugh Dickins, Mel Gorman, Yu Zhao, Shakeel Butt
On Thu, Apr 3, 2025 at 5:00 PM Andrew Morton <akpm@linux-foundation.org> wrote:
> Were any runtime benefits observable?
I had replied as follows on another chain related to this patch:
Yes, the patch reduces IPIs by a factor of 512 by sending one IPI (for TLB
flush) per PMD rather than per page. Since shrink_folio_list()
usually operates on one PMD at a time, I believe we can safely batch
these operations here, but I would appreciate your feedback on this.
Here's a concrete example:
When swapping out 20 GiB (5.2M pages):
- Current: Each page triggers an IPI to all cores
- With 6 cores: 31.4M total interrupts (6 cores × 5.2M pages)
- With patch: One IPI per PMD (512 pages)
- Only 10.2K IPIs required (5.2M/512)
- With 6 cores: 61.4K total interrupts
- Results in ~99% reduction in total interrupts
Application performance impact varies by workload, but here's a
representative test case:
- Thread 1: Continuously accesses a 2 GiB private anonymous map (64B
chunks at random offsets)
- Thread 2: Pinned to different core, uses MADV_PAGEOUT on 20 GiB
private anonymous map to swap it out to SSD
- The threads only access their respective maps.
Results:
- Without patch: Thread 1 sees ~53% throughput reduction during
swap. If there are multiple worker threads (like thread 1), the
cumulative throughput degradation will be much higher
- With patch: Thread 1 maintains normal throughput
On Thu, Apr 3, 2025 at 5:00 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 28 Mar 2025 14:20:55 -0400 Rik van Riel <riel@surriel.com> wrote:
>
> > The current implementation in shrink_folio_list() performs a full TLB
> > flush for every individual folio reclaimed. This causes unnecessary
> > overhead during memory reclaim.
> >
> > The current code:
> > 1. Clears PTEs and unmaps each page individually
> > 2. Performs a full TLB flush on every CPU the mm is running on
> >
> > The new code:
> > 1. Clears PTEs and unmaps each page individually
> > 2. Adds each unmapped page to pageout_folios
> > 3. Flushes the TLB once before procesing pageout_folios
> >
> > This reduces the number of TLB flushes issued by the memory reclaim
> > code by 1/N, where N is the number of mapped folios encountered in
> > the batch processed by shrink_folio_list.
>
> Were any runtime benefits observable?
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2025-04-04 13:37 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-28 18:20 [PATCH v2] mm/vmscan: batch TLB flush during memory reclaim Rik van Riel
2025-04-03 22:00 ` Andrew Morton
2025-04-03 22:31 ` Shakeel Butt
2025-04-04 13:30 ` Vinay Banakar
2025-04-04 13:37 ` Vinay Banakar
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).